bug#40407: [PATCH] slow ENCODE_FILE and DECODE

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
@ 2020-04-03 14:18 Mattias Engdegård
  2020-04-03 16:24 ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-03 14:18 UTC (permalink / raw)
  To: 40407

[-- Attachment #1: Type: text/plain, Size: 1280 bytes --]

ENCODE_FILE and DECODE_FILE turn out to be surprisingly slow, and allocate copious amounts of memory, to the point that they often turn up in both memory and cpu profiles. (This is on macOS; I haven't checked the situation elsewhere.)

For instance, a single call to file-relative-name, with ASCII-only arguments, manages to allocate 140 KiB. There are several conversion steps each involving creating temporary buffers as well as the compilation and execution of very large "quick-check" regexps. Example:

(progn
  (require 'profiler)
  (profiler-reset)
  (garbage-collect)
  (profiler-start 'mem)
  (file-relative-name "abc")
  (profiler-stop)
  (profiler-report))

This applies to just about every function dealing with files or file names.

The attached patch is somewhat conservatively written but at least a starting point. It reduces the memory consumption by file-relative-name in the example above to zero. Perhaps we can assume that file names codings are always ASCII-compatible; if so, the shortcut can be taken in encode_file_name and decode_file_name directly.

There is already a hack in encode_file_name that assumes that no unibyte string ever needs encoding; if so, the shortcut could perhaps be extended to decode_file_name and simplified.

[-- Attachment #2: 0001-Avoid-expensive-recoding-for-ASCII-identity-cases.patch --]
[-- Type: application/octet-stream, Size: 2106 bytes --]

From dca8b997d3e7c36667e12f1c77fc6ffed7d8f555 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 3 Apr 2020 16:01:01 +0200
Subject: [PATCH] Avoid expensive recoding for ASCII identity cases

Optimise for the common case of encoding or decoding an ASCII-only
string using an ASCII-compatible coding, for file names in particular.

* src/coding.c (string_ascii_p): New function.
(code_convert_string): Return the input string for ASCII-only inputs
and ASCII-compatible codings.
---
 src/coding.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/coding.c b/src/coding.c
index 0bea2a0c2b..9a17fafb05 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -9471,6 +9471,17 @@ used (which may be different from CODING-SYSTEM if CODING-SYSTEM is
   return code_convert_region (start, end, coding_system, destination, 1, 0);
 }

+/* Whether a (unibyte) string only contains chars in the 0..127 range.  */
+static bool
+string_ascii_p (Lisp_Object str)
+{
+  ptrdiff_t nbytes = SBYTES (str);
+  for (ptrdiff_t i = 0; i < nbytes; i++)
+    if (SREF (str, i) > 127)
+      return false;
+  return true;
+}
+
 Lisp_Object
 code_convert_string (Lisp_Object string, Lisp_Object coding_system,
 		     Lisp_Object dst_object, bool encodep, bool nocopy,
@@ -9502,7 +9513,17 @@ code_convert_string (Lisp_Object string, Lisp_Object coding_system,
   chars = SCHARS (string);
   bytes = SBYTES (string);

-  if (BUFFERP (dst_object))
+  if (EQ (dst_object, Qt))
+    {
+      /* Fast path for ASCII-only input and an ASCII-compatible coding:
+         act as identity.  */
+      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
+      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
+          && (STRING_MULTIBYTE (string)
+              ? (chars == bytes) : string_ascii_p (string)))
+        return string;
+    }
+  else if (BUFFERP (dst_object))
     {
       struct buffer *buf = XBUFFER (dst_object);
       ptrdiff_t buf_pt = BUF_PT (buf);
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-03 14:18 bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE Mattias Engdegård
@ 2020-04-03 16:24 ` Eli Zaretskii
  2020-04-03 22:32   ` Mattias Engdegård
  2020-04-06 10:10   ` OGAWA Hirofumi
  0 siblings, 2 replies; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-03 16:24 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 3 Apr 2020 16:18:43 +0200
> 
> ENCODE_FILE and DECODE_FILE turn out to be surprisingly slow, and allocate copious amounts of memory, to the point that they often turn up in both memory and cpu profiles. (This is on macOS; I haven't checked the situation elsewhere.)

AFAIR, on macOS the situation is worse than elsewhere, because of the
normalization thing.

> For instance, a single call to file-relative-name, with ASCII-only arguments, manages to allocate 140 KiB. There are several conversion steps each involving creating temporary buffers as well as the compilation and execution of very large "quick-check" regexps. Example:
> 
> (progn
>   (require 'profiler)
>   (profiler-reset)
>   (garbage-collect)
>   (profiler-start 'mem)
>   (file-relative-name "abc")
>   (profiler-stop)
>   (profiler-report))

Can you tell more about the conversion steps and the memory each one
allocates?

> Perhaps we can assume that file names codings are always ASCII-compatible

I don't think every encoding is ASCII compatible, so I don't see how
we can assume that in general.  But the check whether an encoding is
ASCII-compatible takes a negligible amount of time, so why bother with
such an assumption?

> There is already a hack in encode_file_name that assumes that no unibyte string ever needs encoding; if so, the shortcut could perhaps be extended to decode_file_name and simplified.

I'm not sure I understand what you mean by extending the shortcut to
decode_file_name.  Please elaborate.

> -  if (BUFFERP (dst_object))
> +  if (EQ (dst_object, Qt))
> +    {
> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> +         act as identity.  */
> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> +          && (STRING_MULTIBYTE (string)
> +              ? (chars == bytes) : string_ascii_p (string)))
> +        return string;

I don't think we can return the same string if NOCOPY is non-zero.
The callers might not expect that, and you might inadvertently cause
the original string be modified behind the caller's back.

But if NOCOPY is 'false', I think this change is OK.  Just make sure
the test suite doesn't start failing, maybe there's something else we
are missing.

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-03 16:24 ` Eli Zaretskii
@ 2020-04-03 22:32   ` Mattias Engdegård
  2020-04-04  9:26     ` Eli Zaretskii
  2020-04-04 10:26     ` Eli Zaretskii
  2020-04-06 10:10   ` OGAWA Hirofumi
  1 sibling, 2 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-03 22:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

3 apr. 2020 kl. 18.24 skrev Eli Zaretskii <eliz@gnu.org>:

> AFAIR, on macOS the situation is worse than elsewhere, because of the
> normalization thing.

Very likely. It's just what I had in my lap.

> Can you tell more about the conversion steps and the memory each one
> allocates?

Courtesy the memory profiler:

         - file-relative-name                                 141,551  15%
          - file-name-case-insensitive-p                      100,613  11%
           - ucs-normalize-hfs-nfd-pre-write-conversion       100,613  11%
            - ucs-normalize-HFS-NFD-region                    100,613  11%
               ucs-normalize-region                           100,613  11%
          - expand-file-name                                   40,828   4%
           - ucs-normalize-hfs-nfd-post-read-conversion        40,828   4%
            - ucs-normalize-HFS-NFC-region                     40,828   4%
               ucs-normalize-region                            40,828   4%

where file_name_case_insensitive_p calls ENCODE_FILE and expand_file_name calls DECODE_FILE. I'm not sure how much each part of ucs-normalize-region actually consumes, but I think we can agree that we don't want it called on any platform unless strictly necessary.

> I don't think every encoding is ASCII compatible, so I don't see how
> we can assume that in general.  But the check whether an encoding is
> ASCII-compatible takes a negligible amount of time, so why bother with
> such an assumption?

Quite, I just thought I'd ask in case there were some unwritten invariant that you knew about.

> I'm not sure I understand what you mean by extending the shortcut to
> decode_file_name.  Please elaborate.

Never mind, it was an under-thought idea. The existing bootstrap hack making encode_file_name identity for any unibyte string does not seem to need or allow any symmetry in decode_file_name.

> I don't think we can return the same string if NOCOPY is non-zero.
> The callers might not expect that, and you might inadvertently cause
> the original string be modified behind the caller's back.

You are no doubt correct, but doesn't it look like the sense of NOCOPY has been inverted here? It runs contrary to the intuitive meaning and to the doc string of {encode,decode}-coding-string. In fact:

(let* ((nocopy nil)
       (x "abc")
       (y (decode-coding-string x nil nocopy nil)))
  (eq x y))
=> t

Looks like we suddenly got more work on our hands. What a surprise.

Since string mutation is so rare, I doubt it has caused any real trouble. Now, do we fix it by inverting the sense of the argument, or by renaming it to COPY? I'm fairly neutral, but there are arguments in either way, both in terms of performance and correctness. And what about internal calls to code_convert_string?

There are 193 calls to {encode, decode}-coding-string in the Emacs tree, and only 14 of them pass a non-nil value to NOCOPY. I'd be inclined to keep the semantics but rename the argument to COPY, on the grounds that no-copy is a better default; then change those 14 calls to pass nil instead, since that obviously was the intent.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-03 22:32   ` Mattias Engdegård
@ 2020-04-04  9:26     ` Eli Zaretskii
  2020-04-04 16:41       ` Mattias Engdegård
  2020-04-04 10:26     ` Eli Zaretskii
  1 sibling, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-04  9:26 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 4 Apr 2020 00:32:21 +0200
> Cc: 40407@debbugs.gnu.org
> 
>          - file-relative-name                                 141,551  15%
>           - file-name-case-insensitive-p                      100,613  11%
>            - ucs-normalize-hfs-nfd-pre-write-conversion       100,613  11%
>             - ucs-normalize-HFS-NFD-region                    100,613  11%
>                ucs-normalize-region                           100,613  11%
>           - expand-file-name                                   40,828   4%
>            - ucs-normalize-hfs-nfd-post-read-conversion        40,828   4%
>             - ucs-normalize-HFS-NFC-region                     40,828   4%
>                ucs-normalize-region                            40,828   4%
> 
> where file_name_case_insensitive_p calls ENCODE_FILE and expand_file_name calls DECODE_FILE.

DECODE_FILE is called because the file name in question starts with a
"~"?  Otherwise, I don't think I understand why would expand-file-name
need to decode a file name.

> I'm not sure how much each part of ucs-normalize-region actually consumes, but I think we can agree that we don't want it called on any platform unless strictly necessary.

Any expensive code should be avoided if it isn't necessary, so yes, I
agree.  And yes, Unicode normalization is expensive.  If we consider
the macOS filesystem idiosyncrasies important to support efficiently,
perhaps we should rewrite the normalization code in C.

> > I don't think every encoding is ASCII compatible, so I don't see how
> > we can assume that in general.  But the check whether an encoding is
> > ASCII-compatible takes a negligible amount of time, so why bother with
> > such an assumption?
> 
> Quite, I just thought I'd ask in case there were some unwritten invariant that you knew about.

Whether a coding-system is ASCII-compatible is determined by the
definition of that coding-system.  Look in mule-conf.el, and you will
see there several that aren't ASCII-compatible.  UTF-16 is one
example, but there are others.

> > I don't think we can return the same string if NOCOPY is non-zero.
> > The callers might not expect that, and you might inadvertently cause
> > the original string be modified behind the caller's back.
> 
> You are no doubt correct, but doesn't it look like the sense of NOCOPY has been inverted here?

That ship has sailed long ago (I could explain how this "inverted"
meaning could make sense, but I don't think it's relevant to the issue
at hand), and there are several other internal functions that use a
similar argument in the same "inverted" sense.  This is a separate
issue, anyway.

> Since string mutation is so rare, I doubt it has caused any real trouble.

You are wrong here, it can happen very easily, especially when you
manipulate the encoded string in C.  The simplest use case is that you
encode a file name, and then make some change to the encoded string,
like change the letter-case or remove the trailing slash.  Suddenly
the original string is changed as well, and the Lisp caller of the
high-level function might be mightily surprised by the result.

IME, the cases where we can safely assume it's OK to return the same
string are actually very rare.  It is no accident that you saw so few
calls of these functions where we use that optional behavior.

> Now, do we fix it by inverting the sense of the argument, or by renaming it to COPY?

Neither, IMO.  Again, it's a separate problem, and let's keep our
sights squarely on the original issue you wanted to fix.  Let's tackle
the NOCOPY issue in a separate discussion, OK?

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04  9:26     ` Eli Zaretskii
@ 2020-04-04 16:41       ` Mattias Engdegård
  2020-04-04 17:22         ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-04 16:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

[-- Attachment #1: Type: text/plain, Size: 1158 bytes --]

4 apr. 2020 kl. 11.26 skrev Eli Zaretskii <eliz@gnu.org>:

> DECODE_FILE is called because the file name in question starts with a
> "~"?  Otherwise, I don't think I understand why would expand-file-name
> need to decode a file name.

Maybe it's because default-directory started with a tilde. It doesn't really matter; it's a common case, and the profiler tells us as much.

> IME, the cases where we can safely assume it's OK to return the same
> string are actually very rare.  It is no accident that you saw so few
> calls of these functions where we use that optional behavior.

This does not mean that the remaining 179 calls require a copy; they just use the default value of the parameter.

> Neither, IMO.  Again, it's a separate problem, and let's keep our
> sights squarely on the original issue you wanted to fix.  Let's tackle
> the NOCOPY issue in a separate discussion, OK?

Thank you, a separate bug for it is fine.

Here is a revised patch which takes the nocopy parameter into account (in its inverted sense). Obviously it needs to be adapted if the nocopy inversion is dealt with first; the two bugs do not commute.


[-- Attachment #2: 0001-Avoid-expensive-recoding-for-ASCII-identity-cases-bu.patch --]
[-- Type: application/octet-stream, Size: 2146 bytes --]

From 0c6139ab490733f3c1257665535fc4ed2ad0dbe7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 3 Apr 2020 16:01:01 +0200
Subject: [PATCH] Avoid expensive recoding for ASCII identity cases (bug#40407)

Optimise for the common case of encoding or decoding an ASCII-only
string using an ASCII-compatible coding, for file names in particular.

* src/coding.c (string_ascii_p): New function.
(code_convert_string): Return the input string for ASCII-only inputs
and ASCII-compatible codings.
---
 src/coding.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/coding.c b/src/coding.c
index 0bea2a0c2b..0fdbc95939 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -9471,6 +9471,17 @@ used (which may be different from CODING-SYSTEM if CODING-SYSTEM is
   return code_convert_region (start, end, coding_system, destination, 1, 0);
 }
 
+/* Whether a (unibyte) string only contains chars in the 0..127 range.  */
+static bool
+string_ascii_p (Lisp_Object str)
+{
+  ptrdiff_t nbytes = SBYTES (str);
+  for (ptrdiff_t i = 0; i < nbytes; i++)
+    if (SREF (str, i) > 127)
+      return false;
+  return true;
+}
+
 Lisp_Object
 code_convert_string (Lisp_Object string, Lisp_Object coding_system,
 		     Lisp_Object dst_object, bool encodep, bool nocopy,
@@ -9502,7 +9513,17 @@ code_convert_string (Lisp_Object string, Lisp_Object coding_system,
   chars = SCHARS (string);
   bytes = SBYTES (string);
 
-  if (BUFFERP (dst_object))
+  if (EQ (dst_object, Qt))
+    {
+      /* Fast path for ASCII-only input and an ASCII-compatible coding:
+         act as identity.  */
+      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
+      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
+          && (STRING_MULTIBYTE (string)
+              ? (chars == bytes) : string_ascii_p (string)))
+	return nocopy ? Fcopy_sequence (string) : string;
+    }
+  else if (BUFFERP (dst_object))
     {
       struct buffer *buf = XBUFFER (dst_object);
       ptrdiff_t buf_pt = BUF_PT (buf);
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 16:41       ` Mattias Engdegård
@ 2020-04-04 17:22         ` Eli Zaretskii
  2020-04-04 17:37           ` Eli Zaretskii
  2020-04-05 10:14           ` Mattias Engdegård
  0 siblings, 2 replies; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-04 17:22 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 4 Apr 2020 18:41:39 +0200
> Cc: 40407@debbugs.gnu.org
> 
> > DECODE_FILE is called because the file name in question starts with a
> > "~"?  Otherwise, I don't think I understand why would expand-file-name
> > need to decode a file name.
> 
> Maybe it's because default-directory started with a tilde. It doesn't really matter; it's a common case, and the profiler tells us as much.

I think it's important that we understand what happens here to the
last detail, but okay.

> > IME, the cases where we can safely assume it's OK to return the same
> > string are actually very rare.  It is no accident that you saw so few
> > calls of these functions where we use that optional behavior.
> 
> This does not mean that the remaining 179 calls require a copy; they just use the default value of the parameter.

And IMO the default must stay that a copy is returned, except when the
caller says otherwise.

> +  if (EQ (dst_object, Qt))
> +    {
> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> +         act as identity.  */
> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> +          && (STRING_MULTIBYTE (string)
> +              ? (chars == bytes) : string_ascii_p (string)))
> +	return nocopy ? Fcopy_sequence (string) : string;

I think in the use case where we return a copy, we should make sure
the return value is unibyte when encoding and multibyte when decoding.
Otherwise, I think this is OK (for the master branch, obviously).

Thanks.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 17:22         ` Eli Zaretskii
@ 2020-04-04 17:37           ` Eli Zaretskii
  2020-04-04 18:06             ` Mattias Engdegård
  2020-04-05 10:14           ` Mattias Engdegård
  1 sibling, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-04 17:37 UTC (permalink / raw)
  To: mattiase; +Cc: 40407

> Date: Sat, 04 Apr 2020 20:22:37 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 40407@debbugs.gnu.org
> 
> > +  if (EQ (dst_object, Qt))
> > +    {
> > +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> > +         act as identity.  */
> > +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> > +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> > +          && (STRING_MULTIBYTE (string)
> > +              ? (chars == bytes) : string_ascii_p (string)))
> > +	return nocopy ? Fcopy_sequence (string) : string;
> 
> I think in the use case where we return a copy, we should make sure
> the return value is unibyte when encoding and multibyte when decoding.
> Otherwise, I think this is OK (for the master branch, obviously).

Btw, if we want this particular use case to be as fast as possible,
then Fcopy_sequence is not the best way, because it is not optimized
for the case of copying a single string.  We could do better by
calling make_uninit_multibyte/unibyte_string and memcpy directly.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 17:37           ` Eli Zaretskii
@ 2020-04-04 18:06             ` Mattias Engdegård
  2020-04-05  2:37               ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-04 18:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

4 apr. 2020 kl. 19.37 skrev Eli Zaretskii <eliz@gnu.org>:

> Btw, if we want this particular use case to be as fast as possible,
> then Fcopy_sequence is not the best way, because it is not optimized
> for the case of copying a single string.  We could do better by
> calling make_uninit_multibyte/unibyte_string and memcpy directly.

Yes, if that would provide a benefit. The pattern should probably be encapsulated in copy_string or similar, if it doesn't already exist. (Should it copy properties? Probably not.)






^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 18:06             ` Mattias Engdegård
@ 2020-04-05  2:37               ` Eli Zaretskii
  2020-04-05  3:42                 ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-05  2:37 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 4 Apr 2020 20:06:23 +0200
> Cc: 40407@debbugs.gnu.org
> 
> 4 apr. 2020 kl. 19.37 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Btw, if we want this particular use case to be as fast as possible,
> > then Fcopy_sequence is not the best way, because it is not optimized
> > for the case of copying a single string.  We could do better by
> > calling make_uninit_multibyte/unibyte_string and memcpy directly.
> 
> Yes, if that would provide a benefit. The pattern should probably be encapsulated in copy_string or similar, if it doesn't already exist.

I wouldn't make it a separate function for the benefit of just one
caller.  Every function call is a slowdown, albeit a small one.

> (Should it copy properties? Probably not.)

Definitely not.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05  2:37               ` Eli Zaretskii
@ 2020-04-05  3:42                 ` Eli Zaretskii
  0 siblings, 0 replies; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-05  3:42 UTC (permalink / raw)
  To: 40407, mattiase

On April 5, 2020 5:37:03 AM GMT+03:00, Eli Zaretskii <eliz@gnu.org> wrote:
> > From: Mattias Engdegård <mattiase@acm.org>
> > Date: Sat, 4 Apr 2020 20:06:23 +0200
> > Cc: 40407@debbugs.gnu.org
> > 
> > 4 apr. 2020 kl. 19.37 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> > > Btw, if we want this particular use case to be as fast as
> possible,
> > > then Fcopy_sequence is not the best way, because it is not
> optimized
> > > for the case of copying a single string.  We could do better by
> > > calling make_uninit_multibyte/unibyte_string and memcpy directly.
> > 
> > Yes, if that would provide a benefit. The pattern should probably be
> encapsulated in copy_string or similar, if it doesn't already exist.
> 
> I wouldn't make it a separate function for the benefit of just one
> caller.  Every function call is a slowdown, albeit a small one.

However, we already have those functions ready, see make_unibyte_string and make_multibyte_string.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 17:22         ` Eli Zaretskii
  2020-04-04 17:37           ` Eli Zaretskii
@ 2020-04-05 10:14           ` Mattias Engdegård
  2020-04-05 13:28             ` Eli Zaretskii
  1 sibling, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-05 10:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

[-- Attachment #1: Type: text/plain, Size: 1014 bytes --]

4 apr. 2020 kl. 19.22 skrev Eli Zaretskii <eliz@gnu.org>:

>> This does not mean that the remaining 179 calls require a copy; they just use the default value of the parameter.
> 
> And IMO the default must stay that a copy is returned, except when the
> caller says otherwise.

Yes, those can be dealt with piecemeal, and we are in no hurry to do so.

> I think in the use case where we return a copy, we should make sure
> the return value is unibyte when encoding and multibyte when decoding.

I'm not necessarily opposed to the suggestion, but why not return a unibyte string in both cases, simplifying the code? In addition, some operations (aref) are faster on unibyte. Either way, it's nothing that a caller could rely on, is there? (In particular when taking NOCOPY into account.)

> Otherwise, I think this is OK (for the master branch, obviously).

Indeed the intention, thanks.

Here is what I would commit, unless you think the string copy should really be multibyte when decoding.


[-- Attachment #2: 0001-Avoid-expensive-recoding-for-ASCII-identity-cases-bu.patch --]
[-- Type: application/octet-stream, Size: 3246 bytes --]

From 63400cc62506d2c3d9d5f2f27e7bb3bfe7f8f877 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 3 Apr 2020 16:01:01 +0200
Subject: [PATCH] Avoid expensive recoding for ASCII identity cases (bug#40407)

Optimise for the common case of encoding or decoding an ASCII-only
string using an ASCII-compatible coding, for file names in particular.

* src/coding.c (string_ascii_p): New function.
(code_convert_string): Return the input string for ASCII-only inputs
and ASCII-compatible codings.
* test/src/coding-tests.el (coding-nocopy-ascii): New test.
---
 src/coding.c             | 23 ++++++++++++++++++++++-
 test/src/coding-tests.el | 11 +++++++++++
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/src/coding.c b/src/coding.c
index 1049f1b755..2425f5952f 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -9471,6 +9471,17 @@ used (which may be different from CODING-SYSTEM if CODING-SYSTEM is
   return code_convert_region (start, end, coding_system, destination, 1, 0);
 }
 
+/* Whether a (unibyte) string only contains chars in the 0..127 range.  */
+static bool
+string_ascii_p (Lisp_Object str)
+{
+  ptrdiff_t nbytes = SBYTES (str);
+  for (ptrdiff_t i = 0; i < nbytes; i++)
+    if (SREF (str, i) > 127)
+      return false;
+  return true;
+}
+
 Lisp_Object
 code_convert_string (Lisp_Object string, Lisp_Object coding_system,
 		     Lisp_Object dst_object, bool encodep, bool nocopy,
@@ -9502,7 +9513,17 @@ code_convert_string (Lisp_Object string, Lisp_Object coding_system,
   chars = SCHARS (string);
   bytes = SBYTES (string);
 
-  if (BUFFERP (dst_object))
+  if (EQ (dst_object, Qt))
+    {
+      /* Fast path for ASCII-only input and an ASCII-compatible coding:
+         act as identity.  */
+      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
+      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
+          && (STRING_MULTIBYTE (string)
+              ? (chars == bytes) : string_ascii_p (string)))
+	return nocopy ? string : make_unibyte_string (SDATA (string), bytes);
+    }
+  else if (BUFFERP (dst_object))
     {
       struct buffer *buf = XBUFFER (dst_object);
       ptrdiff_t buf_pt = BUF_PT (buf);
diff --git a/test/src/coding-tests.el b/test/src/coding-tests.el
index 110ff12696..93e6709d44 100644
--- a/test/src/coding-tests.el
+++ b/test/src/coding-tests.el
@@ -383,6 +383,17 @@ coding-nocopy-trivial
     (should-not (eq (encode-coding-string s nil nil) s))
     (should (eq (encode-coding-string s nil t) s))))
 
+(ert-deftest coding-nocopy-ascii ()
+  "Check that the NOCOPY parameter works for ASCII-only strings."
+  (let* ((uni (apply #'string (number-sequence 0 127)))
+         (multi (string-to-multibyte uni)))
+    (dolist (s (list uni multi))
+      (dolist (coding '(us-ascii iso-latin-1 utf-8))
+        (should-not (eq (decode-coding-string s coding nil) s))
+        (should-not (eq (encode-coding-string s coding nil) s))
+        (should (eq (decode-coding-string s coding t) s))
+        (should (eq (encode-coding-string s coding t) s))))))
+
 ;; Local Variables:
 ;; byte-compile-warnings: (not obsolete)
 ;; End:
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 10:14           ` Mattias Engdegård
@ 2020-04-05 13:28             ` Eli Zaretskii
  2020-04-05 13:40               ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-05 13:28 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 5 Apr 2020 12:14:59 +0200
> Cc: 40407@debbugs.gnu.org
> 
> > I think in the use case where we return a copy, we should make sure
> > the return value is unibyte when encoding and multibyte when decoding.
> 
> I'm not necessarily opposed to the suggestion, but why not return a unibyte string in both cases, simplifying the code?

For compatibility with what happens now:

  (multibyte-string-p (decode-coding-string "abc" 'utf-8)) => t

> In addition, some operations (aref) are faster on unibyte. Either way, it's nothing that a caller could rely on, is there? (In particular when taking NOCOPY into account.)

That is true, of course, but many/most of our strings are multibyte
nowadays, even if they are ASCII.  Suddenly getting a unibyte string
instead would be surprising, I think, even if no one should depend on
it not happening.  (NOCOPY case is different: then it's the caller's
responsibility to deal with the issue.)  So I'd rather we produced a
multibyte string when "decoding" by copying.

> +/* Whether a (unibyte) string only contains chars in the 0..127 range.  */

One subtle point regarding this comment: I'd remove the "unibyte"
part, because (1) you apply this test to multibyte strings as well,
and (2) strings encoded in iso-2022 will look "pure-ASCII", but they
aren't.  The latter subtlety doesn't interfere with the caller,
because iso-2022 is not ASCII-compatible, but it's something I'd
mention in the comment, lest someone uses this function for some
other use case.

The patch is OK otherwise.  Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 13:28             ` Eli Zaretskii
@ 2020-04-05 13:40               ` Mattias Engdegård
  0 siblings, 0 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-05 13:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

5 apr. 2020 kl. 15.28 skrev Eli Zaretskii <eliz@gnu.org>:

> That is true, of course, but many/most of our strings are multibyte
> nowadays, even if they are ASCII.  Suddenly getting a unibyte string
> instead would be surprising, I think, even if no one should depend on
> it not happening.  (NOCOPY case is different: then it's the caller's
> responsibility to deal with the issue.)  So I'd rather we produced a
> multibyte string when "decoding" by copying.

I don't agree fully but it is definitely not a strongly held opinion, so I followed your suggestion.

> One subtle point regarding this comment: I'd remove the "unibyte"
> part

Right, done. Thanks for the reviews!






^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-03 22:32   ` Mattias Engdegård
  2020-04-04  9:26     ` Eli Zaretskii
@ 2020-04-04 10:26     ` Eli Zaretskii
  2020-04-04 16:55       ` Mattias Engdegård
  1 sibling, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-04 10:26 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 4 Apr 2020 00:32:21 +0200
> Cc: 40407@debbugs.gnu.org
> 
> Since string mutation is so rare, I doubt it has caused any real trouble. Now, do we fix it by inverting the sense of the argument, or by renaming it to COPY? I'm fairly neutral, but there are arguments in either way, both in terms of performance and correctness. And what about internal calls to code_convert_string?
> 
> There are 193 calls to {encode, decode}-coding-string in the Emacs tree, and only 14 of them pass a non-nil value to NOCOPY. I'd be inclined to keep the semantics but rename the argument to COPY, on the grounds that no-copy is a better default; then change those 14 calls to pass nil instead, since that obviously was the intent.

After looking at this for some time, I think the problem is rarely if
ever seen.  The only function which has the NOCOPY sense inverted is
code_convert_string, and it only does that when the CODING_SYSTEM
argument is nil, which should almost never happen.  So I think it's OK
to change code_convert_string on master to use NOCOPY in its correct
sense.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 10:26     ` Eli Zaretskii
@ 2020-04-04 16:55       ` Mattias Engdegård
  2020-04-04 17:04         ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-04 16:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

4 apr. 2020 kl. 12.26 skrev Eli Zaretskii <eliz@gnu.org>:

> After looking at this for some time, I think the problem is rarely if
> ever seen.  The only function which has the NOCOPY sense inverted is
> code_convert_string, and it only does that when the CODING_SYSTEM
> argument is nil, which should almost never happen.

Oh, that's the easy part. It's the ASCII optimisation that makes it interesting.







^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 16:55       ` Mattias Engdegård
@ 2020-04-04 17:04         ` Eli Zaretskii
  2020-04-04 18:01           ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-04 17:04 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 4 Apr 2020 18:55:26 +0200
> Cc: 40407@debbugs.gnu.org
> 
> 4 apr. 2020 kl. 12.26 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > After looking at this for some time, I think the problem is rarely if
> > ever seen.  The only function which has the NOCOPY sense inverted is
> > code_convert_string, and it only does that when the CODING_SYSTEM
> > argument is nil, which should almost never happen.
> 
> Oh, that's the easy part. It's the ASCII optimisation that makes it interesting.

How so?





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 17:04         ` Eli Zaretskii
@ 2020-04-04 18:01           ` Mattias Engdegård
  2020-04-04 18:25             ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-04 18:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

4 apr. 2020 kl. 19.04 skrev Eli Zaretskii <eliz@gnu.org>:

> How so?

Because then NOCOPY suddenly matters for almost all coding systems, not just nil. After all, an all-ASCII input string and ASCII-compatible coding is not an unusual combination. This forces us to be careful when correcting the NOCOPY sense, and may expose latent bugs.

But you are right: we should trust calls to {en,de}code-coding-system with NOCOPY=t, and the rest can also remain as they are until someone cares enough to change them.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 18:01           ` Mattias Engdegård
@ 2020-04-04 18:25             ` Eli Zaretskii
  2020-04-05 10:48               ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-04 18:25 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 4 Apr 2020 20:01:12 +0200
> Cc: 40407@debbugs.gnu.org
> 
> 4 apr. 2020 kl. 19.04 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > How so?
> 
> Because then NOCOPY suddenly matters for almost all coding systems, not just nil. After all, an all-ASCII input string and ASCII-compatible coding is not an unusual combination. This forces us to be careful when correcting the NOCOPY sense, and may expose latent bugs.

That's true.  But as a matter of fact, I don't see any calls to
code_convert_string with NOCOPY non-zero, they all pass zero or false
to it.  So none of the existing direct calls from C wants or expects
to get the same string.

> But you are right: we should trust calls to {en,de}code-coding-system with NOCOPY=t, and the rest can also remain as they are until someone cares enough to change them.

Agreed.

Btw, this bug was introduced in commit 4031e2b, 18 years ago.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-04 18:25             ` Eli Zaretskii
@ 2020-04-05 10:48               ` Mattias Engdegård
  2020-04-05 13:39                 ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-05 10:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

4 apr. 2020 kl. 20.25 skrev Eli Zaretskii <eliz@gnu.org>:

> That's true.  But as a matter of fact, I don't see any calls to
> code_convert_string with NOCOPY non-zero, they all pass zero or false
> to it.  So none of the existing direct calls from C wants or expects
> to get the same string.

Right. However, I did some reading and believe that nocopy=true is actually correct for all uses of {EN,DE}CODE_FILE, and in fact all calls to code_convert_string_norecord. One of the reasons is that the callers need to be careful with mutation wrt GC anyway; any post-recoding mutation is done on copies. (Not being able to change the length of strings also helps.)

I pushed what we agreed on in part for the pleasure of resolving such an old-standing bug) to master (962562cde4).
Given the limited scope of the change, would you agree to a backport of that to emacs-27?

For the reasons above, I think it's correct and proper to do (on master)

--- a/src/coding.c
+++ b/src/coding.c
@@ -9554,7 +9554,7 @@ code_convert_string (Lisp_Object string, Lisp_Object coding_system,
 code_convert_string_norecord (Lisp_Object string, Lisp_Object coding_system,
                              bool encodep)
 {
-  return code_convert_string (string, coding_system, Qt, encodep, 0, 1);
+  return code_convert_string (string, coding_system, Qt, encodep, 1, 1);
 }

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 10:48               ` Mattias Engdegård
@ 2020-04-05 13:39                 ` Eli Zaretskii
  2020-04-05 15:03                   ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-05 13:39 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 5 Apr 2020 12:48:51 +0200
> Cc: 40407@debbugs.gnu.org
> 
> 4 apr. 2020 kl. 20.25 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > That's true.  But as a matter of fact, I don't see any calls to
> > code_convert_string with NOCOPY non-zero, they all pass zero or false
> > to it.  So none of the existing direct calls from C wants or expects
> > to get the same string.
> 
> Right. However, I did some reading and believe that nocopy=true is actually correct for all uses of {EN,DE}CODE_FILE, and in fact all calls to code_convert_string_norecord.

I don't think I follow.  We call code_convert_string_norecord, which
invokes code_convert_string with NOCOPY set to 'false'.  So all those
users should NOT receive the same string as the argument, and I don't
believe they expect that and can cope with it.

Perhaps what you meant was that NOCOPY = false was actually acting as
if the value were 'true', due to the bug.  But if so, that didn't
affect ENCODE_FILE and DECODE_FILE, because the bug is only visible
when the CODING_SYSTEM argument is nil, and both ENCODE_FILE and
DECODE_FILE never let that happen:

  if (! NILP (Vfile_name_coding_system))
    return code_convert_string_norecord (fname, Vfile_name_coding_system, 1);
  else if (! NILP (Vdefault_file_name_coding_system))
    return code_convert_string_norecord (fname,
					 Vdefault_file_name_coding_system, 1);
  else
    return fname;

So in practice this bug was probably never seen until now.

> One of the reasons is that the callers need to be careful with mutation wrt GC anyway; any post-recoding mutation is done on copies. (Not being able to change the length of strings also helps.)

I don't think I understand your line of reasoning here.  I don't think
GC is relevant, and as long as we are talking about file names, the
first null byte terminates it even though the Lisp string's length
could be larger.

> Given the limited scope of the change, would you agree to a backport of that to emacs-27?

That'd be a mistake, I think.  My reasoning goes like this: If I'm
right that this bug was never seen, fixing it on emacs-27 will have no
visible effect; and if I'm wrong, then we will break the release
branch.  The danger of breakage in the latter case is much more severe
than the gain from the fix in the former case.

> For the reasons above, I think it's correct and proper to do (on master)
> 
> --- a/src/coding.c
> +++ b/src/coding.c
> @@ -9554,7 +9554,7 @@ code_convert_string (Lisp_Object string, Lisp_Object coding_system,
>  code_convert_string_norecord (Lisp_Object string, Lisp_Object coding_system,
>                               bool encodep)
>  {
> -  return code_convert_string (string, coding_system, Qt, encodep, 0, 1);
> +  return code_convert_string (string, coding_system, Qt, encodep, 1, 1);
>  }

I hope you now agree with me that we should not do this.  The default
should stay NOCOPY = false, and any caller that wants otherwise must
explicitly request that by calling code_convert_string.

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 13:39                 ` Eli Zaretskii
@ 2020-04-05 15:03                   ` Mattias Engdegård
  2020-04-05 15:35                     ` Mattias Engdegård
  2020-04-05 16:00                     ` Eli Zaretskii
  0 siblings, 2 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-05 15:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

5 apr. 2020 kl. 15.39 skrev Eli Zaretskii <eliz@gnu.org>:

> I don't think I follow.  We call code_convert_string_norecord, which
> invokes code_convert_string with NOCOPY set to 'false'.  So all those
> users should NOT receive the same string as the argument, and I don't
> believe they expect that and can cope with it.

Actually they can, as far as I can tell. Have a look yourself.

> I don't think I understand your line of reasoning here.  I don't think
> GC is relevant, and as long as we are talking about file names, the
> first null byte terminates it even though the Lisp string's length
> could be larger.

It is stated as a reason in Fexpand_file_name for working on copies of strings; see comments therein. But that is not really important in itself.

>> Given the limited scope of the change, would you agree to a backport of that to emacs-27?
> 
> That'd be a mistake, I think.  My reasoning goes like this: If I'm
> right that this bug was never seen, fixing it on emacs-27 will have no
> visible effect; and if I'm wrong, then we will break the release
> branch.  The danger of breakage in the latter case is much more severe
> than the gain from the fix in the former case.

We do fix clear bugs on emacs-27 even when nobody complained about them, but you are right that it's not that important in this case. Let's leave it on master.

> I hope you now agree with me that we should not do this.  The default
> should stay NOCOPY = false, and any caller that wants otherwise must
> explicitly request that by calling code_convert_string.

I disagree -- if the callers handle the situation safely, there is no reason not to to do the change, saving some consing. We do this sort of code improvement all the time; nothing special about this one.

Of course, if you prefer the scenic route, we could add {en,de}code_file_nocopy and replace {EN,DE}CODE_FILE calls one by one until they all are done, and arrive at essentially the same code.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 15:03                   ` Mattias Engdegård
@ 2020-04-05 15:35                     ` Mattias Engdegård
  2020-04-05 15:56                       ` Eli Zaretskii
  2020-04-05 16:00                     ` Eli Zaretskii
  1 sibling, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-05 15:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

5 apr. 2020 kl. 17.03 skrev Mattias Engdegård <mattiase@acm.org>:

> Actually they can, as far as I can tell. Have a look yourself.

To clarify, calls to {EN,DE}CODE_FILE are probably safe by design since they already return their argument in several cases. Some calls to code_convert_string_norecord are not safe because they are used with auto lisp strings; I'm going through them to find out just which ones. This can be done piecemeal.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 15:35                     ` Mattias Engdegård
@ 2020-04-05 15:56                       ` Eli Zaretskii
  2020-04-06 18:13                         ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-05 15:56 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 5 Apr 2020 17:35:55 +0200
> Cc: 40407@debbugs.gnu.org
> 
> 5 apr. 2020 kl. 17.03 skrev Mattias Engdegård <mattiase@acm.org>:
> 
> > Actually they can, as far as I can tell. Have a look yourself.
> 
> To clarify, calls to {EN,DE}CODE_FILE are probably safe by design since they already return their argument in several cases.

In theory, yes.  In practice, not really.  The cases where they return
their argument are those when we didn't yet set up any encoding for
file names.  This happens only very early into startup, and frankly,
we have nothing else to do at that point.

Once we do set up default-file-name-coding-system, these macros will
never return their argument (unless someone forcefully sets the
encoding to nil, in which case they deserve what they get).  Do you
agree?

> Some calls to code_convert_string_norecord are not safe because they are used with auto lisp strings; I'm going through them to find out just which ones. This can be done piecemeal.

I'm okay with making the safe cases faster, but we'd need to clearly
comment each one, because later changes might make them unsafe.  Any
code that uses the un-encoded string after encoding it, or the
un-decoded string after decoding it, could become broken if these two
are the same string.

And let's please keep in mind that on most modern platforms in most
use cases default-file-name-coding-system is utf-8, so encoding is
fast, and thus we don't need to go overboard here.  IOW, if there's a
doubt, there's no doubt.

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 15:56                       ` Eli Zaretskii
@ 2020-04-06 18:13                         ` Mattias Engdegård
  0 siblings, 0 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-06 18:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407

[-- Attachment #1: Type: text/plain, Size: 1119 bytes --]

5 apr. 2020 kl. 17.56 skrev Eli Zaretskii <eliz@gnu.org>:

> Once we do set up default-file-name-coding-system, these macros will
> never return their argument (unless someone forcefully sets the
> encoding to nil, in which case they deserve what they get).  Do you
> agree?

Thank you, and yes, I do agree partly: ENCODE_FILE is the identity for all unibyte strings no matter the coding system in use.

However, my point (which I didn't do a very good job explaining) was that if either ENCODE_FILE or DECODE_FILE are called with the assumption that they return a new string, that is at least a latent bug.

Thus I went through them all once again, and found a few questionable calls that I'd like to fix. They rely on Fexpand_file_name returning a new string, which may or may not be true now but we would be better without such assumptions. (I also stumbled on a potential GC-related bug.) Patch attached!

With these fixed, nothing prevents those two functions from using no-copy semantics. I agree this approach is better and safer than going straight for code_convert_string_norecord in one pass.


[-- Attachment #2: 0001-Don-t-rely-on-copying-in-EN-DE-CODE_FILE.patch --]
[-- Type: application/octet-stream, Size: 2496 bytes --]

From ff62a3874890810823f79dac1273ebdd214ba529 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 6 Apr 2020 15:20:08 +0200
Subject: [PATCH] Don't rely on copying in {EN,DE}CODE_FILE

Callers of ENCODE_FILE and DECODE_FILE should not assume that these
functions always return a new string (bug#40407).

* src/w32fns.c (Fw32_shell_execute):
* src/w32proc.c (Fw32_application_type):
Sink taking the address of a Lisp string past GC points.
Copy values returned from ENCODE_FILE before mutating them.
---
 src/w32fns.c  | 4 ++--
 src/w32proc.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/w32fns.c b/src/w32fns.c
index 9bb4e27b01..8d714f0b8d 100644
--- a/src/w32fns.c
+++ b/src/w32fns.c
@@ -8258,7 +8258,6 @@ parameters (e.g., \"printto\" requires the printer address).  Otherwise,
   /* Encode filename, current directory and parameters.  */
   current_dir = GUI_ENCODE_FILE (current_dir);
   document = GUI_ENCODE_FILE (document);
-  doc_w = GUI_SDATA (document);
   if (STRINGP (parameters))
     {
       parameters = GUI_ENCODE_SYSTEM (parameters);
@@ -8269,6 +8268,7 @@ parameters (e.g., \"printto\" requires the printer address).  Otherwise,
       operation = GUI_ENCODE_SYSTEM (operation);
       ops_w = GUI_SDATA (operation);
     }
+  doc_w = GUI_SDATA (document);
   result = (intptr_t) ShellExecuteW (NULL, ops_w, doc_w, params_w,
 				     GUI_SDATA (current_dir),
 				     (FIXNUMP (show_flag)
@@ -8353,7 +8353,7 @@ parameters (e.g., \"printto\" requires the printer address).  Otherwise,
   handler = Ffind_file_name_handler (absdoc, Qfile_exists_p);
   if (NILP (handler))
     {
-      Lisp_Object absdoc_encoded = ENCODE_FILE (absdoc);
+      Lisp_Object absdoc_encoded = Fcopy_sequence (ENCODE_FILE (absdoc));
 
       if (faccessat (AT_FDCWD, SSDATA (absdoc_encoded), F_OK, AT_EACCESS) == 0)
 	{
diff --git a/src/w32proc.c b/src/w32proc.c
index de33726905..16e32e4c58 100644
--- a/src/w32proc.c
+++ b/src/w32proc.c
@@ -3231,7 +3231,7 @@ DEFUN ("w32-application-type", Fw32_application_type,
   char *progname, progname_a[MAX_PATH];
 
   program = Fexpand_file_name (program, Qnil);
-  encoded_progname = ENCODE_FILE (program);
+  encoded_progname = Fcopy_sequence (ENCODE_FILE (program));
   progname = SSDATA (encoded_progname);
   unixtodos_filename (progname);
   filename_to_ansi (progname, progname_a);
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-05 15:03                   ` Mattias Engdegård
  2020-04-05 15:35                     ` Mattias Engdegård
@ 2020-04-05 16:00                     ` Eli Zaretskii
  1 sibling, 0 replies; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-05 16:00 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 5 Apr 2020 17:03:49 +0200
> Cc: 40407@debbugs.gnu.org
> 
> > I don't think I understand your line of reasoning here.  I don't think
> > GC is relevant, and as long as we are talking about file names, the
> > first null byte terminates it even though the Lisp string's length
> > could be larger.
> 
> It is stated as a reason in Fexpand_file_name for working on copies
> of strings; see comments therein.

That refers to code that keeps C pointers into string text.  This is
not our case here: we are talking about Lisp strings, not C pointers
into them.

> > I hope you now agree with me that we should not do this.  The default
> > should stay NOCOPY = false, and any caller that wants otherwise must
> > explicitly request that by calling code_convert_string.
> 
> I disagree -- if the callers handle the situation safely, there is no reason not to to do the change, saving some consing. We do this sort of code improvement all the time; nothing special about this one.

"If the callers handle the situation safely" is not a trivial
condition.  The programmer will more often than not be unaware of this
subtlety, and may not write such safe code.  Moreover, the callers
will have to handle it safely in the future, or be sure to insist on a
copy if not, and these two macros don't give the callers knobs to
request that.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-03 16:24 ` Eli Zaretskii
  2020-04-03 22:32   ` Mattias Engdegård
@ 2020-04-06 10:10   ` OGAWA Hirofumi
  2020-04-06 14:21     ` Eli Zaretskii
  1 sibling, 1 reply; 42+ messages in thread
From: OGAWA Hirofumi @ 2020-04-06 10:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407, Mattias Engdegård

Eli Zaretskii <eliz@gnu.org> writes:

>> -  if (BUFFERP (dst_object))
>> +  if (EQ (dst_object, Qt))
>> +    {
>> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
>> +         act as identity.  */
>> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
>> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
>> +          && (STRING_MULTIBYTE (string)
>> +              ? (chars == bytes) : string_ascii_p (string)))
>> +        return string;

While using the latest master branch, I noticed this became the cause of
decoding error.

The simple reproducible test is,

	(decode-coding-string "&abc" 'utf-7-imap)
        => "&abc"

like the above result, decoding utf-7-imap didn't work.

Because (coding-system-get 'utf-7-imap :ascii-compatible-p) => t.

I'm not sure, 'utf-7* should be fixed as non ascii-compatible, or
string_ascii_p() should check more strictly.

[BTW,

(define-coding-system 'utf-7-imap
  "UTF-7 encoding of Unicode, IMAP version (RFC 2060)"
  :coding-type 'utf-8
  :mnemonic ?u
  :charset-list '(unicode)
  :ascii-compatible-p nil		;; <=== added this line
  :pre-write-conversion 'utf-7-imap-pre-write-conversion
  :post-read-conversion 'utf-7-imap-post-read-conversion)

doesn't work. Because define-coding-system-internal overwrites
ascii-compatible-p.]

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 10:10   ` OGAWA Hirofumi
@ 2020-04-06 14:21     ` Eli Zaretskii
  2020-04-06 15:56       ` Mattias Engdegård
  2020-04-16 13:11       ` handa
  0 siblings, 2 replies; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-06 14:21 UTC (permalink / raw)
  To: OGAWA Hirofumi, Kenichi Handa; +Cc: 40407, mattiase

> From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
> Cc: Mattias Engdegård <mattiase@acm.org>,
>         40407@debbugs.gnu.org
> Date: Mon, 06 Apr 2020 19:10:48 +0900
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> -  if (BUFFERP (dst_object))
> >> +  if (EQ (dst_object, Qt))
> >> +    {
> >> +      /* Fast path for ASCII-only input and an ASCII-compatible coding:
> >> +         act as identity.  */
> >> +      Lisp_Object attrs = CODING_ID_ATTRS (coding.id);
> >> +      if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
> >> +          && (STRING_MULTIBYTE (string)
> >> +              ? (chars == bytes) : string_ascii_p (string)))
> >> +        return string;
> 
> While using the latest master branch, I noticed this became the cause of
> decoding error.
> 
> The simple reproducible test is,
> 
> 	(decode-coding-string "&abc" 'utf-7-imap)
>         => "&abc"
> 
> like the above result, decoding utf-7-imap didn't work.
> 
> Because (coding-system-get 'utf-7-imap :ascii-compatible-p) => t.

Thanks.

> I'm not sure, 'utf-7* should be fixed as non ascii-compatible, or
> string_ascii_p() should check more strictly.

The former, since UTF-7 is definitely *not* ASCII-compatible.  Does
the patch below produce good results?

Kenichi, why was coding-type of UTF-7 systems set to 'utf-8'?
Wouldn't it be better to set it to 'utf-16'?  Or is there some
subtlety here that we should be aware of?  Do you have any comments on
the patch below?

Thanks.

diff --git a/src/coding.c b/src/coding.c
index 97a6eb9..71ff93c 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -11301,7 +11301,10 @@ DEFUN ("define-coding-system-internal", Fdefine_coding_system_internal,
 	  CHECK_CODING_SYSTEM (val);
 	}
       ASET (attrs, coding_attr_utf_bom, bom);
-      if (NILP (bom))
+      if (NILP (bom)
+	  /* UTF-7 has :coding-type set to 'utf-8' (why not
+	     'utf-16'?), but it is definitely NOT ASCII-compatible.  */
+	  && !EQ (name, Qutf_7) && !EQ (name, Qutf_7_imap))
 	ASET (attrs, coding_attr_ascii_compat, Qt);
 
       category = (CONSP (bom) ? coding_category_utf_8_auto
@@ -11673,6 +11676,9 @@ syms_of_coding (void)
   DEFSYM (Qutf_8_unix, "utf-8-unix");
   DEFSYM (Qutf_8_emacs, "utf-8-emacs");
 
+  DEFSYM (Qutf_7, "utf-7");
+  DEFSYM (Qutf_7_imap, "utf-7-imap");
+
 #if defined (WINDOWSNT) || defined (CYGWIN)
   /* No, not utf-16-le: that one has a BOM.  */
   DEFSYM (Qutf_16le, "utf-16le");





^ permalink raw reply related	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 14:21     ` Eli Zaretskii
@ 2020-04-06 15:56       ` Mattias Engdegård
  2020-04-06 16:33         ` Eli Zaretskii
  2020-04-16 13:11       ` handa
  1 sibling, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-06 15:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407, OGAWA Hirofumi

[-- Attachment #1: Type: text/plain, Size: 743 bytes --]

6 apr. 2020 kl. 16.21 skrev Eli Zaretskii <eliz@gnu.org>:

> Kenichi, why was coding-type of UTF-7 systems set to 'utf-8'?
> Wouldn't it be better to set it to 'utf-16'?  Or is there some
> subtlety here that we should be aware of?  Do you have any comments on
> the patch below?

There is no reason why utf-7[-imap] should have utf-8 as coding-type, is there? utf-16 is definitely wrong (utf-7* are encoded in ASCII). What about the patch below instead?

By the way, there appears to be another, unrelated bug in utf-7-imap: According to RFC 2060, all C0 controls are base64-encoded, but in Emacs some of them are passed through unchanged (CR, LF and TAB). This is permitted by plain UTF-7 (RFC 1642) but not in the IMAP variant.


[-- Attachment #2: utf-7.diff --]
[-- Type: application/octet-stream, Size: 2474 bytes --]

diff --git a/lisp/international/mule-conf.el b/lisp/international/mule-conf.el
index e6e6135243..c5cfbaeb87 100644
--- a/lisp/international/mule-conf.el
+++ b/lisp/international/mule-conf.el
@@ -1511,20 +1511,25 @@ 'iso-safe
 
 (define-coding-system 'utf-7
   "UTF-7 encoding of Unicode (RFC 2152)."
-  :coding-type 'utf-8
+  :coding-type 'charset
+  :charset-list '(ascii)
   :mnemonic ?U
   :mime-charset 'utf-7
-  :charset-list '(unicode)
   :pre-write-conversion 'utf-7-pre-write-conversion
   :post-read-conversion 'utf-7-post-read-conversion)
+;; Having `ascii' in :charset-list automatically sets :ascii-compatible-p,
+;; but UTF-7 is not ASCII compatible; disable.
+(coding-system-put 'utf-7 :ascii-compatible-p nil)
 
 (define-coding-system 'utf-7-imap
   "UTF-7 encoding of Unicode, IMAP version (RFC 2060)"
-  :coding-type 'utf-8
+  :coding-type 'charset
+  :charset-list '(ascii)
   :mnemonic ?u
-  :charset-list '(unicode)
   :pre-write-conversion 'utf-7-imap-pre-write-conversion
   :post-read-conversion 'utf-7-imap-post-read-conversion)
+;; See comment for utf-7 above.
+(coding-system-put 'utf-7-imap :ascii-compatible-p nil)
 
 ;; Use us-ascii for terminal output if some other coding system is not
 ;; specified explicitly.
diff --git a/test/lisp/international/mule-tests.el b/test/lisp/international/mule-tests.el
index 91e3c2279f..b5fbb4ab8e 100644
--- a/test/lisp/international/mule-tests.el
+++ b/test/lisp/international/mule-tests.el
@@ -48,6 +48,19 @@ mule-cmds--test-universal-coding-system-argument
                         (append (kbd "C-x RET c u t f - 8 RET C-u C-u c a b RET") nil)))
                    (read-string "prompt:")))))
 
+(ert-deftest mule-utf-7 ()
+  ;; utf-7 and utf-7-imap are not ASCII-compatible.
+  (should-not (coding-system-get 'utf-7 :ascii-compatible-p))
+  (should-not (coding-system-get 'utf-7-imap :ascii-compatible-p))
+  ;; Invariant ASCII subset.
+  (let ((s (apply #'string (append (number-sequence #x20 #x25)
+                                   (number-sequence #x27 #x7e)))))
+    (should (equal (encode-coding-string s 'utf-7-imap) s))
+    (should (equal (decode-coding-string s 'utf-7-imap) s)))
+  ;; Escaped ampersand.
+  (should (equal (encode-coding-string "a&bcd" 'utf-7-imap) "a&-bcd"))
+  (should (equal (decode-coding-string "a&-bcd" 'utf-7-imap) "a&bcd")))
+
 ;; Stop "Local Variables" above causing confusion when visiting this file.
 \f
 

[-- Attachment #3: Type: text/plain, Size: 2 bytes --]




^ permalink raw reply related	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 15:56       ` Mattias Engdegård
@ 2020-04-06 16:33         ` Eli Zaretskii
  2020-04-06 16:55           ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-06 16:33 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407, hirofumi

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 6 Apr 2020 17:56:27 +0200
> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
>         Kenichi Handa <handa@gnu.org>, 40407@debbugs.gnu.org
> 
> > Kenichi, why was coding-type of UTF-7 systems set to 'utf-8'?
> > Wouldn't it be better to set it to 'utf-16'?  Or is there some
> > subtlety here that we should be aware of?  Do you have any comments on
> > the patch below?
> 
> There is no reason why utf-7[-imap] should have utf-8 as coding-type, is there?

I think it might be just some convenience thing: utf-7 and utf-8 have
something in common that made it convenient to treat them the same in
the internal routines.  Or maybe it's just an accident.

> utf-16 is definitely wrong (utf-7* are encoded in ASCII).

Why do you think the ASCII encoding contradicts the utf-16
coding-type?

> What about the patch below instead?

I don't think 'charset' is the right type for this encoding (any
reason why you've chosen it?), but I will let Handa-san comment.
Defining coding-systems is a black art which I don't think I ever
mastered.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 16:33         ` Eli Zaretskii
@ 2020-04-06 16:55           ` Mattias Engdegård
  2020-04-06 17:18             ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-06 16:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407, hirofumi

6 apr. 2020 kl. 18.33 skrev Eli Zaretskii <eliz@gnu.org>:

> I think it might be just some convenience thing: utf-7 and utf-8 have
> something in common that made it convenient to treat them the same in
> the internal routines.  Or maybe it's just an accident.

There is nothing common between utf-7 and utf-8 at all (apart from a subset of ASCII being encoded in the same way, and the fact that both encode the Unicode repertoire).

> Why do you think the ASCII encoding contradicts the utf-16
> coding-type?

Because :coding type is the first stage of decoding, or the last stage of encoding. It reflects the low-level structure of the encoded data: using utf-16 as :coding-type implies that utf-7 is encoded into 16-bit parcels, but it's not -- the result of utf-7-imap encoding is a sequence of ASCII bytes. (UTF-16 plays a part in an intermediary step for some values before they are base64-encoded, but that's not visible in the final byte stream.)

> I don't think 'charset' is the right type for this encoding (any
> reason why you've chosen it?), but I will let Handa-san comment.

We could use 'raw-text' as well but that implies that any byte value could be part of an utf-7[-imap] text, which is incorrect.
In fact, utf-7-imap only uses codes 0x20-0x7e (utf-7 is allowed to use a few C0 controls too, as mentioned).

Arguably the heuristics of define-coding-system-internal are somewhat inscrutable. There seems to be leaks between layers -- ascii-compatible-p is an end-to-end property and cannot really be set the way it is by that function. But since it is, fixing it afterwards should be the correct way.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 16:55           ` Mattias Engdegård
@ 2020-04-06 17:18             ` Eli Zaretskii
  2020-04-06 17:49               ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-06 17:18 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407, hirofumi

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 6 Apr 2020 18:55:30 +0200
> Cc: hirofumi@mail.parknet.co.jp, handa@gnu.org, 40407@debbugs.gnu.org
> 
> 6 apr. 2020 kl. 18.33 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > I think it might be just some convenience thing: utf-7 and utf-8 have
> > something in common that made it convenient to treat them the same in
> > the internal routines.  Or maybe it's just an accident.
> 
> There is nothing common between utf-7 and utf-8 at all (apart from a subset of ASCII being encoded in the same way, and the fact that both encode the Unicode repertoire).

By "in common" in this context I meant from the POV of internal
treating of the two encodings.

> > I don't think 'charset' is the right type for this encoding (any
> > reason why you've chosen it?), but I will let Handa-san comment.
> 
> We could use 'raw-text' as well but that implies that any byte value could be part of an utf-7[-imap] text, which is incorrect.
> In fact, utf-7-imap only uses codes 0x20-0x7e (utf-7 is allowed to use a few C0 controls too, as mentioned).
> 
> Arguably the heuristics of define-coding-system-internal are somewhat inscrutable. There seems to be leaks between layers -- ascii-compatible-p is an end-to-end property and cannot really be set the way it is by that function. But since it is, fixing it afterwards should be the correct way.

I prefer to wait for Handa-san's response, and meanwhile install the
least disruptive change, which just fixes the one aspect that got
broken.  Call me a coward, if you wish.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 17:18             ` Eli Zaretskii
@ 2020-04-06 17:49               ` Mattias Engdegård
  2020-04-06 18:20                 ` Eli Zaretskii
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-06 17:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407, hirofumi

6 apr. 2020 kl. 19.18 skrev Eli Zaretskii <eliz@gnu.org>:

> By "in common" in this context I meant from the POV of internal
> treating of the two encodings.

So did I.

> I prefer to wait for Handa-san's response, and meanwhile install the
> least disruptive change, which just fixes the one aspect that got
> broken.  Call me a coward, if you wish.

If so, the least disruptive change by far would be the two calls to coding-system-put in my patch (along with the tests, of course).






^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 17:49               ` Mattias Engdegård
@ 2020-04-06 18:20                 ` Eli Zaretskii
  2020-04-06 18:34                   ` OGAWA Hirofumi
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-06 18:20 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407, hirofumi

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 6 Apr 2020 19:49:19 +0200
> Cc: hirofumi@mail.parknet.co.jp, handa@gnu.org, 40407@debbugs.gnu.org
> 
> > I prefer to wait for Handa-san's response, and meanwhile install the
> > least disruptive change, which just fixes the one aspect that got
> > broken.  Call me a coward, if you wish.
> 
> If so, the least disruptive change by far would be the two calls to coding-system-put in my patch (along with the tests, of course).

Fine with me, but please say something about why we use 'put' instead
of specifying :ascii-compatible-p directly in the coding-system's
definition, and also add a FIXME there, since I hope this is not the
final word on that matter.

Thanks.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 18:20                 ` Eli Zaretskii
@ 2020-04-06 18:34                   ` OGAWA Hirofumi
  2020-04-06 21:57                     ` Mattias Engdegård
  2020-04-09 11:03                     ` Mattias Engdegård
  0 siblings, 2 replies; 42+ messages in thread
From: OGAWA Hirofumi @ 2020-04-06 18:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407, Mattias Engdegård

Eli Zaretskii <eliz@gnu.org> writes:

>> > I prefer to wait for Handa-san's response, and meanwhile install the
>> > least disruptive change, which just fixes the one aspect that got
>> > broken.  Call me a coward, if you wish.
>> 
>> If so, the least disruptive change by far would be the two calls to coding-system-put in my patch (along with the tests, of course).
>
> Fine with me, but please say something about why we use 'put' instead
> of specifying :ascii-compatible-p directly in the coding-system's
> definition, and also add a FIXME there, since I hope this is not the
> final word on that matter.

BTW, 

Mattias Engdegård <mattiase@acm.org> writes:

>  (define-coding-system 'utf-7-imap
>    "UTF-7 encoding of Unicode, IMAP version (RFC 2060)"
> -  :coding-type 'utf-8
> +  :coding-type 'charset
> +  :charset-list '(ascii)
>    :mnemonic ?u
> -  :charset-list '(unicode)
>    :pre-write-conversion 'utf-7-imap-pre-write-conversion
>    :post-read-conversion 'utf-7-imap-post-read-conversion)
> +;; See comment for utf-7 above.
> +(coding-system-put 'utf-7-imap :ascii-compatible-p nil)

(check-coding-systems-region "あ" nil '(utf-7-imap))
=> ((utf-7-imap 0))

It says "cannot encodable by utf-7-imap", so looks like ":charset-list
'(ascii)" doesn't work at least.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 18:34                   ` OGAWA Hirofumi
@ 2020-04-06 21:57                     ` Mattias Engdegård
  2020-04-09 11:03                     ` Mattias Engdegård
  1 sibling, 0 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-06 21:57 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: 40407

6 apr. 2020 kl. 20.34 skrev OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>:

> (check-coding-systems-region "あ" nil '(utf-7-imap))
> => ((utf-7-imap 0))
> 
> It says "cannot encodable by utf-7-imap", so looks like ":charset-list
> '(ascii)" doesn't work at least.

Thanks for catching that! The documentation doesn't explain the role of :charset-list for a :coding-type of 'utf-8, but a minimal patch seemed to work. Your example has been added to the test.






^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 18:34                   ` OGAWA Hirofumi
  2020-04-06 21:57                     ` Mattias Engdegård
@ 2020-04-09 11:03                     ` Mattias Engdegård
  2020-04-09 14:09                       ` Kazuhiro Ito
  2020-04-11 15:09                       ` Mattias Engdegård
  1 sibling, 2 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-09 11:03 UTC (permalink / raw)
  To: OGAWA Hirofumi; +Cc: 40407

Eli, thank you very much for thinking about EOL conversion; obviously I didn't. I fixed a couple of glitches: a variable was used uninitialised, the logic didn't quite work for both unibyte and multibyte, and unless I'm mistaken it's LF that we should look for when encoding, not CR? Anyway, hope you don't mind.

The alternative would be to skip the no-conversion shortcut whenever EOL conversion applies, but memchr is quite fast and it's all likely to be in D1$ by that time.

I also need to thank Ogawa-san again for drawing my attention to check-coding-systems-region which crashes Emacs (SIGABRT) if given an invalid encoding; fixed.

An exhaustive search for other encodings that erroneously were marked as ASCII compatible found only one (chinese-hz); now fixed.

The calls to {ENCODE,DECODE}_FILE messily mutating the returned string have now also been fixed, along with a potential GC bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-09 11:03                     ` Mattias Engdegård
@ 2020-04-09 14:09                       ` Kazuhiro Ito
  2020-04-09 14:22                         ` Mattias Engdegård
  2020-04-11 15:09                       ` Mattias Engdegård
  1 sibling, 1 reply; 42+ messages in thread
From: Kazuhiro Ito @ 2020-04-09 14:09 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 40407

On Thu, 09 Apr 2020 20:03:51 +0900,
Mattias Engdegård wrote:
> 
> Eli, thank you very much for thinking about EOL conversion;
> obviously I didn't. I fixed a couple of glitches: a variable was
> used uninitialised, the logic didn't quite work for both unibyte and
> multibyte, and unless I'm mistaken it's LF that we should look for
> when encoding, not CR? Anyway, hope you don't mind.

I noticed that last-coding-system-used was not set when the fast path
was used.

(let ((string "ABCD\r\nEFGH")
      inhibit-eol-conversion)
  (decode-coding-string string 'raw-text-dos)
  (decode-coding-string string 'raw-text-unix)
  last-coding-system-used)

-> raw-text-dos

-- 
Kazuhiro Ito





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-09 14:09                       ` Kazuhiro Ito
@ 2020-04-09 14:22                         ` Mattias Engdegård
  0 siblings, 0 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-09 14:22 UTC (permalink / raw)
  To: Kazuhiro Ito; +Cc: 40407

9 apr. 2020 kl. 16.09 skrev Kazuhiro Ito <kzhr@d1.dion.ne.jp>:

> I noticed that last-coding-system-used was not set when the fast path
> was used.

A keen observation -- thank you! Now fixed.






^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-09 11:03                     ` Mattias Engdegård
  2020-04-09 14:09                       ` Kazuhiro Ito
@ 2020-04-11 15:09                       ` Mattias Engdegård
  1 sibling, 0 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-11 15:09 UTC (permalink / raw)
  To: 40407-done; +Cc: OGAWA Hirofumi

I think we are done here -- now that all calls to ENCODE_FILE and DECODE_FILE have been checked to be safe for no-copy semantics, there is no need to copy in the ASCII identity case; pushed to master.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-06 14:21     ` Eli Zaretskii
  2020-04-06 15:56       ` Mattias Engdegård
@ 2020-04-16 13:11       ` handa
  2020-04-16 13:44         ` Eli Zaretskii
  1 sibling, 1 reply; 42+ messages in thread
From: handa @ 2020-04-16 13:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407, mattiase, hirofumi

In article <83wo6sr0s2.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > > I don't think 'charset' is the right type for this encoding (any
> > > reason why you've chosen it?), but I will let Handa-san comment.
> > 
> > We could use 'raw-text' as well but that implies that any byte value could be part of an utf-7[-imap] text, which is incorrect.
> > In fact, utf-7-imap only uses codes 0x20-0x7e (utf-7 is allowed to use a few C0 controls too, as mentioned).

I don't remember why utf-7 has coding type utf-8.  As main
decoding/encoding routines of utf-7 are by Lisp (in utf-7.el which was
contributed not by me), perhaps, any other ASCII transparent types was
ok.  It seems that we should introduce a new type for such a coding
system.

> > Arguably the heuristics of define-coding-system-internal are somewhat inscrutable. There seems to be leaks between layers -- ascii-compatible-p is an end-to-end property and cannot really be set the way it is by that function. But since it is, fixing it afterwards should be the correct way.

> I prefer to wait for Handa-san's response, and meanwhile install the
> least disruptive change, which just fixes the one aspect that got
> broken.  Call me a coward, if you wish.

I think Mattias' patch is good.

---
K. Handa
handa@gnu.org





^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-16 13:11       ` handa
@ 2020-04-16 13:44         ` Eli Zaretskii
  2020-04-16 13:59           ` Mattias Engdegård
  0 siblings, 1 reply; 42+ messages in thread
From: Eli Zaretskii @ 2020-04-16 13:44 UTC (permalink / raw)
  To: handa; +Cc: 40407, mattiase, hirofumi

> From: handa <handa@gnu.org>
> Cc: hirofumi@mail.parknet.co.jp, mattiase@acm.org, 40407@debbugs.gnu.org,
>  handa@gnu.org
> Date: Thu, 16 Apr 2020 22:11:37 +0900
> 
> > I prefer to wait for Handa-san's response, and meanwhile install the
> > least disruptive change, which just fixes the one aspect that got
> > broken.  Call me a coward, if you wish.
> 
> I think Mattias' patch is good.

Including making the coding-type of utf-7 'charset'?  I think that
didn't work, see an earlier message in this discussion:

  https://debbugs.gnu.org/cgi/bugreport.cgi?bug=40407#107

And could you please tell more about the conditions for a
coding-system to be a candidate for 'charset' coding-type? what
exactly is such a coding-system supposed to do/provide/support?  I
don't think this is documented anywhere, and most/all of the
coding-systems that have this type are simple single-byte encodings
that support just 256 codepoints (which is not what utf-7 is).

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE
  2020-04-16 13:44         ` Eli Zaretskii
@ 2020-04-16 13:59           ` Mattias Engdegård
  0 siblings, 0 replies; 42+ messages in thread
From: Mattias Engdegård @ 2020-04-16 13:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 40407, hirofumi

16 apr. 2020 kl. 15.44 skrev Eli Zaretskii <eliz@gnu.org>:

> Including making the coding-type of utf-7 'charset'?  I think that
> didn't work, see an earlier message in this discussion:
> 
>  https://debbugs.gnu.org/cgi/bugreport.cgi?bug=40407#107

Indeed, 'charset' was probably not the right choice. Perhaps we should revisit the decision to set :ascii-compatible-p automatically, since it is an end-to-end property that depends on the semantics of all parts in the coding chain, including the provided conversion functions and translation tables. The caller is in a much better position to know whether that property should be set.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-04-16 13:59 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-03 14:18 bug#40407: [PATCH] slow ENCODE_FILE and DECODE_FILE Mattias Engdegård
2020-04-03 16:24 ` Eli Zaretskii
2020-04-03 22:32   ` Mattias Engdegård
2020-04-04  9:26     ` Eli Zaretskii
2020-04-04 16:41       ` Mattias Engdegård
2020-04-04 17:22         ` Eli Zaretskii
2020-04-04 17:37           ` Eli Zaretskii
2020-04-04 18:06             ` Mattias Engdegård
2020-04-05  2:37               ` Eli Zaretskii
2020-04-05  3:42                 ` Eli Zaretskii
2020-04-05 10:14           ` Mattias Engdegård
2020-04-05 13:28             ` Eli Zaretskii
2020-04-05 13:40               ` Mattias Engdegård
2020-04-04 10:26     ` Eli Zaretskii
2020-04-04 16:55       ` Mattias Engdegård
2020-04-04 17:04         ` Eli Zaretskii
2020-04-04 18:01           ` Mattias Engdegård
2020-04-04 18:25             ` Eli Zaretskii
2020-04-05 10:48               ` Mattias Engdegård
2020-04-05 13:39                 ` Eli Zaretskii
2020-04-05 15:03                   ` Mattias Engdegård
2020-04-05 15:35                     ` Mattias Engdegård
2020-04-05 15:56                       ` Eli Zaretskii
2020-04-06 18:13                         ` Mattias Engdegård
2020-04-05 16:00                     ` Eli Zaretskii
2020-04-06 10:10   ` OGAWA Hirofumi
2020-04-06 14:21     ` Eli Zaretskii
2020-04-06 15:56       ` Mattias Engdegård
2020-04-06 16:33         ` Eli Zaretskii
2020-04-06 16:55           ` Mattias Engdegård
2020-04-06 17:18             ` Eli Zaretskii
2020-04-06 17:49               ` Mattias Engdegård
2020-04-06 18:20                 ` Eli Zaretskii
2020-04-06 18:34                   ` OGAWA Hirofumi
2020-04-06 21:57                     ` Mattias Engdegård
2020-04-09 11:03                     ` Mattias Engdegård
2020-04-09 14:09                       ` Kazuhiro Ito
2020-04-09 14:22                         ` Mattias Engdegård
2020-04-11 15:09                       ` Mattias Engdegård
2020-04-16 13:11       ` handa
2020-04-16 13:44         ` Eli Zaretskii
2020-04-16 13:59           ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).