Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
       [not found] ` <20190525191040.CCD6C207F5@vcs0.savannah.gnu.org>
@ 2019-05-25 19:41   ` Stefan Monnier
  2019-05-25 19:59     ` Eli Zaretskii
  2019-05-27  9:47   ` Robert Pluim
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-25 19:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> +length of the decoded text.  If that buffer is a unibyte buffer
> +(@pxref{Selecting a Representations}), the internal representation of
> +the decoded text (@pxref{Text Representations}) is inserted into the
> +buffer as individual bytes.

If the decoded char is a byte between 128-255, is it inserted as
a single byte or as the two-byte sequence used internally for those
"eight-bit" chars?


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-25 19:41   ` [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer Stefan Monnier
@ 2019-05-25 19:59     ` Eli Zaretskii
  2019-05-25 20:15       ` Eli Zaretskii
  2019-05-25 21:11       ` Stefan Monnier
  0 siblings, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-25 19:59 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Sat, 25 May 2019 15:41:46 -0400
> 
> > +length of the decoded text.  If that buffer is a unibyte buffer
> > +(@pxref{Selecting a Representations}), the internal representation of
> > +the decoded text (@pxref{Text Representations}) is inserted into the
> > +buffer as individual bytes.
> 
> If the decoded char is a byte between 128-255, is it inserted as
> a single byte or as the two-byte sequence used internally for those
> "eight-bit" chars?

The internal representation of the decoded text could include both.
If some of the bytes in the original byte stream couldn't be decoded
using the specified coding-system, they will be represented as raw
bytes, using 2-byte sequences.  OTOH, Latin characters successfully
decoded into codepoints less than 256 will take 1 byte.

Again, this is just the internal representation of what was decoded.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-25 19:59     ` Eli Zaretskii
@ 2019-05-25 20:15       ` Eli Zaretskii
  2019-05-25 21:11       ` Stefan Monnier
  1 sibling, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-25 20:15 UTC (permalink / raw)
  To: monnier; +Cc: emacs-devel

> Date: Sat, 25 May 2019 22:59:02 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org
> 
> The internal representation of the decoded text could include both.
> If some of the bytes in the original byte stream couldn't be decoded
> using the specified coding-system, they will be represented as raw
> bytes, using 2-byte sequences.  OTOH, Latin characters successfully
> decoded into codepoints less than 256 will take 1 byte.
                          ^^^^^^^^^^^^^
Oops, I meant less than 128.  Characters between 128 and 255 will be
represented as 2-byte UTF-8 sequences, of course.

> Again, this is just the internal representation of what was decoded.

Right.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-25 19:59     ` Eli Zaretskii
  2019-05-25 20:15       ` Eli Zaretskii
@ 2019-05-25 21:11       ` Stefan Monnier
  2019-05-25 21:27         ` Stefan Monnier
  2019-05-26  2:37         ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Stefan Monnier @ 2019-05-25 21:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> The internal representation of the decoded text could include both.
> If some of the bytes in the original byte stream couldn't be decoded
> using the specified coding-system, they will be represented as raw
> bytes, using 2-byte sequences.  OTOH, Latin characters successfully
> decoded into codepoints less than 256 will take 1 byte.
> Again, this is just the internal representation of what was decoded.

Great, thanks.

But now I wonder, what can we do with this representation.
I guess set-buffer-multibyte will convert it to the intended chars, but
that bugs the question "why bother deciding into the unibyte buffer and
call set-buffer-multibyte afterwards rather than do the reverse"?
Anything else we can do with it?


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-25 21:11       ` Stefan Monnier
@ 2019-05-25 21:27         ` Stefan Monnier
  2019-05-26  2:37         ` Eli Zaretskii
  1 sibling, 0 replies; 50+ messages in thread
From: Stefan Monnier @ 2019-05-25 21:27 UTC (permalink / raw)
  To: emacs-devel

> But now I wonder, what can we do with this representation.
> I guess set-buffer-multibyte will convert it to the intended chars, but
> that bugs the question "why bother deciding into the unibyte buffer and
       ^^^^                          ^^^^^^^^
       begs                          decoding

The first typo is probably telling of my state of mind, tho :-)


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-25 21:11       ` Stefan Monnier
  2019-05-25 21:27         ` Stefan Monnier
@ 2019-05-26  2:37         ` Eli Zaretskii
  1 sibling, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-26  2:37 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Sat, 25 May 2019 17:11:00 -0400
> 
> But now I wonder, what can we do with this representation.
> I guess set-buffer-multibyte will convert it to the intended chars, but
> that bugs the question "why bother deciding into the unibyte buffer and
> call set-buffer-multibyte afterwards rather than do the reverse"?
> Anything else we can do with it?

I don't know.  It isn't a usual thing to do, to be sure.  But it isn't
non-sensical, either.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
       [not found] ` <20190525191040.CCD6C207F5@vcs0.savannah.gnu.org>
  2019-05-25 19:41   ` [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer Stefan Monnier
@ 2019-05-27  9:47   ` Robert Pluim
  2019-05-27 12:24     ` Stefan Monnier
  2019-05-27 16:40     ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Robert Pluim @ 2019-05-27  9:47 UTC (permalink / raw)
  To: emacs-devel; +Cc: Eli Zaretskii

>>>>> On Sat, 25 May 2019 15:10:40 -0400 (EDT), eliz@gnu.org (Eli Zaretskii) said:

    Eli> branch: emacs-26
    Eli> commit 8f18d121210aa27dc05555140ab21a8489f0de50
    Eli> Author: Eli Zaretskii <eliz@gnu.org>
    Eli> Commit: Eli Zaretskii <eliz@gnu.org>

    Eli>     Improve documentation of decoding into a unibyte buffer
    
    Eli>     * doc/lispref/nonascii.texi (Explicit Encoding): Document what
    Eli>     happens when DESTINATION of decoding is a unibyte buffer.
    
    Eli>     * src/coding.c (Fdecode_coding_region)
    Eli>     (Fdecode_coding_string): Document what happens if DESTINATION
    Eli>     is a unibyte buffer.

A related issue: C-h f string-as-unibyte

    string-as-unibyte is a built-in function in `src/fns.c'.

    (string-as-unibyte STRING)

      This function is obsolete since 26.1;
      use `encode-coding-string'.
      Probably introduced at or before Emacs version 20.3.
      This function does not change global state, including the match data.

Having trawled through the elisp manual, for the life of me itʼs not
clear which coding system I should use. 'raw-text'? 'us-ascii'?
Something Else?

Robert



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27  9:47   ` Robert Pluim
@ 2019-05-27 12:24     ` Stefan Monnier
  2019-05-27 13:02       ` Robert Pluim
  2019-05-27 16:42       ` Eli Zaretskii
  2019-05-27 16:40     ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Stefan Monnier @ 2019-05-27 12:24 UTC (permalink / raw)
  To: emacs-devel

> A related issue: C-h f string-as-unibyte
>
>     string-as-unibyte is a built-in function in `src/fns.c'.
>
>     (string-as-unibyte STRING)
>
>       This function is obsolete since 26.1;
>       use `encode-coding-string'.
>       Probably introduced at or before Emacs version 20.3.
>       This function does not change global state, including the match data.
>
> Having trawled through the elisp manual, for the life of me itʼs not
> clear which coding system I should use. 'raw-text'? 'us-ascii'?
> Something Else?

The coding that most closely corresponds to what string-as-unibyte does
is `emacs-internal`.  In 90% of the cases, it's not what you want, tho
because the code shouldn't have used string-as-unibyte in the
first place, so you'll need to find out what the code *really* needs.


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 12:24     ` Stefan Monnier
@ 2019-05-27 13:02       ` Robert Pluim
  2019-05-27 13:32         ` Stefan Monnier
  2019-05-27 16:43         ` Eli Zaretskii
  2019-05-27 16:42       ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Robert Pluim @ 2019-05-27 13:02 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

>>>>> On Mon, 27 May 2019 08:24:46 -0400, Stefan Monnier <monnier@iro.umontreal.ca> said:

    >> A related issue: C-h f string-as-unibyte
    >> 
    >> string-as-unibyte is a built-in function in `src/fns.c'.
    >> 
    >> (string-as-unibyte STRING)
    >> 
    >> This function is obsolete since 26.1;
    >> use `encode-coding-string'.
    >> Probably introduced at or before Emacs version 20.3.
    >> This function does not change global state, including the match data.
    >> 
    >> Having trawled through the elisp manual, for the life of me itʼs not
    >> clear which coding system I should use. 'raw-text'? 'us-ascii'?
    >> Something Else?

    Stefan> The coding that most closely corresponds to what string-as-unibyte does
    Stefan> is `emacs-internal`.  In 90% of the cases, it's not what you want, tho
    Stefan> because the code shouldn't have used string-as-unibyte in the
    Stefan> first place, so you'll need to find out what the code *really* needs.

Almost all uses of string-as-unibyte are gone now, but the one I was
looking at is this one in international/mule-cmds.el:

    (defun encoded-string-description (str coding-system)
      "Return a pretty description of STR that is encoded by CODING-SYSTEM."
      (setq str (string-as-unibyte str))
      (mapconcat
       (if (and coding-system (eq (coding-system-type coding-system) 'iso-2022))
           ;; Try to get a pretty description for ISO 2022 escape sequences.
           (function (lambda (x) (or (cdr (assq x iso-2022-control-alist))
                                     (format "#x%02X" x))))
         (function (lambda (x) (format "#x%02X" x))))
       str " "))

If I take a string of say "β", and replace string-as-unibyte with
(encode-coding-string 'emacs-internal), `encoded-string-description'
prints "#xCE #xB2", which is the correct UTF-8 encoded
value. 'raw-text works too. Iʼm certain that there are subtle
differences between the two that I donʼt understand.

Robert



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 13:02       ` Robert Pluim
@ 2019-05-27 13:32         ` Stefan Monnier
  2019-05-27 13:49           ` Robert Pluim
  2019-05-27 16:51           ` Eli Zaretskii
  2019-05-27 16:43         ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Stefan Monnier @ 2019-05-27 13:32 UTC (permalink / raw)
  To: emacs-devel

> Almost all uses of string-as-unibyte are gone now, but the one I was
> looking at is this one in international/mule-cmds.el:
>
>     (defun encoded-string-description (str coding-system)
>       "Return a pretty description of STR that is encoded by CODING-SYSTEM."
>       (setq str (string-as-unibyte str))
>       (mapconcat
>        (if (and coding-system (eq (coding-system-type coding-system) 'iso-2022))
>            ;; Try to get a pretty description for ISO 2022 escape sequences.
>            (function (lambda (x) (or (cdr (assq x iso-2022-control-alist))
>                                      (format "#x%02X" x))))
>          (function (lambda (x) (format "#x%02X" x))))
>        str " "))
>
> If I take a string of say "β", and replace string-as-unibyte with
> (encode-coding-string 'emacs-internal), `encoded-string-description'
> prints "#xCE #xB2", which is the correct UTF-8 encoded
> value. 'raw-text works too. Iʼm certain that there are subtle
> differences between the two that I donʼt understand.

But "β" is not a "STR that is encoded by CODING-SYSTEM", so this output
is neither correct nor incorrect in any case.

I think the right thing to do here is one of:
- signal an error if `str` is multibyte.
- signal an error if `str` is multibyte and contains non-byte chars.
- if multibyte, encode `str` with `coding-system`.
- just don't bother looking at whether `str` is unibyte or not, just
  pass it as is to `mapconcat`.
- just don't bother looking at whether `str` is unibyte or not, just
  pass it as is to `mapconcat` but in the lambda, do catch the case
  where `x` is an "eight bit raw-byte char" and if so pass it to
  multibyte-char-to-unibyte.
- ...

But encoding `str` with any coding system like raw-text or
emacs-internal doesn't seem to make much sense.


        Stefan





^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 13:32         ` Stefan Monnier
@ 2019-05-27 13:49           ` Robert Pluim
  2019-05-27 16:53             ` Eli Zaretskii
  2019-05-28  3:08             ` Stefan Monnier
  2019-05-27 16:51           ` Eli Zaretskii
  1 sibling, 2 replies; 50+ messages in thread
From: Robert Pluim @ 2019-05-27 13:49 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

>>>>> On Mon, 27 May 2019 09:32:11 -0400, Stefan Monnier <monnier@iro.umontreal.ca> said:
    >> If I take a string of say "β", and replace string-as-unibyte with
    >> (encode-coding-string 'emacs-internal), `encoded-string-description'
    >> prints "#xCE #xB2", which is the correct UTF-8 encoded
    >> value. 'raw-text works too. Iʼm certain that there are subtle
    >> differences between the two that I donʼt understand.

    Stefan> But "β" is not a "STR that is encoded by CODING-SYSTEM", so this output
    Stefan> is neither correct nor incorrect in any case.

It matches the current output of encoded-string-description, though.

    Stefan> I think the right thing to do here is one of:
    Stefan> - signal an error if `str` is multibyte.
    Stefan> - signal an error if `str` is multibyte and contains non-byte chars.
    Stefan> - if multibyte, encode `str` with `coding-system`.
    Stefan> - just don't bother looking at whether `str` is unibyte or not, just
    Stefan>   pass it as is to `mapconcat`.
    Stefan> - just don't bother looking at whether `str` is unibyte or not, just
    Stefan>   pass it as is to `mapconcat` but in the lambda, do catch the case
    Stefan>   where `x` is an "eight bit raw-byte char" and if so pass it to
    Stefan>   multibyte-char-to-unibyte.
    Stefan> - ...

Since this is the underlying code that displays the 'buffer code'
section of 'C-u C-x =', I donʼt think barfing on multibyte is the
right thing to do. Nor is passing it on as is.

    Stefan> But encoding `str` with any coding system like raw-text or
    Stefan> emacs-internal doesn't seem to make much sense.

Then what is the correct way to say 'give me the raw byte version
of this character'? (or maybe we should just let sleeping encodings
lie :-) )

Robert



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27  9:47   ` Robert Pluim
  2019-05-27 12:24     ` Stefan Monnier
@ 2019-05-27 16:40     ` Eli Zaretskii
  2019-05-27 20:17       ` Richard Stallman
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-27 16:40 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Cc: Eli Zaretskii <eliz@gnu.org>
> Date: Mon, 27 May 2019 11:47:35 +0200
> 
>     string-as-unibyte is a built-in function in `src/fns.c'.
> 
>     (string-as-unibyte STRING)
> 
>       This function is obsolete since 26.1;
>       use `encode-coding-string'.
>       Probably introduced at or before Emacs version 20.3.
>       This function does not change global state, including the match data.
> 
> Having trawled through the elisp manual, for the life of me itʼs not
> clear which coding system I should use. 'raw-text'? 'us-ascii'?
> Something Else?

You should use utf-8-emacs-unix.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 12:24     ` Stefan Monnier
  2019-05-27 13:02       ` Robert Pluim
@ 2019-05-27 16:42       ` Eli Zaretskii
  2019-05-27 19:13         ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-27 16:42 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 27 May 2019 08:24:46 -0400
> 
> >     (string-as-unibyte STRING)
> >
> >       This function is obsolete since 26.1;
> >       use `encode-coding-string'.
> >       Probably introduced at or before Emacs version 20.3.
> >       This function does not change global state, including the match data.
> >
> > Having trawled through the elisp manual, for the life of me itʼs not
> > clear which coding system I should use. 'raw-text'? 'us-ascii'?
> > Something Else?
> 
> The coding that most closely corresponds to what string-as-unibyte does
> is `emacs-internal`.

Why "most closely"?  Maybe I'm missing something, but I think it
corresponds exactly.  (I actually prefer using utf-8-emacs-unix, since
emacs-internal is an alias, and deceptively doesn't mention its EOL
conversion.)



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 13:02       ` Robert Pluim
  2019-05-27 13:32         ` Stefan Monnier
@ 2019-05-27 16:43         ` Eli Zaretskii
  1 sibling, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-27 16:43 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Date: Mon, 27 May 2019 15:02:42 +0200
> Cc: emacs-devel@gnu.org
> 
> If I take a string of say "β", and replace string-as-unibyte with
> (encode-coding-string 'emacs-internal), `encoded-string-description'
> prints "#xCE #xB2", which is the correct UTF-8 encoded
> value.

emacs-internal produces the internal representation, which is a
superset of UTF-8.  It doesn't produce a 100% valid UTF-8.  That's an
important distinction that should be always kept in mind.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 13:32         ` Stefan Monnier
  2019-05-27 13:49           ` Robert Pluim
@ 2019-05-27 16:51           ` Eli Zaretskii
  2019-05-27 19:17             ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-27 16:51 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 27 May 2019 09:32:11 -0400
> 
> encoding `str` with any coding system like raw-text or
> emacs-internal doesn't seem to make much sense.

For a multibyte string that was encoded already, encoding by
utf-8-emacs-unix is IMO the _only_ thing that makes sense.  That's how
you convert raw bytes in their internal representation to the
single-byte external representation you want to see in the output.

Also note that encoded-string-description supports codepoints outside
the Unicode space, and utf-8-emacs-unix is the only sane way to
produce the overlong UTF-8 sequences in those cases.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 13:49           ` Robert Pluim
@ 2019-05-27 16:53             ` Eli Zaretskii
  2019-05-28  6:23               ` Robert Pluim
  2019-05-28  3:08             ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-27 16:53 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Date: Mon, 27 May 2019 15:49:50 +0200
> Cc: emacs-devel@gnu.org
> 
> Then what is the correct way to say 'give me the raw byte version
> of this character'?

I suspect that I already answered that question, but if not, please
explain what you mean by "the raw byte version of a character".



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 16:42       ` Eli Zaretskii
@ 2019-05-27 19:13         ` Stefan Monnier
  0 siblings, 0 replies; 50+ messages in thread
From: Stefan Monnier @ 2019-05-27 19:13 UTC (permalink / raw)
  To: emacs-devel

> Why "most closely"?  Maybe I'm missing something, but I think it
> corresponds exactly.

I'm never sure if it corresponds 100%.

> (I actually prefer using utf-8-emacs-unix, since emacs-internal is an
> alias, and deceptively doesn't mention its EOL conversion.)

`emacs-internal` works both in Emacs-22 and Emacs-25 (and presumably in
Emacs-36 where we'll be using the new quantum-character encoding).
IOW it says precisely what we want: the internal encoding used in the
current version of Emacs.  Whereas utf-8-emacs-unix reads to me as "some
Emacs-specific encoding related to utf-8", so the intention is
less clear.

But either way works, it's a matter of taste,

        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 16:51           ` Eli Zaretskii
@ 2019-05-27 19:17             ` Stefan Monnier
  2019-05-28  2:30               ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-27 19:17 UTC (permalink / raw)
  To: emacs-devel

> For a multibyte string that was encoded already, encoding by
> utf-8-emacs-unix is IMO the _only_ thing that makes sense.

I disagree: a multibyte string that was encoded already should only
contain ASCII and "eight-bit raw byte chars" (and BTW, would be much
better represented as a unibyte string to start with) so it can be
converted with string-to-unibyte.


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 16:40     ` Eli Zaretskii
@ 2019-05-27 20:17       ` Richard Stallman
  2019-05-28  2:36         ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Richard Stallman @ 2019-05-27 20:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rpluim, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > Having trawled through the elisp manual, for the life of me itʼs not
  > > clear which coding system I should use. 'raw-text'? 'us-ascii'?
  > > Something Else?

  > You should use utf-8-emacs-unix.

Have we got clear documentation of how raw-text and utf-8-emacs-unix differ
and when to use each one?

Does the Emacs Lisp manual refer to that documention from the places
it would be useful to do so?

-- 
Dr Richard Stallman
President, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 19:17             ` Stefan Monnier
@ 2019-05-28  2:30               ` Eli Zaretskii
  2019-05-28  2:56                 ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28  2:30 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 27 May 2019 15:17:12 -0400
> 
> > For a multibyte string that was encoded already, encoding by
> > utf-8-emacs-unix is IMO the _only_ thing that makes sense.
> 
> I disagree: a multibyte string that was encoded already should only
> contain ASCII and "eight-bit raw byte chars"

But both ASCII and raw bytes have multibyte representation.  If the
destination of the encoding is a multibyte buffer, that is what you
get there, and then taking buffer-substring will give you a multibyte
string with encoded text.  I don't see why we shouldn't support this
scenario.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 20:17       ` Richard Stallman
@ 2019-05-28  2:36         ` Eli Zaretskii
  2019-05-28  7:06           ` Robert Pluim
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28  2:36 UTC (permalink / raw)
  To: rms; +Cc: rpluim, emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Cc: rpluim@gmail.com, emacs-devel@gnu.org
> Date: Mon, 27 May 2019 16:17:31 -0400
> 
>   > > Having trawled through the elisp manual, for the life of me itʼs not
>   > > clear which coding system I should use. 'raw-text'? 'us-ascii'?
>   > > Something Else?
> 
>   > You should use utf-8-emacs-unix.
> 
> Have we got clear documentation of how raw-text and utf-8-emacs-unix differ
> and when to use each one?
> 
> Does the Emacs Lisp manual refer to that documention from the places
> it would be useful to do so?

We have these two documented.  I'm not the right person to say whether
it's clear enough, people should free to complain about unclear
aspects and missing references.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  2:30               ` Eli Zaretskii
@ 2019-05-28  2:56                 ` Stefan Monnier
  2019-05-28  4:17                   ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28  2:56 UTC (permalink / raw)
  To: emacs-devel

>> > For a multibyte string that was encoded already, encoding by
>> > utf-8-emacs-unix is IMO the _only_ thing that makes sense.
>> I disagree: a multibyte string that was encoded already should only
>> contain ASCII and "eight-bit raw byte chars"
> But both ASCII and raw bytes have multibyte representation.

Not sure why you say "but" here: I was also talking abut a multibyte
string, so I obviously agree.

> If the destination of the encoding is a multibyte buffer, that is what
> you get there, and then taking buffer-substring will give you
> a multibyte string with encoded text.  I don't see why we shouldn't
> support this scenario.

I'm not sure where you read that I was arguing we shouldn't support
this scenario.  I was just pointing out that utf-8-emacs-unix is not the
only thing that makes sense: string-to-unibyte should work just as well
(if not better).


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 13:49           ` Robert Pluim
  2019-05-27 16:53             ` Eli Zaretskii
@ 2019-05-28  3:08             ` Stefan Monnier
  2019-05-28  4:40               ` Eli Zaretskii
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28  3:08 UTC (permalink / raw)
  To: emacs-devel

> Since this is the underlying code that displays the 'buffer code'
> section of 'C-u C-x =', I donʼt think barfing on multibyte is the
> right thing to do. Nor is passing it on as is.

grep gives me:

    % grep --color -nH --null -e encoded-string-description **/*.el
    descr-text.el\0350:	  (encoded-string-description encoded coding)))))
    descr-text.el\0645:                    (encoded-string-description
    descr-text.el\0655:                           (list (encoded-string-description encoded coding)
    international/mule-cmds.el\02900:(defun encoded-string-description (str coding-system)
    simple.el\01481:				(encoded-string-description encoded coding)))

and according to my quick investigation, all callers pass a string
that's coming straight from encode-coding-string or encode-coding-char,
so the argument should *always* be unibyte, AFAICT.

Hence according to my reading of the code, this call to
string-as-unibyte will always just return its argument unchanged.


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  2:56                 ` Stefan Monnier
@ 2019-05-28  4:17                   ` Eli Zaretskii
  2019-05-28  6:21                     ` Robert Pluim
  2019-05-28 11:54                     ` Stefan Monnier
  0 siblings, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28  4:17 UTC (permalink / raw)
  To: emacs-devel, Stefan Monnier

On May 28, 2019 5:56:18 AM GMT+03:00, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> >> > For a multibyte string that was encoded already, encoding by
> >> > utf-8-emacs-unix is IMO the _only_ thing that makes sense.
> >> I disagree: a multibyte string that was encoded already should only
> >> contain ASCII and "eight-bit raw byte chars"
> > But both ASCII and raw bytes have multibyte representation.
> 
> Not sure why you say "but" here: I was also talking abut a multibyte
> string, so I obviously agree.
> 
> > If the destination of the encoding is a multibyte buffer, that is
> what
> > you get there, and then taking buffer-substring will give you
> > a multibyte string with encoded text.  I don't see why we shouldn't
> > support this scenario.
> 
> I'm not sure where you read that I was arguing we shouldn't support
> this scenario.  I was just pointing out that utf-8-emacs-unix is not
> the
> only thing that makes sense: string-to-unibyte should work just as
> well
> (if not better).
> 
> 
>         Stefan

string-to-unibyte is also marked obsolete.  Robert was asking how to use encoding functions instead of the obsolete string-as-unibyte; advising to use yet another obsolete function makes little sense to me.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  3:08             ` Stefan Monnier
@ 2019-05-28  4:40               ` Eli Zaretskii
  2019-05-28 11:55                 ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28  4:40 UTC (permalink / raw)
  To: emacs-devel, Stefan Monnier

On May 28, 2019 6:08:36 AM GMT+03:00, Stefan Monnier <monnier@iro.umontreal.ca> wrote:

> and according to my quick investigation, all callers pass a string
> that's coming straight from encode-coding-string or
> encode-coding-char,
> so the argument should *always* be unibyte, AFAICT.
> 
> Hence according to my reading of the code, this call to
> string-as-unibyte will always just return its argument unchanged.


That's not entirely true, because encode-coding-char can return a multibyte string.  Depending on what you wanted to suggest based on your conclusion, that factoid may or may not be important.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  4:17                   ` Eli Zaretskii
@ 2019-05-28  6:21                     ` Robert Pluim
  2019-05-28 11:53                       ` Stefan Monnier
  2019-05-28 11:54                     ` Stefan Monnier
  1 sibling, 1 reply; 50+ messages in thread
From: Robert Pluim @ 2019-05-28  6:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

>>>>> On Tue, 28 May 2019 07:17:35 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> I'm not sure where you read that I was arguing we shouldn't support
    >> this scenario.  I was just pointing out that utf-8-emacs-unix is not
    >> the
    >> only thing that makes sense: string-to-unibyte should work just as
    >> well
    >> (if not better).
    >> 
    >> 
    >> Stefan

    Eli> string-to-unibyte is also marked obsolete.  Robert was asking how to
    Eli> use encoding functions instead of the obsolete string-as-unibyte;
    Eli> advising to use yet another obsolete function makes little sense to
    Eli> me.

Plus calling string-to-unibyte on a multi-byte string will signal an
error.

(encode-coding-string str 'emacs-internal)

seems to be what I want, as Eli pointed out upthread.

Robert



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-27 16:53             ` Eli Zaretskii
@ 2019-05-28  6:23               ` Robert Pluim
  2019-05-28 14:57                 ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Robert Pluim @ 2019-05-28  6:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>>>>> On Mon, 27 May 2019 19:53:18 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Robert Pluim <rpluim@gmail.com>
    >> Date: Mon, 27 May 2019 15:49:50 +0200
    >> Cc: emacs-devel@gnu.org
    >> 
    >> Then what is the correct way to say 'give me the raw byte version
    >> of this character'?

    Eli> I suspect that I already answered that question, but if not, please
    Eli> explain what you mean by "the raw byte version of a character".

Yes, you did. 'the bytes corresponding to the internal representation'
is what I was looking for.

(this whole area is pretty easy to get confused about, especially with
the various *-unibyte functions. Perhaps I should just pretend they
donʼt exist).

Robert



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  2:36         ` Eli Zaretskii
@ 2019-05-28  7:06           ` Robert Pluim
  2019-05-28 14:59             ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Robert Pluim @ 2019-05-28  7:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rms, emacs-devel

>>>>> On Tue, 28 May 2019 05:36:42 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Richard Stallman <rms@gnu.org>
    >> Cc: rpluim@gmail.com, emacs-devel@gnu.org
    >> Date: Mon, 27 May 2019 16:17:31 -0400
    >> 
    >> > > Having trawled through the elisp manual, for the life of me itʼs not
    >> > > clear which coding system I should use. 'raw-text'? 'us-ascii'?
    >> > > Something Else?
    >> 
    >> > You should use utf-8-emacs-unix.
    >> 
    >> Have we got clear documentation of how raw-text and utf-8-emacs-unix differ
    >> and when to use each one?
    >> 
    >> Does the Emacs Lisp manual refer to that documention from the places
    >> it would be useful to do so?

    Eli> We have these two documented.  I'm not the right person to say whether
    Eli> it's clear enough, people should free to complain about unclear
    Eli> aspects and missing references.

Itʼs clear enough to me now. The only thing that trips me up is this
kind of phrase (from (info "(elisp)Coding System Basics")

    When you use ‘raw-text’ to encode multibyte text, it does perform
    one character code conversion: it converts eight-bit characters to
    their single-byte external representation.

because I always forget that eight-bit characters have a multi-byte
representation in the emacs-internal coding system.

Robert



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  6:21                     ` Robert Pluim
@ 2019-05-28 11:53                       ` Stefan Monnier
  0 siblings, 0 replies; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28 11:53 UTC (permalink / raw)
  To: emacs-devel

> Plus calling string-to-unibyte on a multi-byte string will signal an
> error.

Only if that string was not encoded.  Which means it's good because it
will tell you when you called encoded-string-description incorrectly
rather than silently outputting garbage.


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  4:17                   ` Eli Zaretskii
  2019-05-28  6:21                     ` Robert Pluim
@ 2019-05-28 11:54                     ` Stefan Monnier
  2019-05-28 15:11                       ` Eli Zaretskii
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28 11:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> string-to-unibyte is also marked obsolete.

Not any more:

    ;; We used to declare string-to-unibyte obsolete, but it is a valid
    ;; (make-obsolete 'string-to-unibyte   "use `encode-coding-string'." "26.1")


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  4:40               ` Eli Zaretskii
@ 2019-05-28 11:55                 ` Stefan Monnier
  2019-05-28 15:18                   ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28 11:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>> and according to my quick investigation, all callers pass a string
>> that's coming straight from encode-coding-string or
>> encode-coding-char,
>> so the argument should *always* be unibyte, AFAICT.
>> 
>> Hence according to my reading of the code, this call to
>> string-as-unibyte will always just return its argument unchanged.
> That's not entirely true, because encode-coding-char can return a multibyte
> string.

That's weird.  When would that happen?


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  6:23               ` Robert Pluim
@ 2019-05-28 14:57                 ` Eli Zaretskii
  0 siblings, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28 14:57 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Tue, 28 May 2019 08:23:37 +0200
> 
> (this whole area is pretty easy to get confused about, especially with
> the various *-unibyte functions. Perhaps I should just pretend they
> donʼt exist).

Welcome to the club.  I myself _never_ remember what each one of them
does, so every time they pop up in discussions or in code, I need to
look up their code to figure that out anew.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28  7:06           ` Robert Pluim
@ 2019-05-28 14:59             ` Eli Zaretskii
  2019-05-28 15:11               ` Robert Pluim
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28 14:59 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Cc: rms@gnu.org,  emacs-devel@gnu.org
> Date: Tue, 28 May 2019 09:06:37 +0200
> 
> Itʼs clear enough to me now. The only thing that trips me up is this
> kind of phrase (from (info "(elisp)Coding System Basics")
> 
>     When you use ‘raw-text’ to encode multibyte text, it does perform
>     one character code conversion: it converts eight-bit characters to
>     their single-byte external representation.
> 
> because I always forget that eight-bit characters have a multi-byte
> representation in the emacs-internal coding system.

So why does that trip you?  It reminds you what you tend to forget, so
it's a good documentation, right?



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 14:59             ` Eli Zaretskii
@ 2019-05-28 15:11               ` Robert Pluim
  0 siblings, 0 replies; 50+ messages in thread
From: Robert Pluim @ 2019-05-28 15:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>>>>> On Tue, 28 May 2019 17:59:39 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Robert Pluim <rpluim@gmail.com>
    >> Cc: rms@gnu.org,  emacs-devel@gnu.org
    >> Date: Tue, 28 May 2019 09:06:37 +0200
    >> 
    >> Itʼs clear enough to me now. The only thing that trips me up is this
    >> kind of phrase (from (info "(elisp)Coding System Basics")
    >> 
    >> When you use ‘raw-text’ to encode multibyte text, it does perform
    >> one character code conversion: it converts eight-bit characters to
    >> their single-byte external representation.
    >> 
    >> because I always forget that eight-bit characters have a multi-byte
    >> representation in the emacs-internal coding system.

    Eli> So why does that trip you?  It reminds you what you tend to forget, so
    Eli> it's a good documentation, right?

Itʼs just a moment of cognitive dissonance: "why does it convert
eight-bit characters? They're one byte in size! Oh actually they're
not". So yes, in that sense itʼs good, it prevents the propagation of
my misconception.

Robert



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 11:54                     ` Stefan Monnier
@ 2019-05-28 15:11                       ` Eli Zaretskii
  2019-05-28 17:25                         ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28 15:11 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Tue, 28 May 2019 07:54:41 -0400
> 
> > string-to-unibyte is also marked obsolete.
> 
> Not any more:
> 
>     ;; We used to declare string-to-unibyte obsolete, but it is a valid
>     ;; (make-obsolete 'string-to-unibyte   "use `encode-coding-string'." "26.1")

OK, but how does this affect the issue at hand?  We want to replace
string-as-unibyte, or remove it, and the obsolescence message says to
replace it with encode-coding-string.  So if we cannot remove it for
some reason, it makes more sense to use encode-coding-string.  Either
way, using string-to-unibyte instead sounds less desirable to me.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 11:55                 ` Stefan Monnier
@ 2019-05-28 15:18                   ` Eli Zaretskii
  2019-05-28 17:43                     ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28 15:18 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Tue, 28 May 2019 07:55:40 -0400
> 
> >> Hence according to my reading of the code, this call to
> >> string-as-unibyte will always just return its argument unchanged.
> > That's not entirely true, because encode-coding-char can return a multibyte
> > string.
> 
> That's weird.  When would that happen?

"Use the source, Luke!"

  (let* ((str1 (string-as-multibyte (string char)))
	 (str2 (string-as-multibyte (string char char)))
	 (found (find-coding-systems-string str1))
	enc1 enc2 i1 i2)
    (if (and (consp found)
	     (eq (car found) 'undecided))
	str1  <<<<<<<<<<<<<<<<<<<<<<<<<

If we return here, the value is str1, which is a multibyte string, see
how it was calculated.

The easiest use case is this:

  (multibyte-string-p (encode-coding-char ?a 'utf-8))
    => t

I didn't think enough about this to figure out if there can be less
trivial use cases.  If you can describe all the cases where
find-coding-systems-string will return a list whose 'car' is
'undecided', my hat off to you.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 15:11                       ` Eli Zaretskii
@ 2019-05-28 17:25                         ` Stefan Monnier
  2019-05-28 18:51                           ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28 17:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> OK, but how does this affect the issue at hand?  We want to replace
> string-as-unibyte, or remove it, and the obsolescence message says to
> replace it with encode-coding-string.

But this obsolescence message assumes that the call to string-as-unibyte
was used because of a need to encode the strings using Emacs's internal
coding-system.  In this case, the string is already encoded (as stated
by the function's name, the docstring, ...), so using
encode-coding-string is rather odd.

Also using string-to-unibyte will correctly signal an error if the
caller forgot to send an *encoded* string.

> Either way, using string-to-unibyte instead sounds less desirable
> to me.

I agree that removing the call altogether is the better option.

        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 15:18                   ` Eli Zaretskii
@ 2019-05-28 17:43                     ` Stefan Monnier
  2019-05-28 18:58                       ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28 17:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> "Use the source, Luke!"

But the dark side is so enticing!

>   (let* ((str1 (string-as-multibyte (string char)))
> 	 (str2 (string-as-multibyte (string char char)))

Why on earth do we call string-as-multibyte here?  AFAIK, the only cases
where `string` returns a unibyte string is when char <128 (it could make
sense to also do that for char ≥128 and <160, but we don't seem to do
that currently) and these are better turned into multibyte via
string-TO-unibyte (tho here we don't even need that, since the unibyte
string works just as well for what we do) than string-AS-unibyte.

I think this is an error.  The patch below seems in order.

> 	 (found (find-coding-systems-string str1))
> 	enc1 enc2 i1 i2)
>     (if (and (consp found)
> 	     (eq (car found) 'undecided))
> 	str1  <<<<<<<<<<<<<<<<<<<<<<<<<
>
> If we return here, the value is str1, which is a multibyte string, see
> how it was calculated.

I think it's a bug.  Largely harmless since it only applies to ASCII
chars for which we conflate the char/byte status, but still, it's a wart.

> I didn't think enough about this to figure out if there can be less
> trivial use cases.  If you can describe all the cases where
> find-coding-systems-string will return a list whose 'car' is
> 'undecided', my hat off to you.

AFAIK it only happens for pure-ASCII strings.


        Stefan


diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el
index 2b0aaca664..391efbedc8 100644
--- a/lisp/international/mule-cmds.el
+++ b/lisp/international/mule-cmds.el
@@ -2926,12 +2926,11 @@ encode-coding-char
 If CODING-SYSTEM can't safely encode CHAR, return nil.
 The 3rd optional argument CHARSET, if non-nil, is a charset preferred
 on encoding."
-  (let* ((str1 (string-as-multibyte (string char)))
-	 (str2 (string-as-multibyte (string char char)))
+  (let* ((str1 (string char))
+	 (str2 (string char char))
 	 (found (find-coding-systems-string str1))
 	enc1 enc2 i1 i2)
-    (if (and (consp found)
-	     (eq (car found) 'undecided))
+    (if (not (multibyte-string-p str1))
 	str1
       (when (memq (coding-system-base coding-system) found)
 	;; We must find the encoded string of CHAR.  But, just encoding




^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 17:25                         ` Stefan Monnier
@ 2019-05-28 18:51                           ` Eli Zaretskii
  2019-05-28 23:39                             ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28 18:51 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Tue, 28 May 2019 13:25:17 -0400
> 
> > OK, but how does this affect the issue at hand?  We want to replace
> > string-as-unibyte, or remove it, and the obsolescence message says to
> > replace it with encode-coding-string.
> 
> But this obsolescence message assumes that the call to string-as-unibyte
> was used because of a need to encode the strings using Emacs's internal
> coding-system.  In this case, the string is already encoded (as stated
> by the function's name, the docstring, ...), so using
> encode-coding-string is rather odd.

If the input string is unibyte, then using string-to-unibyte will be
also odd.  And if it's multibyte, the using encode-coding-string is
not really odd, is it?

> I agree that removing the call altogether is the better option.

Right.  In that case we need to document that the function expects as
input either a unibyte string or a pure-ASCII string.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 17:43                     ` Stefan Monnier
@ 2019-05-28 18:58                       ` Eli Zaretskii
  2019-05-28 19:35                         ` Eli Zaretskii
  2019-05-28 23:44                         ` Stefan Monnier
  0 siblings, 2 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28 18:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Tue, 28 May 2019 13:43:47 -0400
> 
> >   (let* ((str1 (string-as-multibyte (string char)))
> > 	 (str2 (string-as-multibyte (string char char)))
> 
> Why on earth do we call string-as-multibyte here?  AFAIK, the only cases
> where `string` returns a unibyte string is when char <128 (it could make
> sense to also do that for char ≥128 and <160, but we don't seem to do
> that currently) and these are better turned into multibyte via
> string-TO-unibyte (tho here we don't even need that, since the unibyte
> string works just as well for what we do) than string-AS-unibyte.
> 
> I think this is an error.  The patch below seems in order.

I'm not sure.  Be sure to read the comments about the tricky business
of this function, and the method it employs to solve it, and be sure
you understand all of the subtleties there.

> > I didn't think enough about this to figure out if there can be less
> > trivial use cases.  If you can describe all the cases where
> > find-coding-systems-string will return a list whose 'car' is
> > 'undecided', my hat off to you.
> 
> AFAIK it only happens for pure-ASCII strings.

What is your reasoning?



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 18:58                       ` Eli Zaretskii
@ 2019-05-28 19:35                         ` Eli Zaretskii
  2019-05-28 23:44                         ` Stefan Monnier
  1 sibling, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-28 19:35 UTC (permalink / raw)
  To: monnier; +Cc: emacs-devel

> Date: Tue, 28 May 2019 21:58:03 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org
> 
> > From: Stefan Monnier <monnier@iro.umontreal.ca>
> > Cc: emacs-devel@gnu.org
> > Date: Tue, 28 May 2019 13:43:47 -0400
> > 
> > >   (let* ((str1 (string-as-multibyte (string char)))
> > > 	 (str2 (string-as-multibyte (string char char)))
> > 
> > Why on earth do we call string-as-multibyte here?  AFAIK, the only cases
> > where `string` returns a unibyte string is when char <128 (it could make
> > sense to also do that for char ≥128 and <160, but we don't seem to do
> > that currently) and these are better turned into multibyte via
> > string-TO-unibyte (tho here we don't even need that, since the unibyte
> > string works just as well for what we do) than string-AS-unibyte.
> > 
> > I think this is an error.  The patch below seems in order.
> 
> I'm not sure.  Be sure to read the comments about the tricky business
> of this function, and the method it employs to solve it, and be sure
> you understand all of the subtleties there.

Btw, that function has a bug:

  (encode-coding-char ?a 'ebcdic-us) => "a"

It assumes that any ASCII character is encoded into itself, i.e. that
every coding-system is ASCII-compatible, which is false.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 18:51                           ` Eli Zaretskii
@ 2019-05-28 23:39                             ` Stefan Monnier
  2019-05-29  2:45                               ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28 23:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> If the input string is unibyte, then using string-to-unibyte will be
> also odd.  And if it's multibyte, the using encode-coding-string is
> not really odd, is it?

It's odd that the input string be multibyte since it's supposed to be
encoded, yes.  And it's also odd to call encode-coding-string on
a string that we assume to be encoded (just because the string is
multibyte doesn't make it less odd).

In any case, I think we should strive to avoid using "encoded" multibyte
strings.  I can't remember ever having had a need for those, but when
needed the way to turn a unibyte string into an equivalent multibyte
string (without changing the fact that it's encoded) is
string-to-multibyte.

>> I agree that removing the call altogether is the better option.
>
> Right.  In that case we need to document that the function expects as
> input either a unibyte string or a pure-ASCII string.

OK, I'll do that.

        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 18:58                       ` Eli Zaretskii
  2019-05-28 19:35                         ` Eli Zaretskii
@ 2019-05-28 23:44                         ` Stefan Monnier
  2019-05-29 14:33                           ` Eli Zaretskii
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-28 23:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>> I think this is an error.  The patch below seems in order.
>
> I'm not sure.  Be sure to read the comments about the tricky business
> of this function, and the method it employs to solve it, and be sure
> you understand all of the subtleties there.

This only applies to the case where `char` is not ASCII.
I installed a slightly more conservative patch which should make sure
the returned string is always unibyte and that also fixes the ebcdic
case at the same occasion.

>> AFAIK it only happens for pure-ASCII strings.
> What is your reasoning?

For one, the docstring says that, pretty much.  But also the fact that
`undecided` implies that any coding system should be applicable, IOW
`char` is in the intersection of all the coding systems we have, so it
can only happen if the string is pure ASCII (since one of the coding
systems is `us-ascii`, the insection cannot be larger than that.
That doesn't preclude a non-undefined return value for some pure ASCII
strings, admittedly, tho the docstring suggests that any ASCII string
just returns `undecided`).

        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 23:39                             ` Stefan Monnier
@ 2019-05-29  2:45                               ` Eli Zaretskii
  2019-05-29 16:28                                 ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-29  2:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Tue, 28 May 2019 19:39:15 -0400
> 
> In any case, I think we should strive to avoid using "encoded" multibyte
> strings.

I don't think it's possible, because buffers are by default
multibyte.  (And I don't really see a need for avoiding that, but
that's an old disagreement between us.)



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-28 23:44                         ` Stefan Monnier
@ 2019-05-29 14:33                           ` Eli Zaretskii
  0 siblings, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-29 14:33 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Tue, 28 May 2019 19:44:15 -0400
> 
> I installed a slightly more conservative patch

Ugh! why unilaterally? why not show the patch and wait for agreement?
what's the rush?



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-29  2:45                               ` Eli Zaretskii
@ 2019-05-29 16:28                                 ` Stefan Monnier
  2019-05-29 18:19                                   ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-29 16:28 UTC (permalink / raw)
  To: emacs-devel

>> In any case, I think we should strive to avoid using "encoded" multibyte
>> strings.
> I don't think it's possible,

"strive to avoid" is always possible.  I didn't say we should completely
disallow it (which might be possible, but it's too far from where we are
to be able to tell).

> because buffers are by default multibyte.

And those contains chars 99,99% of the time.
And buffers that contain bytes are unibyte in most cases.
This is the sane way to work.  It makes it easy to know what is what.

Also, not only it's possible, but it's pretty much the case already.
Whether we'll be able to eliminate all cases, I don't know.  But I think
we should try to make the cases of "decoded text in unibyte" and "encoded
text in multibyte" as rare as possible.

[ Similarly, set-buffer-multibyte should only ever be called in an
  empty buffer.  ]

        Stefan

PS: I added checks in encoding/decoding functions to signal errors when
decoding from multibyte and encoding from unibyte (in my local Emacs),
and that's been tremendously useful to track down and fix encoding bugs
in Gnus.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-29 16:28                                 ` Stefan Monnier
@ 2019-05-29 18:19                                   ` Eli Zaretskii
  2019-05-29 18:58                                     ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-29 18:19 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Wed, 29 May 2019 12:28:30 -0400
> 
> PS: I added checks in encoding/decoding functions to signal errors when
> decoding from multibyte and encoding from unibyte (in my local Emacs),
> and that's been tremendously useful to track down and fix encoding bugs
> in Gnus.

I have nothing against you making these changes locally to catch
bugs.  I sometimes do similar things in my locale version of the code.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-29 18:19                                   ` Eli Zaretskii
@ 2019-05-29 18:58                                     ` Stefan Monnier
  2019-05-29 19:09                                       ` Eli Zaretskii
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2019-05-29 18:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> > I don't think there was much to discuss about it.
> Unilaterally.

Yes, it's was a clear and simple bug-fix, AFAIK.

>> PS: I added checks in encoding/decoding functions to signal errors when
>> decoding from multibyte and encoding from unibyte (in my local Emacs),
>> and that's been tremendously useful to track down and fix encoding bugs
>> in Gnus.
> I have nothing against you making these changes locally to catch
> bugs.  I sometimes do similar things in my locale version of the code.

Yes, and this experience showed me that it's a good practice to follow
(I don't mean "adding checks" but "align unibyte/multibyte with
encoded/decoded").

BTW, I think we should add such checks in `master` (but make them
conditional on a variable like
`check-encoding/decoding-strictly-for-debugging-purposes` which would
of course default to nil).


        Stefan




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-29 18:58                                     ` Stefan Monnier
@ 2019-05-29 19:09                                       ` Eli Zaretskii
  2019-05-29 19:50                                         ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2019-05-29 19:09 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Wed, 29 May 2019 14:58:57 -0400
> 
> > > I don't think there was much to discuss about it.
> > Unilaterally.
> 
> Yes, it's was a clear and simple bug-fix, AFAIK.

We were in the middle of discussing what should be done, so fairness
would have it that a patch should be proposed, not pushed.  Or so I
thought, turns out naïvely.

> > I have nothing against you making these changes locally to catch
> > bugs.  I sometimes do similar things in my locale version of the code.
> 
> Yes, and this experience showed me that it's a good practice to follow
> (I don't mean "adding checks" but "align unibyte/multibyte with
> encoded/decoded").
> 
> BTW, I think we should add such checks in `master` (but make them
> conditional on a variable like
> `check-encoding/decoding-strictly-for-debugging-purposes` which would
> of course default to nil).

I'm okay with enabling such checks in an Emacs configured with
'--enable-checking', but not in the production version.  The main
purpose of a released Emacs is not to expose bugs, it's to allow users
do whatever they need to do in Emacs, even if it means tolerating some
bugs.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer
  2019-05-29 19:09                                       ` Eli Zaretskii
@ 2019-05-29 19:50                                         ` Stefan Monnier
  0 siblings, 0 replies; 50+ messages in thread
From: Stefan Monnier @ 2019-05-29 19:50 UTC (permalink / raw)
  To: emacs-devel

> I'm okay with enabling such checks in an Emacs configured with
> '--enable-checking', but not in the production version.  The main
> purpose of a released Emacs is not to expose bugs, it's to allow users
> do whatever they need to do in Emacs, even if it means tolerating some
> bugs.

Yes, as mentioned the default would be nil: there's no point signaling
such errors to an end-user that hasn't specifically asked for such
sanity checks.

I hadn't (and still haven't) considered enabling those checks
when --enable-checking is specified: it might turn out too many false
positives currently.  But I'll keep it in mind.

        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2019-05-29 19:50 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190525191039.14136.23307@vcs0.savannah.gnu.org>
     [not found] ` <20190525191040.CCD6C207F5@vcs0.savannah.gnu.org>
2019-05-25 19:41   ` [Emacs-diffs] emacs-26 8f18d12: Improve documentation of decoding into a unibyte buffer Stefan Monnier
2019-05-25 19:59     ` Eli Zaretskii
2019-05-25 20:15       ` Eli Zaretskii
2019-05-25 21:11       ` Stefan Monnier
2019-05-25 21:27         ` Stefan Monnier
2019-05-26  2:37         ` Eli Zaretskii
2019-05-27  9:47   ` Robert Pluim
2019-05-27 12:24     ` Stefan Monnier
2019-05-27 13:02       ` Robert Pluim
2019-05-27 13:32         ` Stefan Monnier
2019-05-27 13:49           ` Robert Pluim
2019-05-27 16:53             ` Eli Zaretskii
2019-05-28  6:23               ` Robert Pluim
2019-05-28 14:57                 ` Eli Zaretskii
2019-05-28  3:08             ` Stefan Monnier
2019-05-28  4:40               ` Eli Zaretskii
2019-05-28 11:55                 ` Stefan Monnier
2019-05-28 15:18                   ` Eli Zaretskii
2019-05-28 17:43                     ` Stefan Monnier
2019-05-28 18:58                       ` Eli Zaretskii
2019-05-28 19:35                         ` Eli Zaretskii
2019-05-28 23:44                         ` Stefan Monnier
2019-05-29 14:33                           ` Eli Zaretskii
2019-05-27 16:51           ` Eli Zaretskii
2019-05-27 19:17             ` Stefan Monnier
2019-05-28  2:30               ` Eli Zaretskii
2019-05-28  2:56                 ` Stefan Monnier
2019-05-28  4:17                   ` Eli Zaretskii
2019-05-28  6:21                     ` Robert Pluim
2019-05-28 11:53                       ` Stefan Monnier
2019-05-28 11:54                     ` Stefan Monnier
2019-05-28 15:11                       ` Eli Zaretskii
2019-05-28 17:25                         ` Stefan Monnier
2019-05-28 18:51                           ` Eli Zaretskii
2019-05-28 23:39                             ` Stefan Monnier
2019-05-29  2:45                               ` Eli Zaretskii
2019-05-29 16:28                                 ` Stefan Monnier
2019-05-29 18:19                                   ` Eli Zaretskii
2019-05-29 18:58                                     ` Stefan Monnier
2019-05-29 19:09                                       ` Eli Zaretskii
2019-05-29 19:50                                         ` Stefan Monnier
2019-05-27 16:43         ` Eli Zaretskii
2019-05-27 16:42       ` Eli Zaretskii
2019-05-27 19:13         ` Stefan Monnier
2019-05-27 16:40     ` Eli Zaretskii
2019-05-27 20:17       ` Richard Stallman
2019-05-28  2:36         ` Eli Zaretskii
2019-05-28  7:06           ` Robert Pluim
2019-05-28 14:59             ` Eli Zaretskii
2019-05-28 15:11               ` Robert Pluim

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).