bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
@ 2022-06-03  6:20 Richard Hansen
  2022-06-03  7:02 ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Hansen @ 2022-06-03  6:20 UTC (permalink / raw)
  To: 55777


[-- Attachment #1.1.1: Type: text/plain, Size: 15 bytes --]

See attached.

[-- Attachment #1.1.2: 0001-Improve-documentation-of-string-to-multibyte-string-.patch --]
[-- Type: text/x-patch, Size: 2576 bytes --]

From 2e0e944840de65936a979b075aa2ea4177f49854 Mon Sep 17 00:00:00 2001
From: Richard Hansen <rhansen@rhansen.org>
Date: Fri, 3 Jun 2022 01:04:41 -0400
Subject: [PATCH] Improve documentation of `string-to-multibyte',
 `string-to-unibyte'

* doc/lispref/nonascii.texi (Converting Representations): Fix
erroneous description of `string-to-unibyte' (it does not signal an
error on eight-bit characters) and clarify its behavior. Update
documentation of `string-to-multibyte' to match.
---
 doc/lispref/nonascii.texi | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index d7d25dc36a..8746b79de8 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -271,20 +271,24 @@ Converting Representations
 @defun string-to-multibyte string
 This function returns a multibyte string containing the same sequence
 of characters as @var{string}.  If @var{string} is a multibyte string,
-it is returned unchanged.  The function assumes that @var{string}
-includes only @acronym{ASCII} characters and raw 8-bit bytes; the
-latter are converted to their multibyte representation corresponding
-to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
-(@pxref{Text Representations, codepoints}).
+it is returned unchanged.  Otherwise, byte values @code{#x00} through
+@code{#x7F} (@acronym{ASCII} characters) are mapped to their
+corresponding codepoints, and byte values @code{#x80} through
+@code{#xFF} (eight-bit characters) are mapped to codepoints
+@code{#x3FFF80} through @code{#x3FFFFF} (@pxref{Text Representations,
+codepoints}).
 @end defun
 
 @defun string-to-unibyte string
 This function returns a unibyte string containing the same sequence of
-characters as @var{string}.  It signals an error if @var{string}
-contains a non-@acronym{ASCII} character.  If @var{string} is a
-unibyte string, it is returned unchanged.  Use this function for
-@var{string} arguments that contain only @acronym{ASCII} and eight-bit
-characters.
+characters as @var{string}.  If @var{string} is a unibyte string, it
+is returned unchanged.  Otherwise, codepoints @code{#x00} through
+@code{#x7F} (@acronym{ASCII} characters) are mapped to their
+corresponding byte values, and codepoints @code{#x3FFF80} through
+@code{#x3FFFFF} (eight-bit characters) are mapped to byte values
+@code{#x80} through @code{#xFF} (@pxref{Text Representations,
+codepoints}).  It signals an error if any other codepoint is
+encountered.
 @end defun
 
 @defun byte-to-string byte
-- 
2.36.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-03  6:20 bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Richard Hansen
@ 2022-06-03  7:02 ` Eli Zaretskii
  2022-06-04  3:28   ` Richard Hansen
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-06-03  7:02 UTC (permalink / raw)
  To: Richard Hansen; +Cc: 55777

> Date: Fri, 3 Jun 2022 02:20:35 -0400
> From: Richard Hansen <rhansen@rhansen.org>
> 
> See attached.

Thanks, but please explain the motivation for these changes.  In
particular, why would we need to describe in a doc string such
intimate details of our current implementation?  If there was some
situation where you needed these details for some Lisp program, please
describe that situation.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-03  7:02 ` Eli Zaretskii
@ 2022-06-04  3:28   ` Richard Hansen
  2022-06-04  7:09     ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Hansen @ 2022-06-04  3:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 55777


[-- Attachment #1.1.1: Type: text/plain, Size: 2081 bytes --]

On 6/3/22 03:02, Eli Zaretskii wrote:
> Thanks, but please explain the motivation for these changes.

The motivation is in the commit message, which I revised in the
attached patch to hopefully make it more clear.

> In particular, why would we need to describe in a doc string such 
> intimate details of our current implementation?
There is a fair amount of implementation detail right now; the patch
doesn't significantly change that. But I revised the patch to remove
some of the detail.

> If there was some situation where you needed these details for some 
> Lisp program, please describe that situation.
I'm trying to understand some inconsistent behavior I'm observing
while writing code to process binary data, and I found the existing
documentation lacking.

     ;; Unibyte vs. multibyte characters:
     (eq ?\xff ?\x3fffff)                           ; t (ok)
     (eq (aref "\x3fffff" 0) (aref "\xff" 0))       ; t (ok)
     (eq (aref "\x3fffff 😀" 0) (aref "\xff 😀" 0)) ; t (ok)
     (eq (aref "\xff" 0) (aref "\xff 😀" 0))        ; nil (expected t)

     ;; Unibyte vs. multibyte strings:
     (multibyte-string-p "\xff")                    ; nil (ok)
     (multibyte-string-p "\x3fffff")                ; nil (ok???)
     (string= "\xff" (string-to-multibyte "\xff"))  ; nil (expected t)

     ;; Char code vs. Unicode codepoint:
     (string= "😀\xff" "😀\x3fffff")                ; t (ok)
     (string= "😀\N{U+ff}" "😀\xff")                ; nil (ok)
     (string= "😀\N{U+ff}" "😀\x3fffff")            ; nil (ok)
     (string= "😀ÿ" "😀\N{U+ff}")                   ; t (ok)
     (string= "😀ÿ" "😀\xff")                       ; nil (ok)
     (string= "😀ÿ" "😀\x3fffff")                   ; nil (ok)
     (eq ?\N{U+ff} ?\xff)                           ; t (expected nil)
     (eq ?\N{U+ff} ?\x3fffff)                       ; t (expected nil)
     (eq ?ÿ ?\xff)                                  ; t (expected nil)
     (eq ?ÿ ?\x3fffff)                              ; t (expected nil)

[-- Attachment #1.1.2: 0001-Improve-documentation-of-string-to-multibyte-string-.patch --]
[-- Type: text/x-patch, Size: 2678 bytes --]

From 6813b0a43250c9633d84d72418904025e973f1c8 Mon Sep 17 00:00:00 2001
From: Richard Hansen <rhansen@rhansen.org>
Date: Fri, 3 Jun 2022 01:04:41 -0400
Subject: [PATCH] Improve documentation of `string-to-multibyte',
 `string-to-unibyte'

* doc/lispref/nonascii.texi (Converting Representations): Remove
confusing sentence from `string-to-multibyte' documentation (by
definition unibyte strings can only contain ASCII and eight-bit
characters, so there's no need to assume that unibyte strings only
contain those characters).  Fix description of the characters that
will cause `string-to-unibyte' to signal an error (`eight-bit'
characters are OK).  Remove some implementation details that are
discussed in the xrefed section.  Word the documentation for the two
functions similarly so that it is clear they are inverses of each
other.
---
 doc/lispref/nonascii.texi | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index d7d25dc36a..e8f02d8f2f 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -271,20 +271,19 @@ Converting Representations
 @defun string-to-multibyte string
 This function returns a multibyte string containing the same sequence
 of characters as @var{string}.  If @var{string} is a multibyte string,
-it is returned unchanged.  The function assumes that @var{string}
-includes only @acronym{ASCII} characters and raw 8-bit bytes; the
-latter are converted to their multibyte representation corresponding
-to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
-(@pxref{Text Representations, codepoints}).
+it is returned unchanged.  Otherwise, byte values are transformed to
+their corresponding multibyte codepoints (@acronym{ASCII} characters
+and characters in the @code{eight-bit} charset).  @xref{Text
+Representations, codepoints}.
 @end defun
 
 @defun string-to-unibyte string
 This function returns a unibyte string containing the same sequence of
-characters as @var{string}.  It signals an error if @var{string}
-contains a non-@acronym{ASCII} character.  If @var{string} is a
-unibyte string, it is returned unchanged.  Use this function for
-@var{string} arguments that contain only @acronym{ASCII} and eight-bit
-characters.
+characters as @var{string}.  If @var{string} is a unibyte string, it
+is returned unchanged.  Otherwise, @acronym{ASCII} characters and
+characters in the @code{eight-bit} charset are converted to their
+corresponding byte values.  It signals an error if any other character
+is encountered.  @xref{Text Representations, codepoints}.
 @end defun
 
 @defun byte-to-string byte
-- 
2.36.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-04  3:28   ` Richard Hansen
@ 2022-06-04  7:09     ` Eli Zaretskii
  2022-06-05  0:16       ` Richard Hansen
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-06-04  7:09 UTC (permalink / raw)
  To: Richard Hansen; +Cc: 55777

> Date: Fri, 3 Jun 2022 23:28:51 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> > If there was some situation where you needed these details for some 
> > Lisp program, please describe that situation.
> I'm trying to understand some inconsistent behavior I'm observing
> while writing code to process binary data, and I found the existing
> documentation lacking.

You are digging into low-level details of how Emacs keeps strings in
memory, and the higher-level context of _why_ you need to understand
these details is left untold.

In general, Lisp programs are well advised to stay away of
manipulating unibyte strings, and definitely to refrain from comparing
unibyte and multibyte strings -- because these are supposed to be
never needed in Lisp applications, and because doing TRT with those
requires non-trivial knowledge of the Emacs internals.

I see no reason to complicate the documentation for the very rare
occasions where these issues unfortunately leak to
higher-than-expected levels.

>      ;; Unibyte vs. multibyte characters:
>      (eq ?\xff ?\x3fffff)                           ; t (ok)
>      (eq (aref "\x3fffff" 0) (aref "\xff" 0))       ; t (ok)
>      (eq (aref "\x3fffff 😀" 0) (aref "\xff 😀" 0)) ; t (ok)
>      (eq (aref "\xff" 0) (aref "\xff 😀" 0))        ; nil (expected t)
> 
>      ;; Unibyte vs. multibyte strings:
>      (multibyte-string-p "\xff")                    ; nil (ok)
>      (multibyte-string-p "\x3fffff")                ; nil (ok???)
>      (string= "\xff" (string-to-multibyte "\xff"))  ; nil (expected t)
> 
>      ;; Char code vs. Unicode codepoint:
>      (string= "😀\xff" "😀\x3fffff")                ; t (ok)
>      (string= "😀\N{U+ff}" "😀\xff")                ; nil (ok)
>      (string= "😀\N{U+ff}" "😀\x3fffff")            ; nil (ok)
>      (string= "😀ÿ" "😀\N{U+ff}")                   ; t (ok)
>      (string= "😀ÿ" "😀\xff")                       ; nil (ok)
>      (string= "😀ÿ" "😀\x3fffff")                   ; nil (ok)
>      (eq ?\N{U+ff} ?\xff)                           ; t (expected nil)
>      (eq ?\N{U+ff} ?\x3fffff)                       ; t (expected nil)
>      (eq ?ÿ ?\xff)                                  ; t (expected nil)
>      (eq ?ÿ ?\x3fffff)                              ; t (expected nil)

If you still don't understand some of these, please feel free to ask
questions, and we will gladly answer them.  But I see no reason to
change the documentation on that behalf.

> @@ -271,20 +271,19 @@ Converting Representations
>  @defun string-to-multibyte string
>  This function returns a multibyte string containing the same sequence
>  of characters as @var{string}.  If @var{string} is a multibyte string,
> -it is returned unchanged.  The function assumes that @var{string}
> -includes only @acronym{ASCII} characters and raw 8-bit bytes; the
> -latter are converted to their multibyte representation corresponding
> -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
> -(@pxref{Text Representations, codepoints}).
> +it is returned unchanged.  Otherwise, byte values are transformed to
> +their corresponding multibyte codepoints (@acronym{ASCII} characters
> +and characters in the @code{eight-bit} charset).  @xref{Text
> +Representations, codepoints}.

This loses information, so I don't think we should make this change.
It might be trivially clear to you that unibyte string can only
contain ASCII and raw bytes, but it isn't necessarily clear to
everyone.

>  @defun string-to-unibyte string
>  This function returns a unibyte string containing the same sequence of
> -characters as @var{string}.  It signals an error if @var{string}
> -contains a non-@acronym{ASCII} character.  If @var{string} is a
> -unibyte string, it is returned unchanged.  Use this function for
> -@var{string} arguments that contain only @acronym{ASCII} and eight-bit
> -characters.
> +characters as @var{string}.  If @var{string} is a unibyte string, it
> +is returned unchanged.  Otherwise, @acronym{ASCII} characters and
> +characters in the @code{eight-bit} charset are converted to their
> +corresponding byte values.  It signals an error if any other character
> +is encountered.  @xref{Text Representations, codepoints}.

This basically rearranges the existing text, and adds just one
sentence:

  Otherwise, @acronym{ASCII} characters and characters in the
  @code{eight-bit} charset are converted to their corresponding byte
  values.

The cross-reference is identical to the one we already have a few
lines above this text, so it is redundant.  I've made a change to add
the above sentence, and slightly rearranged the text to be more clear
and logically complete.

Here's how this text looks now on the emacs-28 branch (and will appear
in Emacs 28.2 and later):

  @defun string-to-unibyte string
  This function returns a unibyte string containing the same sequence of
  characters as @var{string}.  If @var{string} is a unibyte string, it
  is returned unchanged.  Otherwise, @acronym{ASCII} characters and
  characters in the @code{eight-bit} charset are converted to their
  corresponding byte values.  Use this function for @var{string}
  arguments that contain only @acronym{ASCII} and eight-bit characters;
  the function signals an error if any other characters are encountered.
  @end defun

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-04  7:09     ` Eli Zaretskii
@ 2022-06-05  0:16       ` Richard Hansen
  2022-06-05  5:37         ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Hansen @ 2022-06-05  0:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 55777


[-- Attachment #1.1.1: Type: text/plain, Size: 3677 bytes --]

On 6/4/22 03:09, Eli Zaretskii wrote:
>>> If there was some situation where you needed these details for some
>>> Lisp program, please describe that situation.
>>
>> I'm trying to understand some inconsistent behavior I'm observing
>> while writing code to process binary data, and I found the existing
>> documentation lacking.
> 
> You are digging into low-level details of how Emacs keeps strings in
> memory, and the higher-level context of _why_ you need to understand
> these details is left untold.

Readers either think the documentation is confusing or they don't; why
they need to understand the documentation is mostly irrelevant. I
find the documentation to be confusing, and I suspect I am not the
only one.

> In general, Lisp programs are well advised to stay away of
> manipulating unibyte strings, and definitely to refrain from comparing
> unibyte and multibyte strings -- because these are supposed to be
> never needed in Lisp applications, and because doing TRT with those
> requires non-trivial knowledge of the Emacs internals.

I disagree with "well advised". The documentation in 34.1 and 34.3
make it sound like the representation is merely an internal elisp
implementation detail that programmers don't need to worry about,
unless they are doing something unusually low-level.

I consider binary data processing to be somewhat common, not
"unusually low-level". Yet manipulating byte values 128-255 in unibyte
strings, and characters with Unicode codepoints 128-255 in multibyte
strings, is fraught with peril. For example, it is risky to use `aref'
to read a character or `aset' to write a character unless you either
know the string representation or know that the character is not in
#x80-#xff or #x3fff80-#x3fffff.

> 
> I see no reason to complicate the documentation for the very rare
> occasions where these issues unfortunately leak to
> higher-than-expected levels.

I don't think the occasions are all that rare.  But even if they are,
the precise behavior should be documented somewhere so that
programmers who need low-level string manipulation can do so
correctly.  I would argue that programmers using `string-to-unibyte'
or `string-to-multibyte' fall into that category.

>> @@ -271,20 +271,19 @@ Converting Representations
>>   @defun string-to-multibyte string
>>   This function returns a multibyte string containing the same sequence
>>   of characters as @var{string}.  If @var{string} is a multibyte string,
>> -it is returned unchanged.  The function assumes that @var{string}
>> -includes only @acronym{ASCII} characters and raw 8-bit bytes; the
>> -latter are converted to their multibyte representation corresponding
>> -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
>> -(@pxref{Text Representations, codepoints}).
>> +it is returned unchanged.  Otherwise, byte values are transformed to
>> +their corresponding multibyte codepoints (@acronym{ASCII} characters
>> +and characters in the @code{eight-bit} charset).  @xref{Text
>> +Representations, codepoints}.
> 
> This loses information, so I don't think we should make this change.
> It might be trivially clear to you that unibyte string can only
> contain ASCII and raw bytes, but it isn't necessarily clear to
> everyone.

I still find the current wording to be confusing. To me, all bytes
have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also,
ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant
to refer only to non-ASCII values. I have attached another revision
that I think is complete, correct, and easier to understand.

Thanks,
Richard

[-- Attachment #1.1.2: 0001-Clarify-documentation-of-string-to-multibyte.patch --]
[-- Type: text/x-patch, Size: 1359 bytes --]

From 0d7481239056e2d701591728f240094c7a939d3a Mon Sep 17 00:00:00 2001
From: Richard Hansen <rhansen@rhansen.org>
Date: Fri, 3 Jun 2022 01:04:41 -0400
Subject: [PATCH] Clarify documentation of `string-to-multibyte'

* doc/lispref/nonascii.texi (Converting Representations): Clarify
what `string-to-multibyte' does.  (Bug#55777)
---
 doc/lispref/nonascii.texi | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 6dc23637a7..7a3c3c63e7 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -271,11 +271,9 @@ Converting Representations
 @defun string-to-multibyte string
 This function returns a multibyte string containing the same sequence
 of characters as @var{string}.  If @var{string} is a multibyte string,
-it is returned unchanged.  The function assumes that @var{string}
-includes only @acronym{ASCII} characters and raw 8-bit bytes; the
-latter are converted to their multibyte representation corresponding
-to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
-(@pxref{Text Representations, codepoints}).
+it is returned unchanged.  Otherwise, bytes with values 128 to 255 are
+converted to their corresponding multibyte representations in the
+@code{eight-bit} charset.
 @end defun
 
 @defun string-to-unibyte string
-- 
2.36.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-05  0:16       ` Richard Hansen
@ 2022-06-05  5:37         ` Eli Zaretskii
  2022-06-06  2:00           ` Richard Hansen
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-06-05  5:37 UTC (permalink / raw)
  To: Richard Hansen; +Cc: 55777

> Date: Sat, 4 Jun 2022 20:16:47 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> > You are digging into low-level details of how Emacs keeps strings in
> > memory, and the higher-level context of _why_ you need to understand
> > these details is left untold.
> 
> Readers either think the documentation is confusing or they don't; why
> they need to understand the documentation is mostly irrelevant. I
> find the documentation to be confusing, and I suspect I am not the
> only one.

I said "understand the details", not "understand the documentation".
The latter is a no-brainer: documentation should be understandable,
and I don't think what we have now isn't.  See below regarding the
parts you say confused you.

> > In general, Lisp programs are well advised to stay away of
> > manipulating unibyte strings, and definitely to refrain from comparing
> > unibyte and multibyte strings -- because these are supposed to be
> > never needed in Lisp applications, and because doing TRT with those
> > requires non-trivial knowledge of the Emacs internals.
> 
> I disagree with "well advised". The documentation in 34.1 and 34.3
> make it sound like the representation is merely an internal elisp
> implementation detail that programmers don't need to worry about,
> unless they are doing something unusually low-level.

That is exactly the intent.

The recommendation not to deal with non-text data directly (as opposed
via, say, packages like bindat.el) is based on experience, both mine
and that of others.

> I consider binary data processing to be somewhat common, not
> "unusually low-level". Yet manipulating byte values 128-255 in unibyte
> strings, and characters with Unicode codepoints 128-255 in multibyte
> strings, is fraught with peril. For example, it is risky to use `aref'
> to read a character or `aset' to write a character unless you either
> know the string representation or know that the character is not in
> #x80-#xff or #x3fff80-#x3fffff.

You are describing some of the known difficulties that arise when
manipulating binary data in Emacs strings and buffers, which are the
reasons for the above recommendation.  Emacs can do all this, but not
easily, since it isn't its main design goal.  For comparison, some
other text-processing environments simply reject any non-character
data in strings.

> > I see no reason to complicate the documentation for the very rare
> > occasions where these issues unfortunately leak to
> > higher-than-expected levels.
> 
> I don't think the occasions are all that rare.  But even if they are,
> the precise behavior should be documented somewhere so that
> programmers who need low-level string manipulation can do so
> correctly.

Documenting every aspect of the Emacs behavior for the rare chance
that someone some day will find it useful would make our documentation
too large.  The Emacs Lisp Reference manual already prints in 2 very
thick volumes.  So our policy is not to document the aspects that are
too obscure to be useful to many.

> I would argue that programmers using `string-to-unibyte'
> or `string-to-multibyte' fall into that category.

I disagree.  First, these functions should be used very rarely, and we
generally try to avoid them entirely.  And if they do need to be used,
the current documentation is IMO adequate.  It still has to be
understandable, of course, but it doesn't need to describe every
possible detail of how Emacs handles raw bytes and conversions between
them and readable text.

> I still find the current wording to be confusing. To me, all bytes
> have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also,
> ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant
> to refer only to non-ASCII values.

What are "raw bytes" is explained in one of the previous sections of
this chapter.

> I have attached another revision that I think is complete, correct,
> and easier to understand.

I think it muddies the water by talking about numerical values 128 to
255, which also match some Latin characters.  It also removes the
reference to the codepoints Emacs uses to represent these bytes, which
is important in some situations.  So I think your proposal would
change this text for the worse.

Could you please state what is confusing in the current wording?  If
it's only the "raw 8-bit bytes" thing, it is explained earlier in the
manual; if needed, we could add a cross-reference there to that
section.  If it's something else, please tell.  But mentioning the
single-byte numerical values here actually increases the confusion,
IME, due to overlap with valid Unicode codepoints, which is why we
should and do deliberately refrain from doing that.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-05  5:37         ` Eli Zaretskii
@ 2022-06-06  2:00           ` Richard Hansen
  2022-06-06 11:29             ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Hansen @ 2022-06-06  2:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 55777

On 6/5/22 01:37, Eli Zaretskii wrote:
> Could you please state what is confusing in the current wording?

   * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in
     the chapter -- the term is even in a @dfn{} -- but there's no
     definition there.

   * The term "raw 8-bit bytes" is misleading. It suggests binary data
     (bytes with values 0-255) but it's actually meant to only cover
     128-255.

   * The term "raw 8-bit bytes" is not used consistently. Sometimes "8"
     is spelled out as "eight", sometimes "raw" comes after "8-bit",
     and sometimes it refers to all byte values 0-255 (see the first
     sentence under `@cindex unibyte text`).

   * It's not clear whether "raw 8-bit bytes" is meant to refer to
     bytes with values 128-255, or to the *characters* that map to
     those byte values.

   * The following phrasing is weird: "The function assumes that
     @var{string} includes ASCII characters and raw 8-bit bytes". The
     purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so
     by definition that assumption is always true. By saying "the
     function assumes", the reader is left wondering about the cases
     where that assumption is not true, which in turn causes the reader
     to question whether "raw 8-bit bytes" fully covers non-ASCII byte
     values, which in turn causes the reader to wonder how to handle
     those non-covered values (whatever they are).

     Maybe something like this:

         By definition, unibyte strings contain only @acronym{ASCII}
         characters (bytes with values 0-127) and raw 8-bit bytes
         (bytes with values 128-255); the latter are converted to their
         corresponding multibyte representations in the
         @code{eight-bit} character set (@pxref{Text Representations,
         codepoints}).





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-06  2:00           ` Richard Hansen
@ 2022-06-06 11:29             ` Eli Zaretskii
  2022-08-17 23:21               ` Stefan Kangas
  0 siblings, 1 reply; 9+ messages in thread
From: Eli Zaretskii @ 2022-06-06 11:29 UTC (permalink / raw)
  To: Richard Hansen; +Cc: 55777

> Date: Sun, 5 Jun 2022 22:00:35 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> On 6/5/22 01:37, Eli Zaretskii wrote:
> > Could you please state what is confusing in the current wording?
> 
>    * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in
>      the chapter -- the term is even in a @dfn{} -- but there's no
>      definition there.

It is defined as best we could without confusing the readers:

     Occasionally, Emacs needs to hold and manipulate encoded text or
  binary non-text data in its buffers or strings.  For example, when Emacs
  visits a file, it first reads the file’s text verbatim into a buffer,
  and only then converts it to the internal representation.  Before the
  conversion, the buffer holds encoded text.

     Encoded text is not really text, as far as Emacs is concerned, but
  rather a sequence of raw 8-bit bytes.  We call buffers and strings that
  hold encoded text “unibyte” buffers and strings, because Emacs treats
  them as a sequence of individual bytes. [...]

(The @dfn part is markup used whenever new terminology is first used,
it doesn't imply "definition".)

You are welcome to propose a better explanation, but one thing is a
non-starter: mentioning the numerical codes of those bytes, certainly
as part of their "definition".  This is because their numerical codes
overlap Latin characters, and people were very confused about that
when we mentioned them in the documentation in the past.  So now we
deliberately don't mention the values.  The definition is effectively
"bytes that have no meaning as human-readable text".

>    * The term "raw 8-bit bytes" is misleading. It suggests binary data
>      (bytes with values 0-255) but it's actually meant to only cover
>      128-255.

It indeed could potentially mislead.  But not necessarily: it is
customary to use "eight-bit" to mean "with the 8th bit set".

Once again, you don't have to convince me that this area is confusing
and notoriously hard to document.  The challenge is to come up with
something that is better than what we have and yet doesn't trigger
confusion which we already had in the past.

>    * The term "raw 8-bit bytes" is not used consistently. Sometimes "8"
>      is spelled out as "eight", sometimes "raw" comes after "8-bit",
>      and sometimes it refers to all byte values 0-255 (see the first
>      sentence under `@cindex unibyte text`).

I see no problem here, none at all.  This is a manual, not a
mathematical treatise.

>    * It's not clear whether "raw 8-bit bytes" is meant to refer to
>      bytes with values 128-255, or to the *characters* that map to
>      those byte values.

We specifically say they are NOT characters.  From the above-cited
description:

     Encoded text is not really text, as far as Emacs is concerned, but
  rather a sequence of raw 8-bit bytes.

>    * The following phrasing is weird: "The function assumes that
>      @var{string} includes ASCII characters and raw 8-bit bytes". The
>      purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so
>      by definition that assumption is always true.

No, it isn't true "by definition".  We are trying to make it very
clear that we distinguish between "characters" and "raw bytes".
"Characters" are units of human-readable text, and each character has
a set of attributes that Emacs uses when processing text.  Characters
have letter-case, general category, directionality, numerical value,
etc.  By contrast, "raw bytes" don't have any such attributes: it is
meaningless to ask whether a given raw byte is upper- or lower-case,
or if its directionality is right-to-left, etc.

I hope you now better understand what the sentence above attempts to
say; it doesn't say things that are trivially true.

>                                                       By saying "the
>      function assumes", the reader is left wondering about the cases
>      where that assumption is not true,

Those other cases are multibyte strings, of course.  We could add that
in parentheses, e.g.:

  The function assumes that @var{string} includes ASCII characters and
  raw 8-bit bytes (as opposed to multibyte text).

>      Maybe something like this:
> 
>          By definition, unibyte strings contain only @acronym{ASCII}
>          characters (bytes with values 0-127) and raw 8-bit bytes
>          (bytes with values 128-255); the latter are converted to their
>          corresponding multibyte representations in the
>          @code{eight-bit} character set (@pxref{Text Representations,
>          codepoints}).

As I tried to explain above, using the numerical codes of the bytes is
a step backward: we've been there and done that, and found that people
get confused by that, because the byte codes overlap the Unicode
codepoints of Latin characters.  Explaining the difference rigorously
is IME impossible without delving into the internal representation of
each one of them, since that is how Emacs _really_ distinguishes
between them.  But having all that in the ELisp Reference manual is
completely unjustified (let alone not future-proof, since the internal
representation can change).

Another problem with the above text is that it implies ASCII
characters are bytes: we don't want to call them that, to maintain the
fundamental difference between characters and bytes.

Yet another problem there is that you can have a multibyte string that
is pure-ASCII, so "by definition" is also problematic.

Bottom line: I think the manual describes this reasonably well, and,
given the past experience, any change will have to be tangibly better
before we make it.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
  2022-06-06 11:29             ` Eli Zaretskii
@ 2022-08-17 23:21               ` Stefan Kangas
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Kangas @ 2022-08-17 23:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 55777-done, Richard Hansen

Eli Zaretskii <eliz@gnu.org> writes:

> Bottom line: I think the manual describes this reasonably well, and,
> given the past experience, any change will have to be tangibly better
> before we make it.

No further comments within 10 weeks, but it sounds like the conclusion
here is that we don't want to make the suggested changes.  I'm therefore
closing this bug report.





^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-08-17 23:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-03  6:20 bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Richard Hansen
2022-06-03  7:02 ` Eli Zaretskii
2022-06-04  3:28   ` Richard Hansen
2022-06-04  7:09     ` Eli Zaretskii
2022-06-05  0:16       ` Richard Hansen
2022-06-05  5:37         ` Eli Zaretskii
2022-06-06  2:00           ` Richard Hansen
2022-06-06 11:29             ` Eli Zaretskii
2022-08-17 23:21               ` Stefan Kangas

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.