* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' @ 2022-06-03 6:20 Richard Hansen 2022-06-03 7:02 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Richard Hansen @ 2022-06-03 6:20 UTC (permalink / raw) To: 55777 [-- Attachment #1.1.1: Type: text/plain, Size: 15 bytes --] See attached. [-- Attachment #1.1.2: 0001-Improve-documentation-of-string-to-multibyte-string-.patch --] [-- Type: text/x-patch, Size: 2576 bytes --] From 2e0e944840de65936a979b075aa2ea4177f49854 Mon Sep 17 00:00:00 2001 From: Richard Hansen <rhansen@rhansen.org> Date: Fri, 3 Jun 2022 01:04:41 -0400 Subject: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' * doc/lispref/nonascii.texi (Converting Representations): Fix erroneous description of `string-to-unibyte' (it does not signal an error on eight-bit characters) and clarify its behavior. Update documentation of `string-to-multibyte' to match. --- doc/lispref/nonascii.texi | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index d7d25dc36a..8746b79de8 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -271,20 +271,24 @@ Converting Representations @defun string-to-multibyte string This function returns a multibyte string containing the same sequence of characters as @var{string}. If @var{string} is a multibyte string, -it is returned unchanged. The function assumes that @var{string} -includes only @acronym{ASCII} characters and raw 8-bit bytes; the -latter are converted to their multibyte representation corresponding -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive -(@pxref{Text Representations, codepoints}). +it is returned unchanged. Otherwise, byte values @code{#x00} through +@code{#x7F} (@acronym{ASCII} characters) are mapped to their +corresponding codepoints, and byte values @code{#x80} through +@code{#xFF} (eight-bit characters) are mapped to codepoints +@code{#x3FFF80} through @code{#x3FFFFF} (@pxref{Text Representations, +codepoints}). @end defun @defun string-to-unibyte string This function returns a unibyte string containing the same sequence of -characters as @var{string}. It signals an error if @var{string} -contains a non-@acronym{ASCII} character. If @var{string} is a -unibyte string, it is returned unchanged. Use this function for -@var{string} arguments that contain only @acronym{ASCII} and eight-bit -characters. +characters as @var{string}. If @var{string} is a unibyte string, it +is returned unchanged. Otherwise, codepoints @code{#x00} through +@code{#x7F} (@acronym{ASCII} characters) are mapped to their +corresponding byte values, and codepoints @code{#x3FFF80} through +@code{#x3FFFFF} (eight-bit characters) are mapped to byte values +@code{#x80} through @code{#xFF} (@pxref{Text Representations, +codepoints}). It signals an error if any other codepoint is +encountered. @end defun @defun byte-to-string byte -- 2.36.1 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-03 6:20 bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Richard Hansen @ 2022-06-03 7:02 ` Eli Zaretskii 2022-06-04 3:28 ` Richard Hansen 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2022-06-03 7:02 UTC (permalink / raw) To: Richard Hansen; +Cc: 55777 > Date: Fri, 3 Jun 2022 02:20:35 -0400 > From: Richard Hansen <rhansen@rhansen.org> > > See attached. Thanks, but please explain the motivation for these changes. In particular, why would we need to describe in a doc string such intimate details of our current implementation? If there was some situation where you needed these details for some Lisp program, please describe that situation. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-03 7:02 ` Eli Zaretskii @ 2022-06-04 3:28 ` Richard Hansen 2022-06-04 7:09 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Richard Hansen @ 2022-06-04 3:28 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 55777 [-- Attachment #1.1.1: Type: text/plain, Size: 2081 bytes --] On 6/3/22 03:02, Eli Zaretskii wrote: > Thanks, but please explain the motivation for these changes. The motivation is in the commit message, which I revised in the attached patch to hopefully make it more clear. > In particular, why would we need to describe in a doc string such > intimate details of our current implementation? There is a fair amount of implementation detail right now; the patch doesn't significantly change that. But I revised the patch to remove some of the detail. > If there was some situation where you needed these details for some > Lisp program, please describe that situation. I'm trying to understand some inconsistent behavior I'm observing while writing code to process binary data, and I found the existing documentation lacking. ;; Unibyte vs. multibyte characters: (eq ?\xff ?\x3fffff) ; t (ok) (eq (aref "\x3fffff" 0) (aref "\xff" 0)) ; t (ok) (eq (aref "\x3fffff 😀" 0) (aref "\xff 😀" 0)) ; t (ok) (eq (aref "\xff" 0) (aref "\xff 😀" 0)) ; nil (expected t) ;; Unibyte vs. multibyte strings: (multibyte-string-p "\xff") ; nil (ok) (multibyte-string-p "\x3fffff") ; nil (ok???) (string= "\xff" (string-to-multibyte "\xff")) ; nil (expected t) ;; Char code vs. Unicode codepoint: (string= "😀\xff" "😀\x3fffff") ; t (ok) (string= "😀\N{U+ff}" "😀\xff") ; nil (ok) (string= "😀\N{U+ff}" "😀\x3fffff") ; nil (ok) (string= "😀ÿ" "😀\N{U+ff}") ; t (ok) (string= "😀ÿ" "😀\xff") ; nil (ok) (string= "😀ÿ" "😀\x3fffff") ; nil (ok) (eq ?\N{U+ff} ?\xff) ; t (expected nil) (eq ?\N{U+ff} ?\x3fffff) ; t (expected nil) (eq ?ÿ ?\xff) ; t (expected nil) (eq ?ÿ ?\x3fffff) ; t (expected nil) [-- Attachment #1.1.2: 0001-Improve-documentation-of-string-to-multibyte-string-.patch --] [-- Type: text/x-patch, Size: 2678 bytes --] From 6813b0a43250c9633d84d72418904025e973f1c8 Mon Sep 17 00:00:00 2001 From: Richard Hansen <rhansen@rhansen.org> Date: Fri, 3 Jun 2022 01:04:41 -0400 Subject: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' * doc/lispref/nonascii.texi (Converting Representations): Remove confusing sentence from `string-to-multibyte' documentation (by definition unibyte strings can only contain ASCII and eight-bit characters, so there's no need to assume that unibyte strings only contain those characters). Fix description of the characters that will cause `string-to-unibyte' to signal an error (`eight-bit' characters are OK). Remove some implementation details that are discussed in the xrefed section. Word the documentation for the two functions similarly so that it is clear they are inverses of each other. --- doc/lispref/nonascii.texi | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index d7d25dc36a..e8f02d8f2f 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -271,20 +271,19 @@ Converting Representations @defun string-to-multibyte string This function returns a multibyte string containing the same sequence of characters as @var{string}. If @var{string} is a multibyte string, -it is returned unchanged. The function assumes that @var{string} -includes only @acronym{ASCII} characters and raw 8-bit bytes; the -latter are converted to their multibyte representation corresponding -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive -(@pxref{Text Representations, codepoints}). +it is returned unchanged. Otherwise, byte values are transformed to +their corresponding multibyte codepoints (@acronym{ASCII} characters +and characters in the @code{eight-bit} charset). @xref{Text +Representations, codepoints}. @end defun @defun string-to-unibyte string This function returns a unibyte string containing the same sequence of -characters as @var{string}. It signals an error if @var{string} -contains a non-@acronym{ASCII} character. If @var{string} is a -unibyte string, it is returned unchanged. Use this function for -@var{string} arguments that contain only @acronym{ASCII} and eight-bit -characters. +characters as @var{string}. If @var{string} is a unibyte string, it +is returned unchanged. Otherwise, @acronym{ASCII} characters and +characters in the @code{eight-bit} charset are converted to their +corresponding byte values. It signals an error if any other character +is encountered. @xref{Text Representations, codepoints}. @end defun @defun byte-to-string byte -- 2.36.1 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-04 3:28 ` Richard Hansen @ 2022-06-04 7:09 ` Eli Zaretskii 2022-06-05 0:16 ` Richard Hansen 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2022-06-04 7:09 UTC (permalink / raw) To: Richard Hansen; +Cc: 55777 > Date: Fri, 3 Jun 2022 23:28:51 -0400 > Cc: 55777@debbugs.gnu.org > From: Richard Hansen <rhansen@rhansen.org> > > > If there was some situation where you needed these details for some > > Lisp program, please describe that situation. > I'm trying to understand some inconsistent behavior I'm observing > while writing code to process binary data, and I found the existing > documentation lacking. You are digging into low-level details of how Emacs keeps strings in memory, and the higher-level context of _why_ you need to understand these details is left untold. In general, Lisp programs are well advised to stay away of manipulating unibyte strings, and definitely to refrain from comparing unibyte and multibyte strings -- because these are supposed to be never needed in Lisp applications, and because doing TRT with those requires non-trivial knowledge of the Emacs internals. I see no reason to complicate the documentation for the very rare occasions where these issues unfortunately leak to higher-than-expected levels. > ;; Unibyte vs. multibyte characters: > (eq ?\xff ?\x3fffff) ; t (ok) > (eq (aref "\x3fffff" 0) (aref "\xff" 0)) ; t (ok) > (eq (aref "\x3fffff 😀" 0) (aref "\xff 😀" 0)) ; t (ok) > (eq (aref "\xff" 0) (aref "\xff 😀" 0)) ; nil (expected t) > > ;; Unibyte vs. multibyte strings: > (multibyte-string-p "\xff") ; nil (ok) > (multibyte-string-p "\x3fffff") ; nil (ok???) > (string= "\xff" (string-to-multibyte "\xff")) ; nil (expected t) > > ;; Char code vs. Unicode codepoint: > (string= "😀\xff" "😀\x3fffff") ; t (ok) > (string= "😀\N{U+ff}" "😀\xff") ; nil (ok) > (string= "😀\N{U+ff}" "😀\x3fffff") ; nil (ok) > (string= "😀ÿ" "😀\N{U+ff}") ; t (ok) > (string= "😀ÿ" "😀\xff") ; nil (ok) > (string= "😀ÿ" "😀\x3fffff") ; nil (ok) > (eq ?\N{U+ff} ?\xff) ; t (expected nil) > (eq ?\N{U+ff} ?\x3fffff) ; t (expected nil) > (eq ?ÿ ?\xff) ; t (expected nil) > (eq ?ÿ ?\x3fffff) ; t (expected nil) If you still don't understand some of these, please feel free to ask questions, and we will gladly answer them. But I see no reason to change the documentation on that behalf. > @@ -271,20 +271,19 @@ Converting Representations > @defun string-to-multibyte string > This function returns a multibyte string containing the same sequence > of characters as @var{string}. If @var{string} is a multibyte string, > -it is returned unchanged. The function assumes that @var{string} > -includes only @acronym{ASCII} characters and raw 8-bit bytes; the > -latter are converted to their multibyte representation corresponding > -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive > -(@pxref{Text Representations, codepoints}). > +it is returned unchanged. Otherwise, byte values are transformed to > +their corresponding multibyte codepoints (@acronym{ASCII} characters > +and characters in the @code{eight-bit} charset). @xref{Text > +Representations, codepoints}. This loses information, so I don't think we should make this change. It might be trivially clear to you that unibyte string can only contain ASCII and raw bytes, but it isn't necessarily clear to everyone. > @defun string-to-unibyte string > This function returns a unibyte string containing the same sequence of > -characters as @var{string}. It signals an error if @var{string} > -contains a non-@acronym{ASCII} character. If @var{string} is a > -unibyte string, it is returned unchanged. Use this function for > -@var{string} arguments that contain only @acronym{ASCII} and eight-bit > -characters. > +characters as @var{string}. If @var{string} is a unibyte string, it > +is returned unchanged. Otherwise, @acronym{ASCII} characters and > +characters in the @code{eight-bit} charset are converted to their > +corresponding byte values. It signals an error if any other character > +is encountered. @xref{Text Representations, codepoints}. This basically rearranges the existing text, and adds just one sentence: Otherwise, @acronym{ASCII} characters and characters in the @code{eight-bit} charset are converted to their corresponding byte values. The cross-reference is identical to the one we already have a few lines above this text, so it is redundant. I've made a change to add the above sentence, and slightly rearranged the text to be more clear and logically complete. Here's how this text looks now on the emacs-28 branch (and will appear in Emacs 28.2 and later): @defun string-to-unibyte string This function returns a unibyte string containing the same sequence of characters as @var{string}. If @var{string} is a unibyte string, it is returned unchanged. Otherwise, @acronym{ASCII} characters and characters in the @code{eight-bit} charset are converted to their corresponding byte values. Use this function for @var{string} arguments that contain only @acronym{ASCII} and eight-bit characters; the function signals an error if any other characters are encountered. @end defun Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-04 7:09 ` Eli Zaretskii @ 2022-06-05 0:16 ` Richard Hansen 2022-06-05 5:37 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Richard Hansen @ 2022-06-05 0:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 55777 [-- Attachment #1.1.1: Type: text/plain, Size: 3677 bytes --] On 6/4/22 03:09, Eli Zaretskii wrote: >>> If there was some situation where you needed these details for some >>> Lisp program, please describe that situation. >> >> I'm trying to understand some inconsistent behavior I'm observing >> while writing code to process binary data, and I found the existing >> documentation lacking. > > You are digging into low-level details of how Emacs keeps strings in > memory, and the higher-level context of _why_ you need to understand > these details is left untold. Readers either think the documentation is confusing or they don't; why they need to understand the documentation is mostly irrelevant. I find the documentation to be confusing, and I suspect I am not the only one. > In general, Lisp programs are well advised to stay away of > manipulating unibyte strings, and definitely to refrain from comparing > unibyte and multibyte strings -- because these are supposed to be > never needed in Lisp applications, and because doing TRT with those > requires non-trivial knowledge of the Emacs internals. I disagree with "well advised". The documentation in 34.1 and 34.3 make it sound like the representation is merely an internal elisp implementation detail that programmers don't need to worry about, unless they are doing something unusually low-level. I consider binary data processing to be somewhat common, not "unusually low-level". Yet manipulating byte values 128-255 in unibyte strings, and characters with Unicode codepoints 128-255 in multibyte strings, is fraught with peril. For example, it is risky to use `aref' to read a character or `aset' to write a character unless you either know the string representation or know that the character is not in #x80-#xff or #x3fff80-#x3fffff. > > I see no reason to complicate the documentation for the very rare > occasions where these issues unfortunately leak to > higher-than-expected levels. I don't think the occasions are all that rare. But even if they are, the precise behavior should be documented somewhere so that programmers who need low-level string manipulation can do so correctly. I would argue that programmers using `string-to-unibyte' or `string-to-multibyte' fall into that category. >> @@ -271,20 +271,19 @@ Converting Representations >> @defun string-to-multibyte string >> This function returns a multibyte string containing the same sequence >> of characters as @var{string}. If @var{string} is a multibyte string, >> -it is returned unchanged. The function assumes that @var{string} >> -includes only @acronym{ASCII} characters and raw 8-bit bytes; the >> -latter are converted to their multibyte representation corresponding >> -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive >> -(@pxref{Text Representations, codepoints}). >> +it is returned unchanged. Otherwise, byte values are transformed to >> +their corresponding multibyte codepoints (@acronym{ASCII} characters >> +and characters in the @code{eight-bit} charset). @xref{Text >> +Representations, codepoints}. > > This loses information, so I don't think we should make this change. > It might be trivially clear to you that unibyte string can only > contain ASCII and raw bytes, but it isn't necessarily clear to > everyone. I still find the current wording to be confusing. To me, all bytes have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also, ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant to refer only to non-ASCII values. I have attached another revision that I think is complete, correct, and easier to understand. Thanks, Richard [-- Attachment #1.1.2: 0001-Clarify-documentation-of-string-to-multibyte.patch --] [-- Type: text/x-patch, Size: 1359 bytes --] From 0d7481239056e2d701591728f240094c7a939d3a Mon Sep 17 00:00:00 2001 From: Richard Hansen <rhansen@rhansen.org> Date: Fri, 3 Jun 2022 01:04:41 -0400 Subject: [PATCH] Clarify documentation of `string-to-multibyte' * doc/lispref/nonascii.texi (Converting Representations): Clarify what `string-to-multibyte' does. (Bug#55777) --- doc/lispref/nonascii.texi | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 6dc23637a7..7a3c3c63e7 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi @@ -271,11 +271,9 @@ Converting Representations @defun string-to-multibyte string This function returns a multibyte string containing the same sequence of characters as @var{string}. If @var{string} is a multibyte string, -it is returned unchanged. The function assumes that @var{string} -includes only @acronym{ASCII} characters and raw 8-bit bytes; the -latter are converted to their multibyte representation corresponding -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive -(@pxref{Text Representations, codepoints}). +it is returned unchanged. Otherwise, bytes with values 128 to 255 are +converted to their corresponding multibyte representations in the +@code{eight-bit} charset. @end defun @defun string-to-unibyte string -- 2.36.1 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-05 0:16 ` Richard Hansen @ 2022-06-05 5:37 ` Eli Zaretskii 2022-06-06 2:00 ` Richard Hansen 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2022-06-05 5:37 UTC (permalink / raw) To: Richard Hansen; +Cc: 55777 > Date: Sat, 4 Jun 2022 20:16:47 -0400 > Cc: 55777@debbugs.gnu.org > From: Richard Hansen <rhansen@rhansen.org> > > > You are digging into low-level details of how Emacs keeps strings in > > memory, and the higher-level context of _why_ you need to understand > > these details is left untold. > > Readers either think the documentation is confusing or they don't; why > they need to understand the documentation is mostly irrelevant. I > find the documentation to be confusing, and I suspect I am not the > only one. I said "understand the details", not "understand the documentation". The latter is a no-brainer: documentation should be understandable, and I don't think what we have now isn't. See below regarding the parts you say confused you. > > In general, Lisp programs are well advised to stay away of > > manipulating unibyte strings, and definitely to refrain from comparing > > unibyte and multibyte strings -- because these are supposed to be > > never needed in Lisp applications, and because doing TRT with those > > requires non-trivial knowledge of the Emacs internals. > > I disagree with "well advised". The documentation in 34.1 and 34.3 > make it sound like the representation is merely an internal elisp > implementation detail that programmers don't need to worry about, > unless they are doing something unusually low-level. That is exactly the intent. The recommendation not to deal with non-text data directly (as opposed via, say, packages like bindat.el) is based on experience, both mine and that of others. > I consider binary data processing to be somewhat common, not > "unusually low-level". Yet manipulating byte values 128-255 in unibyte > strings, and characters with Unicode codepoints 128-255 in multibyte > strings, is fraught with peril. For example, it is risky to use `aref' > to read a character or `aset' to write a character unless you either > know the string representation or know that the character is not in > #x80-#xff or #x3fff80-#x3fffff. You are describing some of the known difficulties that arise when manipulating binary data in Emacs strings and buffers, which are the reasons for the above recommendation. Emacs can do all this, but not easily, since it isn't its main design goal. For comparison, some other text-processing environments simply reject any non-character data in strings. > > I see no reason to complicate the documentation for the very rare > > occasions where these issues unfortunately leak to > > higher-than-expected levels. > > I don't think the occasions are all that rare. But even if they are, > the precise behavior should be documented somewhere so that > programmers who need low-level string manipulation can do so > correctly. Documenting every aspect of the Emacs behavior for the rare chance that someone some day will find it useful would make our documentation too large. The Emacs Lisp Reference manual already prints in 2 very thick volumes. So our policy is not to document the aspects that are too obscure to be useful to many. > I would argue that programmers using `string-to-unibyte' > or `string-to-multibyte' fall into that category. I disagree. First, these functions should be used very rarely, and we generally try to avoid them entirely. And if they do need to be used, the current documentation is IMO adequate. It still has to be understandable, of course, but it doesn't need to describe every possible detail of how Emacs handles raw bytes and conversions between them and readable text. > I still find the current wording to be confusing. To me, all bytes > have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also, > ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant > to refer only to non-ASCII values. What are "raw bytes" is explained in one of the previous sections of this chapter. > I have attached another revision that I think is complete, correct, > and easier to understand. I think it muddies the water by talking about numerical values 128 to 255, which also match some Latin characters. It also removes the reference to the codepoints Emacs uses to represent these bytes, which is important in some situations. So I think your proposal would change this text for the worse. Could you please state what is confusing in the current wording? If it's only the "raw 8-bit bytes" thing, it is explained earlier in the manual; if needed, we could add a cross-reference there to that section. If it's something else, please tell. But mentioning the single-byte numerical values here actually increases the confusion, IME, due to overlap with valid Unicode codepoints, which is why we should and do deliberately refrain from doing that. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-05 5:37 ` Eli Zaretskii @ 2022-06-06 2:00 ` Richard Hansen 2022-06-06 11:29 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Richard Hansen @ 2022-06-06 2:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 55777 On 6/5/22 01:37, Eli Zaretskii wrote: > Could you please state what is confusing in the current wording? * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in the chapter -- the term is even in a @dfn{} -- but there's no definition there. * The term "raw 8-bit bytes" is misleading. It suggests binary data (bytes with values 0-255) but it's actually meant to only cover 128-255. * The term "raw 8-bit bytes" is not used consistently. Sometimes "8" is spelled out as "eight", sometimes "raw" comes after "8-bit", and sometimes it refers to all byte values 0-255 (see the first sentence under `@cindex unibyte text`). * It's not clear whether "raw 8-bit bytes" is meant to refer to bytes with values 128-255, or to the *characters* that map to those byte values. * The following phrasing is weird: "The function assumes that @var{string} includes ASCII characters and raw 8-bit bytes". The purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so by definition that assumption is always true. By saying "the function assumes", the reader is left wondering about the cases where that assumption is not true, which in turn causes the reader to question whether "raw 8-bit bytes" fully covers non-ASCII byte values, which in turn causes the reader to wonder how to handle those non-covered values (whatever they are). Maybe something like this: By definition, unibyte strings contain only @acronym{ASCII} characters (bytes with values 0-127) and raw 8-bit bytes (bytes with values 128-255); the latter are converted to their corresponding multibyte representations in the @code{eight-bit} character set (@pxref{Text Representations, codepoints}). ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-06 2:00 ` Richard Hansen @ 2022-06-06 11:29 ` Eli Zaretskii 2022-08-17 23:21 ` Stefan Kangas 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2022-06-06 11:29 UTC (permalink / raw) To: Richard Hansen; +Cc: 55777 > Date: Sun, 5 Jun 2022 22:00:35 -0400 > Cc: 55777@debbugs.gnu.org > From: Richard Hansen <rhansen@rhansen.org> > > On 6/5/22 01:37, Eli Zaretskii wrote: > > Could you please state what is confusing in the current wording? > > * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in > the chapter -- the term is even in a @dfn{} -- but there's no > definition there. It is defined as best we could without confusing the readers: Occasionally, Emacs needs to hold and manipulate encoded text or binary non-text data in its buffers or strings. For example, when Emacs visits a file, it first reads the file’s text verbatim into a buffer, and only then converts it to the internal representation. Before the conversion, the buffer holds encoded text. Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. We call buffers and strings that hold encoded text “unibyte” buffers and strings, because Emacs treats them as a sequence of individual bytes. [...] (The @dfn part is markup used whenever new terminology is first used, it doesn't imply "definition".) You are welcome to propose a better explanation, but one thing is a non-starter: mentioning the numerical codes of those bytes, certainly as part of their "definition". This is because their numerical codes overlap Latin characters, and people were very confused about that when we mentioned them in the documentation in the past. So now we deliberately don't mention the values. The definition is effectively "bytes that have no meaning as human-readable text". > * The term "raw 8-bit bytes" is misleading. It suggests binary data > (bytes with values 0-255) but it's actually meant to only cover > 128-255. It indeed could potentially mislead. But not necessarily: it is customary to use "eight-bit" to mean "with the 8th bit set". Once again, you don't have to convince me that this area is confusing and notoriously hard to document. The challenge is to come up with something that is better than what we have and yet doesn't trigger confusion which we already had in the past. > * The term "raw 8-bit bytes" is not used consistently. Sometimes "8" > is spelled out as "eight", sometimes "raw" comes after "8-bit", > and sometimes it refers to all byte values 0-255 (see the first > sentence under `@cindex unibyte text`). I see no problem here, none at all. This is a manual, not a mathematical treatise. > * It's not clear whether "raw 8-bit bytes" is meant to refer to > bytes with values 128-255, or to the *characters* that map to > those byte values. We specifically say they are NOT characters. From the above-cited description: Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. > * The following phrasing is weird: "The function assumes that > @var{string} includes ASCII characters and raw 8-bit bytes". The > purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so > by definition that assumption is always true. No, it isn't true "by definition". We are trying to make it very clear that we distinguish between "characters" and "raw bytes". "Characters" are units of human-readable text, and each character has a set of attributes that Emacs uses when processing text. Characters have letter-case, general category, directionality, numerical value, etc. By contrast, "raw bytes" don't have any such attributes: it is meaningless to ask whether a given raw byte is upper- or lower-case, or if its directionality is right-to-left, etc. I hope you now better understand what the sentence above attempts to say; it doesn't say things that are trivially true. > By saying "the > function assumes", the reader is left wondering about the cases > where that assumption is not true, Those other cases are multibyte strings, of course. We could add that in parentheses, e.g.: The function assumes that @var{string} includes ASCII characters and raw 8-bit bytes (as opposed to multibyte text). > Maybe something like this: > > By definition, unibyte strings contain only @acronym{ASCII} > characters (bytes with values 0-127) and raw 8-bit bytes > (bytes with values 128-255); the latter are converted to their > corresponding multibyte representations in the > @code{eight-bit} character set (@pxref{Text Representations, > codepoints}). As I tried to explain above, using the numerical codes of the bytes is a step backward: we've been there and done that, and found that people get confused by that, because the byte codes overlap the Unicode codepoints of Latin characters. Explaining the difference rigorously is IME impossible without delving into the internal representation of each one of them, since that is how Emacs _really_ distinguishes between them. But having all that in the ELisp Reference manual is completely unjustified (let alone not future-proof, since the internal representation can change). Another problem with the above text is that it implies ASCII characters are bytes: we don't want to call them that, to maintain the fundamental difference between characters and bytes. Yet another problem there is that you can have a multibyte string that is pure-ASCII, so "by definition" is also problematic. Bottom line: I think the manual describes this reasonably well, and, given the past experience, any change will have to be tangibly better before we make it. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' 2022-06-06 11:29 ` Eli Zaretskii @ 2022-08-17 23:21 ` Stefan Kangas 0 siblings, 0 replies; 9+ messages in thread From: Stefan Kangas @ 2022-08-17 23:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 55777-done, Richard Hansen Eli Zaretskii <eliz@gnu.org> writes: > Bottom line: I think the manual describes this reasonably well, and, > given the past experience, any change will have to be tangibly better > before we make it. No further comments within 10 weeks, but it sounds like the conclusion here is that we don't want to make the suggested changes. I'm therefore closing this bug report. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-08-17 23:21 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-06-03 6:20 bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Richard Hansen 2022-06-03 7:02 ` Eli Zaretskii 2022-06-04 3:28 ` Richard Hansen 2022-06-04 7:09 ` Eli Zaretskii 2022-06-05 0:16 ` Richard Hansen 2022-06-05 5:37 ` Eli Zaretskii 2022-06-06 2:00 ` Richard Hansen 2022-06-06 11:29 ` Eli Zaretskii 2022-08-17 23:21 ` Stefan Kangas
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.