From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte',
 `string-to-unibyte'
Date: Mon, 06 Jun 2022 14:29:19 +0300
Message-ID: <83fski81yo.fsf@gnu.org>
References: <e55042a8-fc0d-af3d-faed-95a82f730d07@rhansen.org>
 <83sfomcjr7.fsf@gnu.org> <d41c6629-b9c6-a813-5023-7da42f7a95ca@rhansen.org>
 <83ilpgc3bd.fsf@gnu.org> <1c6f61d2-80df-38ab-a895-f73ad4be63a7@rhansen.org>
 <83zgiracxf.fsf@gnu.org> <16ed6ce6-725f-a183-8864-7e9185b14ff4@rhansen.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="1928"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: 55777@debbugs.gnu.org
To: Richard Hansen <rhansen@rhansen.org>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon Jun 06 13:31:09 2022
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1nyAwt-0000Jg-B9
	for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 06 Jun 2022 13:31:07 +0200
Original-Received: from localhost ([::1]:48026 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1nyAws-0000IH-CD
	for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 06 Jun 2022 07:31:06 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37786)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1nyAvr-0000Hr-8P
 for bug-gnu-emacs@gnu.org; Mon, 06 Jun 2022 07:30:04 -0400
Original-Received: from debbugs.gnu.org ([209.51.188.43]:40627)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1nyAvq-0000TJ-SJ
 for bug-gnu-emacs@gnu.org; Mon, 06 Jun 2022 07:30:02 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1nyAvq-0002PZ-JR
 for bug-gnu-emacs@gnu.org; Mon, 06 Jun 2022 07:30:02 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Eli Zaretskii <eliz@gnu.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Mon, 06 Jun 2022 11:30:02 +0000
Resent-Message-ID: <handler.55777.B55777.16545149829204@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 55777
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: patch
Original-Received: via spool by 55777-submit@debbugs.gnu.org id=B55777.16545149829204
 (code B ref 55777); Mon, 06 Jun 2022 11:30:02 +0000
Original-Received: (at 55777) by debbugs.gnu.org; 6 Jun 2022 11:29:42 +0000
Original-Received: from localhost ([127.0.0.1]:34524 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1nyAvV-0002ON-RM
 for submit@debbugs.gnu.org; Mon, 06 Jun 2022 07:29:42 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:52722)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@gnu.org>) id 1nyAvS-0002Ny-Tn
 for 55777@debbugs.gnu.org; Mon, 06 Jun 2022 07:29:39 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:34202)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1nyAvN-0000Mn-Ij; Mon, 06 Jun 2022 07:29:33 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From:
 Date; bh=+5ClX2qqUErDidK0lXiD47oymOGFxVEWyfHK3BeSQMg=; b=jacweLZzXQjKQmvikkVK
 t0n7kvGZYaHj6GEGg8UYHJcW4GjV7ZptuNqGQ3MuehmWNo9O92xfPnmlBsb46iwAbYY2o5ZG/CQy+
 Ui0HgzPxEhT16LTKHN8j9ZGz9GpKq4eY1uLMJxmsJOJpODAghVgKAu1NaudKzlJj9A4gzqIwE1RfP
 bogzhwf1TtfS58or1ft81IjELDh5niTsTmDAIVaFhs+xnraDheucttxJzmmPtXwgjJVvcuwvTuAA5
 AZUm63IPL4xFULNjgX97ykUL9KYZ7keSPylKgz+q4K5hfWhvlMPnVo19yMCejMAhyuXZVhuT4637J
 6djiaAro2RRwKw==;
Original-Received: from [87.69.77.57] (port=3745 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1nyAvL-0000yW-Ds; Mon, 06 Jun 2022 07:29:33 -0400
In-Reply-To: <16ed6ce6-725f-a183-8864-7e9185b14ff4@rhansen.org> (message from
 Richard Hansen on Sun, 5 Jun 2022 22:00:35 -0400)
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: "bug-gnu-emacs"
 <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.bugs:233770
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/233770>

> Date: Sun, 5 Jun 2022 22:00:35 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> On 6/5/22 01:37, Eli Zaretskii wrote:
> > Could you please state what is confusing in the current wording?
> 
>    * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in
>      the chapter -- the term is even in a @dfn{} -- but there's no
>      definition there.

It is defined as best we could without confusing the readers:

     Occasionally, Emacs needs to hold and manipulate encoded text or
  binary non-text data in its buffers or strings.  For example, when Emacs
  visits a file, it first reads the file’s text verbatim into a buffer,
  and only then converts it to the internal representation.  Before the
  conversion, the buffer holds encoded text.

     Encoded text is not really text, as far as Emacs is concerned, but
  rather a sequence of raw 8-bit bytes.  We call buffers and strings that
  hold encoded text “unibyte” buffers and strings, because Emacs treats
  them as a sequence of individual bytes. [...]

(The @dfn part is markup used whenever new terminology is first used,
it doesn't imply "definition".)

You are welcome to propose a better explanation, but one thing is a
non-starter: mentioning the numerical codes of those bytes, certainly
as part of their "definition".  This is because their numerical codes
overlap Latin characters, and people were very confused about that
when we mentioned them in the documentation in the past.  So now we
deliberately don't mention the values.  The definition is effectively
"bytes that have no meaning as human-readable text".

>    * The term "raw 8-bit bytes" is misleading. It suggests binary data
>      (bytes with values 0-255) but it's actually meant to only cover
>      128-255.

It indeed could potentially mislead.  But not necessarily: it is
customary to use "eight-bit" to mean "with the 8th bit set".

Once again, you don't have to convince me that this area is confusing
and notoriously hard to document.  The challenge is to come up with
something that is better than what we have and yet doesn't trigger
confusion which we already had in the past.

>    * The term "raw 8-bit bytes" is not used consistently. Sometimes "8"
>      is spelled out as "eight", sometimes "raw" comes after "8-bit",
>      and sometimes it refers to all byte values 0-255 (see the first
>      sentence under `@cindex unibyte text`).

I see no problem here, none at all.  This is a manual, not a
mathematical treatise.

>    * It's not clear whether "raw 8-bit bytes" is meant to refer to
>      bytes with values 128-255, or to the *characters* that map to
>      those byte values.

We specifically say they are NOT characters.  From the above-cited
description:

     Encoded text is not really text, as far as Emacs is concerned, but
  rather a sequence of raw 8-bit bytes.

>    * The following phrasing is weird: "The function assumes that
>      @var{string} includes ASCII characters and raw 8-bit bytes". The
>      purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so
>      by definition that assumption is always true.

No, it isn't true "by definition".  We are trying to make it very
clear that we distinguish between "characters" and "raw bytes".
"Characters" are units of human-readable text, and each character has
a set of attributes that Emacs uses when processing text.  Characters
have letter-case, general category, directionality, numerical value,
etc.  By contrast, "raw bytes" don't have any such attributes: it is
meaningless to ask whether a given raw byte is upper- or lower-case,
or if its directionality is right-to-left, etc.

I hope you now better understand what the sentence above attempts to
say; it doesn't say things that are trivially true.

>                                                       By saying "the
>      function assumes", the reader is left wondering about the cases
>      where that assumption is not true,

Those other cases are multibyte strings, of course.  We could add that
in parentheses, e.g.:

  The function assumes that @var{string} includes ASCII characters and
  raw 8-bit bytes (as opposed to multibyte text).

>      Maybe something like this:
> 
>          By definition, unibyte strings contain only @acronym{ASCII}
>          characters (bytes with values 0-127) and raw 8-bit bytes
>          (bytes with values 128-255); the latter are converted to their
>          corresponding multibyte representations in the
>          @code{eight-bit} character set (@pxref{Text Representations,
>          codepoints}).

As I tried to explain above, using the numerical codes of the bytes is
a step backward: we've been there and done that, and found that people
get confused by that, because the byte codes overlap the Unicode
codepoints of Latin characters.  Explaining the difference rigorously
is IME impossible without delving into the internal representation of
each one of them, since that is how Emacs _really_ distinguishes
between them.  But having all that in the ELisp Reference manual is
completely unjustified (let alone not future-proof, since the internal
representation can change).

Another problem with the above text is that it implies ASCII
characters are bytes: we don't want to call them that, to maintain the
fundamental difference between characters and bytes.

Yet another problem there is that you can have a multibyte string that
is pure-ASCII, so "by definition" is also problematic.

Bottom line: I think the manual describes this reasonably well, and,
given the past experience, any change will have to be tangibly better
before we make it.

Thanks.