From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte',
 `string-to-unibyte'
Date: Sat, 04 Jun 2022 10:09:42 +0300
Message-ID: <83ilpgc3bd.fsf@gnu.org>
References: <e55042a8-fc0d-af3d-faed-95a82f730d07@rhansen.org>
 <83sfomcjr7.fsf@gnu.org> <d41c6629-b9c6-a813-5023-7da42f7a95ca@rhansen.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="25473"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: 55777@debbugs.gnu.org
To: Richard Hansen <rhansen@rhansen.org>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sat Jun 04 09:10:39 2022
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1nxNvi-0006WC-JO
	for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 04 Jun 2022 09:10:38 +0200
Original-Received: from localhost ([::1]:57634 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1nxNvh-0001cJ-4R
	for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 04 Jun 2022 03:10:37 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:56798)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1nxNv8-0001bv-Lh
 for bug-gnu-emacs@gnu.org; Sat, 04 Jun 2022 03:10:02 -0400
Original-Received: from debbugs.gnu.org ([209.51.188.43]:35400)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1nxNv8-0000AS-C8
 for bug-gnu-emacs@gnu.org; Sat, 04 Jun 2022 03:10:02 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1nxNv8-0004ZY-5H
 for bug-gnu-emacs@gnu.org; Sat, 04 Jun 2022 03:10:02 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Eli Zaretskii <eliz@gnu.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Sat, 04 Jun 2022 07:10:02 +0000
Resent-Message-ID: <handler.55777.B55777.165432659617556@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 55777
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: patch
Original-Received: via spool by 55777-submit@debbugs.gnu.org id=B55777.165432659617556
 (code B ref 55777); Sat, 04 Jun 2022 07:10:02 +0000
Original-Received: (at 55777) by debbugs.gnu.org; 4 Jun 2022 07:09:56 +0000
Original-Received: from localhost ([127.0.0.1]:57530 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1nxNv2-0004Z5-AC
 for submit@debbugs.gnu.org; Sat, 04 Jun 2022 03:09:56 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:43582)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@gnu.org>) id 1nxNuy-0004Yl-21
 for 55777@debbugs.gnu.org; Sat, 04 Jun 2022 03:09:54 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:46198)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1nxNus-00009G-D7; Sat, 04 Jun 2022 03:09:46 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From:
 Date; bh=ITaF/gk5lVbG/9bmEi4cIT0c8biuM+zXJN3jZmNfnNc=; b=Cpx5UwM9b5JfHsytfUw9
 Iq5Xy8g7x885KPh/6B0jZsQZIZ8BxC0LMjoE3QdkNlOMxv9x9EXUaDDDGn8iFzpk3Lj0kAdNW0F5P
 UJ8PeBjfNefyb515e9hEAWAEaWVsAa6jWdYoJKobVxbgD4TMwmz1Ez+noFJsnHUDIpEiyesJoG/0D
 MVo3QDpxxTA7lKHoB38n3MTluIoKMmhNJc+YK04dYgFI975q3j4beP9LCTLycaL70+kHeszqqYHdS
 klSMlpuT1QzXJlDEs8qq1+8HLzB/aWImtMIKNqRtrPLOOyk1A/OVSq6UYHFIKF/tfpjyJDxDZWIAg
 a+QMv7D1tYHVBg==;
Original-Received: from [87.69.77.57] (port=3781 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1nxNur-0001oR-Ok; Sat, 04 Jun 2022 03:09:46 -0400
In-Reply-To: <d41c6629-b9c6-a813-5023-7da42f7a95ca@rhansen.org> (message from
 Richard Hansen on Fri, 3 Jun 2022 23:28:51 -0400)
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: "bug-gnu-emacs"
 <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.bugs:233639
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/233639>

> Date: Fri, 3 Jun 2022 23:28:51 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> > If there was some situation where you needed these details for some 
> > Lisp program, please describe that situation.
> I'm trying to understand some inconsistent behavior I'm observing
> while writing code to process binary data, and I found the existing
> documentation lacking.

You are digging into low-level details of how Emacs keeps strings in
memory, and the higher-level context of _why_ you need to understand
these details is left untold.

In general, Lisp programs are well advised to stay away of
manipulating unibyte strings, and definitely to refrain from comparing
unibyte and multibyte strings -- because these are supposed to be
never needed in Lisp applications, and because doing TRT with those
requires non-trivial knowledge of the Emacs internals.

I see no reason to complicate the documentation for the very rare
occasions where these issues unfortunately leak to
higher-than-expected levels.

>      ;; Unibyte vs. multibyte characters:
>      (eq ?\xff ?\x3fffff)                           ; t (ok)
>      (eq (aref "\x3fffff" 0) (aref "\xff" 0))       ; t (ok)
>      (eq (aref "\x3fffff 😀" 0) (aref "\xff 😀" 0)) ; t (ok)
>      (eq (aref "\xff" 0) (aref "\xff 😀" 0))        ; nil (expected t)
> 
>      ;; Unibyte vs. multibyte strings:
>      (multibyte-string-p "\xff")                    ; nil (ok)
>      (multibyte-string-p "\x3fffff")                ; nil (ok???)
>      (string= "\xff" (string-to-multibyte "\xff"))  ; nil (expected t)
> 
>      ;; Char code vs. Unicode codepoint:
>      (string= "😀\xff" "😀\x3fffff")                ; t (ok)
>      (string= "😀\N{U+ff}" "😀\xff")                ; nil (ok)
>      (string= "😀\N{U+ff}" "😀\x3fffff")            ; nil (ok)
>      (string= "😀ÿ" "😀\N{U+ff}")                   ; t (ok)
>      (string= "😀ÿ" "😀\xff")                       ; nil (ok)
>      (string= "😀ÿ" "😀\x3fffff")                   ; nil (ok)
>      (eq ?\N{U+ff} ?\xff)                           ; t (expected nil)
>      (eq ?\N{U+ff} ?\x3fffff)                       ; t (expected nil)
>      (eq ?ÿ ?\xff)                                  ; t (expected nil)
>      (eq ?ÿ ?\x3fffff)                              ; t (expected nil)

If you still don't understand some of these, please feel free to ask
questions, and we will gladly answer them.  But I see no reason to
change the documentation on that behalf.

> @@ -271,20 +271,19 @@ Converting Representations
>  @defun string-to-multibyte string
>  This function returns a multibyte string containing the same sequence
>  of characters as @var{string}.  If @var{string} is a multibyte string,
> -it is returned unchanged.  The function assumes that @var{string}
> -includes only @acronym{ASCII} characters and raw 8-bit bytes; the
> -latter are converted to their multibyte representation corresponding
> -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
> -(@pxref{Text Representations, codepoints}).
> +it is returned unchanged.  Otherwise, byte values are transformed to
> +their corresponding multibyte codepoints (@acronym{ASCII} characters
> +and characters in the @code{eight-bit} charset).  @xref{Text
> +Representations, codepoints}.

This loses information, so I don't think we should make this change.
It might be trivially clear to you that unibyte string can only
contain ASCII and raw bytes, but it isn't necessarily clear to
everyone.

>  @defun string-to-unibyte string
>  This function returns a unibyte string containing the same sequence of
> -characters as @var{string}.  It signals an error if @var{string}
> -contains a non-@acronym{ASCII} character.  If @var{string} is a
> -unibyte string, it is returned unchanged.  Use this function for
> -@var{string} arguments that contain only @acronym{ASCII} and eight-bit
> -characters.
> +characters as @var{string}.  If @var{string} is a unibyte string, it
> +is returned unchanged.  Otherwise, @acronym{ASCII} characters and
> +characters in the @code{eight-bit} charset are converted to their
> +corresponding byte values.  It signals an error if any other character
> +is encountered.  @xref{Text Representations, codepoints}.

This basically rearranges the existing text, and adds just one
sentence:

  Otherwise, @acronym{ASCII} characters and characters in the
  @code{eight-bit} charset are converted to their corresponding byte
  values.

The cross-reference is identical to the one we already have a few
lines above this text, so it is redundant.  I've made a change to add
the above sentence, and slightly rearranged the text to be more clear
and logically complete.

Here's how this text looks now on the emacs-28 branch (and will appear
in Emacs 28.2 and later):

  @defun string-to-unibyte string
  This function returns a unibyte string containing the same sequence of
  characters as @var{string}.  If @var{string} is a unibyte string, it
  is returned unchanged.  Otherwise, @acronym{ASCII} characters and
  characters in the @code{eight-bit} charset are converted to their
  corresponding byte values.  Use this function for @var{string}
  arguments that contain only @acronym{ASCII} and eight-bit characters;
  the function signals an error if any other characters are encountered.
  @end defun

Thanks.