From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Date: Mon, 06 Jun 2022 14:29:19 +0300 Message-ID: <83fski81yo.fsf@gnu.org> References: <83sfomcjr7.fsf@gnu.org> <83ilpgc3bd.fsf@gnu.org> <1c6f61d2-80df-38ab-a895-f73ad4be63a7@rhansen.org> <83zgiracxf.fsf@gnu.org> <16ed6ce6-725f-a183-8864-7e9185b14ff4@rhansen.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="1928"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 55777@debbugs.gnu.org To: Richard Hansen Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon Jun 06 13:31:09 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nyAwt-0000Jg-B9 for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 06 Jun 2022 13:31:07 +0200 Original-Received: from localhost ([::1]:48026 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nyAws-0000IH-CD for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 06 Jun 2022 07:31:06 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37786) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyAvr-0000Hr-8P for bug-gnu-emacs@gnu.org; Mon, 06 Jun 2022 07:30:04 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:40627) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nyAvq-0000TJ-SJ for bug-gnu-emacs@gnu.org; Mon, 06 Jun 2022 07:30:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nyAvq-0002PZ-JR for bug-gnu-emacs@gnu.org; Mon, 06 Jun 2022 07:30:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 06 Jun 2022 11:30:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55777 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 55777-submit@debbugs.gnu.org id=B55777.16545149829204 (code B ref 55777); Mon, 06 Jun 2022 11:30:02 +0000 Original-Received: (at 55777) by debbugs.gnu.org; 6 Jun 2022 11:29:42 +0000 Original-Received: from localhost ([127.0.0.1]:34524 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nyAvV-0002ON-RM for submit@debbugs.gnu.org; Mon, 06 Jun 2022 07:29:42 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:52722) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nyAvS-0002Ny-Tn for 55777@debbugs.gnu.org; Mon, 06 Jun 2022 07:29:39 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:34202) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyAvN-0000Mn-Ij; Mon, 06 Jun 2022 07:29:33 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=+5ClX2qqUErDidK0lXiD47oymOGFxVEWyfHK3BeSQMg=; b=jacweLZzXQjKQmvikkVK t0n7kvGZYaHj6GEGg8UYHJcW4GjV7ZptuNqGQ3MuehmWNo9O92xfPnmlBsb46iwAbYY2o5ZG/CQy+ Ui0HgzPxEhT16LTKHN8j9ZGz9GpKq4eY1uLMJxmsJOJpODAghVgKAu1NaudKzlJj9A4gzqIwE1RfP bogzhwf1TtfS58or1ft81IjELDh5niTsTmDAIVaFhs+xnraDheucttxJzmmPtXwgjJVvcuwvTuAA5 AZUm63IPL4xFULNjgX97ykUL9KYZ7keSPylKgz+q4K5hfWhvlMPnVo19yMCejMAhyuXZVhuT4637J 6djiaAro2RRwKw==; Original-Received: from [87.69.77.57] (port=3745 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nyAvL-0000yW-Ds; Mon, 06 Jun 2022 07:29:33 -0400 In-Reply-To: <16ed6ce6-725f-a183-8864-7e9185b14ff4@rhansen.org> (message from Richard Hansen on Sun, 5 Jun 2022 22:00:35 -0400) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:233770 Archived-At: > Date: Sun, 5 Jun 2022 22:00:35 -0400 > Cc: 55777@debbugs.gnu.org > From: Richard Hansen > > On 6/5/22 01:37, Eli Zaretskii wrote: > > Could you please state what is confusing in the current wording? > > * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in > the chapter -- the term is even in a @dfn{} -- but there's no > definition there. It is defined as best we could without confusing the readers: Occasionally, Emacs needs to hold and manipulate encoded text or binary non-text data in its buffers or strings. For example, when Emacs visits a file, it first reads the file’s text verbatim into a buffer, and only then converts it to the internal representation. Before the conversion, the buffer holds encoded text. Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. We call buffers and strings that hold encoded text “unibyte” buffers and strings, because Emacs treats them as a sequence of individual bytes. [...] (The @dfn part is markup used whenever new terminology is first used, it doesn't imply "definition".) You are welcome to propose a better explanation, but one thing is a non-starter: mentioning the numerical codes of those bytes, certainly as part of their "definition". This is because their numerical codes overlap Latin characters, and people were very confused about that when we mentioned them in the documentation in the past. So now we deliberately don't mention the values. The definition is effectively "bytes that have no meaning as human-readable text". > * The term "raw 8-bit bytes" is misleading. It suggests binary data > (bytes with values 0-255) but it's actually meant to only cover > 128-255. It indeed could potentially mislead. But not necessarily: it is customary to use "eight-bit" to mean "with the 8th bit set". Once again, you don't have to convince me that this area is confusing and notoriously hard to document. The challenge is to come up with something that is better than what we have and yet doesn't trigger confusion which we already had in the past. > * The term "raw 8-bit bytes" is not used consistently. Sometimes "8" > is spelled out as "eight", sometimes "raw" comes after "8-bit", > and sometimes it refers to all byte values 0-255 (see the first > sentence under `@cindex unibyte text`). I see no problem here, none at all. This is a manual, not a mathematical treatise. > * It's not clear whether "raw 8-bit bytes" is meant to refer to > bytes with values 128-255, or to the *characters* that map to > those byte values. We specifically say they are NOT characters. From the above-cited description: Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. > * The following phrasing is weird: "The function assumes that > @var{string} includes ASCII characters and raw 8-bit bytes". The > purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so > by definition that assumption is always true. No, it isn't true "by definition". We are trying to make it very clear that we distinguish between "characters" and "raw bytes". "Characters" are units of human-readable text, and each character has a set of attributes that Emacs uses when processing text. Characters have letter-case, general category, directionality, numerical value, etc. By contrast, "raw bytes" don't have any such attributes: it is meaningless to ask whether a given raw byte is upper- or lower-case, or if its directionality is right-to-left, etc. I hope you now better understand what the sentence above attempts to say; it doesn't say things that are trivially true. > By saying "the > function assumes", the reader is left wondering about the cases > where that assumption is not true, Those other cases are multibyte strings, of course. We could add that in parentheses, e.g.: The function assumes that @var{string} includes ASCII characters and raw 8-bit bytes (as opposed to multibyte text). > Maybe something like this: > > By definition, unibyte strings contain only @acronym{ASCII} > characters (bytes with values 0-127) and raw 8-bit bytes > (bytes with values 128-255); the latter are converted to their > corresponding multibyte representations in the > @code{eight-bit} character set (@pxref{Text Representations, > codepoints}). As I tried to explain above, using the numerical codes of the bytes is a step backward: we've been there and done that, and found that people get confused by that, because the byte codes overlap the Unicode codepoints of Latin characters. Explaining the difference rigorously is IME impossible without delving into the internal representation of each one of them, since that is how Emacs _really_ distinguishes between them. But having all that in the ELisp Reference manual is completely unjustified (let alone not future-proof, since the internal representation can change). Another problem with the above text is that it implies ASCII characters are bytes: we don't want to call them that, to maintain the fundamental difference between characters and bytes. Yet another problem there is that you can have a multibyte string that is pure-ASCII, so "by definition" is also problematic. Bottom line: I think the manual describes this reasonably well, and, given the past experience, any change will have to be tangibly better before we make it. Thanks.