From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Date: Sun, 05 Jun 2022 08:37:16 +0300 Message-ID: <83zgiracxf.fsf@gnu.org> References: <83sfomcjr7.fsf@gnu.org> <83ilpgc3bd.fsf@gnu.org> <1c6f61d2-80df-38ab-a895-f73ad4be63a7@rhansen.org> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="40395"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 55777@debbugs.gnu.org To: Richard Hansen Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sun Jun 05 07:38:42 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nxiyI-000AOw-6X for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 05 Jun 2022 07:38:42 +0200 Original-Received: from localhost ([::1]:49690 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nxiyG-00037m-SE for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 05 Jun 2022 01:38:40 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:58464) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nxixe-0002sA-Nk for bug-gnu-emacs@gnu.org; Sun, 05 Jun 2022 01:38:06 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:37810) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nxixe-0004KQ-Eu for bug-gnu-emacs@gnu.org; Sun, 05 Jun 2022 01:38:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nxixe-0001PB-9K for bug-gnu-emacs@gnu.org; Sun, 05 Jun 2022 01:38:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 05 Jun 2022 05:38:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 55777 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 55777-submit@debbugs.gnu.org id=B55777.16544074735385 (code B ref 55777); Sun, 05 Jun 2022 05:38:02 +0000 Original-Received: (at 55777) by debbugs.gnu.org; 5 Jun 2022 05:37:53 +0000 Original-Received: from localhost ([127.0.0.1]:59940 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nxixV-0001On-Fy for submit@debbugs.gnu.org; Sun, 05 Jun 2022 01:37:53 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:45266) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nxixS-0001OZ-UP for 55777@debbugs.gnu.org; Sun, 05 Jun 2022 01:37:52 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:38640) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nxixM-0004Jn-If; Sun, 05 Jun 2022 01:37:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=BkwBl+VLJ+YCWdKEcSu5PprBosUqHcHPEiDO2B0LtjM=; b=bamFdi74Q6zh 4xKSOSAgN4x8ciMILAgPdW0weC6++vbn+u8qQo2beNO7TPDiLzilqpsjjU7CmD7ZVC+iFWxPF0xv9 yeczvS7peiSg/GTnvwDz5W0xK4n+pbqGfqRe4WWN2qN14gaIgO2GXiDd+3thoG1uAxQ41vsNjqWJg E0iDfklANQcrrbJ3q0yQuSASs3DYONeuW/u1xm9F40nEhyt8zFLD9Qk4Alh84hiP6HIIUminC4cUd 0lsnqtdMsC8vfB9qSDpMulTw4ZAVvPuQ8SWzDnlyHsaqJeCQiUo09sB2y+lhlBeL12RcGYrF8jsar 40pZNaRq6PEOV4B52IEsBw==; Original-Received: from [87.69.77.57] (port=4457 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nxixD-0008W8-EY; Sun, 05 Jun 2022 01:37:44 -0400 In-Reply-To: <1c6f61d2-80df-38ab-a895-f73ad4be63a7@rhansen.org> (message from Richard Hansen on Sat, 4 Jun 2022 20:16:47 -0400) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:233689 Archived-At: > Date: Sat, 4 Jun 2022 20:16:47 -0400 > Cc: 55777@debbugs.gnu.org > From: Richard Hansen > > > You are digging into low-level details of how Emacs keeps strings in > > memory, and the higher-level context of _why_ you need to understand > > these details is left untold. > > Readers either think the documentation is confusing or they don't; why > they need to understand the documentation is mostly irrelevant. I > find the documentation to be confusing, and I suspect I am not the > only one. I said "understand the details", not "understand the documentation". The latter is a no-brainer: documentation should be understandable, and I don't think what we have now isn't. See below regarding the parts you say confused you. > > In general, Lisp programs are well advised to stay away of > > manipulating unibyte strings, and definitely to refrain from comparing > > unibyte and multibyte strings -- because these are supposed to be > > never needed in Lisp applications, and because doing TRT with those > > requires non-trivial knowledge of the Emacs internals. > > I disagree with "well advised". The documentation in 34.1 and 34.3 > make it sound like the representation is merely an internal elisp > implementation detail that programmers don't need to worry about, > unless they are doing something unusually low-level. That is exactly the intent. The recommendation not to deal with non-text data directly (as opposed via, say, packages like bindat.el) is based on experience, both mine and that of others. > I consider binary data processing to be somewhat common, not > "unusually low-level". Yet manipulating byte values 128-255 in unibyte > strings, and characters with Unicode codepoints 128-255 in multibyte > strings, is fraught with peril. For example, it is risky to use `aref' > to read a character or `aset' to write a character unless you either > know the string representation or know that the character is not in > #x80-#xff or #x3fff80-#x3fffff. You are describing some of the known difficulties that arise when manipulating binary data in Emacs strings and buffers, which are the reasons for the above recommendation. Emacs can do all this, but not easily, since it isn't its main design goal. For comparison, some other text-processing environments simply reject any non-character data in strings. > > I see no reason to complicate the documentation for the very rare > > occasions where these issues unfortunately leak to > > higher-than-expected levels. > > I don't think the occasions are all that rare. But even if they are, > the precise behavior should be documented somewhere so that > programmers who need low-level string manipulation can do so > correctly. Documenting every aspect of the Emacs behavior for the rare chance that someone some day will find it useful would make our documentation too large. The Emacs Lisp Reference manual already prints in 2 very thick volumes. So our policy is not to document the aspects that are too obscure to be useful to many. > I would argue that programmers using `string-to-unibyte' > or `string-to-multibyte' fall into that category. I disagree. First, these functions should be used very rarely, and we generally try to avoid them entirely. And if they do need to be used, the current documentation is IMO adequate. It still has to be understandable, of course, but it doesn't need to describe every possible detail of how Emacs handles raw bytes and conversions between them and readable text. > I still find the current wording to be confusing. To me, all bytes > have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also, > ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant > to refer only to non-ASCII values. What are "raw bytes" is explained in one of the previous sections of this chapter. > I have attached another revision that I think is complete, correct, > and easier to understand. I think it muddies the water by talking about numerical values 128 to 255, which also match some Latin characters. It also removes the reference to the codepoints Emacs uses to represent these bytes, which is important in some situations. So I think your proposal would change this text for the worse. Could you please state what is confusing in the current wording? If it's only the "raw 8-bit bytes" thing, it is explained earlier in the manual; if needed, we could add a cross-reference there to that section. If it's something else, please tell. But mentioning the single-byte numerical values here actually increases the confusion, IME, due to overlap with valid Unicode codepoints, which is why we should and do deliberately refrain from doing that. Thanks.