bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: Eli Zaretskii <eliz@gnu.org>
To: Richard Hansen <rhansen@rhansen.org>
Cc: 55777@debbugs.gnu.org
Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
Date: Sun, 05 Jun 2022 08:37:16 +0300	[thread overview]
Message-ID: <83zgiracxf.fsf@gnu.org> (raw)
In-Reply-To: <1c6f61d2-80df-38ab-a895-f73ad4be63a7@rhansen.org> (message from Richard Hansen on Sat, 4 Jun 2022 20:16:47 -0400)

> Date: Sat, 4 Jun 2022 20:16:47 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> > You are digging into low-level details of how Emacs keeps strings in
> > memory, and the higher-level context of _why_ you need to understand
> > these details is left untold.
> 
> Readers either think the documentation is confusing or they don't; why
> they need to understand the documentation is mostly irrelevant. I
> find the documentation to be confusing, and I suspect I am not the
> only one.

I said "understand the details", not "understand the documentation".
The latter is a no-brainer: documentation should be understandable,
and I don't think what we have now isn't.  See below regarding the
parts you say confused you.

> > In general, Lisp programs are well advised to stay away of
> > manipulating unibyte strings, and definitely to refrain from comparing
> > unibyte and multibyte strings -- because these are supposed to be
> > never needed in Lisp applications, and because doing TRT with those
> > requires non-trivial knowledge of the Emacs internals.
> 
> I disagree with "well advised". The documentation in 34.1 and 34.3
> make it sound like the representation is merely an internal elisp
> implementation detail that programmers don't need to worry about,
> unless they are doing something unusually low-level.

That is exactly the intent.

The recommendation not to deal with non-text data directly (as opposed
via, say, packages like bindat.el) is based on experience, both mine
and that of others.

> I consider binary data processing to be somewhat common, not
> "unusually low-level". Yet manipulating byte values 128-255 in unibyte
> strings, and characters with Unicode codepoints 128-255 in multibyte
> strings, is fraught with peril. For example, it is risky to use `aref'
> to read a character or `aset' to write a character unless you either
> know the string representation or know that the character is not in
> #x80-#xff or #x3fff80-#x3fffff.

You are describing some of the known difficulties that arise when
manipulating binary data in Emacs strings and buffers, which are the
reasons for the above recommendation.  Emacs can do all this, but not
easily, since it isn't its main design goal.  For comparison, some
other text-processing environments simply reject any non-character
data in strings.

> > I see no reason to complicate the documentation for the very rare
> > occasions where these issues unfortunately leak to
> > higher-than-expected levels.
> 
> I don't think the occasions are all that rare.  But even if they are,
> the precise behavior should be documented somewhere so that
> programmers who need low-level string manipulation can do so
> correctly.

Documenting every aspect of the Emacs behavior for the rare chance
that someone some day will find it useful would make our documentation
too large.  The Emacs Lisp Reference manual already prints in 2 very
thick volumes.  So our policy is not to document the aspects that are
too obscure to be useful to many.

> I would argue that programmers using `string-to-unibyte'
> or `string-to-multibyte' fall into that category.

I disagree.  First, these functions should be used very rarely, and we
generally try to avoid them entirely.  And if they do need to be used,
the current documentation is IMO adequate.  It still has to be
understandable, of course, but it doesn't need to describe every
possible detail of how Emacs handles raw bytes and conversions between
them and readable text.

> I still find the current wording to be confusing. To me, all bytes
> have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also,
> ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant
> to refer only to non-ASCII values.

What are "raw bytes" is explained in one of the previous sections of
this chapter.

> I have attached another revision that I think is complete, correct,
> and easier to understand.

I think it muddies the water by talking about numerical values 128 to
255, which also match some Latin characters.  It also removes the
reference to the codepoints Emacs uses to represent these bytes, which
is important in some situations.  So I think your proposal would
change this text for the worse.

Could you please state what is confusing in the current wording?  If
it's only the "raw 8-bit bytes" thing, it is explained earlier in the
manual; if needed, we could add a cross-reference there to that
section.  If it's something else, please tell.  But mentioning the
single-byte numerical values here actually increases the confusion,
IME, due to overlap with valid Unicode codepoints, which is why we
should and do deliberately refrain from doing that.

Thanks.

next prev parent reply	other threads:[~2022-06-05  5:37 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-03  6:20 bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Richard Hansen
2022-06-03  7:02 ` Eli Zaretskii
2022-06-04  3:28   ` Richard Hansen
2022-06-04  7:09     ` Eli Zaretskii
2022-06-05  0:16       ` Richard Hansen
2022-06-05  5:37         ` Eli Zaretskii [this message]
2022-06-06  2:00           ` Richard Hansen
2022-06-06 11:29             ` Eli Zaretskii
2022-08-17 23:21               ` Stefan Kangas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83zgiracxf.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=55777@debbugs.gnu.org \
    --cc=rhansen@rhansen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).