From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.help Subject: Re: `write-region' writes different bytes than passed to it? Date: Sun, 10 Feb 2019 20:06:57 +0100 Message-ID: References: <83d0q8136v.fsf@gnu.org> <837eg09sn3.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="103681"; mail-complaints-to="usenet@blaine.gmane.org" Cc: help-gnu-emacs To: Eli Zaretskii Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sun Feb 10 20:07:47 2019 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gsuSA-000QsL-By for geh-help-gnu-emacs@m.gmane.org; Sun, 10 Feb 2019 20:07:46 +0100 Original-Received: from localhost ([127.0.0.1]:34449 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gsuS9-00023R-Bk for geh-help-gnu-emacs@m.gmane.org; Sun, 10 Feb 2019 14:07:45 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:60881) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gsuRe-00023L-5S for help-gnu-emacs@gnu.org; Sun, 10 Feb 2019 14:07:15 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gsuRc-0005kY-LR for help-gnu-emacs@gnu.org; Sun, 10 Feb 2019 14:07:14 -0500 Original-Received: from mail-ot1-x334.google.com ([2607:f8b0:4864:20::334]:40656) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gsuRa-0005iy-K6; Sun, 10 Feb 2019 14:07:10 -0500 Original-Received: by mail-ot1-x334.google.com with SMTP id s5so14064239oth.7; Sun, 10 Feb 2019 11:07:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=mR+vY+EOLQFlqnZvHEl7zGoGWZEXYzdCIw/oNrKbMhM=; b=DTRj+MVY87bwx2OiydTNcT33DrPk7t/szWoH/Ttm4DyMTqc/XVPcYSwJ3i+lcneKEU kWSBvOEVznbGmOqn9CWGnklVDZ9+oJ9k7H2zNimqDTOJON3cHZmW2ud4q/x0Mi+58/KG zHM0u+KypuTXjHY42gcyozGmE7wzn1h071gRTnNwZzN+DxV5u1gA73DTYwi/olaMCtT0 +yjZsHYAYUbTPWQHH/JGCt8oyVecoYwcewaRNYmivHnifXoNSQF5lXqKK5pwxBzLUjqj JAGt5tLClWV4nmd327LmQbgnqa+OP9OfU16pL7uaFABOb9GLf+vF3PwM1RYEIg/qzapO iQMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=mR+vY+EOLQFlqnZvHEl7zGoGWZEXYzdCIw/oNrKbMhM=; b=PJAnxiDr4c2nJtq85dqi7SqwpOmQLw23rr6k2R5hXXDrmz2lTFbgGWTP9ALv+QzZYN 31qbtLydVm66XYV+WS5eyPG3tyzSiV95OC9emAf0iOE1bMlPVc0k/E64A3/kGjiElUiO 2Gq/AcjcH2vkhiq7ieOfxVnsCT6ius9rMAa0i1MpvlFYM0K0YI1TAgZKQiK6Dn29pGdB duOmyPOU8cFJ1W3jC850bqsyB+01hYPXj1wBcfXCx65I4dTAXfyD2sfW1Z//r3UdfxlX xdxIoKG1wcTDNOgWqp0eWRbsKVH52fKEYw/JIqs96UzJb+7oooxVUlwkgj3OcpWpOyql eAqA== X-Gm-Message-State: AHQUAuZnnGfJ/jvWGeFRCJjnAOullBSxuNk5PTnfI2XXVDp0gkdPrFJg x5ns4KtCxj+vWYUhadE5EVqPAlNzMDm3tAFga8QVlQ== X-Google-Smtp-Source: AHgI3Iat7/VEeQviTfgYFIqpUZgzs/HFH5F+HmoZtaKAApykg8hhD+gDy7vp+0H69lNgdWOUhnaymivDdNT9s+uINFc= X-Received: by 2002:a9d:4685:: with SMTP id z5mr3887134ote.251.1549825628760; Sun, 10 Feb 2019 11:07:08 -0800 (PST) In-Reply-To: <837eg09sn3.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::334 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:119319 Archived-At: Am So., 23. Dez. 2018 um 16:21 Uhr schrieb Eli Zaretskii : > > > From: Philipp Stephani > > Date: Sat, 22 Dec 2018 23:58:07 +0100 > > Cc: help-gnu-emacs > > > > > Yes, because "\xC1\xB2" just happens to be the internal multibyte > > > representation of a raw-byte F2. Raw bytes are always converted to > > > their single-byte values on output, regardless of the encoding you > > > request. > > > > > > > Is that documented somewhere? > > Which part(s)? All of it? ;) Basically, "what is the behavior of write-region". > > > Or, in other words, what are the semantics of > > > > (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...)) > > > > ? > > > > There are two easy cases: > > 1. STRING is a unibyte string containing only bytes within the ASCII ra= nge > > 2. STRING is a multibyte string containing only Unicode scalar values > > In those cases the answer is simple: The form writes the UTF-8 > > representation of STRING. > > However, the interesting cases are as follows: > > 3. STRING is a unibyte string with at least one byte outside the ASCII = range > > 4. STRING is a multibyte string with at least one elements that is not > > a Unicode scalar value > > You are actually asking what code conversion does in these cases, so > let's limit the discussion to that part. write-region is not really > relevant here. > > One technicality before I answer the question: there are no "Unicode > scalar values" in Emacs strings and buffers. The internal > representation is a multibyte one, so any non-ASCII character, be it a > valid Unicode character or a raw byte, is always stored as a multibyte > sequence. So let's please use a less confusing wording, like > "strictly valid UTF-8 sequence" or something to that effect. I don't think we should change the terminology. Emacs multibyte strings are sequences of integers (in most cases, scalar values), not UTF-8 strings. They are internally represented as byte arrays, but that's a different story. > > > My example is an instance of (3). I admit I haven't read the entire > > Emacs Lisp reference manual, but quite some parts of it, and I > > couldn't find a description of the cases (3) and (4). Naively there > > are a couple options: > > - Signal an error. That would seem appropriate as such strings can't > > be encoded as UTF-8. However, evidently Emacs doesn't do this. > > - For case 3, write the bytes in STRING, ignoring the coding system. I > > had expected this to happen, but apparently it isn't the case either. > > IMO, doing encoding on unibyte strings invokes undefined behavior, > since encoding is only defined for multibyte strings. That is very unfortunate. Is there any hope we can get out of that situatio= n? > Admittedly, we > don't say that explicitly (we could if that's deemed important), but > the entire description in "Coding System Basics" makes no sense > without this assumption, and even hints on that indirectly: > > The coding system =E2=80=98raw-text=E2=80=99 is special in that it p= revents character > code conversion, and causes the buffer visited with this coding system > to be a unibyte buffer. For historical reasons, you can save both > unibyte and multibyte text with this coding system. > > The last sentence implicitly tells you that coding systems other than > raw-text (with the exception of no-conversion, described in the very > next paragraph) can only be meaningfully used when writing multibyte > text. That's true, but very subtle. You first have to read the description of a certain encoding to figure out how other encodings behave. > > Since this is undefined behavior, Emacs can do anything that best > suits the relevant use cases. What it actually does is convert raw > bytes from their internal two-byte representation to a single byte. > Emacs jumps through many hoops to avoid exposing the internal > multibyte representation of raw bytes outside of buffers and strings, > and this is one of those hoops. That's because exposing that internal > representation is considered to be corruption of the original byte > stream, and is not generally useful. But in this question there is never any internal representation, just a byte array that happens to match the internal representation of something else. > > Signaling an error in this situation is also not useful, because it > turns out many Lisp programs did this kind of thing in the past (Gnus > is a notable example), and undoubtedly quite a few still do. Well, if the behavior is unspecified, then signaling an error would absolutely be a legal (and even expected) behavior. > > Emacs handles this case like it does because many years of bitter > experience have taught us that this suits best the use cases we want > to support. In particular, signaling errors when encountering invalid > UTF-8 sequences is a bad idea in a text-editing application, where > users expect an arbitrary byte stream to pass unscathed from input to > output. This is why Emacs is decades ahead of other similar systems, > such as Guile, which still throw exceptions in such cases (and claim > that they are "correct"). I'm not saying that Emacs should necessary start signaling errors when visiting files with invalid UTF-8 sequences. That it degrades gracefully in this case is very welcome and user-friendly. But visiting a file can't result in a call to write-region with a unibyte string, right? > > > > I'm not sure that single use case is important enough to change > > > something that was working like that since Emacs 23. Who knows how > > > many more important use cases this will break? > > > > It's important for correctness and for actually describing what "encodi= ng" does. > > So does labeling this as undefined behavior, which is what it is. We > don't really need to describe undefined behavior in detail, because > Lisp programs shouldn't do that. Rather than describing it in detail, it should be removed. Unspecified behavior makes a programming system hard to use and reason about. > > > Do we expect users to explicitly put the byte sequences for the > > (undocumented) internal encoding into unibyte strings? Shouldn't we > > rather expect that users want to write unibyte strings as is, in all > > cases? > > To avoid the undefined behavior, a Lisp program should never try to > encode a unibyte string with anything other than no-conversion or > raw-text (the latter also allows the application to convert EOL > format, if that is desired). IOW, you should have used either > raw-text-unix or no-conversion in your example, not utf-8. If Lisp code shouldn't try that, then the encoding functions should signal an error on such cases. > > > > Oh, indeed, especially since it sounds to me like the problem is in t= he > > > original code (if you don't want to change bytes, the use a `binary` > > > encoding rather than utf-8). > > > > That wouldn't work with multibyte strings, right? Because they need to > > be encoded. > > You can detect when a string is a unibyte string with > multibyte-string-p, if your application needs to handle both unibyte > and multibyte strings. For unibyte strings, use only raw-text or > no-conversion. I get that, but this is too subtle and nontrivial. > > > > Exactly: I think we should try and get rid of those heuristics > > > (progressively). Actually, it's already what we've been doing since > > > Emacs-20, tho "lately" the progression in this respect has slowed > > > down I think. > > > > I'd definitely welcome any simplification in this area. There seems to > > be a lot of incidental complexity and undocumented corner cases here. > > AFAIK, all of that heuristics are in the undefined behavior > department. Lisp programs are well advised to stay away from that. > If Lisp programs do stay away, they will never need to deal with the > complexity and the undocumented corner cases. You can't tell programmers to stay away from something. Either it should work as documented or signal an error. Silently doing the wrong thing is the worst choice. > > We keep the current behavior for backward compatibility, and for this > reason I would suggest to avoid changes in this area unless we have a > very good reason for a change. What was the reason you needed to > write something like the original snippet? > I'm writing a function to write an arbitrary string to a file. This should be trivial, but as you can see, it isn't.