From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.help Subject: Re: `write-region' writes different bytes than passed to it? Date: Sun, 23 Dec 2018 17:20:32 +0200 Message-ID: <837eg09sn3.fsf@gnu.org> References: <83d0q8136v.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1545578372 26316 195.159.176.226 (23 Dec 2018 15:19:32 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 23 Dec 2018 15:19:32 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sun Dec 23 16:19:28 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gb5XK-0006jE-A0 for geh-help-gnu-emacs@m.gmane.org; Sun, 23 Dec 2018 16:19:26 +0100 Original-Received: from localhost ([::1]:41732 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gb5ZQ-0006BV-Le for geh-help-gnu-emacs@m.gmane.org; Sun, 23 Dec 2018 10:21:36 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:54160) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gb5Ye-00068i-IU for help-gnu-emacs@gnu.org; Sun, 23 Dec 2018 10:20:49 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gb5Yb-0007rT-UM for help-gnu-emacs@gnu.org; Sun, 23 Dec 2018 10:20:48 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:54783) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gb5Yb-0007rA-PN for help-gnu-emacs@gnu.org; Sun, 23 Dec 2018 10:20:45 -0500 Original-Received: from [176.228.60.248] (port=3864 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1gb5Ya-0007yQ-Gp for help-gnu-emacs@gnu.org; Sun, 23 Dec 2018 10:20:45 -0500 In-reply-to: (message from Philipp Stephani on Sat, 22 Dec 2018 23:58:07 +0100) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:119034 Archived-At: > From: Philipp Stephani > Date: Sat, 22 Dec 2018 23:58:07 +0100 > Cc: help-gnu-emacs > > > Yes, because "\xC1\xB2" just happens to be the internal multibyte > > representation of a raw-byte F2. Raw bytes are always converted to > > their single-byte values on output, regardless of the encoding you > > request. > > > > Is that documented somewhere? Which part(s)? > Or, in other words, what are the semantics of > > (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...)) > > ? > > There are two easy cases: > 1. STRING is a unibyte string containing only bytes within the ASCII range > 2. STRING is a multibyte string containing only Unicode scalar values > In those cases the answer is simple: The form writes the UTF-8 > representation of STRING. > However, the interesting cases are as follows: > 3. STRING is a unibyte string with at least one byte outside the ASCII range > 4. STRING is a multibyte string with at least one elements that is not > a Unicode scalar value You are actually asking what code conversion does in these cases, so let's limit the discussion to that part. write-region is not really relevant here. One technicality before I answer the question: there are no "Unicode scalar values" in Emacs strings and buffers. The internal representation is a multibyte one, so any non-ASCII character, be it a valid Unicode character or a raw byte, is always stored as a multibyte sequence. So let's please use a less confusing wording, like "strictly valid UTF-8 sequence" or something to that effect. > My example is an instance of (3). I admit I haven't read the entire > Emacs Lisp reference manual, but quite some parts of it, and I > couldn't find a description of the cases (3) and (4). Naively there > are a couple options: > - Signal an error. That would seem appropriate as such strings can't > be encoded as UTF-8. However, evidently Emacs doesn't do this. > - For case 3, write the bytes in STRING, ignoring the coding system. I > had expected this to happen, but apparently it isn't the case either. IMO, doing encoding on unibyte strings invokes undefined behavior, since encoding is only defined for multibyte strings. Admittedly, we don't say that explicitly (we could if that's deemed important), but the entire description in "Coding System Basics" makes no sense without this assumption, and even hints on that indirectly: The coding system ‘raw-text’ is special in that it prevents character code conversion, and causes the buffer visited with this coding system to be a unibyte buffer. For historical reasons, you can save both unibyte and multibyte text with this coding system. The last sentence implicitly tells you that coding systems other than raw-text (with the exception of no-conversion, described in the very next paragraph) can only be meaningfully used when writing multibyte text. Since this is undefined behavior, Emacs can do anything that best suits the relevant use cases. What it actually does is convert raw bytes from their internal two-byte representation to a single byte. Emacs jumps through many hoops to avoid exposing the internal multibyte representation of raw bytes outside of buffers and strings, and this is one of those hoops. That's because exposing that internal representation is considered to be corruption of the original byte stream, and is not generally useful. Signaling an error in this situation is also not useful, because it turns out many Lisp programs did this kind of thing in the past (Gnus is a notable example), and undoubtedly quite a few still do. Emacs handles this case like it does because many years of bitter experience have taught us that this suits best the use cases we want to support. In particular, signaling errors when encountering invalid UTF-8 sequences is a bad idea in a text-editing application, where users expect an arbitrary byte stream to pass unscathed from input to output. This is why Emacs is decades ahead of other similar systems, such as Guile, which still throw exceptions in such cases (and claim that they are "correct"). > > I'm not sure that single use case is important enough to change > > something that was working like that since Emacs 23. Who knows how > > many more important use cases this will break? > > It's important for correctness and for actually describing what "encoding" does. So does labeling this as undefined behavior, which is what it is. We don't really need to describe undefined behavior in detail, because Lisp programs shouldn't do that. > Do we expect users to explicitly put the byte sequences for the > (undocumented) internal encoding into unibyte strings? Shouldn't we > rather expect that users want to write unibyte strings as is, in all > cases? To avoid the undefined behavior, a Lisp program should never try to encode a unibyte string with anything other than no-conversion or raw-text (the latter also allows the application to convert EOL format, if that is desired). IOW, you should have used either raw-text-unix or no-conversion in your example, not utf-8. > > Oh, indeed, especially since it sounds to me like the problem is in the > > original code (if you don't want to change bytes, the use a `binary` > > encoding rather than utf-8). > > That wouldn't work with multibyte strings, right? Because they need to > be encoded. You can detect when a string is a unibyte string with multibyte-string-p, if your application needs to handle both unibyte and multibyte strings. For unibyte strings, use only raw-text or no-conversion. > > Exactly: I think we should try and get rid of those heuristics > > (progressively). Actually, it's already what we've been doing since > > Emacs-20, tho "lately" the progression in this respect has slowed > > down I think. > > I'd definitely welcome any simplification in this area. There seems to > be a lot of incidental complexity and undocumented corner cases here. AFAIK, all of that heuristics are in the undefined behavior department. Lisp programs are well advised to stay away from that. If Lisp programs do stay away, they will never need to deal with the complexity and the undocumented corner cases. We keep the current behavior for backward compatibility, and for this reason I would suggest to avoid changes in this area unless we have a very good reason for a change. What was the reason you needed to write something like the original snippet?