From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Fri, 28 Mar 2014 19:28:56 +0900 Message-ID: <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> <8361myyac6.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-Trace: ger.gmane.org 1396002690 24832 80.91.229.3 (28 Mar 2014 10:31:30 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 28 Mar 2014 10:31:30 +0000 (UTC) Cc: monnier@IRO.UMontreal.CA, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 28 11:31:39 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTU4Y-0005aH-Gg for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 11:31:38 +0100 Original-Received: from localhost ([::1]:60253 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTU4X-0000Tr-I5 for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 06:31:37 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46862) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTU4O-0000Td-7i for emacs-devel@gnu.org; Fri, 28 Mar 2014 06:31:35 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTU4G-0003mP-M2 for emacs-devel@gnu.org; Fri, 28 Mar 2014 06:31:28 -0400 Original-Received: from mgmt2.sk.tsukuba.ac.jp ([130.158.97.224]:39710) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTU45-0002qE-UX; Fri, 28 Mar 2014 06:31:10 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mgmt2.sk.tsukuba.ac.jp (Postfix) with ESMTP id 4096A970A3D; Fri, 28 Mar 2014 19:28:56 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id 2D2291A28DC; Fri, 28 Mar 2014 19:28:56 +0900 (JST) In-Reply-To: <8361myyac6.fsf@gnu.org> X-Mailer: VM undefined under 21.5 (beta34) "kale" 2a0f42961ed4 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 130.158.97.224 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171067 Archived-At: Eli Zaretskii writes: > Let's not talk about Emacs 20 vintage problems, If they were *only* Emacs 20 vintage, this thread wouldn't exist. > Likewise examples from XEmacs, since the differences in this area > between Emacs and XEmacs are substantial, and that precludes useful > comparison. "It works fine" isn't useful information? XEmacs has *two* reasons to want to change its internal representation. (1) A Unicode representation, especially UTF-8, would allow all autosave files to be readable by other programs. (2) A PEP 393-like representation would be way faster for big buffers and strings. Bytes-character confusion is just plain not an issue, not for anybody, not at all. > First, we must have a way to have buffer "text" that represents a > stream of bytes, not some human-readable text. (Just as a random > example, a buffer visiting an mbox file, from which you decode > portions into another buffer for display.) Agreed? No, I disagree. XEmacs/MULE has never had such a feature, yet we can run all Emacs programs without changing the buffer representation (modulo inability to represent all Unicode characters properly, but the JIT charsets are plenty good enough in practice). > In such unibyte buffers, we need a way to represent raw bytes, which > are parts of as yet un-decoded byte sequences that represent encoded > characters. Again, I disagree. Unibyte is a design mistake, and unnecessary. XEmacs proves it -- we use (essentially) the same code in many applications (VM, Gnus for two mbox-using examples) as GNU Emacs does. The variations for XEmacs and Emacs are due to extents vs. overlays and such-like, not due to buffer representation. For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as no-ops forever, and as far as I can tell nobody's ever needed to worry about it (of course, maybe the folks who use those are just more clued than the poor user in my next paragraph). I agree that having a way to represent "undecodable bytes" in a string or buffer is extremely convenient. XEmacs's lack of this capability is surely a deficiency (Hi, David K!) But this is a completely different issue from unibyte buffers. Emacs doesn't need unibyte buffers to perform its work, and if they are desirable on the grounds of space or time efficiency, they should be opaque to Lisp. > We cannot represent each such byte as a Latin-1 character, because > Latin-1 characters are stored inside Emacs as 2-byte sequences of > their UTF-8 encoding. If you interpret bytes as Latin-1 > characters, functions like string-bytes will return wrong results > for those raw bytes. Agreed? No, I still disagree. `(defun string-bytes (&rest junk) (error))', and live happily ever after. You don't need `string-bytes' unless you've exposed internal representation to Lisp, then you desperately need it to write correct code (which some users won't be able to do anyway without help, cf. https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk). So *don't expose internal representation* (and the hammer marks on users' foreheads will disappear in due time, and the headaches even faster!) > So here you have already at least 2 valid reasons No, *you* have them. XEmacs works perfectly well without them, using code written for Emacs. > If we want to get rid of unibyte, Someone(TM) should present a > complete practical solution to those two problems (and a few > others), otherwise, this whole discussion leads nowhere. Complete practical solution: "They are non-problems, forget about them, and rewrite any code that implies you need to remember them." Fortunately for me, I am *intimately* familiar with XEmacs internals, and therefore RMS won't let me write this code for Emacs. :-) > > If you stick to the interpretation that bytes contain non-negative > > integers less than 256, you won't have a problem in practice if you > > think them as the first 256 Unicode characters, but choose not to use > > functions that make sense only with characters. > > What do you mean by "choose"? Lisp code is used by many programmers > out there; sometimes, they aren't even aware if the buffer they work > on is unibyte, or what that means. Which is precisely why we're having this thread. If there were *no* Lisp-visibile unibyte buffers or strings, it couldn't possibly matter. > Even when they are aware, they just want Emacs to DTRT, for their > own value of "RT". Too bad for them, as long as Emacs has unibyte buffers. They have to be aware, and write code correctly for the mode of the buffer. Viz. the poor serial port programmer in comp.emacs. In XEmacs, they don't have to; they just use an appropriate network-coding-system, and it just works. That may not be *obvious* to a programmer coming from a different background (say, Python) who expects there to be both byte streams and text streams, but since there's no other way to do it, it's not hard to get it right. > And what does "choose not to use" mean, anyway? How do you choose not > to use 'insert', for example? what do you use instead? Of course you use `insert'. What I'm saying is that if you don't want to trash a binary buffer where each byte is represented by an ISO-8859-1 character in internal representation, you need to avoid (1) coding-system-for-write other than 'binary (in XEmacs, aliased to 'iso-8859-1-unix), and (2) functions that mutate characters using properties of characters that bytes don't have (eg, upcase). That's really all there is to it. > The issue at hand is how do you pull the trick, in practice, of > doing TRT with the legitimate use cases where Emacs needs to > manipulate raw bytes. Follow the Nike advice: Just Do It. Works fine, I assure you. I can understand that you're worried by this: > As long as Emacs exposes the character values to Lisp programs as > simple integers, I don't think we can take this path. ... but I'm not really sure why not. I'll grant that after drinking the Ben Wing Kool-Aid the idea of Emacsen without a character type gives me hives, but that's because arbitrary integers, if decomposed into byte- sized fields and inserted into a buffer, can become non-characters and crash XEmacs. But surely you have a function like `char-int-p'[1] that is used (implicitly by `insert') to prevent non-characters (in Emacs, 0xFFFF and surrogates would be examples, I suppose) from being inserted in buffers. Otherwise you'd have crashes all over the place, I would imagine. Since you don't, you must be doing something to prevent arbitrary integers from getting inserted. It seems to me that the only real issue, given that you have a way in Emacs to represent undecodable bytes (XEmacs doesn't, but Emacs does) is what to do if somebody reads in data as 'binary, then proceeds to insert non-Latin-1 characters in the buffer. I can think of three possibilities: (1) don't allow it without changing the buffer's output codec, (2) treat the existing characters as Latin-1, or (3) convert all the existing "bytes" to undecodable bytes representation. XEmacs implicitly does (2) ((3) can't be implemented at all, at present). I tend to prefer (1), but ISTR that would not have worked very well with some programs, specifically readmail and VM (whose author had a lot of influence on how XEmacs internals were designed), because they narrowed the buffer and converted wire format (including raw multibyte encodings) to displayed text in-place. Footnotes: [1] `char-int-p' is a built-in function (char-int-p OBJECT) Documentation: Return t if OBJECT is an integer that can be converted into a character. See `char-int'.