From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Fri, 28 Mar 2014 20:29:10 +0300 Message-ID: <83eh1mfd09.fsf@gnu.org> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> <8361myyac6.fsf@gnu.org> <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1396027756 27419 80.91.229.3 (28 Mar 2014 17:29:16 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 28 Mar 2014 17:29:16 +0000 (UTC) Cc: monnier@IRO.UMontreal.CA, emacs-devel@gnu.org To: "Stephen J. Turnbull" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 28 18:29:25 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTaaq-0003SH-FK for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 18:29:24 +0100 Original-Received: from localhost ([::1]:35161 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTaaq-0008GH-1Q for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 13:29:24 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38081) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTaai-0008D8-Pu for emacs-devel@gnu.org; Fri, 28 Mar 2014 13:29:21 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTaac-0002bX-Qg for emacs-devel@gnu.org; Fri, 28 Mar 2014 13:29:16 -0400 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:62012) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTaac-0002bI-D0 for emacs-devel@gnu.org; Fri, 28 Mar 2014 13:29:10 -0400 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0N3500600PGR8W00@a-mtaout20.012.net.il> for emacs-devel@gnu.org; Fri, 28 Mar 2014 20:29:09 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0N35006Y1PWKBG00@a-mtaout20.012.net.il>; Fri, 28 Mar 2014 20:29:09 +0300 (IDT) In-reply-to: <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.166 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171083 Archived-At: > From: "Stephen J. Turnbull" > Cc: monnier@IRO.UMontreal.CA, > emacs-devel@gnu.org > Date: Fri, 28 Mar 2014 19:28:56 +0900 > > Eli Zaretskii writes: > > > Let's not talk about Emacs 20 vintage problems, > > If they were *only* Emacs 20 vintage, this thread wouldn't exist. This thread is about different issues. > > Likewise examples from XEmacs, since the differences in this area > > between Emacs and XEmacs are substantial, and that precludes useful > > comparison. > > "It works fine" isn't useful information? No, because it describes a very different implementation. > > First, we must have a way to have buffer "text" that represents a > > stream of bytes, not some human-readable text. (Just as a random > > example, a buffer visiting an mbox file, from which you decode > > portions into another buffer for display.) Agreed? > > No, I disagree. Then I guess you will have to suggest how to implement this without unibyte buffers. > > In such unibyte buffers, we need a way to represent raw bytes, which > > are parts of as yet un-decoded byte sequences that represent encoded > > characters. > > Again, I disagree. Unibyte is a design mistake, and unnecessary. Then what do you call a buffer whose "text" is encoded? > XEmacs proves it -- we use (essentially) the same code in many > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does. I asked you not to bring XEmacs into the discussion, because I cannot talk intelligently about its implementation. If you insist on doing that, this discussion is futile from my POV. > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as > no-ops forever I wasn't talking about those functions. I was talking about the need to have unibyte buffers and strings. > I agree that having a way to represent "undecodable bytes" in a string > or buffer is extremely convenient. XEmacs's lack of this capability > is surely a deficiency (Hi, David K!) But this is a completely > different issue from unibyte buffers. How is it different? What would be the encoding of a buffer that contains raw bytes? > > We cannot represent each such byte as a Latin-1 character, because > > Latin-1 characters are stored inside Emacs as 2-byte sequences of > > their UTF-8 encoding. If you interpret bytes as Latin-1 > > characters, functions like string-bytes will return wrong results > > for those raw bytes. Agreed? > > No, I still disagree. > > `(defun string-bytes (&rest junk) (error))', and live happily ever > after. But that's ridiculous: a raw byte is just a single byte, so string-bytes should return a meaningful value for a string of such bytes. > You don't need `string-bytes' unless you've exposed internal > representation to Lisp, then you desperately need it to write correct > code (which some users won't be able to do anyway without help, cf. > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk). So > *don't expose internal representation* (and the hammer marks on users' > foreheads will disappear in due time, and the headaches even faster!) How else would you know how many bytes will a string take on disk? > > So here you have already at least 2 valid reasons > > No, *you* have them. XEmacs works perfectly well without them, using > code written for Emacs. XEmacs also works "perfectly well" without bidi and other stuff. That doesn't help at all in this discussion. > > If we want to get rid of unibyte, Someone(TM) should present a > > complete practical solution to those two problems (and a few > > others), otherwise, this whole discussion leads nowhere. > > Complete practical solution: "They are non-problems, forget about > them, and rewrite any code that implies you need to remember them." That a slogan, not a solution. > Fortunately for me, I am *intimately* familiar with XEmacs internals, > and therefore RMS won't let me write this code for Emacs. :-) Then perhaps you shouldn't be part of this discussion. > > > If you stick to the interpretation that bytes contain non-negative > > > integers less than 256, you won't have a problem in practice if you > > > think them as the first 256 Unicode characters, but choose not to use > > > functions that make sense only with characters. > > > > What do you mean by "choose"? Lisp code is used by many programmers > > out there; sometimes, they aren't even aware if the buffer they work > > on is unibyte, or what that means. > > Which is precisely why we're having this thread. If there were *no* > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter. And if I had $5M on by bank account, I'd probably be elsewhere enjoying myself. IOW, how are "if there were no..." arguments useful? > > Even when they are aware, they just want Emacs to DTRT, for their > > own value of "RT". > > Too bad for them, as long as Emacs has unibyte buffers. They have to > be aware, and write code correctly for the mode of the buffer. > Viz. the poor serial port programmer in comp.emacs. > > In XEmacs, they don't have to; they just use an appropriate > network-coding-system, and it just works. This is not a discussion about whose model is better, Emacs or XEmacs. This is a discussion of whether and how can we remove unibyte buffers, strings, and characters from Emacs. You must start by understanding how are they used in Emacs 24, and then suggest practical ways to change that. Saying "look at XEmacs" doesn't help, because we can't, and you know it. I explicitly asked not to bring these arguments into the discussion, and yet you still insist on doing precisely that. > > And what does "choose not to use" mean, anyway? How do you choose not > > to use 'insert', for example? what do you use instead? > > Of course you use `insert'. In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers and characters. If you use it, you get what it does. > What I'm saying is that if you don't want to trash a binary buffer > where each byte is represented by an ISO-8859-1 character in > internal representation, you need to avoid (1) > coding-system-for-write other than 'binary (in XEmacs, aliased to > 'iso-8859-1-unix), and (2) functions that mutate characters using > properties of characters that bytes don't have (eg, upcase). That's > really all there is to it. If the buffer is not marked specially, how will I know to avoid those? > But surely you have a function like > `char-int-p'[1] that is used (implicitly by `insert') to prevent > non-characters (in Emacs, 0xFFFF and surrogates would be examples, I > suppose) from being inserted in buffers. Otherwise you'd have crashes > all over the place, I would imagine. Since you don't, you must be > doing something to prevent arbitrary integers from getting inserted. There's char-valid-p, but I don't see how that is relevant to the current discussion. > It seems to me that the only real issue, given that you have a way in > Emacs to represent undecodable bytes (XEmacs doesn't, but Emacs does) > is what to do if somebody reads in data as 'binary, then proceeds to > insert non-Latin-1 characters in the buffer. I can think of three > possibilities: (1) don't allow it without changing the buffer's output > codec, (2) treat the existing characters as Latin-1, or (3) convert > all the existing "bytes" to undecodable bytes representation. > > XEmacs implicitly does (2) ((3) can't be implemented at all, at > present). Not sure I understand what you describe, but if I do, Emacs does (3). And I still don't see how this is relevant. You are describing a marginally valid use case, while I'm talking about use cases we meet every day, and which must be supported, e.g. when some Lisp wants to decode or encode text by hand.