From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Sat, 29 Mar 2014 18:23:17 +0900 Message-ID: <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> <8361myyac6.fsf@gnu.org> <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> <83eh1mfd09.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1396085019 3940 80.91.229.3 (29 Mar 2014 09:23:39 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Mar 2014 09:23:39 +0000 (UTC) Cc: monnier@IRO.UMontreal.CA, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Mar 29 10:23:48 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTpUR-0005vr-DQ for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 10:23:47 +0100 Original-Received: from localhost ([::1]:38256 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTpUQ-0002pt-O2 for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 05:23:46 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:33178) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTpUH-0002pi-3H for emacs-devel@gnu.org; Sat, 29 Mar 2014 05:23:44 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTpU9-0006mn-HM for emacs-devel@gnu.org; Sat, 29 Mar 2014 05:23:37 -0400 Original-Received: from mgmt2.sk.tsukuba.ac.jp ([130.158.97.224]:47144) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTpU0-0006lm-Tt; Sat, 29 Mar 2014 05:23:21 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mgmt2.sk.tsukuba.ac.jp (Postfix) with ESMTP id E943D970A3D; Sat, 29 Mar 2014 18:23:17 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id D3F491A28DC; Sat, 29 Mar 2014 18:23:17 +0900 (JST) In-Reply-To: <83eh1mfd09.fsf@gnu.org> X-Mailer: VM undefined under 21.5 (beta34) "kale" 2a0f42961ed4 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 130.158.97.224 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171124 Archived-At: Eli Zaretskii writes: > This thread is about different issues. *sigh* No, it's about unibyte being a premature pessimization. > > > Likewise examples from XEmacs, since the differences in this area > > > between Emacs and XEmacs are substantial, and that precludes useful > > > comparison. > >=20 > > "It works fine" isn't useful information? >=20 > No, because it describes a very different implementation. Not at all. The implementation of multibyte buffers is very similar. What's different is that Emacs complifusticates matters by also having a separate implementation of unibyte buffers, and then basically making a union out of the two structures called "buffer". XEmacs simply implements binary as a particular coding system in and out of multibyte buffers. > Then I guess you will have to suggest how to implement this without > unibyte buffers. No, I don't. I already told you how to do it: nuke unibyte buffers and use iso-8859-1-unix as the binary codec. Then you're done, except for those applications that actually make the mistake of using unibyte text explicitly. If there are cases where unibyte happens implicitly, and this transformation causes a bug, I think you'll discover unibyte itself was problematic. > > > In such unibyte buffers, we need a way to represent raw bytes, which > > > are parts of as yet un-decoded byte sequences that represent encoded > > > characters. > >=20 > > Again, I disagree. Unibyte is a design mistake, and unnecessary. >=20 > Then what do you call a buffer whose "text" is encoded? "Binary." > > XEmacs proves it -- we use (essentially) the same code in many > > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does. >=20 > I asked you not to bring XEmacs into the discussion, because I cannot > talk intelligently about its implementation. If you insist on doing > that, this discussion is futile from my POV. The whole point here is that exactly what the XEmacs implementation is *irrelevant*. The point that we implement the same API as GNU Emacs without unibyte buffers or the annoyances and incoherence that comes with them. > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as > > no-ops forever >=20 > I wasn't talking about those functions. I was talking about the need > to have unibyte buffers and strings. There is no "need for unibyte." You're simply afraid to throw it away. > How is it different? What would be the encoding of a buffer that > contains raw bytes? Depends. If it's uninterpreted bytes, "binary." If those are undecodable bytes, they'll be the representation of raw bytes that occurred in an otherwise sane encoded stream, and the buffer's encoding will be the nominal encoding of that stream. If you want to ensure sanity of output, then you will use an output encoding that errors on rawbytes, and a program that cleans up those rawbytes in a way appropriate for the application. If you expect the next program in the pipeline to handle them, then you use a variant encoding that just encodes them back to the original undecodable rawbytes. > But that's ridiculous: a raw byte is just a single byte, so > string-bytes should return a meaningful value for a string of such > bytes. `string-bytes' should not exist. As I wrote earlier: > > You don't need `string-bytes' unless you've exposed internal > > representation to Lisp, then you desperately need it to write correct > > code (which some users won't be able to do anyway without help, cf.=20 > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk). So > > *don't expose internal representation* (and the hammer marks on users' > > foreheads will disappear in due time, and the headaches even faster!) >=20 > How else would you know how many bytes will a string take on disk? How does `string-bytes' help? You don't know what encoding will be used to write them, and in general it won't be the same number that they take up in the string. If you use iso-8859-1-unix as the coding system, then "bytes on the wire" =3D=3D "characters in the string". No problema, se=C3=B1or. >=20 > > > So here you have already at least 2 valid reasons > >=20 > > No, *you* have them. XEmacs works perfectly well without them, using > > code written for Emacs. >=20 > XEmacs also works "perfectly well" without bidi and other stuff. That > doesn't help at all in this discussion. You're right: because XEmacs doesn't handle bidi, it's irrelevant to this discussion. Why did *you* bring it up? What is relevant is how to represent byte streams in Emacs. The obvious non-unibyte way is a one-to-one mapping of bytes to Unicode characters. It is *extremely* convenient if the first 128 of those bytes correspond to the ASCII coded character set, because so many wire protocols use ASCII "words" syntactically. The other 128 don't matter much, so why not just use the extremely convenient Latin-1 set for them? > > > If we want to get rid of unibyte, Someone(TM) should present a > > > complete practical solution to those two problems (and a few > > > others), otherwise, this whole discussion leads nowhere. > >=20 > > Complete practical solution: "They are non-problems, forget about > > them, and rewrite any code that implies you need to remember them." >=20 > That a slogan, not a solution. No, it is a precise high-level design for a solution. The same design that XEmacs uses, and which would be quite straightforward for Emacs to adopt since it already has multibyte buffers of the same power as XEmacs's, though with (currently) a different internal encoding. > > Fortunately for me, I am *intimately* familiar with XEmacs internals, > > and therefore RMS won't let me write this code for Emacs. :-) >=20 > Then perhaps you shouldn't be part of this discussion. Since I've been invited to leave, I will. My point is sufficiently well-made for open minds to deal with the details. I'll finish this post on the off chance that somewhere in it will be the key that will unlock yours. > > Which is precisely why we're having this thread. If there were *no* > > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter. >=20 > And if I had $5M on by bank account, I'd probably be elsewhere > enjoying myself. IOW, how are "if there were no..." arguments useful? Because they point out that this thread wouldn't have happened with a different design. I consider that design better, after experience with two separate implementations of multibyte only (NEmacs, XEmacs/MULE), an implementation with strict separation of bytes from characters (Python 2 with PEP 383), an implementation with strict separation of bytes from characters and space-efficient character representation (Python 3 with PEPS 383, 393), and one implementation with unibyte (Emacs). The first four work fine dealing with bytes and characters, and there is no confusion. Both Pythons can handle undecodable bytes in encoded streams (ie, roundtrip). Only GNU Emacs has issues about dealing with unibyte vs. multibyte. > This is not a discussion about whose model is better, Emacs or XEmacs. > This is a discussion of whether and how can we remove unibyte buffers, > strings, and characters from Emacs. You must start by understanding > how are they used in Emacs 24, and then suggest practical ways to > change that. Well, I would have said "tell me about it", but you've asked me to leave, so I won't. I will say nothing you've said so far even hints at issues with simply removing the whole concept of unibyte. > In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers > and characters. If you use it, you get what it does. And I'm telling you those subtleties are a *problem* that solves nothing that an Emacs without a unibyte concept can't handle fine. > If the buffer is not marked specially, how will I know to avoid > [inserting non-Latin-1 characters in a "binary" buffer]? All experience with XEmacs says *you* (the human programmer) *won't* have any problem avoiding that. As a programmer, if you're working with a binary protocol, you will be using binary buffers and strings, and byte-sized integers. If you accidentally mix things up, you'll quickly get an encoding error on output (since the binary codec can't output non-Latin-1 Unicode characters. It's just not a problem in practice, and that's not why unibyte was introduced in Emacs anyway. Unibyte was introduced because some folks thought working with variable-width-encoded buffers was too inefficient so they wanted access to a flat buffer of bytes. That's why buffer-as-{uni,multi}byte type punning was included. > > But surely you have a function like `char-int-p'[1] [...] >=20 > There's char-valid-p, but I don't see how that is relevant to the > current discussion. Only insofar as you thought char-int confusion might be an issue. > And I still don't see how this is relevant. You are describing a > marginally valid use case, while I'm talking about use cases we meet > every day, and which must be supported, e.g. when some Lisp wants to > decode or encode text by hand. You use `encode-coding-region' and `decode-coding-region', same as you do now. Do you seriously think that XEmacs doesn't support those use cases? o/o