From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Fri, 28 Mar 2014 11:51:53 +0300 Message-ID: <8361myyac6.fsf@gnu.org> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1395996724 18181 80.91.229.3 (28 Mar 2014 08:52:04 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 28 Mar 2014 08:52:04 +0000 (UTC) Cc: monnier@IRO.UMontreal.CA, emacs-devel@gnu.org To: "Stephen J. Turnbull" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 28 09:52:13 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTSWF-0006vB-Tk for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 09:52:08 +0100 Original-Received: from localhost ([::1]:57883 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTSWF-0005Dq-7T for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 04:52:07 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:34535) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTSW7-00058b-Pt for emacs-devel@gnu.org; Fri, 28 Mar 2014 04:52:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTSW2-0008FE-O8 for emacs-devel@gnu.org; Fri, 28 Mar 2014 04:51:59 -0400 Original-Received: from mtaout21.012.net.il ([80.179.55.169]:51615) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTSW2-0008Es-El for emacs-devel@gnu.org; Fri, 28 Mar 2014 04:51:54 -0400 Original-Received: from conversion-daemon.a-mtaout21.012.net.il by a-mtaout21.012.net.il (HyperSendmail v2007.08) id <0N35004001I81J00@a-mtaout21.012.net.il> for emacs-devel@gnu.org; Fri, 28 Mar 2014 11:51:52 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout21.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0N35003JG1YFYC20@a-mtaout21.012.net.il>; Fri, 28 Mar 2014 11:51:52 +0300 (IDT) In-reply-to: <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.169 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171066 Archived-At: > From: "Stephen J. Turnbull" > Date: Fri, 28 Mar 2014 12:38:10 +0900 > Cc: Stefan Monnier , emacs-devel@gnu.org > > Eli Zaretskii writes: > > > Paul seemed to say something more broad: that _all_ behaviors specific > > to unibyte buffers should go away. Do you agree? > > Yes, please. XEmacs has never had the unibyte hack with Mule, and > never has had much trouble with that. It also has never had an > instance of the \201 bug since Mule was declared stable -- where Emacs > has had *many* regressions. Let's not talk about Emacs 20 vintage problems, that is not useful. Likewise examples from XEmacs, since the differences in this area between Emacs and XEmacs are substantial, and that precludes useful comparison. > It's arguable that there are performance implications, but simply > aliasing the binary codec to latin1-unix has *never* caused a bug in > handling binary files -- all bugs are due to autodetection errors, > not the buffer representation. Forget about performance, there are real problems unrelated to that which need to be solved, and I don't see how can you avoid them by treating raw bytes as Latin-1 characters. Let me explain. First, we must have a way to have buffer "text" that represents a stream of bytes, not some human-readable text. (Just as a random example, a buffer visiting an mbox file, from which you decode portions into another buffer for display.) Agreed? In such unibyte buffers, we need a way to represent raw bytes, which are parts of as yet un-decoded byte sequences that represent encoded characters. We cannot represent each such byte as a Latin-1 character, because Latin-1 characters are stored inside Emacs as 2-byte sequences of their UTF-8 encoding. If you interpret bytes as Latin-1 characters, functions like string-bytes will return wrong results for those raw bytes. Agreed? So here you have already at least 2 valid reasons why Emacs must be able to support raw bytes that are distinguishable from Latin-1 characters that have the same byte values, and why we must have buffers that hold such raw bytes. If we want to get rid of unibyte, Someone(TM) should present a complete practical solution to those two problems (and a few others), otherwise, this whole discussion leads nowhere. ("Practical" means that suggestions to introduce a character data type are out of scope, or at least belong to an entirely different discussion.) > OTOH Emacs' unibyte buffer toggle is a design bug, pure and simple, > and it should be backed up against a wall and immersed in > insecticide. I might even agree with you about the toggle. But eliminating the toggle doesn't solve the bigger issue, see above. > If you stick to the interpretation that bytes contain non-negative > integers less than 256, you won't have a problem in practice if you > think them as the first 256 Unicode characters, but choose not to use > functions that make sense only with characters. What do you mean by "choose"? Lisp code is used by many programmers out there; sometimes, they aren't even aware if the buffer they work on is unibyte, or what that means. Even when they are aware, they just want Emacs to DTRT, for their own value of "RT". Unless each one of those programmers "chooses" not to use the problematic functions, we are back at square one. And what does "choose not to use" mean, anyway? How do you choose not to use 'insert', for example? what do you use instead? The issue at hand is how do you pull the trick, in practice, of doing TRT with the legitimate use cases where Emacs needs to manipulate raw bytes. > Python actually implements many polymorphic functions (ie, they can > be interpreted as bytes->bytes or characters->characters, etc) by > converting bytes to characters as Latin-1, then using the character > implementation of the function. As long as Emacs exposes the character values to Lisp programs as simple integers, I don't think we can take this path.