From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Sat, 29 Mar 2014 12:25:43 +0300 Message-ID: <83r45le4q0.fsf@gnu.org> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> <8361myyac6.fsf@gnu.org> <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> <83eh1mfd09.fsf@gnu.org> <87wqfeqkl1.fsf@fencepost.gnu.org> <834n2ifa50.fsf@gnu.org> <87siq2qg6a.fsf@fencepost.gnu.org> <83zjk9ec92.fsf@gnu.org> <87d2h5qxhm.fsf@fencepost.gnu.org> <83siq1e7kq.fsf@gnu.org> <8761mxqty4.fsf@fencepost.gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1396085155 5255 80.91.229.3 (29 Mar 2014 09:25:55 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Mar 2014 09:25:55 +0000 (UTC) Cc: emacs-devel@gnu.org To: David Kastrup Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Mar 29 10:26:02 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTpWc-0007Eu-GH for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 10:26:02 +0100 Original-Received: from localhost ([::1]:38261 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTpWc-0003Ua-03 for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 05:26:02 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:33615) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTpWU-0003Sn-J5 for emacs-devel@gnu.org; Sat, 29 Mar 2014 05:26:00 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTpWO-0007Zs-MI for emacs-devel@gnu.org; Sat, 29 Mar 2014 05:25:54 -0400 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:41175) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTpWH-0007Xo-9v; Sat, 29 Mar 2014 05:25:41 -0400 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0N3600E00Y6EHQ00@a-mtaout20.012.net.il>; Sat, 29 Mar 2014 12:25:39 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0N3600ERBY6RH600@a-mtaout20.012.net.il>; Sat, 29 Mar 2014 12:25:39 +0300 (IDT) In-reply-to: <8761mxqty4.fsf@fencepost.gnu.org> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.166 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171125 Archived-At: > From: David Kastrup > Cc: emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 09:40:03 +0100 > > >> It means a buffer where each _character_ has the same value that the > >> no-longer-available unibyte buffer would have in its bytes/characters. > > > > This doesn't seem to be a complete description of what is suggested. > > E.g., just by looking at the values of characters, it is impossible to > > distinguish between Latin characters below 256 and raw bytes. In a > > unibyte buffer, we know how to make that distinction, > > Uh, what? The point of a unibyte buffer is that it does not make the > distinction. Yes, it does: it treats every character as a raw byte. So the dilemma is resolved there by definition. How to do that without unibyte buffers remains to be defined, otherwise plans to remove unibyte buffers are impractical. > > but if there are no unibyte buffers, something else is needed for > > doing that. > > >> You can do that whether or not the conceptual array of 0..255 characters > >> is internally encoded in unibyte or multibyte encodings. > > > > What do you mean by "multibyte encodings" in this context? Are you > > suggesting to store the bytes 128..255 as Latin-1 characters, > > i.e. using the 2-byte UTF-8 sequences of the corresponding Latin > > characters? > > That would make the most sense, yes. Then the above distinction is impossible, and all kinds of subtly incorrect behaviors creep in. > > Or are you suggesting something else? > > You could also use the "raw byte" character encodings we use for not > losing information when reading not properly formed utf-8 files into a > multibyte buffer, but that seems less practical when working with the > character codes. Why less practical?