From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Sat, 29 Mar 2014 18:16:39 +0100 Organization: Organization?!? Message-ID: <8738i1orgo.fsf@fencepost.gnu.org> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> <8361myyac6.fsf@gnu.org> <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> <83eh1mfd09.fsf@gnu.org> <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp> <87a9c9aqhu.fsf@nbtrap.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1396113436 1566 80.91.229.3 (29 Mar 2014 17:17:16 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Mar 2014 17:17:16 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Mar 29 18:17:10 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTwsX-0006N6-Fg for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 18:17:09 +0100 Original-Received: from localhost ([::1]:40242 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTwsX-00078k-1K for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 13:17:09 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:51792) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTwsN-00070C-Lw for emacs-devel@gnu.org; Sat, 29 Mar 2014 13:17:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTwsI-0002OP-6G for emacs-devel@gnu.org; Sat, 29 Mar 2014 13:16:59 -0400 Original-Received: from plane.gmane.org ([80.91.229.3]:39204) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTwsI-0002OI-0Y for emacs-devel@gnu.org; Sat, 29 Mar 2014 13:16:54 -0400 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1WTwsD-00069W-Uw for emacs-devel@gnu.org; Sat, 29 Mar 2014 18:16:49 +0100 Original-Received: from x2f4094b.dyn.telefonica.de ([2.244.9.75]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 29 Mar 2014 18:16:49 +0100 Original-Received: from dak by x2f4094b.dyn.telefonica.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 29 Mar 2014 18:16:49 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 41 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: x2f4094b.dyn.telefonica.de X-Face: 2FEFf>]>q>2iw=B6, xrUubRI>pR&Ml9=ao@P@i)L:\urd*t9M~y1^:+Y]'C0~{mAl`oQuAl \!3KEIp?*w`|bL5qr,H)LFO6Q=qx~iH4DN; i"; /yuIsqbLLCh/!U#X[S~(5eZ41to5f%E@'ELIi$t^ Vc\LWP@J5p^rst0+('>Er0=^1{]M9!p?&:\z]|;&=NP3AhB!B_bi^]Pfkw User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) Cancel-Lock: sha1:49YKBoCcb0IBUOPg0bfBCRdLd0Q= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171163 Archived-At: Nathan Trapuzzano writes: > "Stephen J. Turnbull" writes: > >> What is relevant is how to represent byte streams in Emacs. The >> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode >> characters. It is *extremely* convenient if the first 128 of those >> bytes correspond to the ASCII coded character set, because so many >> wire protocols use ASCII "words" syntactically. The other 128 don't >> matter much, so why not just use the extremely convenient Latin-1 set >> for them? > > Sorry if someone brought this up already, but one reason raw bytes > shouldn't be represented as Latin-1 characters is that the "raw > bytes"-ness would be lost when writing them back to disk if the stream > also contained characters outside the Latin-1 range. No. > For example, say we decode a stream of raw bytes as utf8, but that the > stream contains some non-utf8 sequences. IIUC, Emacs will interpret > those as "raw bytes", so that when it goes to encode the string to write > it back, they will be written back verbatim. "Raw bytes" here are represented as particular characters outside of the Unicode range. They are representable in multibyte buffers. They never were representable in unibyte buffers. While it is conceivable to map characters 128..255 in unibyte strings/buffers to the respective character codes outside of the Unicode range, that would render programmatic manipulation of bytes strenuous. > Whereas, if they had been interpreted as Latin-1 characters, they > would get written back as the UTF8 equivalents. Hence you have the > odd situation where you can decode and then encode and end up with a > different string. No, you can't unless you decode into a unibyte buffer, and then all bets are off regarding reencoding. -- David Kastrup