From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel Subject: Re: Unibyte characters, strings, and buffers Date: Sat, 29 Mar 2014 11:42:43 +0100 Message-ID: <87y4ztp9p8.fsf@fencepost.gnu.org> References: <831txozsqa.fsf@gnu.org> <83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp> <8361myyac6.fsf@gnu.org> <87a9capqfr.fsf@uwakimon.sk.tsukuba.ac.jp> <83eh1mfd09.fsf@gnu.org> <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1396089798 19264 80.91.229.3 (29 Mar 2014 10:43:18 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 29 Mar 2014 10:43:18 +0000 (UTC) Cc: Eli Zaretskii , monnier@IRO.UMontreal.CA, emacs-devel@gnu.org To: "Stephen J. Turnbull" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Mar 29 11:43:28 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1WTqjW-00021d-6F for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 11:43:26 +0100 Original-Received: from localhost ([::1]:38384 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTqjV-0004kN-Hw for ged-emacs-devel@m.gmane.org; Sat, 29 Mar 2014 06:43:25 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:43468) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTqjS-0004k7-2r for emacs-devel@gnu.org; Sat, 29 Mar 2014 06:43:23 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WTqjR-00073J-1g for emacs-devel@gnu.org; Sat, 29 Mar 2014 06:43:22 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:51987) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTqjQ-00073F-Tu for emacs-devel@gnu.org; Sat, 29 Mar 2014 06:43:20 -0400 Original-Received: from localhost ([127.0.0.1]:59163 helo=lola) by fencepost.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WTqjP-0001eq-Hy; Sat, 29 Mar 2014 06:43:19 -0400 Original-Received: by lola (Postfix, from userid 1000) id 3998FE0497; Sat, 29 Mar 2014 11:42:43 +0100 (CET) In-Reply-To: <87ob0pnyt6.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Sat, 29 Mar 2014 18:23:17 +0900") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:171127 Archived-At: "Stephen J. Turnbull" writes: > Eli Zaretskii writes: > > > How is it different? What would be the encoding of a buffer that > > contains raw bytes? > > Depends. If it's uninterpreted bytes, "binary." If those are > undecodable bytes, they'll be the representation of raw bytes that > occurred in an otherwise sane encoded stream, and the buffer's > encoding will be the nominal encoding of that stream. It's worth pointing out that there is no such thing as a "buffer's encoding" in general in Emacs. Buffers are sequences of characters or, in the case of a unibyte buffer, bytes. Encodings come into play for import/export only but they are not an inherent property of the buffer as such but rather, for example, of the file association of the buffer. Emacs has two kinds of internal representation (what one might actually want to call "buffer encoding"): unibyte and multibyte. XEmacs, I think, has only one. The current point of contention is about changing the way of codepoint-based character operations depending on the unibyte state of the current buffer. I consider that an astonishingly bad idea since character and string operations are not tied to a particular buffer. The whole point of MULE from a rather early point of time on was to deal with only a single Unicode-based character set in all of Emacs. Making character operations change meaning based on a buffer's unibyte status means a return to the character set semantics of Emacs 19. I am not necessarily of the same opinion as Stephen regarding whether or not abolishing unibyte buffers is a worthwhile goal. But I am pretty sure that "unibyte" should not be bleeding over into character and string operations. A unibyte buffer or unibyte string might error out when trying to insert characters out of the range 0..255. That's an obvious consequence of the buffer's representation. If we want different semantics for case-fold-search in binary buffers, then the solution is setting a buffer-local setting of case-fold-search when opening a buffer intended to be manipulated in a binary way. But the unibyte setting of the buffer should not affect normal character and string operation semantics. It is a buffer implementation detail that should not really have a visible effect apart from making some buffer operations impossible. Whether or not we want to abolish unibyte buffer representations, we don't want this to bleed effects beyond the buffer representation. If something chooses a unibyte buffer representation for some reason, it is the responsibility of the same something to switch character operations and case-fold-search etc to something making sense in the context of its operation. That may well be through some buffer-local setting of case-fold-search etc, but it is not tied to the internal representation of the buffer contents. -- David Kastrup