From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Unibyte characters, strings, and buffers
Date: Fri, 28 Mar 2014 11:51:53 +0300
Message-ID: <8361myyac6.fsf@gnu.org>
References: <831txozsqa.fsf@gnu.org> <jwv4n2j2141.fsf-monnier+emacs@gnu.org>
	<83ppl7y30l.fsf@gnu.org> <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
X-Trace: ger.gmane.org 1395996724 18181 80.91.229.3 (28 Mar 2014 08:52:04 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 28 Mar 2014 08:52:04 +0000 (UTC)
Cc: monnier@IRO.UMontreal.CA, emacs-devel@gnu.org
To: "Stephen J. Turnbull" <stephen@xemacs.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 28 09:52:13 2014
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1WTSWF-0006vB-Tk
	for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 09:52:08 +0100
Original-Received: from localhost ([::1]:57883 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1WTSWF-0005Dq-7T
	for ged-emacs-devel@m.gmane.org; Fri, 28 Mar 2014 04:52:07 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:34535)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1WTSW7-00058b-Pt
	for emacs-devel@gnu.org; Fri, 28 Mar 2014 04:52:04 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1WTSW2-0008FE-O8
	for emacs-devel@gnu.org; Fri, 28 Mar 2014 04:51:59 -0400
Original-Received: from mtaout21.012.net.il ([80.179.55.169]:51615)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1WTSW2-0008Es-El
	for emacs-devel@gnu.org; Fri, 28 Mar 2014 04:51:54 -0400
Original-Received: from conversion-daemon.a-mtaout21.012.net.il by
	a-mtaout21.012.net.il (HyperSendmail v2007.08) id
	<0N35004001I81J00@a-mtaout21.012.net.il> for
	emacs-devel@gnu.org; Fri, 28 Mar 2014 11:51:52 +0300 (IDT)
Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout21.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0N35003JG1YFYC20@a-mtaout21.012.net.il>;
	Fri, 28 Mar 2014 11:51:52 +0300 (IDT)
In-reply-to: <87r45nouvx.fsf@uwakimon.sk.tsukuba.ac.jp>
X-012-Sender: halo1@inter.net.il
X-detected-operating-system: by eggs.gnu.org: Solaris 10
X-Received-From: 80.179.55.169
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:171066
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/171066>

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Fri, 28 Mar 2014 12:38:10 +0900
> Cc: Stefan Monnier <monnier@IRO.UMontreal.CA>, emacs-devel@gnu.org
> 
> Eli Zaretskii writes:
> 
>  > Paul seemed to say something more broad: that _all_ behaviors specific
>  > to unibyte buffers should go away.  Do you agree?
> 
> Yes, please.  XEmacs has never had the unibyte hack with Mule, and
> never has had much trouble with that.  It also has never had an
> instance of the \201 bug since Mule was declared stable -- where Emacs
> has had *many* regressions.

Let's not talk about Emacs 20 vintage problems, that is not useful.
Likewise examples from XEmacs, since the differences in this area
between Emacs and XEmacs are substantial, and that precludes useful
comparison.

> It's arguable that there are performance implications, but simply
> aliasing the binary codec to latin1-unix has *never* caused a bug in
> handling binary files -- all bugs are due to autodetection errors,
> not the buffer representation.

Forget about performance, there are real problems unrelated to that
which need to be solved, and I don't see how can you avoid them by
treating raw bytes as Latin-1 characters.  Let me explain.

First, we must have a way to have buffer "text" that represents a
stream of bytes, not some human-readable text.  (Just as a random
example, a buffer visiting an mbox file, from which you decode
portions into another buffer for display.)  Agreed?

In such unibyte buffers, we need a way to represent raw bytes, which
are parts of as yet un-decoded byte sequences that represent encoded
characters.  We cannot represent each such byte as a Latin-1
character, because Latin-1 characters are stored inside Emacs as
2-byte sequences of their UTF-8 encoding.  If you interpret bytes as
Latin-1 characters, functions like string-bytes will return wrong
results for those raw bytes.  Agreed?

So here you have already at least 2 valid reasons why Emacs must be
able to support raw bytes that are distinguishable from Latin-1
characters that have the same byte values, and why we must have
buffers that hold such raw bytes.  If we want to get rid of unibyte,
Someone(TM) should present a complete practical solution to those two
problems (and a few others), otherwise, this whole discussion leads
nowhere.  ("Practical" means that suggestions to introduce a character
data type are out of scope, or at least belong to an entirely
different discussion.)

> OTOH Emacs' unibyte buffer toggle is a design bug, pure and simple,
> and it should be backed up against a wall and immersed in
> insecticide.

I might even agree with you about the toggle.  But eliminating the
toggle doesn't solve the bigger issue, see above.

> If you stick to the interpretation that bytes contain non-negative
> integers less than 256, you won't have a problem in practice if you
> think them as the first 256 Unicode characters, but choose not to use
> functions that make sense only with characters.

What do you mean by "choose"?  Lisp code is used by many programmers
out there; sometimes, they aren't even aware if the buffer they work
on is unibyte, or what that means.  Even when they are aware, they
just want Emacs to DTRT, for their own value of "RT".  Unless each one
of those programmers "chooses" not to use the problematic functions,
we are back at square one.

And what does "choose not to use" mean, anyway?  How do you choose not
to use 'insert', for example? what do you use instead?

The issue at hand is how do you pull the trick, in practice, of doing
TRT with the legitimate use cases where Emacs needs to manipulate raw
bytes.

> Python actually implements many polymorphic functions (ie, they can
> be interpreted as bytes->bytes or characters->characters, etc) by
> converting bytes to characters as Latin-1, then using the character
> implementation of the function.

As long as Emacs exposes the character values to Lisp programs as
simple integers, I don't think we can take this path.