From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Stefan Monnier <monnier@iro.umontreal.ca>
Newsgroups: gmane.emacs.devel
Subject: Re: Improvements to `(emacs)File Variables'
Date: Mon, 15 Nov 2004 00:15:05 -0500
Message-ID: <87actj1zba.fsf-monnier+emacs@gnu.org>
References: <v9ekiws17i.fsf@marauder.physik.uni-ulm.de>
	<878y942l8h.fsf-monnier+emacs@gnu.org>
	<20041114232600.GA12968@fencepost>
	<87r7mw0zik.fsf-monnier+emacs@gnu.org>
	<20041114235558.GA18975@fencepost>
	<87lld40xve.fsf-monnier+emacs@gnu.org>
	<buo3bzbemrp.fsf@mctpc71.ucom.lsi.nec.co.jp>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1100495759 9166 80.91.229.6 (15 Nov 2004 05:15:59 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Mon, 15 Nov 2004 05:15:59 +0000 (UTC)
Cc: Simon Krahnke <overlord@gmx.li>, emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 15 06:15:48 2004
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Original-Received: from lists.gnu.org ([199.232.76.165])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1CTZDQ-0001j3-00
	for <ged-emacs-devel@m.gmane.org>; Mon, 15 Nov 2004 06:15:48 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.33)
	id 1CTZM6-0002yj-LD
	for ged-emacs-devel@m.gmane.org; Mon, 15 Nov 2004 00:24:46 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33)
	id 1CTZLu-0002yM-36
	for emacs-devel@gnu.org; Mon, 15 Nov 2004 00:24:34 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33)
	id 1CTZLt-0002xh-Gw
	for emacs-devel@gnu.org; Mon, 15 Nov 2004 00:24:33 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.33) id 1CTZLt-0002xG-5P
	for emacs-devel@gnu.org; Mon, 15 Nov 2004 00:24:33 -0500
Original-Received: from [209.226.175.74] (helo=tomts20-srv.bellnexxia.net)
	by monty-python.gnu.org with esmtp (Exim 4.34)
	id 1CTZCk-00087p-9R; Mon, 15 Nov 2004 00:15:06 -0500
Original-Received: from alfajor ([70.48.82.50]) by tomts20-srv.bellnexxia.net
	(InterMail vM.5.01.06.10 201-253-122-130-110-20040306) with ESMTP
	id <20041115051505.IXLW2034.tomts20-srv.bellnexxia.net@alfajor>;
	Mon, 15 Nov 2004 00:15:05 -0500
Original-Received: by alfajor (Postfix, from userid 1000)
	id 5561A2FD26; Mon, 15 Nov 2004 00:15:05 -0500 (EST)
Original-To: Miles Bader <miles@gnu.org>
In-Reply-To: <buo3bzbemrp.fsf@mctpc71.ucom.lsi.nec.co.jp> (Miles Bader's
	message of "Mon, 15 Nov 2004 13:53:14 +0900")
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:29857
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:29857

> I'm not sure.  "Unibyte" as used in emacs seems (to me) to imply several
> things:  (1) of course, a single byte per character, (2) the concept of
> strings/buffers whose encoding is "unknown".

> If you were to consistently treat (2) as in fact meaning an explicit
> "binary" encoding, maybe it would be useful, but my impression is that
> at least historically, people/code have _not_ always done this, leading
> to lots and lots of confusion.  I suppose much of the reason is that
> people want the efficiency gain of (1), and either don't realize the
> problems caused by (2) or think they can kludge around it.

> As I've posted before, I think "unibyte" strings/buffers should be only
> an optimization, and should have an explicit (8-bit) encoding associated
> with them, so that any conversions to/from multibyte can automatically
> do the correct thing; one of these encoding could of course be "binary",
> which maybe would allow the historical usage of unibyte to be preserved.

I'd tend to disagree on the idea of associating an encoding with
unibyte buffers.  I think a large part of the problem is that people with
a unibyte background (i.e. latin-1 mostly) typically confuse the notion of
character and byte and mix things up hopelessly.

In Emacs-20, automatic conversion between unibyte and multibyte was provided
mostly as a way to work "correctly" even with confused code which didn't
understand that there's more than 256 characters in this world.

It made sense at the time to avoid alienating too many Emacs coders.
But to get things right, the first thing we need to do is to make it very
clear that there is no way to automatically convert between unibyte
and multibyte.  Such a conversion should only be doable via
(en|de)coding-coding-foo functions, thus forcing anyone who wants to go down
that path to actually provide a coding system explicitly and thus to think
of what coding system should be used.

After all, autoconversion can only work for 8bit encoding, so any code which
uses autoconversion is in two possible cases:
1 - the code somehow knows that all the possible encodings it might need to
    use there are 8bit.  Most likely, it's the case where there's only ever
    one encoding used.
2 - the code *doesn't* know, but just assumes (probably without even being
    aware of it) that all encodings are 8bit.  Thus it will break if used
    in China, Japan, ...
Situation 2 is a bug.  Situation 1 seems rather unusual.  My conclusion is
that autoconversion is harmful.

I've hacked my own local Emacs to "disallow" autoconversion
(i.e. auto-conversion from unibyte->multibyte is allowed and generates
eight-bit-control and eight-bit-graphic chars; auto-conversion from
multibyte to unibyte is allowed but only for ascii, eight-bit-graphic, and
eight-bit-control chars, any other char causes an error).  It actually works
fairly well.  The main problems I encounter have to do with regexp matching
where the regexp is multibyte and the text is unibyte.


        Stefan