From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Improvements to `(emacs)File Variables' Date: Mon, 15 Nov 2004 00:15:05 -0500 Message-ID: <87actj1zba.fsf-monnier+emacs@gnu.org> References: <878y942l8h.fsf-monnier+emacs@gnu.org> <20041114232600.GA12968@fencepost> <87r7mw0zik.fsf-monnier+emacs@gnu.org> <20041114235558.GA18975@fencepost> <87lld40xve.fsf-monnier+emacs@gnu.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1100495759 9166 80.91.229.6 (15 Nov 2004 05:15:59 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Mon, 15 Nov 2004 05:15:59 +0000 (UTC) Cc: Simon Krahnke , emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 15 06:15:48 2004 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1CTZDQ-0001j3-00 for ; Mon, 15 Nov 2004 06:15:48 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1CTZM6-0002yj-LD for ged-emacs-devel@m.gmane.org; Mon, 15 Nov 2004 00:24:46 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33) id 1CTZLu-0002yM-36 for emacs-devel@gnu.org; Mon, 15 Nov 2004 00:24:34 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33) id 1CTZLt-0002xh-Gw for emacs-devel@gnu.org; Mon, 15 Nov 2004 00:24:33 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1CTZLt-0002xG-5P for emacs-devel@gnu.org; Mon, 15 Nov 2004 00:24:33 -0500 Original-Received: from [209.226.175.74] (helo=tomts20-srv.bellnexxia.net) by monty-python.gnu.org with esmtp (Exim 4.34) id 1CTZCk-00087p-9R; Mon, 15 Nov 2004 00:15:06 -0500 Original-Received: from alfajor ([70.48.82.50]) by tomts20-srv.bellnexxia.net (InterMail vM.5.01.06.10 201-253-122-130-110-20040306) with ESMTP id <20041115051505.IXLW2034.tomts20-srv.bellnexxia.net@alfajor>; Mon, 15 Nov 2004 00:15:05 -0500 Original-Received: by alfajor (Postfix, from userid 1000) id 5561A2FD26; Mon, 15 Nov 2004 00:15:05 -0500 (EST) Original-To: Miles Bader In-Reply-To: (Miles Bader's message of "Mon, 15 Nov 2004 13:53:14 +0900") User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:29857 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:29857 > I'm not sure. "Unibyte" as used in emacs seems (to me) to imply several > things: (1) of course, a single byte per character, (2) the concept of > strings/buffers whose encoding is "unknown". > If you were to consistently treat (2) as in fact meaning an explicit > "binary" encoding, maybe it would be useful, but my impression is that > at least historically, people/code have _not_ always done this, leading > to lots and lots of confusion. I suppose much of the reason is that > people want the efficiency gain of (1), and either don't realize the > problems caused by (2) or think they can kludge around it. > As I've posted before, I think "unibyte" strings/buffers should be only > an optimization, and should have an explicit (8-bit) encoding associated > with them, so that any conversions to/from multibyte can automatically > do the correct thing; one of these encoding could of course be "binary", > which maybe would allow the historical usage of unibyte to be preserved. I'd tend to disagree on the idea of associating an encoding with unibyte buffers. I think a large part of the problem is that people with a unibyte background (i.e. latin-1 mostly) typically confuse the notion of character and byte and mix things up hopelessly. In Emacs-20, automatic conversion between unibyte and multibyte was provided mostly as a way to work "correctly" even with confused code which didn't understand that there's more than 256 characters in this world. It made sense at the time to avoid alienating too many Emacs coders. But to get things right, the first thing we need to do is to make it very clear that there is no way to automatically convert between unibyte and multibyte. Such a conversion should only be doable via (en|de)coding-coding-foo functions, thus forcing anyone who wants to go down that path to actually provide a coding system explicitly and thus to think of what coding system should be used. After all, autoconversion can only work for 8bit encoding, so any code which uses autoconversion is in two possible cases: 1 - the code somehow knows that all the possible encodings it might need to use there are 8bit. Most likely, it's the case where there's only ever one encoding used. 2 - the code *doesn't* know, but just assumes (probably without even being aware of it) that all encodings are 8bit. Thus it will break if used in China, Japan, ... Situation 2 is a bug. Situation 1 seems rather unusual. My conclusion is that autoconversion is harmful. I've hacked my own local Emacs to "disallow" autoconversion (i.e. auto-conversion from unibyte->multibyte is allowed and generates eight-bit-control and eight-bit-graphic chars; auto-conversion from multibyte to unibyte is allowed but only for ascii, eight-bit-graphic, and eight-bit-control chars, any other char causes an error). It actually works fairly well. The main problems I encounter have to do with regexp matching where the regexp is multibyte and the text is unibyte. Stefan