From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: eight-bit char handling in emacs-unicode Date: 17 Nov 2003 16:17:56 -0500 Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: References: <200311130153.KAA04615@etlken.m17n.org> <200311130610.PAA04983@etlken.m17n.org> <200311130901.SAA05204@etlken.m17n.org> <200311140047.JAA06414@etlken.m17n.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1069104247 12906 80.91.224.253 (17 Nov 2003 21:24:07 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Mon, 17 Nov 2003 21:24:07 +0000 (UTC) Cc: emacs-devel@gnu.org, jas@extundo.com Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Mon Nov 17 22:24:04 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1ALqqq-00025u-00 for ; Mon, 17 Nov 2003 22:24:04 +0100 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1ALqqq-0006c3-00 for ; Mon, 17 Nov 2003 22:24:04 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1ALrkk-0003z7-OU for emacs-devel@quimby.gnus.org; Mon, 17 Nov 2003 17:21:50 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1ALrjF-00036k-Ta for emacs-devel@gnu.org; Mon, 17 Nov 2003 17:20:17 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1ALrii-0002k1-Pz for emacs-devel@gnu.org; Mon, 17 Nov 2003 17:20:16 -0500 Original-Received: from [132.204.24.67] (helo=mercure.iro.umontreal.ca) by monty-python.gnu.org with esmtp (Exim 4.24) id 1ALrih-0002ht-FS for emacs-devel@gnu.org; Mon, 17 Nov 2003 17:19:43 -0500 Original-Received: from vor.iro.umontreal.ca (vor.iro.umontreal.ca [132.204.24.42]) by mercure.iro.umontreal.ca (8.12.9/8.12.9) with ESMTP id hAHLHubj026132; Mon, 17 Nov 2003 16:17:57 -0500 Original-Received: by vor.iro.umontreal.ca (Postfix, from userid 20848) id B91D93C547; Mon, 17 Nov 2003 16:17:56 -0500 (EST) Original-To: Kenichi Handa In-Reply-To: <200311140047.JAA06414@etlken.m17n.org> Original-Lines: 34 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3.50 X-DIRO-MailScanner: Found to be clean X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:17871 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:17871 > The basic problem is that we don't distinguish a character > (code) and a number. So, we introduce a character object That's one way to look at the problem. Another is to say that the problem is instead that we do not distinguish between arrays of chars and arrays of bytes. We just use strings and buffers and expect to be able to mix bytes and chars in them. Such mixes are admittedly very rare for strings, but they're pretty common for buffers. So when we write 192 at a location, we don't know whether we should put there the byte 192 or the eight-bit-char character that will be encoded into a 192 byte. In Emacs-21 we worked around the problem by arranging for "the eight-bit-char that encodes to 192" to be represented by the integer 192, so as to avoid having to choose. But with unicode, the 128-255 zone cannot be dedicated to eight-bit-char since it's already used up for latin-1, so we have to face the problem more directly. The places where Emacs-21 still had to choose, we just used heursitics, so `concat' will sometimes return a unibyte string, and sometimes multibyte string. So I think your options 1-3 are better than 4. BTW, your function `eight-bit-char' should be named `byte-to-char' instead. Which of 1 to 3 is the best is not clear, and maybe we can just live with `make-string-unibyte' and `make-string-multibyte'. Note that 1-3 are not mutually exclusive so we can use them all. Stefan