From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: eight-bit char handling in emacs-unicode Date: Tue, 18 Nov 2003 16:33:15 +0900 (JST) Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: <200311180733.QAA13703@etlken.m17n.org> References: <200311130153.KAA04615@etlken.m17n.org> <200311130610.PAA04983@etlken.m17n.org> <200311130901.SAA05204@etlken.m17n.org> <200311140047.JAA06414@etlken.m17n.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1069140974 8912 80.91.224.253 (18 Nov 2003 07:36:14 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 18 Nov 2003 07:36:14 +0000 (UTC) Cc: emacs-devel@gnu.org, jas@extundo.com Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Tue Nov 18 08:36:11 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AM0PD-00057P-00 for ; Tue, 18 Nov 2003 08:36:11 +0100 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1AM0PD-00045E-00 for ; Tue, 18 Nov 2003 08:36:11 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AM1M1-0005Mf-Eq for emacs-devel@quimby.gnus.org; Tue, 18 Nov 2003 03:36:57 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AM1KT-0004Xu-5v for emacs-devel@gnu.org; Tue, 18 Nov 2003 03:35:21 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AM1Ju-0003w2-Fw for emacs-devel@gnu.org; Tue, 18 Nov 2003 03:35:17 -0500 Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AM1Jt-0003u4-6u for emacs-devel@gnu.org; Tue, 18 Nov 2003 03:34:45 -0500 Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2]) by tsukuba.m17n.org (8.11.6p2/3.7W-20010518204228) with ESMTP id hAI7XGh10520; Tue, 18 Nov 2003 16:33:16 +0900 (JST) (envelope-from handa@m17n.org) Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by fs.m17n.org (8.11.6/3.7W-20010823150639) with ESMTP id hAI7XFs17285; Tue, 18 Nov 2003 16:33:15 +0900 (JST) Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id QAA13703; Tue, 18 Nov 2003 16:33:15 +0900 (JST) Original-To: monnier@IRO.UMontreal.CA In-reply-to: (message from Stefan Monnier on 17 Nov 2003 16:17:56 -0500) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:17880 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:17880 In article , Stef= an Monnier writes: >> The basic problem is that we don't distinguish a character >> (code) and a number. So, we introduce a character object > That's one way to look at the problem. > Another is to say that the problem is instead that we do not distinguish > between arrays of chars and arrays of bytes. I agree that it's possible to grasp the problem in that way, but I'm not sure which is the better way. Could you explain WHY yours is better? [...] > In Emacs-21 we worked around the problem by arranging for "the > eight-bit-char that encodes to 192" to be represented by the integer 192,= so > as to avoid having to choose. But with unicode, the 128-255 zone cannot = be > dedicated to eight-bit-char since it's already used up for latin-1, so we > have to face the problem more directly. > The places where Emacs-21 still had to choose, we just used heursitics, > so `concat' will sometimes return a unibyte string, and sometimes > multibyte string. > So I think your options 1-3 are better than 4. BTW, your function > `eight-bit-char' should be named `byte-to-char' instead. > Which of 1 to 3 is the best is not clear, and maybe we can just live with > `make-string-unibyte' and `make-string-multibyte'. I think you mean string-make-unibyte/multibyte, but, for the current problem, we can't use it because string-make-unibyte may behave differently in different language environment. Such a lang. env. that makes iso-8859-1 or Unicode the highest priority for the character `=C0' is ok. (string-make-unibyte (concat '(?a 192))) =3D "a\300" But, if some lang. env. prefers such a charset for `=C0' that encodes it not to 192 (e.g. Vietnamese VSCII), we fail. > Note that 1-3 are not mutually exclusive so we can use > them all. Yes, but, at least, I really want to avoid "(3) Make a series of new functions". --- Ken'ichi HANDA handa@m17n.org