From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Simon Josefsson Newsgroups: gmane.emacs.devel Subject: Re: eight-bit char handling in emacs-unicode Date: Sat, 15 Nov 2003 04:04:05 +0100 Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: References: <200311130153.KAA04615@etlken.m17n.org> <200311130610.PAA04983@etlken.m17n.org> <200311130901.SAA05204@etlken.m17n.org> <200311140047.JAA06414@etlken.m17n.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1068865626 24481 80.91.224.253 (15 Nov 2003 03:07:06 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Sat, 15 Nov 2003 03:07:06 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Sat Nov 15 04:07:03 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AKqm7-0006MQ-00 for ; Sat, 15 Nov 2003 04:07:03 +0100 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1AKqm7-0001dp-00 for ; Sat, 15 Nov 2003 04:07:03 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AKri6-0002nq-By for emacs-devel@quimby.gnus.org; Fri, 14 Nov 2003 23:06:58 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AKrhU-0002lk-Tj for emacs-devel@gnu.org; Fri, 14 Nov 2003 23:06:20 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AKrgw-0002aX-Lz for emacs-devel@gnu.org; Fri, 14 Nov 2003 23:06:17 -0500 Original-Received: from [217.13.230.178] (helo=yxa.extundo.com) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.24) id 1AKrgw-0002aR-2j for emacs-devel@gnu.org; Fri, 14 Nov 2003 23:05:46 -0500 Original-Received: from latte (yxa.extundo.com [217.13.230.178]) (authenticated bits=0) by yxa.extundo.com (8.12.10/8.12.10) with ESMTP id hAF34D1K002915; Sat, 15 Nov 2003 04:04:16 +0100 Original-To: Kenichi Handa Mail-Copies-To: nobody X-Payment: hashcash 1.2 0:031115:handa@m17n.org:d87efbfdd2bfadce X-Hashcash: 0:031115:handa@m17n.org:d87efbfdd2bfadce X-Payment: hashcash 1.2 0:031115:emacs-devel@gnu.org:3c4a776e14446854 X-Hashcash: 0:031115:emacs-devel@gnu.org:3c4a776e14446854 In-Reply-To: <200311140047.JAA06414@etlken.m17n.org> (Kenichi Handa's message of "Fri, 14 Nov 2003 09:47:51 +0900 (JST)") User-Agent: Gnus/5.1003 (Gnus v5.10.3) Emacs/21.3.50 (gnu/linux) X-MIME-Autoconverted: from 8bit to quoted-printable by yxa.extundo.com id hAF34D1K002915 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:17833 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:17833 Kenichi Handa writes: > In article , Simon Josefsson writes: >> rfc2104.el now works, thanks. But does the fix really have to >> explicitly mention charsets like iso-latin-1? Is there no way to >> handle binary octet strings in emacs-unicode? Preferably in a >> portable way, that works on old Emacs versions and on XEmacs. > >>> This is a typical problem of emacs-unicode in which >>> characters 128..255 are valid Unicode characters, thus, for >>> instance, (concat '(?a ?\300)) returns a multibyte string of >>> `a' and `=C0'. But in the current Emacs, it returns a unibyte >>> string. >>>=20 >>> I suspect the similar fix is necessary in several other >>> places. > >> Having a way to deal with data that is a pure single byte, without >> involving coding systems, seems like a rather important thing to me. > > I agree with you. Currently, I can think of these methods: Can you think of one that would work on Emacs 21? Having a stable idiom to use to deal with octets would be useful, forcing third-party packages to try several methods can easily lead to unreadable code. > (1) Perhaps the easiest way. > > Check `default-enable-multibyte-characters' or a newly > instroduced variable `byte-as-byte' to decide whether a > integer 128..255 must be treated as a Latin-1 char or a > byte. So, > (concat '(?a ?\300)) =3D> "a=C0" (multibyte string) > (let ((byte-as-byte t)) > (concat '(?a ?\300))) =3D> "a\300" (unibyte string) > > (2) Introduce a new function `eight-bit-char'. > > It converts an argument to ascii or eight-bit-char. > (eight-bit-char ?a) =3D> 94 > (eight-bit-char ?\300) =3D> 4194240 > Then, > (concat '(?a (eight-bit-char ?\300))) =3D> "a\300" Both would work for me, although superficially both look like quick hacks to me. > (3) Make a series of new functions (I think it's not good) > > concat vs concat-unibyte > string vs string-unibyte > aset vs aset-unibyte I agree it isn't good. > (4) Most drastic way (the cleanest but requires lots of work) > > The basic problem is that we don't distinguish a character > (code) and a number. So, we introduce a character object > (like XEmacs). The function `character' converts a > character code into the corresponding character object. The > lisp reader always generate a character object for ?a, > ?\300, etc. So: > (concat '(?a ?\300)) =3D> "a=C0" > (concat '(?a #o300)) =3D> "a\300" > (concat '(?a (character #o300))) =3D> "a=C0" > (concat '(?a #o300 (character #o300))) =3D> "a\300=C0" > > Note: (character X) =3D=3D (decode-char 'ucs X) This would be nice. Characters aren't numbers (unless within the internal representation, but the internal representation should be hidden), so separating the two types is useful. So to be consistent with that, I think your `character' function should be called `ucs-character' or similar. >> It started now, but when I enter a summary buffer it crashed: > >> Program received signal SIGSEGV, Segmentation fault. >> 0x081a3c81 in skip_chars (forwardp=3D1, string=3D160, lim=3D36) at syn= tax.c:1591 >> 1591 char_ranges[n_char_ranges++] =3D c; >> (gdb) bt >> #0 0x081a3c81 in skip_chars (forwardp=3D1, string=3D160, lim=3D36) at= syntax.c:1591 > > I just tried gnus but I couldn't reproduce it. So, I need > more help. Could you show me the results of the following? > > (gdb) p n_char_ranges > (gbd) p c > (gdb) p string > (gdb) xstring > (gdb) p *$ I'll try to get time to try emacs-unicode-2 more, but no promises. Thanks.