From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: eight-bit char handling in emacs-unicode Date: Fri, 14 Nov 2003 09:47:51 +0900 (JST) Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Message-ID: <200311140047.JAA06414@etlken.m17n.org> References: <200311130153.KAA04615@etlken.m17n.org> <200311130610.PAA04983@etlken.m17n.org> <200311130901.SAA05204@etlken.m17n.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1068771347 10485 80.91.224.253 (14 Nov 2003 00:55:47 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 14 Nov 2003 00:55:47 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Fri Nov 14 01:55:44 2003 Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AKSFU-0004z3-00 for ; Fri, 14 Nov 2003 01:55:44 +0100 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian)) id 1AKSFT-0003vB-00 for ; Fri, 14 Nov 2003 01:55:43 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AKT93-00050G-1l for emacs-devel@quimby.gnus.org; Thu, 13 Nov 2003 20:53:09 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AKT8D-0004fC-GS for emacs-devel@gnu.org; Thu, 13 Nov 2003 20:52:17 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AKT5x-000270-7f for emacs-devel@gnu.org; Thu, 13 Nov 2003 20:50:28 -0500 Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AKT5X-00013a-Gy for emacs-devel@gnu.org; Thu, 13 Nov 2003 20:49:31 -0500 Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2]) by tsukuba.m17n.org (8.11.6p2/3.7W-20010518204228) with ESMTP id hAE0lrh25673; Fri, 14 Nov 2003 09:47:53 +0900 (JST) (envelope-from handa@m17n.org) Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by fs.m17n.org (8.11.6/3.7W-20010823150639) with ESMTP id hAE0lqs16802; Fri, 14 Nov 2003 09:47:52 +0900 (JST) Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id JAA06414; Fri, 14 Nov 2003 09:47:51 +0900 (JST) Original-To: jas@extundo.com In-reply-to: (message from Simon Josefsson on Thu, 13 Nov 2003 17:34:14 +0100) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:17811 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:17811 In article , Simon Josefsson writes: > rfc2104.el now works, thanks. But does the fix really have to > explicitly mention charsets like iso-latin-1? Is there no way to > handle binary octet strings in emacs-unicode? Preferably in a > portable way, that works on old Emacs versions and on XEmacs. >> This is a typical problem of emacs-unicode in which >> characters 128..255 are valid Unicode characters, thus, for >> instance, (concat '(?a ?\300)) returns a multibyte string of >> `a' and `=C0'. But in the current Emacs, it returns a unibyte >> string. >>=20 >> I suspect the similar fix is necessary in several other >> places. > Having a way to deal with data that is a pure single byte, without > involving coding systems, seems like a rather important thing to me. I agree with you. Currently, I can think of these methods: (1) Perhaps the easiest way. Check `default-enable-multibyte-characters' or a newly instroduced variable `byte-as-byte' to decide whether a integer 128..255 must be treated as a Latin-1 char or a byte. So, (concat '(?a ?\300)) =3D> "a=C0" (multibyte string) (let ((byte-as-byte t)) (concat '(?a ?\300))) =3D> "a\300" (unibyte string) (2) Introduce a new function `eight-bit-char'. It converts an argument to ascii or eight-bit-char. (eight-bit-char ?a) =3D> 94 (eight-bit-char ?\300) =3D> 4194240 Then, (concat '(?a (eight-bit-char ?\300))) =3D> "a\300" (3) Make a series of new functions (I think it's not good) concat vs concat-unibyte string vs string-unibyte aset vs aset-unibyte (4) Most drastic way (the cleanest but requires lots of work) The basic problem is that we don't distinguish a character (code) and a number. So, we introduce a character object (like XEmacs). The function `character' converts a character code into the corresponding character object. The lisp reader always generate a character object for ?a, ?\300, etc. So: (concat '(?a ?\300)) =3D> "a=C0" (concat '(?a #o300)) =3D> "a\300" (concat '(?a (character #o300))) =3D> "a=C0" (concat '(?a #o300 (character #o300))) =3D> "a\300=C0" Note: (character X) =3D=3D (decode-char 'ucs X) > It started now, but when I enter a summary buffer it crashed: > Program received signal SIGSEGV, Segmentation fault. > 0x081a3c81 in skip_chars (forwardp=3D1, string=3D160, lim=3D36) at syntax= .c:1591 > 1591 char_ranges[n_char_ranges++] =3D c; > (gdb) bt > #0 0x081a3c81 in skip_chars (forwardp=3D1, string=3D160, lim=3D36) at sy= ntax.c:1591 I just tried gnus but I couldn't reproduce it. So, I need more help. Could you show me the results of the following? (gdb) p n_char_ranges (gbd) p c (gdb) p string (gdb) xstring (gdb) p *$ --- Ken'ichi HANDA handa@m17n.org