From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: utf-8.el Date: Wed, 19 Jan 2005 15:15:05 +0900 (JST) Message-ID: <200501190615.PAA11950@etlken.m17n.org> References: <200501190251.LAA11194@etlken.m17n.org> <87mzv6avqk.fsf-monnier+emacs@gnu.org> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1106115686 16802 80.91.229.6 (19 Jan 2005 06:21:26 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 19 Jan 2005 06:21:26 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 19 07:21:18 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1Cr9DR-0006l6-00 for ; Wed, 19 Jan 2005 07:21:17 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Cr9L2-0007NW-Ek for ged-emacs-devel@m.gmane.org; Wed, 19 Jan 2005 01:29:08 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Cr9Kl-0007Kz-Ge for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:28:54 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Cr9KY-0007Ba-D6 for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:28:42 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Cr9KX-0007BB-Ig for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:28:37 -0500 Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.34) id 1Cr97V-0007Jc-Lt for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:15:10 -0500 Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2]) by tsukuba.m17n.org (8.12.3/8.12.3/Debian-7.1) with ESMTP id j0J6F6vN017007; Wed, 19 Jan 2005 15:15:07 +0900 Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by fs.m17n.org (8.11.6p2/8.11.6) with ESMTP id j0J6F6u12123; Wed, 19 Jan 2005 15:15:06 +0900 (JST) Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id PAA11950; Wed, 19 Jan 2005 15:15:05 +0900 (JST) Original-To: Stefan Monnier In-reply-to: <87mzv6avqk.fsf-monnier+emacs@gnu.org> (message from Stefan Monnier on Tue, 18 Jan 2005 23:37:10 -0500) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:32364 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:32364 In article <87mzv6avqk.fsf-monnier+emacs@gnu.org>, Stefan Monnier writes: >> subst-tables are not preloaded. They are automatically >> loaded in utf-8-post-read-conversion but it runs after >> ccl-decode-mule-utf-8 is executed. And the arg hash-table >> becomes non-nil only when subst-tables are loaded. > Oh, so the elisp code indeed does the same thing. And that means it's on= ly > really used at most once per Emacs session (since after it's executed, the > hash-table will be active directly in ccl-decode-mule-utf-8). Right? Right except for the case that a user turn utf-translate-cjk-mode off once. >>> I also don't understand the following part of >>> the code: >>> (if (=3D l 2) >>> (put-text-property (point) (min (point-max) (+ l (point))) >>> 'display (format "\\%03o" ch)) >>> (compose-region (point) (+ l (point)) ?=EF=BF=BD)) >>> what does it mean for l (the number of bytes) to be equal to 2? >> The docstring of ccl-untranslated-to-ucs is not clear. In >> "Set r1 to the byte length", the byte length means how many >> of r0, r1, r2, r3 (each of them contains a byte) contribute ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> to a unicode character (or an invalid byte). "^^^^" part is not accuate. "The first few of them that contribute to a unicode character or an invalid byte contain eight-bit characters (thus are byte values)." > So it's the number of bytes used in the buffer's internal representation > (i.e. emacs-mule), not the number of bytes used in the utf-8 representati= on? No, it's the number of characters. r0..r3 are the same as utf-8-ccl-regs[0]..[3] set by utf-8-untranslated-to-ucs. >> If l is 2, that means an invalid byte was converted to >> two-char sequence of eight-bit-graphic (#xC2 or #xC3) and >> eight-bit-control/graphic. > And that's because any other utf-8 char maps to either a 3-byte sequence > (in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence > (like latin-1) it won't pass through this code anyway? Yes. >> In that case, it is better to >> display that sequence by octal instead of showing ?=EF=BF=BD. > Yes, I understand this part. I just have a hard time following the > reasoning that gets us to the point where we know that (=3D l 2) implies = that > it's a single eight-bit-control or eight-bit-graphic char. Not acculate. As I wrote above, (=3D l 2) implies it's an originally invalid byte represented by 2-byte sequence of eight-bit-graphic and eight-bit-control char. >>> - ;; Can't do eval-when-compile to insert a multibyte constant >>> - ;; version of the string in the loop, since it's always loaded = as >>> - ;; unibyte from a byte-compiled file. >>> - (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7")) >>> + (let ((range "^\xc0-\xc3\xe1-\xf7") >> This change is not good because range is set to a unibyte >> string and regexp search converts it to a multibyte >> string by `make-multibyte-string'. Here what we need is a >> multibyte string that contains eight-bit-graphci/control >> chars. > I know that's what the comment says, but my tests lead me to believe that > the comment is not correct and that the string's multibyteness is > correctly preserved. Ah! I've forgotten that "\x" notation in a string forces the string to be read as multibyte in the latest emacs. It wasn't in 21.3. So, yes, now your change is ok. --- Ken'ichi HANDA handa@m17n.org