From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: utf-8.el
Date: Wed, 19 Jan 2005 15:15:05 +0900 (JST)
Message-ID: <200501190615.PAA11950@etlken.m17n.org>
References: <jwvpt02zp5h.fsf-monnier+emacs@gnu.org>	<200501190251.LAA11194@etlken.m17n.org>
	<87mzv6avqk.fsf-monnier+emacs@gnu.org>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Trace: sea.gmane.org 1106115686 16802 80.91.229.6 (19 Jan 2005 06:21:26 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Wed, 19 Jan 2005 06:21:26 +0000 (UTC)
Cc: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 19 07:21:18 2005
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Original-Received: from lists.gnu.org ([199.232.76.165])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1Cr9DR-0006l6-00
	for <ged-emacs-devel@m.gmane.org>; Wed, 19 Jan 2005 07:21:17 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1Cr9L2-0007NW-Ek
	for ged-emacs-devel@m.gmane.org; Wed, 19 Jan 2005 01:29:08 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Cr9Kl-0007Kz-Ge
	for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:28:54 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Cr9KY-0007Ba-D6
	for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:28:42 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Cr9KX-0007BB-Ig
	for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:28:37 -0500
Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org)
	by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168)
	(Exim 4.34) id 1Cr97V-0007Jc-Lt
	for emacs-devel@gnu.org; Wed, 19 Jan 2005 01:15:10 -0500
Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2])
	by tsukuba.m17n.org (8.12.3/8.12.3/Debian-7.1) with ESMTP id
	j0J6F6vN017007; Wed, 19 Jan 2005 15:15:07 +0900
Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125])
	by fs.m17n.org (8.11.6p2/8.11.6) with ESMTP id j0J6F6u12123;
	Wed, 19 Jan 2005 15:15:06 +0900 (JST)
Original-Received: (from handa@localhost)
	by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id PAA11950;
	Wed, 19 Jan 2005 15:15:05 +0900 (JST)
Original-To: Stefan Monnier <monnier@iro.umontreal.ca>
In-reply-to: <87mzv6avqk.fsf-monnier+emacs@gnu.org> (message from Stefan
	Monnier on Tue, 18 Jan 2005 23:37:10 -0500)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2
	Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:32364
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:32364

In article <87mzv6avqk.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@=
iro.umontreal.ca> writes:

>>  subst-tables are not preloaded.  They are automatically
>>  loaded in utf-8-post-read-conversion but it runs after
>>  ccl-decode-mule-utf-8 is executed.  And the arg hash-table
>>  becomes non-nil only when subst-tables are loaded.

> Oh, so the elisp code indeed does the same thing.  And that means it's on=
ly
> really used at most once per Emacs session (since after it's executed, the
> hash-table will be active directly in ccl-decode-mule-utf-8).  Right?

Right except for the case that a user turn
utf-translate-cjk-mode off once.

>>>  I also don't understand the following part of
>>>  the code:

>>>  (if (=3D l 2)
>>>  (put-text-property (point) (min (point-max) (+ l (point)))
>>>  'display (format "\\%03o" ch))
>>>  (compose-region (point) (+ l (point)) ?=EF=BF=BD))

>>>  what does it mean for l (the number of bytes) to be equal to 2?

>>  The docstring of ccl-untranslated-to-ucs is not clear.  In
>>  "Set r1 to the byte length", the byte length means how many
>>  of r0, r1, r2, r3 (each of them contains a byte) contribute
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>  to a unicode character (or an invalid byte).

"^^^^" part is not accuate.  "The first few of them that
contribute to a unicode character or an invalid byte contain
eight-bit characters (thus are byte values)."

> So it's the number of bytes used in the buffer's internal representation
> (i.e. emacs-mule), not the number of bytes used in the utf-8 representati=
on?

No, it's the number of characters.  r0..r3 are the same as
utf-8-ccl-regs[0]..[3] set by utf-8-untranslated-to-ucs.

>>  If l is 2, that means an invalid byte was converted to
>>  two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
>>  eight-bit-control/graphic.

> And that's because any other utf-8 char maps to either a 3-byte sequence
> (in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
> (like latin-1) it won't pass through this code anyway?

Yes.

>>  In that case, it is better to
>>  display that sequence by octal instead of showing ?=EF=BF=BD.

> Yes, I understand this part.  I just have a hard time following the
> reasoning that gets us to the point where we know that (=3D l 2) implies =
that
> it's a single eight-bit-control or eight-bit-graphic char.

Not acculate.  As I wrote above, (=3D l 2) implies it's an
originally invalid byte represented by 2-byte sequence of
eight-bit-graphic and eight-bit-control char.

>>>  -      ;; Can't do eval-when-compile to insert a multibyte constant
>>>  -      ;; version of the string in the loop, since it's always loaded =
as
>>>  -      ;; unibyte from a byte-compiled file.
>>>  -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>>>  +      (let ((range "^\xc0-\xc3\xe1-\xf7")

>>  This change is not good because range is set to a unibyte
>>  string and regexp search converts it to a multibyte
>>  string by `make-multibyte-string'.  Here what we need is a
>>  multibyte string that contains eight-bit-graphci/control
>>  chars.

> I know that's what the comment says, but my tests lead me to believe that
> the comment is not correct and that the string's multibyteness is
> correctly preserved.

Ah!  I've forgotten that "\x" notation in a string forces
the string to be read as multibyte in the latest emacs.  It
wasn't in 21.3.

So, yes, now your change is ok.

---
Ken'ichi HANDA
handa@m17n.org