From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Stefan Monnier <monnier@iro.umontreal.ca>
Newsgroups: gmane.emacs.devel
Subject: Re: utf-8.el
Date: Tue, 18 Jan 2005 23:37:10 -0500
Message-ID: <87mzv6avqk.fsf-monnier+emacs@gnu.org>
References: <jwvpt02zp5h.fsf-monnier+emacs@gnu.org>
	<200501190251.LAA11194@etlken.m17n.org>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: sea.gmane.org 1106110124 6527 80.91.229.6 (19 Jan 2005 04:48:44 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Wed, 19 Jan 2005 04:48:44 +0000 (UTC)
Cc: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 19 05:48:37 2005
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Original-Received: from lists.gnu.org ([199.232.76.165])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1Cr7lk-0002uL-00
	for <ged-emacs-devel@m.gmane.org>; Wed, 19 Jan 2005 05:48:36 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1Cr7xh-0008GG-1R
	for ged-emacs-devel@m.gmane.org; Wed, 19 Jan 2005 00:00:57 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Cr7wf-0007w2-W4
	for emacs-devel@gnu.org; Tue, 18 Jan 2005 23:59:54 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Cr7p5-0006Xv-3y
	for emacs-devel@gnu.org; Tue, 18 Jan 2005 23:52:20 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Cr7ov-0006XT-39
	for emacs-devel@gnu.org; Tue, 18 Jan 2005 23:51:53 -0500
Original-Received: from [209.226.175.93] (helo=tomts36-srv.bellnexxia.net)
	by monty-python.gnu.org with esmtp (Exim 4.34) id 1Cr7aj-0000Gf-JU
	for emacs-devel@gnu.org; Tue, 18 Jan 2005 23:37:13 -0500
Original-Received: from alfajor ([67.71.119.166]) by tomts36-srv.bellnexxia.net
	(InterMail vM.5.01.06.10 201-253-122-130-110-20040306) with ESMTP
	id <20050119043713.WCBJ1694.tomts36-srv.bellnexxia.net@alfajor>;
	Tue, 18 Jan 2005 23:37:13 -0500
Original-Received: by alfajor (Postfix, from userid 1000)
	id 6E212D7315; Tue, 18 Jan 2005 23:37:10 -0500 (EST)
Original-To: Kenichi Handa <handa@m17n.org>
In-Reply-To: <200501190251.LAA11194@etlken.m17n.org> (Kenichi Handa's
	message of "Wed, 19 Jan 2005 11:51:14 +0900 (JST)")
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:32362
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:32362

>> Also, could anyone confirm that the docstring of mule-utf-8 is correct in
>> saying that invalid utf-8 sequences are not always correctly preserved?
>> Why is that?  Can't we fix it?

> I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
> invalid utf-8 sequence as far as possible.  So perhaps the
> current version preserves even invalid sequence correctly.

That's also what I remembered, which is why I asked.

>> Also could anyone explain to me why `utf-8-compose' needs to lookup the
>> hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), sin=
ce
>> it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
>> chars that are in this table.

> subst-tables are not preloaded.  They are automatically
> loaded in utf-8-post-read-conversion but it runs after
> ccl-decode-mule-utf-8 is executed.  And the arg hash-table
> becomes non-nil only when subst-tables are loaded.

Oh, so the elisp code indeed does the same thing.  And that means it's only
really used at most once per Emacs session (since after it's executed, the
hash-table will be active directly in ccl-decode-mule-utf-8).  Right?

>> I also don't understand the following part of
>> the code:

>> (if (=3D l 2)
>> (put-text-property (point) (min (point-max) (+ l (point)))
>> 'display (format "\\%03o" ch))
>> (compose-region (point) (+ l (point)) ?=EF=BF=BD))

>> what does it mean for l (the number of bytes) to be equal to 2?

> The docstring of ccl-untranslated-to-ucs is not clear.  In
> "Set r1 to the byte length", the byte length means how many
> of r0, r1, r2, r3 (each of them contains a byte) contribute
> to a unicode character (or an invalid byte).

So it's the number of bytes used in the buffer's internal representation
(i.e. emacs-mule), not the number of bytes used in the utf-8 representation?

> If l is 2, that means an invalid byte was converted to
> two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
> eight-bit-control/graphic.

And that's because any other utf-8 char maps to either a 3-byte sequence
(in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
(like latin-1) it won't pass through this code anyway?

> In that case, it is better to
> display that sequence by octal instead of showing ?=EF=BF=BD.

Yes, I understand this part.  I just have a hard time following the
reasoning that gets us to the point where we know that (=3D l 2) implies th=
at
it's a single eight-bit-control or eight-bit-graphic char.

>> -      ;; Can't do eval-when-compile to insert a multibyte constant
>> -      ;; version of the string in the loop, since it's always loaded as
>> -      ;; unibyte from a byte-compiled file.
>> -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>> +      (let ((range "^\xc0-\xc3\xe1-\xf7")

> This change is not good because range is set to a unibyte
> string and regexp search converts it to a multibyte
> string by `make-multibyte-string'.  Here what we need is a
> multibyte string that contains eight-bit-graphci/control
> chars.

I know that's what the comment says, but my tests lead me to believe that
the comment is not correct and that the string's multibyteness is
correctly preserved.

> Anyway it is better to change string-as-multibyte to string-to-multibyte.

Indeed.


        Stefan