From: Kenichi Handa <handa@m17n.org>
Cc: emacs-devel@gnu.org
Subject: Re: utf-8.el
Date: Wed, 19 Jan 2005 15:15:05 +0900 (JST) [thread overview]
Message-ID: <200501190615.PAA11950@etlken.m17n.org> (raw)
In-Reply-To: <87mzv6avqk.fsf-monnier+emacs@gnu.org> (message from Stefan Monnier on Tue, 18 Jan 2005 23:37:10 -0500)
In article <87mzv6avqk.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> subst-tables are not preloaded. They are automatically
>> loaded in utf-8-post-read-conversion but it runs after
>> ccl-decode-mule-utf-8 is executed. And the arg hash-table
>> becomes non-nil only when subst-tables are loaded.
> Oh, so the elisp code indeed does the same thing. And that means it's only
> really used at most once per Emacs session (since after it's executed, the
> hash-table will be active directly in ccl-decode-mule-utf-8). Right?
Right except for the case that a user turn
utf-translate-cjk-mode off once.
>>> I also don't understand the following part of
>>> the code:
>>> (if (= l 2)
>>> (put-text-property (point) (min (point-max) (+ l (point)))
>>> 'display (format "\\%03o" ch))
>>> (compose-region (point) (+ l (point)) ?�))
>>> what does it mean for l (the number of bytes) to be equal to 2?
>> The docstring of ccl-untranslated-to-ucs is not clear. In
>> "Set r1 to the byte length", the byte length means how many
>> of r0, r1, r2, r3 (each of them contains a byte) contribute
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> to a unicode character (or an invalid byte).
"^^^^" part is not accuate. "The first few of them that
contribute to a unicode character or an invalid byte contain
eight-bit characters (thus are byte values)."
> So it's the number of bytes used in the buffer's internal representation
> (i.e. emacs-mule), not the number of bytes used in the utf-8 representation?
No, it's the number of characters. r0..r3 are the same as
utf-8-ccl-regs[0]..[3] set by utf-8-untranslated-to-ucs.
>> If l is 2, that means an invalid byte was converted to
>> two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
>> eight-bit-control/graphic.
> And that's because any other utf-8 char maps to either a 3-byte sequence
> (in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
> (like latin-1) it won't pass through this code anyway?
Yes.
>> In that case, it is better to
>> display that sequence by octal instead of showing ?�.
> Yes, I understand this part. I just have a hard time following the
> reasoning that gets us to the point where we know that (= l 2) implies that
> it's a single eight-bit-control or eight-bit-graphic char.
Not acculate. As I wrote above, (= l 2) implies it's an
originally invalid byte represented by 2-byte sequence of
eight-bit-graphic and eight-bit-control char.
>>> - ;; Can't do eval-when-compile to insert a multibyte constant
>>> - ;; version of the string in the loop, since it's always loaded as
>>> - ;; unibyte from a byte-compiled file.
>>> - (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>>> + (let ((range "^\xc0-\xc3\xe1-\xf7")
>> This change is not good because range is set to a unibyte
>> string and regexp search converts it to a multibyte
>> string by `make-multibyte-string'. Here what we need is a
>> multibyte string that contains eight-bit-graphci/control
>> chars.
> I know that's what the comment says, but my tests lead me to believe that
> the comment is not correct and that the string's multibyteness is
> correctly preserved.
Ah! I've forgotten that "\x" notation in a string forces
the string to be read as multibyte in the latest emacs. It
wasn't in 21.3.
So, yes, now your change is ok.
---
Ken'ichi HANDA
handa@m17n.org
next prev parent reply other threads:[~2005-01-19 6:15 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-01-18 16:37 utf-8.el Stefan Monnier
2005-01-19 2:51 ` utf-8.el Kenichi Handa
2005-01-19 4:37 ` utf-8.el Stefan Monnier
2005-01-19 6:15 ` Kenichi Handa [this message]
2005-01-19 23:03 ` utf-8.el Stefan Monnier
2005-01-19 23:47 ` utf-8.el Kenichi Handa
2005-01-19 23:52 ` utf-8.el Stefan Monnier
2005-01-20 1:00 ` utf-8.el Kenichi Handa
2005-01-19 10:51 ` utf-8.el Andreas Schwab
2005-01-19 13:09 ` utf-8.el Kenichi Handa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200501190615.PAA11950@etlken.m17n.org \
--to=handa@m17n.org \
--cc=emacs-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).