Re: utf-8.el - Kenichi Handa

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Kenichi Handa <handa@m17n.org>
Cc: emacs-devel@gnu.org
Subject: Re: utf-8.el
Date: Wed, 19 Jan 2005 15:15:05 +0900 (JST)	[thread overview]
Message-ID: <200501190615.PAA11950@etlken.m17n.org> (raw)
In-Reply-To: <87mzv6avqk.fsf-monnier+emacs@gnu.org> (message from Stefan Monnier on Tue, 18 Jan 2005 23:37:10 -0500)

In article <87mzv6avqk.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>  subst-tables are not preloaded.  They are automatically
>>  loaded in utf-8-post-read-conversion but it runs after
>>  ccl-decode-mule-utf-8 is executed.  And the arg hash-table
>>  becomes non-nil only when subst-tables are loaded.

> Oh, so the elisp code indeed does the same thing.  And that means it's only
> really used at most once per Emacs session (since after it's executed, the
> hash-table will be active directly in ccl-decode-mule-utf-8).  Right?

Right except for the case that a user turn
utf-translate-cjk-mode off once.

>>>  I also don't understand the following part of
>>>  the code:

>>>  (if (= l 2)
>>>  (put-text-property (point) (min (point-max) (+ l (point)))
>>>  'display (format "\\%03o" ch))
>>>  (compose-region (point) (+ l (point)) ?�))

>>>  what does it mean for l (the number of bytes) to be equal to 2?

>>  The docstring of ccl-untranslated-to-ucs is not clear.  In
>>  "Set r1 to the byte length", the byte length means how many
>>  of r0, r1, r2, r3 (each of them contains a byte) contribute
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>  to a unicode character (or an invalid byte).

"^^^^" part is not accuate.  "The first few of them that
contribute to a unicode character or an invalid byte contain
eight-bit characters (thus are byte values)."

> So it's the number of bytes used in the buffer's internal representation
> (i.e. emacs-mule), not the number of bytes used in the utf-8 representation?

No, it's the number of characters.  r0..r3 are the same as
utf-8-ccl-regs[0]..[3] set by utf-8-untranslated-to-ucs.

>>  If l is 2, that means an invalid byte was converted to
>>  two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
>>  eight-bit-control/graphic.

> And that's because any other utf-8 char maps to either a 3-byte sequence
> (in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
> (like latin-1) it won't pass through this code anyway?

Yes.

>>  In that case, it is better to
>>  display that sequence by octal instead of showing ?�.

> Yes, I understand this part.  I just have a hard time following the
> reasoning that gets us to the point where we know that (= l 2) implies that
> it's a single eight-bit-control or eight-bit-graphic char.

Not acculate.  As I wrote above, (= l 2) implies it's an
originally invalid byte represented by 2-byte sequence of
eight-bit-graphic and eight-bit-control char.

>>>  -      ;; Can't do eval-when-compile to insert a multibyte constant
>>>  -      ;; version of the string in the loop, since it's always loaded as
>>>  -      ;; unibyte from a byte-compiled file.
>>>  -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>>>  +      (let ((range "^\xc0-\xc3\xe1-\xf7")

>>  This change is not good because range is set to a unibyte
>>  string and regexp search converts it to a multibyte
>>  string by `make-multibyte-string'.  Here what we need is a
>>  multibyte string that contains eight-bit-graphci/control
>>  chars.

> I know that's what the comment says, but my tests lead me to believe that
> the comment is not correct and that the string's multibyteness is
> correctly preserved.

Ah!  I've forgotten that "\x" notation in a string forces
the string to be read as multibyte in the latest emacs.  It
wasn't in 21.3.

So, yes, now your change is ok.

---
Ken'ichi HANDA
handa@m17n.org

next prev parent reply	other threads:[~2005-01-19  6:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-18 16:37 utf-8.el Stefan Monnier
2005-01-19  2:51 ` utf-8.el Kenichi Handa
2005-01-19  4:37   ` utf-8.el Stefan Monnier
2005-01-19  6:15     ` Kenichi Handa [this message]
2005-01-19 23:03       ` utf-8.el Stefan Monnier
2005-01-19 23:47         ` utf-8.el Kenichi Handa
2005-01-19 23:52           ` utf-8.el Stefan Monnier
2005-01-20  1:00             ` utf-8.el Kenichi Handa
2005-01-19 10:51   ` utf-8.el Andreas Schwab
2005-01-19 13:09     ` utf-8.el Kenichi Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200501190615.PAA11950@etlken.m17n.org \
    --to=handa@m17n.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).