utf-8.el

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* utf-8.el
@ 2005-01-18 16:37 Stefan Monnier
  2005-01-19  2:51 ` utf-8.el Kenichi Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2005-01-18 16:37 UTC (permalink / raw)


Does anyone see a problem with the simple patch below?
Also, could anyone confirm that the docstring of mule-utf-8 is correct in
saying that invalid utf-8 sequences are not always correctly preserved?
Why is that?  Can't we fix it?

Also could anyone explain to me why `utf-8-compose' needs to lookup the
hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since
it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
chars that are in this table.  I also don't understand the following part of
the code:

	  (if (= l 2)
	      (put-text-property (point) (min (point-max) (+ l (point)))
				 'display (format "\\%03o" ch))
	    (compose-region (point) (+ l (point)) ?�))

what does it mean for l (the number of bytes) to be equal to 2?


        Stefan


--- orig/lisp/international/utf-8.el
+++ mod/lisp/international/utf-8.el
@@ -2,7 +2,7 @@
 
 ;; Copyright (C) 2001, 2004 Electrotechnical Laboratory, JAPAN.
 ;; Licensed to the Free Software Foundation.
-;; Copyright (C) 2001, 2002 Free Software Foundation, Inc.
+;; Copyright (C) 2001, 2002, 2005  Free Software Foundation, Inc.
 
 ;; Author: TAKAHASHI Naoto  <ntakahas@m17n.org>
 ;; Maintainer: FSF
@@ -259,7 +259,7 @@
 				 (funcall decode-char-no-trans (car x))
 				 (funcall decode-char-no-trans (cdr x))))
 		     ranges "")))
-  ;; These forces loading and settting tables for
+  ;; This forces loading and setting tables for
   ;; utf-translate-cjk-mode.
   (setq utf-translate-cjk-lang-env nil
 	ucs-mule-cjk-to-unicode (make-hash-table :test 'eq)
@@ -951,10 +951,7 @@
   (save-excursion
     (save-restriction
       (narrow-to-region (point) (+ (point) length))
-      ;; Can't do eval-when-compile to insert a multibyte constant
-      ;; version of the string in the loop, since it's always loaded as
-      ;; unibyte from a byte-compiled file.
-      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
+      (let ((range "^\xc0-\xc3\xe1-\xf7")
 	    (buffer-multibyte enable-multibyte-characters)
 	    hash-table ch)
 	(set-buffer-multibyte t)
@@ -1036,8 +1033,7 @@
     mule-unicode-0100-24ff
     mule-unicode-2500-33ff
     mule-unicode-e000-ffff
-    ,@(if utf-translate-cjk-mode
-	  utf-translate-cjk-charsets))
+    ,@utf-translate-cjk-charsets)
    (mime-charset . utf-8)
    (coding-category . coding-category-utf-8)
    (valid-codes (0 . 255))
@@ -1054,23 +1050,23 @@
 ;; I think this needs special private charsets defined for the
 ;; untranslated sequences, if it's going to work well.
 
-;;; (defun utf-8-compose-function (pos to pattern &optional string)
-;;;   (let* ((prop (get-char-property pos 'composition string))
-;;; 	 (l (and prop (- (cadr prop) (car prop)))))
-;;;     (cond ((and l (> l (- to pos)))
-;;; 	   (delete-region pos to))
-;;; 	  ((and (> (char-after pos) 224)
-;;; 		(< (char-after pos) 256)
-;;; 		(save-restriction
-;;; 		  (narrow-to-region pos to)
-;;; 		  (utf-8-compose)))
-;;; 	   t))))
-
-;;; (dotimes (i 96)
-;;;   (aset composition-function-table
-;;; 	(+ 128 i)
-;;; 	`((,(string-as-multibyte "[\200-\237\240-\377]")
-;;; 	   . utf-8-compose-function))))
+;; (defun utf-8-compose-function (pos to pattern &optional string)
+;;   (let* ((prop (get-char-property pos 'composition string))
+;; 	 (l (and prop (- (cadr prop) (car prop)))))
+;;     (cond ((and l (> l (- to pos)))
+;; 	   (delete-region pos to))
+;; 	  ((and (> (char-after pos) 224)
+;; 		(< (char-after pos) 256)
+;; 		(save-restriction
+;; 		  (narrow-to-region pos to)
+;; 		  (utf-8-compose)))
+;; 	   t))))
+
+;; (dotimes (i 96)
+;;   (aset composition-function-table
+;; 	(+ 128 i)
+;; 	`((,(string-as-multibyte "[\200-\237\240-\377]")
+;; 	   . utf-8-compose-function))))
 
 ;; arch-tag: b08735b7-753b-4ae6-b754-0f3efe4515c5
 ;;; utf-8.el ends here

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-18 16:37 utf-8.el Stefan Monnier
@ 2005-01-19  2:51 ` Kenichi Handa
  2005-01-19  4:37   ` utf-8.el Stefan Monnier
  2005-01-19 10:51   ` utf-8.el Andreas Schwab
  0 siblings, 2 replies; 10+ messages in thread
From: Kenichi Handa @ 2005-01-19  2:51 UTC (permalink / raw)
  Cc: emacs-devel

In article <jwvpt02zp5h.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Does anyone see a problem with the simple patch below?

See the comment below.

> Also, could anyone confirm that the docstring of mule-utf-8 is correct in
> saying that invalid utf-8 sequences are not always correctly preserved?
> Why is that?  Can't we fix it?

I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
invalid utf-8 sequence as far as possible.  So perhaps the
current version preserves even invalid sequence correctly.

I've just run this code for a fairly long time and saw no error.

(defun temp ()
  (let ((count 0))
    (while t
      (setq count (1+ count))
      (message "%d" count)
      (let* ((len (+ 6 (random 6)))
	     (str (make-string len 0)))
	(dotimes (i len)
	  (aset str i (+ 128 (random 128))))
	(or (equal str
		   (encode-coding-string
		    (decode-coding-string str 'utf-8) 'utf-8))
	    (error "%s caused error" (setq error-string str)))))))

> Also could anyone explain to me why `utf-8-compose' needs to lookup the
> hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since
> it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
> chars that are in this table.

subst-tables are not preloaded.  They are automatically
loaded in utf-8-post-read-conversion but it runs after
ccl-decode-mule-utf-8 is executed.  And the arg hash-table
becomes non-nil only when subst-tables are loaded.

> I also don't understand the following part of
> the code:

> 	  (if (= l 2)
> 	      (put-text-property (point) (min (point-max) (+ l (point)))
> 				 'display (format "\\%03o" ch))
> 	    (compose-region (point) (+ l (point)) ?�))

> what does it mean for l (the number of bytes) to be equal to 2?

The docstring of ccl-untranslated-to-ucs is not clear.  In
"Set r1 to the byte length", the byte length means how many
of r0, r1, r2, r3 (each of them contains a byte) contribute
to a unicode character (or an invalid byte).

If l is 2, that means an invalid byte was converted to
two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
eight-bit-control/graphic.  In that case, it is better to
display that sequence by octal instead of showing ?�.

> --- orig/lisp/international/utf-8.el
> +++ mod/lisp/international/utf-8.el
> @@ -2,7 +2,7 @@
 
>  ;; Copyright (C) 2001, 2004 Electrotechnical Laboratory, JAPAN.
>  ;; Licensed to the Free Software Foundation.
> -;; Copyright (C) 2001, 2002 Free Software Foundation, Inc.
> +;; Copyright (C) 2001, 2002, 2005  Free Software Foundation, Inc.
 
>  ;; Author: TAKAHASHI Naoto  <ntakahas@m17n.org>
>  ;; Maintainer: FSF
> @@ -259,7 +259,7 @@
>  				 (funcall decode-char-no-trans (car x))
>  				 (funcall decode-char-no-trans (cdr x))))
>  		     ranges "")))
> -  ;; These forces loading and settting tables for
> +  ;; This forces loading and setting tables for
>    ;; utf-translate-cjk-mode.
>    (setq utf-translate-cjk-lang-env nil
>  	ucs-mule-cjk-to-unicode (make-hash-table :test 'eq)
> @@ -951,10 +951,7 @@
>    (save-excursion
>      (save-restriction
>        (narrow-to-region (point) (+ (point) length))
> -      ;; Can't do eval-when-compile to insert a multibyte constant
> -      ;; version of the string in the loop, since it's always loaded as
> -      ;; unibyte from a byte-compiled file.
> -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
> +      (let ((range "^\xc0-\xc3\xe1-\xf7")

This change is not good because range is set to a unibyte
string and regexp search converts it to a multibyte
string by `make-multibyte-string'.  Here what we need is a
multibyte string that contains eight-bit-graphci/control
chars.  Anyway it is better to change string-as-multibyte to
string-to-multibyte.

>  	    (buffer-multibyte enable-multibyte-characters)
>  	    hash-table ch)
>  	(set-buffer-multibyte t)
> @@ -1036,8 +1033,7 @@
>      mule-unicode-0100-24ff
>      mule-unicode-2500-33ff
>      mule-unicode-e000-ffff
> -    ,@(if utf-translate-cjk-mode
> -	  utf-translate-cjk-charsets))
> +    ,@utf-translate-cjk-charsets)

This change is ok.

>     (mime-charset . utf-8)
>     (coding-category . coding-category-utf-8)
>     (valid-codes (0 . 255))
> @@ -1054,23 +1050,23 @@
>  ;; I think this needs special private charsets defined for the
>  ;; untranslated sequences, if it's going to work well.
 
> -;;; (defun utf-8-compose-function (pos to pattern &optional string)
> -;;;   (let* ((prop (get-char-property pos 'composition string))
> -;;; 	 (l (and prop (- (cadr prop) (car prop)))))
> -;;;     (cond ((and l (> l (- to pos)))
> -;;; 	   (delete-region pos to))
> -;;; 	  ((and (> (char-after pos) 224)
> -;;; 		(< (char-after pos) 256)
> -;;; 		(save-restriction
> -;;; 		  (narrow-to-region pos to)
> -;;; 		  (utf-8-compose)))
> -;;; 	   t))))
> -
> -;;; (dotimes (i 96)
> -;;;   (aset composition-function-table
> -;;; 	(+ 128 i)
> -;;; 	`((,(string-as-multibyte "[\200-\237\240-\377]")
> -;;; 	   . utf-8-compose-function))))
> +;; (defun utf-8-compose-function (pos to pattern &optional string)
> +;;   (let* ((prop (get-char-property pos 'composition string))
> +;; 	 (l (and prop (- (cadr prop) (car prop)))))
> +;;     (cond ((and l (> l (- to pos)))
> +;; 	   (delete-region pos to))
> +;; 	  ((and (> (char-after pos) 224)
> +;; 		(< (char-after pos) 256)
> +;; 		(save-restriction
> +;; 		  (narrow-to-region pos to)
> +;; 		  (utf-8-compose)))
> +;; 	   t))))
> +
> +;; (dotimes (i 96)
> +;;   (aset composition-function-table
> +;; 	(+ 128 i)
> +;; 	`((,(string-as-multibyte "[\200-\237\240-\377]")
> +;; 	   . utf-8-compose-function))))
 
>  ;; arch-tag: b08735b7-753b-4ae6-b754-0f3efe4515c5
>  ;;; utf-8.el ends here

This change is ok if that is the correct coding style for
comments.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19  2:51 ` utf-8.el Kenichi Handa
@ 2005-01-19  4:37   ` Stefan Monnier
  2005-01-19  6:15     ` utf-8.el Kenichi Handa
  2005-01-19 10:51   ` utf-8.el Andreas Schwab
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2005-01-19  4:37 UTC (permalink / raw)
  Cc: emacs-devel

>> Also, could anyone confirm that the docstring of mule-utf-8 is correct in
>> saying that invalid utf-8 sequences are not always correctly preserved?
>> Why is that?  Can't we fix it?

> I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
> invalid utf-8 sequence as far as possible.  So perhaps the
> current version preserves even invalid sequence correctly.

That's also what I remembered, which is why I asked.

>> Also could anyone explain to me why `utf-8-compose' needs to lookup the
>> hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since
>> it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
>> chars that are in this table.

> subst-tables are not preloaded.  They are automatically
> loaded in utf-8-post-read-conversion but it runs after
> ccl-decode-mule-utf-8 is executed.  And the arg hash-table
> becomes non-nil only when subst-tables are loaded.

Oh, so the elisp code indeed does the same thing.  And that means it's only
really used at most once per Emacs session (since after it's executed, the
hash-table will be active directly in ccl-decode-mule-utf-8).  Right?

>> I also don't understand the following part of
>> the code:

>> (if (= l 2)
>> (put-text-property (point) (min (point-max) (+ l (point)))
>> 'display (format "\\%03o" ch))
>> (compose-region (point) (+ l (point)) ?�))

>> what does it mean for l (the number of bytes) to be equal to 2?

> The docstring of ccl-untranslated-to-ucs is not clear.  In
> "Set r1 to the byte length", the byte length means how many
> of r0, r1, r2, r3 (each of them contains a byte) contribute
> to a unicode character (or an invalid byte).

So it's the number of bytes used in the buffer's internal representation
(i.e. emacs-mule), not the number of bytes used in the utf-8 representation?

> If l is 2, that means an invalid byte was converted to
> two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
> eight-bit-control/graphic.

And that's because any other utf-8 char maps to either a 3-byte sequence
(in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
(like latin-1) it won't pass through this code anyway?

> In that case, it is better to
> display that sequence by octal instead of showing ?�.

Yes, I understand this part.  I just have a hard time following the
reasoning that gets us to the point where we know that (= l 2) implies that
it's a single eight-bit-control or eight-bit-graphic char.

>> -      ;; Can't do eval-when-compile to insert a multibyte constant
>> -      ;; version of the string in the loop, since it's always loaded as
>> -      ;; unibyte from a byte-compiled file.
>> -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>> +      (let ((range "^\xc0-\xc3\xe1-\xf7")

> This change is not good because range is set to a unibyte
> string and regexp search converts it to a multibyte
> string by `make-multibyte-string'.  Here what we need is a
> multibyte string that contains eight-bit-graphci/control
> chars.

I know that's what the comment says, but my tests lead me to believe that
the comment is not correct and that the string's multibyteness is
correctly preserved.

> Anyway it is better to change string-as-multibyte to string-to-multibyte.

Indeed.


        Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19  4:37   ` utf-8.el Stefan Monnier
@ 2005-01-19  6:15     ` Kenichi Handa
  2005-01-19 23:03       ` utf-8.el Stefan Monnier
  0 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2005-01-19  6:15 UTC (permalink / raw)
  Cc: emacs-devel

In article <87mzv6avqk.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>  subst-tables are not preloaded.  They are automatically
>>  loaded in utf-8-post-read-conversion but it runs after
>>  ccl-decode-mule-utf-8 is executed.  And the arg hash-table
>>  becomes non-nil only when subst-tables are loaded.

> Oh, so the elisp code indeed does the same thing.  And that means it's only
> really used at most once per Emacs session (since after it's executed, the
> hash-table will be active directly in ccl-decode-mule-utf-8).  Right?

Right except for the case that a user turn
utf-translate-cjk-mode off once.

>>>  I also don't understand the following part of
>>>  the code:

>>>  (if (= l 2)
>>>  (put-text-property (point) (min (point-max) (+ l (point)))
>>>  'display (format "\\%03o" ch))
>>>  (compose-region (point) (+ l (point)) ?�))

>>>  what does it mean for l (the number of bytes) to be equal to 2?

>>  The docstring of ccl-untranslated-to-ucs is not clear.  In
>>  "Set r1 to the byte length", the byte length means how many
>>  of r0, r1, r2, r3 (each of them contains a byte) contribute
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>  to a unicode character (or an invalid byte).

"^^^^" part is not accuate.  "The first few of them that
contribute to a unicode character or an invalid byte contain
eight-bit characters (thus are byte values)."

> So it's the number of bytes used in the buffer's internal representation
> (i.e. emacs-mule), not the number of bytes used in the utf-8 representation?

No, it's the number of characters.  r0..r3 are the same as
utf-8-ccl-regs[0]..[3] set by utf-8-untranslated-to-ucs.

>>  If l is 2, that means an invalid byte was converted to
>>  two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
>>  eight-bit-control/graphic.

> And that's because any other utf-8 char maps to either a 3-byte sequence
> (in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
> (like latin-1) it won't pass through this code anyway?

Yes.

>>  In that case, it is better to
>>  display that sequence by octal instead of showing ?�.

> Yes, I understand this part.  I just have a hard time following the
> reasoning that gets us to the point where we know that (= l 2) implies that
> it's a single eight-bit-control or eight-bit-graphic char.

Not acculate.  As I wrote above, (= l 2) implies it's an
originally invalid byte represented by 2-byte sequence of
eight-bit-graphic and eight-bit-control char.

>>>  -      ;; Can't do eval-when-compile to insert a multibyte constant
>>>  -      ;; version of the string in the loop, since it's always loaded as
>>>  -      ;; unibyte from a byte-compiled file.
>>>  -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>>>  +      (let ((range "^\xc0-\xc3\xe1-\xf7")

>>  This change is not good because range is set to a unibyte
>>  string and regexp search converts it to a multibyte
>>  string by `make-multibyte-string'.  Here what we need is a
>>  multibyte string that contains eight-bit-graphci/control
>>  chars.

> I know that's what the comment says, but my tests lead me to believe that
> the comment is not correct and that the string's multibyteness is
> correctly preserved.

Ah!  I've forgotten that "\x" notation in a string forces
the string to be read as multibyte in the latest emacs.  It
wasn't in 21.3.

So, yes, now your change is ok.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19  2:51 ` utf-8.el Kenichi Handa
  2005-01-19  4:37   ` utf-8.el Stefan Monnier
@ 2005-01-19 10:51   ` Andreas Schwab
  2005-01-19 13:09     ` utf-8.el Kenichi Handa
  1 sibling, 1 reply; 10+ messages in thread
From: Andreas Schwab @ 2005-01-19 10:51 UTC (permalink / raw)
  Cc: Stefan Monnier, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
> invalid utf-8 sequence as far as possible.  So perhaps the
> current version preserves even invalid sequence correctly.

I've just tested with Markus Kuhn's UTF-8 test file
(<http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>) and current
Emacs indeed preserves all invalid UTF-8 sequences.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19 10:51   ` utf-8.el Andreas Schwab
@ 2005-01-19 13:09     ` Kenichi Handa
  0 siblings, 0 replies; 10+ messages in thread
From: Kenichi Handa @ 2005-01-19 13:09 UTC (permalink / raw)
  Cc: monnier, emacs-devel

In article <jer7khwv35.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:
> Kenichi Handa <handa@m17n.org> writes:
>>  I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
>>  invalid utf-8 sequence as far as possible.  So perhaps the
>>  current version preserves even invalid sequence correctly.

> I've just tested with Markus Kuhn's UTF-8 test file
> (<http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>) and current
> Emacs indeed preserves all invalid UTF-8 sequences.

Thank you for the test.  That relieves me.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19  6:15     ` utf-8.el Kenichi Handa
@ 2005-01-19 23:03       ` Stefan Monnier
  2005-01-19 23:47         ` utf-8.el Kenichi Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2005-01-19 23:03 UTC (permalink / raw)
  Cc: emacs-devel

>> Oh, so the elisp code indeed does the same thing.  And that means it's
>> only really used at most once per Emacs session (since after it's
>> executed, the hash-table will be active directly in
>> ccl-decode-mule-utf-8).  Right?

> Right except for the case that a user turn
> utf-translate-cjk-mode off once.

Of course.

> Not acculate.  As I wrote above, (= l 2) implies it's an
> originally invalid byte represented by 2-byte sequence of
> eight-bit-graphic and eight-bit-control char.

Oh, I think I'm beginning to understand: An invalid sequence such as "\201"
is not translated into the single eight-bit-control char \201 but into
a sequence of two eight-bit-* chars: "\302\201".

Hmm... why is that?


        Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19 23:03       ` utf-8.el Stefan Monnier
@ 2005-01-19 23:47         ` Kenichi Handa
  2005-01-19 23:52           ` utf-8.el Stefan Monnier
  0 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2005-01-19 23:47 UTC (permalink / raw)
  Cc: emacs-devel

In article <jwv1xchuiz9.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>  Not acculate.  As I wrote above, (= l 2) implies it's an
>>  originally invalid byte represented by 2-byte sequence of
>>  eight-bit-graphic and eight-bit-control char.

> Oh, I think I'm beginning to understand: An invalid sequence such as "\201"
> is not translated into the single eight-bit-control char \201 but into
> a sequence of two eight-bit-* chars: "\302\201".

> Hmm... why is that?

As far as I remember, that is to distinguish an eight-bit-*
sequence for an untranslated char from that for an invalid
byte.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19 23:47         ` utf-8.el Kenichi Handa
@ 2005-01-19 23:52           ` Stefan Monnier
  2005-01-20  1:00             ` utf-8.el Kenichi Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2005-01-19 23:52 UTC (permalink / raw)
  Cc: emacs-devel

>> Oh, I think I'm beginning to understand: An invalid sequence such as "\201"
>> is not translated into the single eight-bit-control char \201 but into
>> a sequence of two eight-bit-* chars: "\302\201".

>> Hmm... why is that?

> As far as I remember, that is to distinguish an eight-bit-*
> sequence for an untranslated char from that for an invalid byte.

Hmm... why would it be needed?


        Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: utf-8.el
  2005-01-19 23:52           ` utf-8.el Stefan Monnier
@ 2005-01-20  1:00             ` Kenichi Handa
  0 siblings, 0 replies; 10+ messages in thread
From: Kenichi Handa @ 2005-01-20  1:00 UTC (permalink / raw)
  Cc: emacs-devel

In article <jwvk6q9t1u7.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>  As far as I remember, that is to distinguish an eight-bit-*
>>  sequence for an untranslated char from that for an invalid byte.

> Hmm... why would it be needed?

To decide whether to attach a `display' text property or to
compose it in utf-8-compose.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-01-20  1:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-18 16:37 utf-8.el Stefan Monnier
2005-01-19  2:51 ` utf-8.el Kenichi Handa
2005-01-19  4:37   ` utf-8.el Stefan Monnier
2005-01-19  6:15     ` utf-8.el Kenichi Handa
2005-01-19 23:03       ` utf-8.el Stefan Monnier
2005-01-19 23:47         ` utf-8.el Kenichi Handa
2005-01-19 23:52           ` utf-8.el Stefan Monnier
2005-01-20  1:00             ` utf-8.el Kenichi Handa
2005-01-19 10:51   ` utf-8.el Andreas Schwab
2005-01-19 13:09     ` utf-8.el Kenichi Handa

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.