Re: utf-8.el - Kenichi Handa

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Kenichi Handa <handa@m17n.org>
Cc: emacs-devel@gnu.org
Subject: Re: utf-8.el
Date: Wed, 19 Jan 2005 11:51:14 +0900 (JST)	[thread overview]
Message-ID: <200501190251.LAA11194@etlken.m17n.org> (raw)
In-Reply-To: <jwvpt02zp5h.fsf-monnier+emacs@gnu.org> (message from Stefan Monnier on Tue, 18 Jan 2005 11:37:26 -0500)

In article <jwvpt02zp5h.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Does anyone see a problem with the simple patch below?

See the comment below.

> Also, could anyone confirm that the docstring of mule-utf-8 is correct in
> saying that invalid utf-8 sequences are not always correctly preserved?
> Why is that?  Can't we fix it?

I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
invalid utf-8 sequence as far as possible.  So perhaps the
current version preserves even invalid sequence correctly.

I've just run this code for a fairly long time and saw no error.

(defun temp ()
  (let ((count 0))
    (while t
      (setq count (1+ count))
      (message "%d" count)
      (let* ((len (+ 6 (random 6)))
	     (str (make-string len 0)))
	(dotimes (i len)
	  (aset str i (+ 128 (random 128))))
	(or (equal str
		   (encode-coding-string
		    (decode-coding-string str 'utf-8) 'utf-8))
	    (error "%s caused error" (setq error-string str)))))))

> Also could anyone explain to me why `utf-8-compose' needs to lookup the
> hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since
> it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
> chars that are in this table.

subst-tables are not preloaded.  They are automatically
loaded in utf-8-post-read-conversion but it runs after
ccl-decode-mule-utf-8 is executed.  And the arg hash-table
becomes non-nil only when subst-tables are loaded.

> I also don't understand the following part of
> the code:

> 	  (if (= l 2)
> 	      (put-text-property (point) (min (point-max) (+ l (point)))
> 				 'display (format "\\%03o" ch))
> 	    (compose-region (point) (+ l (point)) ?�))

> what does it mean for l (the number of bytes) to be equal to 2?

The docstring of ccl-untranslated-to-ucs is not clear.  In
"Set r1 to the byte length", the byte length means how many
of r0, r1, r2, r3 (each of them contains a byte) contribute
to a unicode character (or an invalid byte).

If l is 2, that means an invalid byte was converted to
two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
eight-bit-control/graphic.  In that case, it is better to
display that sequence by octal instead of showing ?�.

> --- orig/lisp/international/utf-8.el
> +++ mod/lisp/international/utf-8.el
> @@ -2,7 +2,7 @@
 
>  ;; Copyright (C) 2001, 2004 Electrotechnical Laboratory, JAPAN.
>  ;; Licensed to the Free Software Foundation.
> -;; Copyright (C) 2001, 2002 Free Software Foundation, Inc.
> +;; Copyright (C) 2001, 2002, 2005  Free Software Foundation, Inc.
 
>  ;; Author: TAKAHASHI Naoto  <ntakahas@m17n.org>
>  ;; Maintainer: FSF
> @@ -259,7 +259,7 @@
>  				 (funcall decode-char-no-trans (car x))
>  				 (funcall decode-char-no-trans (cdr x))))
>  		     ranges "")))
> -  ;; These forces loading and settting tables for
> +  ;; This forces loading and setting tables for
>    ;; utf-translate-cjk-mode.
>    (setq utf-translate-cjk-lang-env nil
>  	ucs-mule-cjk-to-unicode (make-hash-table :test 'eq)
> @@ -951,10 +951,7 @@
>    (save-excursion
>      (save-restriction
>        (narrow-to-region (point) (+ (point) length))
> -      ;; Can't do eval-when-compile to insert a multibyte constant
> -      ;; version of the string in the loop, since it's always loaded as
> -      ;; unibyte from a byte-compiled file.
> -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
> +      (let ((range "^\xc0-\xc3\xe1-\xf7")

This change is not good because range is set to a unibyte
string and regexp search converts it to a multibyte
string by `make-multibyte-string'.  Here what we need is a
multibyte string that contains eight-bit-graphci/control
chars.  Anyway it is better to change string-as-multibyte to
string-to-multibyte.

>  	    (buffer-multibyte enable-multibyte-characters)
>  	    hash-table ch)
>  	(set-buffer-multibyte t)
> @@ -1036,8 +1033,7 @@
>      mule-unicode-0100-24ff
>      mule-unicode-2500-33ff
>      mule-unicode-e000-ffff
> -    ,@(if utf-translate-cjk-mode
> -	  utf-translate-cjk-charsets))
> +    ,@utf-translate-cjk-charsets)

This change is ok.

>     (mime-charset . utf-8)
>     (coding-category . coding-category-utf-8)
>     (valid-codes (0 . 255))
> @@ -1054,23 +1050,23 @@
>  ;; I think this needs special private charsets defined for the
>  ;; untranslated sequences, if it's going to work well.
 
> -;;; (defun utf-8-compose-function (pos to pattern &optional string)
> -;;;   (let* ((prop (get-char-property pos 'composition string))
> -;;; 	 (l (and prop (- (cadr prop) (car prop)))))
> -;;;     (cond ((and l (> l (- to pos)))
> -;;; 	   (delete-region pos to))
> -;;; 	  ((and (> (char-after pos) 224)
> -;;; 		(< (char-after pos) 256)
> -;;; 		(save-restriction
> -;;; 		  (narrow-to-region pos to)
> -;;; 		  (utf-8-compose)))
> -;;; 	   t))))
> -
> -;;; (dotimes (i 96)
> -;;;   (aset composition-function-table
> -;;; 	(+ 128 i)
> -;;; 	`((,(string-as-multibyte "[\200-\237\240-\377]")
> -;;; 	   . utf-8-compose-function))))
> +;; (defun utf-8-compose-function (pos to pattern &optional string)
> +;;   (let* ((prop (get-char-property pos 'composition string))
> +;; 	 (l (and prop (- (cadr prop) (car prop)))))
> +;;     (cond ((and l (> l (- to pos)))
> +;; 	   (delete-region pos to))
> +;; 	  ((and (> (char-after pos) 224)
> +;; 		(< (char-after pos) 256)
> +;; 		(save-restriction
> +;; 		  (narrow-to-region pos to)
> +;; 		  (utf-8-compose)))
> +;; 	   t))))
> +
> +;; (dotimes (i 96)
> +;;   (aset composition-function-table
> +;; 	(+ 128 i)
> +;; 	`((,(string-as-multibyte "[\200-\237\240-\377]")
> +;; 	   . utf-8-compose-function))))
 
>  ;; arch-tag: b08735b7-753b-4ae6-b754-0f3efe4515c5
>  ;;; utf-8.el ends here

This change is ok if that is the correct coding style for
comments.

---
Ken'ichi HANDA
handa@m17n.org

next prev parent reply	other threads:[~2005-01-19  2:51 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-18 16:37 utf-8.el Stefan Monnier
2005-01-19  2:51 ` Kenichi Handa [this message]
2005-01-19  4:37   ` utf-8.el Stefan Monnier
2005-01-19  6:15     ` utf-8.el Kenichi Handa
2005-01-19 23:03       ` utf-8.el Stefan Monnier
2005-01-19 23:47         ` utf-8.el Kenichi Handa
2005-01-19 23:52           ` utf-8.el Stefan Monnier
2005-01-20  1:00             ` utf-8.el Kenichi Handa
2005-01-19 10:51   ` utf-8.el Andreas Schwab
2005-01-19 13:09     ` utf-8.el Kenichi Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200501190251.LAA11194@etlken.m17n.org \
    --to=handa@m17n.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).