From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Kenichi Handa <handa@etl.go.jp>
Newsgroups: gmane.emacs.devel
Subject: Re: Several serious problems
Date: Mon, 2 Sep 2002 10:28:25 +0900 (JST)
Sender: emacs-devel-admin@gnu.org
Message-ID: <200209020128.KAA08644@etlken.m17n.org>
References: <200208190748.QAA14278@etlken.m17n.org>	<rzqlm6ybz38.fsf@albion.dl.ac.uk>	<200208291325.WAA03596@etlken.m17n.org>	<rzqfzwxz23x.fsf@albion.dl.ac.uk>	<E17kezu-0008LD-00@fencepost.gnu.org> <rzq4rdbx7d4.fsf@albion.dl.ac.uk> <E17lefC-0003IF-00@fencepost.gnu.org>
NNTP-Posting-Host: localhost.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
X-Trace: main.gmane.org 1030930100 993 127.0.0.1 (2 Sep 2002 01:28:20 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Mon, 2 Sep 2002 01:28:20 +0000 (UTC)
Cc: d.love@dl.ac.uk, monnier+gnu/emacs@rum.cs.yale.edu, keichwa@gmx.net,
   emacs-devel@gnu.org
Return-path: <emacs-devel-admin@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by main.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 17lg0j-0000Fq-00
	for <emacs-devel@main.gmane.org>; Mon, 02 Sep 2002 03:28:13 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian))
	id 17lgYi-00075L-00
	for <emacs-devel@quimby.gnus.org>; Mon, 02 Sep 2002 04:03:21 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17lg2E-00052e-00; Sun, 01 Sep 2002 21:29:46 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10)
	id 17lg11-00051X-00
	for emacs-devel@gnu.org; Sun, 01 Sep 2002 21:28:31 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10)
	id 17lg0z-00051L-00
	for emacs-devel@gnu.org; Sun, 01 Sep 2002 21:28:30 -0400
Original-Received: from tsukuba.m17n.org ([192.47.44.130])
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17lg0y-00051F-00; Sun, 01 Sep 2002 21:28:28 -0400
Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2])
	by tsukuba.m17n.org (8.11.6/3.7W-20010518204228) with ESMTP id g821SPl18028;
	Mon, 2 Sep 2002 10:28:25 +0900 (JST)
	(envelope-from handa@m17n.org)
Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125])
	by fs.m17n.org (8.11.3/3.7W-20010823150639) with ESMTP id g821SP905548;
	Mon, 2 Sep 2002 10:28:25 +0900 (JST)
Original-Received: (from handa@localhost)
	by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id KAA08644;
	Mon, 2 Sep 2002 10:28:25 +0900 (JST)
Original-To: rms@gnu.org
In-Reply-To: <E17lefC-0003IF-00@fencepost.gnu.org> (message from Richard
	Stallman on Sun, 01 Sep 2002 20:01:54 -0400)
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)
Errors-To: emacs-devel-admin@gnu.org
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.0.11
Precedence: bulk
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Post: <mailto:emacs-devel@gnu.org>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
List-Id: Emacs development discussions. <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel/>
Xref: main.gmane.org gmane.emacs.devel:7307
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:7307

In article <E17lefC-0003IF-00@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:
>     That depends on whether you include code in utf-8.el that encodes
>     those charsets.  If not, you need that change.

> In that case, I will install that change presently, and then we can
> study the question of whether to include the code in utf-8.el instead.

> What does that code in utf-8.el do, and how safe a change is it?

It defines two CCL codes to decode and encode utf-8 byte
sequence, and makes the coding system mule-utf-8 by using
those CCL codes.

I'll attach the necessary change to enable RC's utf-8 to
encode latin-X plus alpha (e.g. thai).  The docstring of
mule-utf-8 may need improvement.

As the change is very small and that code has been in HEAD
for more than one month, I think the change is quite safe.
I recommend to install it in RC.

I also checked the code to some extent by this testsuite.

(dolist (charset (delq 'ascii
		       (delq 'eight-bit-control
			     (delq 'eight-bit-graphic
				   (coding-system-get 'mule-utf-8
						      'safe-charsets)))))
  (let ((dimension (charset-dimension charset))
	str)
    (if (= dimension 1)
	(setq str (string (make-char charset 33) (make-char charset 34)))
      (setq str (string (make-char charset 33 33) (make-char charset 33 34))))
    (or (memq 'mule-utf-8 (find-coding-systems-string str))
        (not (string-match "\357\277\275" ; UTF-8 form of U+FFFD
			   (encode-coding-string str 'mule-utf-8)))

	(error (format "%s is not supported" charset)))))

---
Ken'ichi HANDA
handa@etl.go.jp

*** utf-8.el.~1.9.4.2.~	Tue Jul 23 13:54:13 2002
--- utf-8.el	Mon Sep  2 10:28:26 2002
***************
*** 269,275 ****
       (loop
        (if (r5 < 0)
  	  ((r1 = -1)
! 	   (read-multibyte-character r0 r1))
  	(;; We have already done read-multibyte-character.
  	 (r0 = r5)
  	 (r1 = r6)
--- 269,277 ----
       (loop
        (if (r5 < 0)
  	  ((r1 = -1)
! 	   (read-multibyte-character r0 r1)
! 	   (translate-character ucs-mule-to-mule-unicode r0 r1))
! 
  	(;; We have already done read-multibyte-character.
  	 (r0 = r5)
  	 (r1 = r6)
***************
*** 392,397 ****
--- 394,423 ----
     mule-unicode-0100-24ff
     mule-unicode-2500-33ff
     mule-unicode-e000-ffff
+    latin-iso8859-2 (*)
+    latin-iso8859-3 (*)
+    latin-iso8859-4 (*)
+    cyrillic-iso8859-5 (*)
+    arabic-iso8859-6 (*)
+    greek-iso8859-7 (*)
+    hebrew-iso8859-8 (*)
+    latin-iso8859-9 (*)
+    latin-iso8859-14 (*)
+    latin-iso8859-15 (*)
+    chinese-sisheng (*)
+    ethiopic (*)
+    ipa (*)
+    lao (*)
+    katakana-jisx0201 (*)
+    thai-tis620 (*)
+    tibetan (*)
+    vietnamese-viscii-lower (*)
+    vietnamese-viscii-upper (*)
+ 
+ Among them, the charsets labeled \"(*)\" are supported only on
+ encoding.  That means, they are correctly encoded to UTF-8, but are
+ decoded back to charsets latin-iso8859-1, mule-unicode-0100-24ff, or
+ mule-unicode-2500-33ff, not to the original charsets.
  
  Unicode characters out of the ranges U+0000-U+33FF and U+E200-U+FFFF
  are decoded into sequences of eight-bit-control and eight-bit-graphic
***************
*** 409,415 ****
      latin-iso8859-1
      mule-unicode-0100-24ff
      mule-unicode-2500-33ff
!     mule-unicode-e000-ffff)
     (mime-charset . utf-8)
     (coding-category . coding-category-utf-8)
     (valid-codes (0 . 255))))
--- 435,460 ----
      latin-iso8859-1
      mule-unicode-0100-24ff
      mule-unicode-2500-33ff
!     mule-unicode-e000-ffff
!     latin-iso8859-2 
!     latin-iso8859-3 
!     latin-iso8859-4 
!     cyrillic-iso8859-5 
!     arabic-iso8859-6 
!     greek-iso8859-7 
!     hebrew-iso8859-8 
!     latin-iso8859-9 
!     latin-iso8859-14 
!     latin-iso8859-15 
!     chinese-sisheng 
!     ethiopic 
!     ipa 
!     lao 
!     katakana-jisx0201 
!     thai-tis620 
!     tibetan 
!     vietnamese-viscii-lower 
!     vietnamese-viscii-upper)
     (mime-charset . utf-8)
     (coding-category . coding-category-utf-8)
     (valid-codes (0 . 255))))