From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: utf-8.el Date: Tue, 18 Jan 2005 11:37:26 -0500 Message-ID: NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1106069923 6181 80.91.229.6 (18 Jan 2005 17:38:43 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 18 Jan 2005 17:38:43 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Jan 18 18:38:25 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1Cqx7m-0007hM-00 for ; Tue, 18 Jan 2005 18:26:38 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1CqxJh-0008U6-6R for ged-emacs-devel@m.gmane.org; Tue, 18 Jan 2005 12:38:57 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1CqxDa-0002Om-B5 for emacs-devel@gnu.org; Tue, 18 Jan 2005 12:32:38 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1CqwvV-0005Yr-6A for emacs-devel@gnu.org; Tue, 18 Jan 2005 12:14:03 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1CqwvP-0005PJ-Cu for emacs-devel@gnu.org; Tue, 18 Jan 2005 12:13:51 -0500 Original-Received: from [132.204.24.67] (helo=mercure.iro.umontreal.ca) by monty-python.gnu.org with esmtp (Exim 4.34) id 1CqwMF-0006J3-Pm for emacs-devel@gnu.org; Tue, 18 Jan 2005 11:37:32 -0500 Original-Received: from hidalgo.iro.umontreal.ca (hidalgo.iro.umontreal.ca [132.204.27.50]) by mercure.iro.umontreal.ca (Postfix) with ESMTP id 85FC68282BB; Tue, 18 Jan 2005 11:37:30 -0500 (EST) Original-Received: from asado.iro.umontreal.ca (asado.iro.umontreal.ca [132.204.24.84]) by hidalgo.iro.umontreal.ca (Postfix) with ESMTP id 8C4AC4AC134; Tue, 18 Jan 2005 11:37:26 -0500 (EST) Original-Received: by asado.iro.umontreal.ca (Postfix, from userid 20848) id 4E1DC4BB62; Tue, 18 Jan 2005 11:37:26 -0500 (EST) Original-To: emacs-devel@gnu.org User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux) X-DIRO-MailScanner-Information: Please contact the ISP for more information X-DIRO-MailScanner: Found to be clean X-DIRO-MailScanner-SpamCheck: n'est pas un polluriel, SpamAssassin (score=-4.744, requis 5, autolearn=not spam, AWL 0.16, BAYES_00 -4.90) X-MailScanner-From: monnier@iro.umontreal.ca X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:32341 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:32341 Does anyone see a problem with the simple patch below? Also, could anyone confirm that the docstring of mule-utf-8 is correct in saying that invalid utf-8 sequences are not always correctly preserved? Why is that? Can't we fix it? Also could anyone explain to me why `utf-8-compose' needs to lookup the hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since it looks to me like ccl-decode-mule-utf-8 already takes care of decoding chars that are in this table. I also don't understand the following part of the code: (if (=3D l 2) (put-text-property (point) (min (point-max) (+ l (point))) 'display (format "\\%03o" ch)) (compose-region (point) (+ l (point)) ?=EF=BF=BD)) what does it mean for l (the number of bytes) to be equal to 2? Stefan --- orig/lisp/international/utf-8.el +++ mod/lisp/international/utf-8.el @@ -2,7 +2,7 @@ =20 ;; Copyright (C) 2001, 2004 Electrotechnical Laboratory, JAPAN. ;; Licensed to the Free Software Foundation. -;; Copyright (C) 2001, 2002 Free Software Foundation, Inc. +;; Copyright (C) 2001, 2002, 2005 Free Software Foundation, Inc. =20 ;; Author: TAKAHASHI Naoto ;; Maintainer: FSF @@ -259,7 +259,7 @@ (funcall decode-char-no-trans (car x)) (funcall decode-char-no-trans (cdr x)))) ranges ""))) - ;; These forces loading and settting tables for + ;; This forces loading and setting tables for ;; utf-translate-cjk-mode. (setq utf-translate-cjk-lang-env nil ucs-mule-cjk-to-unicode (make-hash-table :test 'eq) @@ -951,10 +951,7 @@ (save-excursion (save-restriction (narrow-to-region (point) (+ (point) length)) - ;; Can't do eval-when-compile to insert a multibyte constant - ;; version of the string in the loop, since it's always loaded as - ;; unibyte from a byte-compiled file. - (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7")) + (let ((range "^\xc0-\xc3\xe1-\xf7") (buffer-multibyte enable-multibyte-characters) hash-table ch) (set-buffer-multibyte t) @@ -1036,8 +1033,7 @@ mule-unicode-0100-24ff mule-unicode-2500-33ff mule-unicode-e000-ffff - ,@(if utf-translate-cjk-mode - utf-translate-cjk-charsets)) + ,@utf-translate-cjk-charsets) (mime-charset . utf-8) (coding-category . coding-category-utf-8) (valid-codes (0 . 255)) @@ -1054,23 +1050,23 @@ ;; I think this needs special private charsets defined for the ;; untranslated sequences, if it's going to work well. =20 -;;; (defun utf-8-compose-function (pos to pattern &optional string) -;;; (let* ((prop (get-char-property pos 'composition string)) -;;; (l (and prop (- (cadr prop) (car prop))))) -;;; (cond ((and l (> l (- to pos))) -;;; (delete-region pos to)) -;;; ((and (> (char-after pos) 224) -;;; (< (char-after pos) 256) -;;; (save-restriction -;;; (narrow-to-region pos to) -;;; (utf-8-compose))) -;;; t)))) - -;;; (dotimes (i 96) -;;; (aset composition-function-table -;;; (+ 128 i) -;;; `((,(string-as-multibyte "[\200-\237\240-\377]") -;;; . utf-8-compose-function)))) +;; (defun utf-8-compose-function (pos to pattern &optional string) +;; (let* ((prop (get-char-property pos 'composition string)) +;; (l (and prop (- (cadr prop) (car prop))))) +;; (cond ((and l (> l (- to pos))) +;; (delete-region pos to)) +;; ((and (> (char-after pos) 224) +;; (< (char-after pos) 256) +;; (save-restriction +;; (narrow-to-region pos to) +;; (utf-8-compose))) +;; t)))) + +;; (dotimes (i 96) +;; (aset composition-function-table +;; (+ 128 i) +;; `((,(string-as-multibyte "[\200-\237\240-\377]") +;; . utf-8-compose-function)))) =20 ;; arch-tag: b08735b7-753b-4ae6-b754-0f3efe4515c5 ;;; utf-8.el ends here