From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: normalize.el Date: Wed, 08 Apr 2009 14:39:34 +0900 Message-ID: References: NNTP-Posting-Host: lo.gmane.org X-Trace: ger.gmane.org 1239169148 24546 80.91.229.12 (8 Apr 2009 05:39:08 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 8 Apr 2009 05:39:08 +0000 (UTC) Cc: kawabata.taichi@gmail.com, emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Apr 08 07:40:24 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LrQWM-0000sf-Li for ged-emacs-devel@m.gmane.org; Wed, 08 Apr 2009 07:40:23 +0200 Original-Received: from localhost ([127.0.0.1]:58980 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LrQUy-00068S-87 for ged-emacs-devel@m.gmane.org; Wed, 08 Apr 2009 01:38:56 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LrQUo-00063N-IL for emacs-devel@gnu.org; Wed, 08 Apr 2009 01:38:46 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LrQUk-0005vQ-4m for emacs-devel@gnu.org; Wed, 08 Apr 2009 01:38:45 -0400 Original-Received: from [199.232.76.173] (port=54824 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LrQUj-0005vB-9W for emacs-devel@gnu.org; Wed, 08 Apr 2009 01:38:41 -0400 Original-Received: from mx1.aist.go.jp ([150.29.246.133]:42100) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LrQUi-0002pm-9R for emacs-devel@gnu.org; Wed, 08 Apr 2009 01:38:41 -0400 Original-Received: from rqsmtp1.aist.go.jp (rqsmtp1.aist.go.jp [150.29.254.115]) by mx1.aist.go.jp with ESMTP id n385cWb8015801; Wed, 8 Apr 2009 14:38:32 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp1.aist.go.jp by rqsmtp1.aist.go.jp with ESMTP id n385cW5G003403; Wed, 8 Apr 2009 14:38:32 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp1.aist.go.jp with ESMTP id n385cVhA029055; Wed, 8 Apr 2009 14:38:31 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken with local (Exim 4.69) (envelope-from ) id 1LrQVa-0003Qz-Er; Wed, 08 Apr 2009 14:39:34 +0900 In-reply-to: (message from Stefan Monnier on Thu, 02 Apr 2009 12:49:53 -0400) X-detected-operating-system: by monty-python.gnu.org: Solaris 9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:110143 Archived-At: In article , Stefan Monnier writes: > > The attached is an Unicode normalization tool contributed by > > Kawabata-san. It performs all the Unicode normalization > > NFC/NFD/NFKD/NFKC, and provides a coding system utf-8-hfs > > that is suitable to be used for Mac OS 8.1's file names. > Great. Please rename it to uni-normalize.el or ucs-normalize.el. I renamed it to ucs-normalize.el and changed variable/function names as attached. But, byte-compiling shows these warnings and error. As the original normalize.el causes the same error, Kawabata-san, could you please fix it. Compiling file /home/handa/ucs-normalize.el at Wed Apr 8 14:37:56 2009 Entering directory `/home/handa/' ucs-normalize.el:187:46:Warning: assignment to free variable `combining-chars' ucs-normalize.el:198:43:Warning: assignment to free variable `decomp-pair-to-comp-hash' ucs-normalize.el:201:30:Warning: assignment to free variable `nfc-search-chars' ucs-normalize.el:202:30:Warning: assignment to free variable `nfc-singletons' ucs-normalize.el:203:30:Warning: assignment to free variable `nfd-alist' ucs-normalize.el:206:30:Warning: assignment to free variable `hfs-nfc-search-chars' ucs-normalize.el:207:30:Warning: assignment to free variable `hfs-nfc-singletons' ucs-normalize.el:208:30:Warning: assignment to free variable `hfs-nfd-alist' ucs-normalize.el:191:19:Warning: assignment to free variable `nfkd-alist' ucs-normalize.el:192:36:Warning: reference to free variable `nfkd-alist' ucs-normalize.el:209:13:Warning: reference to free variable `decomp-pair-to-comp-hash' ucs-normalize.el:205:12:Warning: reference to free variable `nfc-predicate' ucs-normalize.el:205:12:Warning: reference to free variable `nfc-search-chars' ucs-normalize.el:205:12:Warning: reference to free variable `nfc-singletons' ucs-normalize.el:223:40:Warning: reference to free variable `nfd-alist' ucs-normalize.el:223:40:Warning: reference to free variable `hfs-nfc-predicate' ucs-normalize.el:223:40:Warning: reference to free variable `hfs-nfc-search-chars' ucs-normalize.el:223:40:Warning: reference to free variable `hfs-nfc-singletons' ucs-normalize.el:224:40:Warning: reference to free variable `hfs-nfd-alist' ucs-normalize.el:225:4:Error: Symbol's value as variable is void: nfc-predicate --- Kenichi Handa handa@m17n.org ;;; ucs-normalize.el --- Unicode normalization NFC/NFD/NFKD/NFKC ;; Copyright (C) 2009 ;; Free Software Foundation, Inc. ;; Author: Taichi Kawabata ;; Maintainer: Taichi Kawabata ;; Keywords: unicode, normalization ;; This file is part of GNU Emacs. ;; GNU Emacs is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; GNU Emacs is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with GNU Emacs. If not, see . ;;; Commentary: ;; The following code will normalize the string or text according to ;; the NFC/NFD/NFKD (single step) NFKC (dual step). ;; This program has passed the NormalizationTest-4.1.0.txt except ;; Hangul normalizations. ;; References: ;; http://www.unicode.org/reports/tr15/ ;; http://www.unicode.org/review/pr-29.html ;; Furthermore, it supports HFS normalization, for people who don't ;; want to normalize the CJK characters. ;;; Note on HFS Normalization. ;; ;; HFS-Normalization is the way to normalize text, as described by HFS ;; Plus Volue format specification ;; (http://developer.apple.com/technotes/tn/tn1150.html). This is ;; useful espcially for CJK people, since it doesn't normlize CJK ;; compatibility characters. ;; ;; Normalization Exclusion Area: ;; U+02000 .. U+02FFF :: Punctuation, symbols, dingbats, arrows, etc. ;; U+0F900 .. U+0FAFF :: CJK compatibility Ideographs. ;; U+2F800 .. U+2FFFF :: CJK compatibility Ideographs. ;;; ;;; Implementation Notes on NFC/HFS-NFC. (ccc = canonical combining class) ;; ;; A. `re-search-forward' for the following characters. ;; (1) characters that should be decomposed anyway. ;; (1-a) composition-exclusion characters ;; (1-b) singleton characters ;; (1-c) characters whose decomposed first char does not have ccc=0. ;; (2) characters that may appear as second of composed character. ;; (2-a) character whose ccc != 0. ;; (2-b) character whose ccc = 0. ;; ;; B. In case of (1) in step A, translate character, go to the ;; beginning position of translated region, and go back to step A. ;; ;; C. In case of (2) in step A, normalize the block around the ;; searched character. (`normalize-block') ;; ;; (1) start of the block ;; In case of (2-a) in step A, if the previous character ;; exists, then the beginning of the block is the previous ;; character. In case of (2-b), if the previous character's ;; ccc = 0, then the beginning of the block is the previous ;; character. Otherwise, the beginning of the block is the ;; searched character. ;; (2) end of the block ;; from the searched character, the end of the block is ;; searched until the character whose ccc=0. ;; ;; D. Normalize the block ;; (1) decompose the block ;; (2) sort the combining characters ;; (3) compose the block. ;;; ;;; Implementation Notes on NFD/HFS-NFD. ;; ;; A. `re-search-forward' for the combining characters sequence. ;; ;; B. Normalize the block around the searched character sequence. ;; (1) Start of the block is the previous character if it exists. ;; If not, start of the block is the searched character. ;; (2) End of the block is the end of the combining character ;; sequence. ;; ;; C. Normalize the block ;; (1) decompose the block ;; (2) sort the combining characters ;; ;; D. Translate all the rest at once. ;;; ;;; Implementation Notes on NFKC/NFKD ;; ;; NFKC/NFKD is realized as applying NFC/NFD to the text which is ;; firstly translated by NFKD(-unique) decomposition patterns. ;;; Code: (eval-and-compile (defconst ucs-normalize-composition-exclusions '(#x0958 #x0959 #x095A #x095B #x095C #x095D #x095E #x095F #x09DC #x09DD #x09DF #x0A33 #x0A36 #x0A59 #x0A5A #x0A5B #x0A5E #x0B5C #x0B5D #x0F43 #x0F4D #x0F52 #x0F57 #x0F5C #x0F69 #x0F76 #x0F78 #x0F93 #x0F9D #x0FA2 #x0FA7 #x0FAC #x0FB9 #xFB1D #xFB1F #xFB2A #xFB2B #xFB2C #xFB2D #xFB2E #xFB2F #xFB30 #xFB31 #xFB32 #xFB33 #xFB34 #xFB35 #xFB36 #xFB38 #xFB39 #xFB3A #xFB3B #xFB3C #xFB3E #xFB40 #xFB41 #xFB43 #xFB44 #xFB46 #xFB47 #xFB48 #xFB49 #xFB4A #xFB4B #xFB4C #xFB4D #xFB4E #x2ADC #x1D15E #x1D15F #x1D160 #x1D161 #x1D162 #x1D163 #x1D164 #x1D1BB #x1D1BC #x1D1BD #x1D1BE #x1D1BF #x1D1C0) "Composition Exclusion Table. The table is taken from http://www.unicode.org/Public/UNIDATA/CompositionExclusions-4.1.0.txt") ) ;; (eval-when-compile ;; generics (setq combining-chars nil) (setq decomp-pair-to-comp-hash (make-hash-table :test 'equal :size 2000)) ; 1003 in Unicode 4.1.0 (except hangul) ;; NFC/NFD (defun nfc-predicate (char) t) (setq nfc-search-chars nil) (setq nfc-singletons nil) (setq nfd-alist nil) ;; HFS-NFC/HFS-NFD (defun hfs-nfc-predicate (char) (or (and (>= char 0) (< char #x2000)) (and (>= char #x3000) (< char #xf900)) (and (>= char #xfb00) (< char #x2f800)) (>= char #x30000))) (setq hfs-nfc-search-chars nil) (setq hfs-nfc-singletons nil) (setq hfs-nfd-alist nil) ;; NFKC/NFKD (setq nfkd-alist nil) ;; macros (defmacro register-database (predicate search-chars singletons decomposition-alist) `(when (apply #',predicate (list char)) (let ((decomp-1st-ccc (get-char-code-property (car decomposition) 'canonical-combining-class))) (setq ,decomposition-alist (cons (cons char (vconcat decomposition)) ,decomposition-alist)) (unless (memq char ucs-normalize-composition-exclusions) (if (= 1 (length decomposition)) (setq ,singletons (cons char ,singletons)) ;; add second char of decomposition for search-chars iff ccc = 0 (if (= 0 (get-char-code-property (cadr decomposition) 'canonical-combining-class)) (add-to-list ',search-chars (cadr decomposition)))) ;; For fast normalization, the case of positive integer ;; `decomp-1st-ccc' means that the char should also be ;; considered as something like singleton. (if (and decomp-1st-ccc (/= 0 decomp-1st-ccc)) (setq ,singletons (cons char ,singletons))))))) (let ((char 0) ccc decomposition) (while (< char #x30000) (setq ccc (get-char-code-property char 'canonical-combining-class)) (setq decomposition (get-char-code-property char 'decomposition)) ;; add all combining-char to fast search list. (if (and ccc (/= 0 ccc)) (add-to-list 'combining-chars char)) (when decomposition (if (symbolp (car decomposition)) ;; compatibiliy (setq nfkd-alist (cons (cons char (vconcat (cdr decomposition))) nfkd-alist)) ;; canonical (setq nfkd-alist (cons (cons char (vconcat decomposition)) nfkd-alist)) (if (and (= 2 (length decomposition)) (null (memq char ucs-normalize-composition-exclusions))) (puthash decomposition char decomp-pair-to-comp-hash)) ;; NFC/NFD (register-database nfc-predicate nfc-search-chars nfc-singletons nfd-alist) ;; HFS-NFC/HFS-NFD (register-database hfs-nfc-predicate hfs-nfc-search-chars hfs-nfc-singletons hfs-nfd-alist))) (setq char (1+ char)))) (defun normalize-repetitively-expand-alist (alist) (let ((flag t) elem hit) (while flag (setq flag nil) (dolist (entry alist) (setq elem (mapcar (lambda (x) (setq hit (assoc x alist)) (if hit (setq flag t)) (if hit (mapcar 'identity (cdr hit)) (list x))) (cdr entry))) (if flag (setcdr entry (vconcat (apply 'append elem)))))))) (normalize-repetitively-expand-alist nfd-alist) (normalize-repetitively-expand-alist hfs-nfd-alist) (normalize-repetitively-expand-alist nfkd-alist) ) (defvar ucs-normalize-combining-chars-regexp nil "Regular expression to match sequence of combining characters.") (setq ucs-normalize-combining-chars-regexp (eval-when-compile (concat (regexp-opt (mapcar 'char-to-string combining-chars)) "+"))) (defvar ucs-normalize-decomp-pair-to-comp-hash nil "Hashtable of decomposed pair to primary composite.") (setq ucs-normalize-decomp-pair-to-comp-hash (eval-when-compile decomp-pair-to-comp-hash)) ;; NFC/NFD (defvar ucs-normalize-nfc-search-regexp nil) (setq ucs-normalize-nfc-search-regexp (eval-when-compile (regexp-opt (append (mapcar 'char-to-string nfc-search-chars) (mapcar 'char-to-string combining-chars) (mapcar 'char-to-string ucs-normalize-composition-exclusions) (mapcar 'char-to-string nfc-singletons))))) (defvar ucs-normalize-nfc-translation-chars nil) (setq ucs-normalize-nfc-translation-chars (eval-when-compile (append ucs-normalize-composition-exclusions nfc-singletons))) (define-translation-table 'ucs-normalize-nfd-table (eval-when-compile (make-translation-table-from-alist nfd-alist))) ;; HFS-NFC/NFD (defvar ucs-normalize-hfs-nfc-search-regexp nil) (setq ucs-normalize-hfs-nfc-search-regexp (eval-when-compile (regexp-opt (append (mapcar 'char-to-string hfs-nfc-search-chars) (mapcar 'char-to-string combining-chars) (mapcar 'char-to-string ucs-normalize-composition-exclusions) (mapcar 'char-to-string hfs-nfc-singletons))))) (defvar ucs-normalize-hfs-nfc-translation-chars nil) (setq ucs-normalize-hfs-nfc-translation-chars (eval-when-compile (append ucs-normalize-composition-exclusions hfs-nfc-singletons))) (define-translation-table 'ucs-normalize-hfs-nfd-table (eval-when-compile (make-translation-table-from-alist hfs-nfd-alist))) ;; NFKC/NFKD (define-translation-table 'ucs-normalize-nfkd-table (eval-when-compile (make-translation-table-from-alist nfkd-alist))) ;;------------------------------------------------------------------------------------------ ;; Normalize local region. (defun ucs-normalize-block (from to &optional translation-table compose) ;; block-string = [starter? diacritics+] (save-restriction (narrow-to-region from to) (goto-char (point-min)) ;; decompose if needed. (if translation-table (translate-region from to translation-table)) (let ((starter (char-after from)) diacritics) (if (= (get-char-code-property starter 'canonical-combining-class) 0) (setq diacritics (string-to-list (buffer-substring (1+ from) (point-max)))) (setq starter nil) (setq diacritics (string-to-list (buffer-substring from (point-max))))) ;; sort diacritical marks if needed. (if (< 1 (length diacritics)) (setq diacritics (sort diacritics (lambda (ch1 ch2) (< (get-char-code-property ch1 'canonical-combining-class) (get-char-code-property ch2 'canonical-combining-class)))))) ;; compose if needed (and both starter and diacritics exists). (if (and compose (and starter diacritics)) (let ((diac diacritics) prev (prev-ccc 0) ccc prim-comp) (while diac (setq ccc (get-char-code-property (car diac) 'canonical-combining-class)) (if (and (or (< prev-ccc ccc) (= ccc 0)) ;; e.g. #x09c7 #x09be -> #x09cb (setq prim-comp (gethash (list starter (car diac)) ucs-normalize-decomp-pair-to-comp-hash))) (progn (setq prev-ccc 0) (setq starter prim-comp) (if prev (setcdr prev (cdr diac)) (setq diacritics (cdr diacritics))) (setq diac diacritics)) (setq prev-ccc ccc prev diac diac (cdr diac)))))) ;; replace buffer (delete-region (point-min) (point-max)) (insert (concat (if starter (list starter)) diacritics))) (point-max))) ;; return the last position. (defun ucs-normalize-composition-region (from to regexp translation-table translation-chars &optional compat) (save-excursion (save-restriction (narrow-to-region from to) (if compat (translate-region from to 'ucs-normalize-nfkd-table)) (goto-char (point-min)) (let (start-pos starter) (while (re-search-forward regexp nil t) (setq starter (string-to-char (match-string 0))) (setq start-pos (match-beginning 0)) (if (memq starter translation-chars) (progn (translate-region start-pos (match-end 0) translation-table) (goto-char start-pos)) (goto-char (ucs-normalize-block ;; from (if (or (= start-pos (point-min)) (and (= 0 (get-char-code-property starter 'canonical-combining-class)) (/= 0 (get-char-code-property (char-after (1- start-pos)) 'canonical-combining-class)))) start-pos (1- start-pos)) ;; to (if (looking-at ucs-normalize-combining-chars-regexp) (match-end 0) (1+ start-pos)) 'ucs-normalize-nfd-table t)))))))) (defun ucs-normalize-decomposition-region (from to translation-table) (save-excursion (save-restriction (narrow-to-region from to) (goto-char (point-min)) (let (start-pos starter) (while (re-search-forward ucs-normalize-combining-chars-regexp nil t) (setq starter (string-to-char (match-string 0))) (setq start-pos (match-beginning 0)) (goto-char (ucs-normalize-block (if (= start-pos (point-min)) start-pos (1- start-pos)) (match-end 0) translation-table nil)))) (translate-region (point-min) (point-max) translation-table)))) ;; -------------------------------------------------------------------------------- ;;;###autoload (defun ucs-normalize-NFC-region (from to) "Normalize the current region by the Unicode NFC." (interactive "r") (ucs-normalize-composition-region from to ucs-normalize-nfc-search-regexp 'ucs-normalize-nfd-table ucs-normalize-nfc-translation-chars)) ;;;###autoload (defun ucs-normalize-NFC-string (str) "Normalize the string STR by the Unicode NFC." (with-temp-buffer (insert str) (ucs-normalize-NFC-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun ucs-normalize-NFD-region (from to) "Normalize the current region by the Unicode NFD." (interactive "r") (ucs-normalize-decomposition-region from to 'ucs-normalize-nfd-table)) ;;;###autoload (defun ucs-normalize-NFD-string (str) "Normalize the string STR by the Unicode NFD." (with-temp-buffer (insert str) (ucs-normalize-NFD-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun ucs-normalize-HFS-NFC-region (from to) "Normalize the current region by the Unicode NFC and Mac OS's HFS Plus." (interactive "r") (ucs-normalize-composition-region from to ucs-normalize-hfs-nfc-search-regexp 'ucs-normalize-hfs-nfd-table ucs-normalize-hfs-nfc-translation-chars)) ;;;###autoload (defun ucs-normalize-HFS-NFC-string (str) "Normalize the string STR by the Unicode NFC and Mac OS's HFS Plus." (with-temp-buffer (insert str) (ucs-normalize-HFS-NFC-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun ucs-normalize-HFS-NFD-region (from to) "Normalize the current region by the Unicode NFD and Mac OS's HFS Plus." (interactive "r") (ucs-normalize-decomposition-region from to 'ucs-normalize-hfs-nfd-table)) ;;;###autoload (defun ucs-normalize-HFS-NFD-string (str) "Normalize the string STR by the Unicode NFD and Mac OS's HFS Plus." (with-temp-buffer (insert str) (ucs-normalize-HFS-NFD-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun ucs-normalize-NFKC-region (from to) "Normalize the current region by the Unicode NFKC." (interactive "r") (ucs-normalize-composition-region from to ucs-normalize-nfc-search-regexp 'ucs-normalize-nfd-table ucs-normalize-nfc-translation-chars t)) ;;;###autoload (defun ucs-normalize-NFKC-string (str) "Normalize the string STR by the Unicode NFKC." (with-temp-buffer (insert str) (ucs-normalize-NFKC-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun ucs-normalize-NFKD-region (from to) "Normalize the current region by the Unicode NFKD." (interactive "r") (ucs-normalize-decomposition-region from to 'ucs-normalize-nfkd-table)) ;;;###autoload (defun ucs-normalize-NFKD-string (str) "Normalize the string STR by the Unicode NFKD." (with-temp-buffer (insert str) (ucs-normalize-NFKD-region (point-min) (point-max)) (buffer-string))) ;; Post-read-conversion function for `utf-8-hfs'. (defun ucs-normalize-hfs-nfd-post-read-conversion (len) (save-excursion (save-restriction (narrow-to-region (point) (+ (point) len)) (let ((buffer-modified-p (buffer-modified-p))) (normalize-hfs-NFC-region (point-min) (point-max)) (- (point-max) (point-min)))))) ;;; coding-system definition (define-coding-system 'utf-8-hfs "UTF-8 base coding system with normalization on decoding. The characters in HFS normalization exclusion area are not normalized. On encoding, don't perform normalization." :coding-type 'utf-8 :mnemonic ?U :charset-list '(unicode) ;; NFD decoder ;; :decode-translation-table (not necessary) :post-read-conversion 'ucs-normalize-hfs-nfd-post-read-conversion ;; NFD encoder is not necessary because MacOS will automatically do it. ;; :encode-translation-table 'nfd-encode-translation-table ;; :pre-write-conversion 'nfd-encode-function ) ;;; normalize.el ends here