From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: normalize.el Date: Thu, 02 Apr 2009 15:40:44 +0900 Message-ID: NNTP-Posting-Host: lo.gmane.org X-Trace: ger.gmane.org 1238654426 10220 80.91.229.12 (2 Apr 2009 06:40:26 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 2 Apr 2009 06:40:26 +0000 (UTC) Cc: kawabata.taichi@gmail.com To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Apr 02 08:41:44 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LpGcN-0007gY-GT for ged-emacs-devel@m.gmane.org; Thu, 02 Apr 2009 08:41:43 +0200 Original-Received: from localhost ([127.0.0.1]:42485 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LpGaz-00039G-HL for ged-emacs-devel@m.gmane.org; Thu, 02 Apr 2009 02:40:13 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LpGas-00037O-Pm for emacs-devel@gnu.org; Thu, 02 Apr 2009 02:40:06 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LpGan-00030l-8l for emacs-devel@gnu.org; Thu, 02 Apr 2009 02:40:06 -0400 Original-Received: from [199.232.76.173] (port=35356 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LpGam-00030i-US for emacs-devel@gnu.org; Thu, 02 Apr 2009 02:40:00 -0400 Original-Received: from mx1.aist.go.jp ([150.29.246.133]:38911) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LpGal-0007hR-OW for emacs-devel@gnu.org; Thu, 02 Apr 2009 02:40:00 -0400 Original-Received: from rqsmtp2.aist.go.jp (rqsmtp2.aist.go.jp [150.29.254.123]) by mx1.aist.go.jp with ESMTP id n326drYm014504; Thu, 2 Apr 2009 15:39:53 +0900 (JST) env-from (handa@m17n.org) Original-Received: from smtp2.aist.go.jp by rqsmtp2.aist.go.jp with ESMTP id n326dreG007215; Thu, 2 Apr 2009 15:39:53 +0900 (JST) env-from (handa@m17n.org) Original-Received: by smtp2.aist.go.jp with ESMTP id n326dqUk012047; Thu, 2 Apr 2009 15:39:52 +0900 (JST) env-from (handa@m17n.org) Original-Received: from handa by etlken with local (Exim 4.69) (envelope-from ) id 1LpGbU-0001ub-65; Thu, 02 Apr 2009 15:40:44 +0900 X-detected-operating-system: by monty-python.gnu.org: Solaris 9 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:109989 Archived-At: The attached is an Unicode normalization tool contributed by Kawabata-san. It performs all the Unicode normalization NFC/NFD/NFKD/NFKC, and provides a coding system utf-8-hfs that is suitable to be used for Mac OS 8.1's file names. The assignment paper from Kawabata-san is already arrived at FSF. Perhaps committing it to the trunk now is not good considering that we are already pretesting for 23.1. What should I do with it? At least, I want Mac users to test it by setting the default-file-name-coding-system to `utf-8-hfs'. --- Kenichi Handa handa@m17n.org ;;; normalize.el --- Unicode normalization NFC/NFD/NFKD/NFKC ;; Copyright (C) 2009 ;; Free Software Foundation, Inc. ;; Author: Taichi Kawabata ;; Maintainer: Taichi Kawabata ;; Keywords: unicode, normalization ;; This file is part of GNU Emacs. ;; GNU Emacs is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; GNU Emacs is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with GNU Emacs. If not, see . ;;; Commentary: ;; The following code will normalize the string or text according to ;; the NFC/NFD/NFKD (single step) NFKC (dual step). ;; This program has passed the NormalizationTest-4.1.0.txt except ;; Hangul normalizations. ;; References: ;; http://www.unicode.org/reports/tr15/ ;; http://www.unicode.org/review/pr-29.html ;; Furthermore, it supports HFS normalization, for people who don't ;; want to normalize the CJK characters. ;;; Note on HFS Normalization. ;; ;; HFS-Normalization is the way to normalize text, as described by HFS ;; Plus Volue format specification ;; (http://developer.apple.com/technotes/tn/tn1150.html). This is ;; useful espcially for CJK people, since it doesn't normlize CJK ;; compatibility characters. ;; ;; Normalization Exclusion Area: ;; U+02000 .. U+02FFF :: Punctuation, symbols, dingbats, arrows, etc. ;; U+0F900 .. U+0FAFF :: CJK compatibility Ideographs. ;; U+2F800 .. U+2FFFF :: CJK compatibility Ideographs. ;;; ;;; Implementation Notes on NFC/HFS-NFC. (ccc = canonical combining class) ;; ;; A. `re-search-forward' for the following characters. ;; (1) characters that should be decomposed anyway. ;; (1-a) composition-exclusion characters ;; (1-b) singleton characters ;; (1-c) characters whose decomposed first char does not have ccc=0. ;; (2) characters that may appear as second of composed character. ;; (2-a) character whose ccc != 0. ;; (2-b) character whose ccc = 0. ;; ;; B. In case of (1) in step A, translate character, go to the ;; beginning position of translated region, and go back to step A. ;; ;; C. In case of (2) in step A, normalize the block around the ;; searched character. (`normalize-block') ;; ;; (1) start of the block ;; In case of (2-a) in step A, if the previous character ;; exists, then the beginning of the block is the previous ;; character. In case of (2-b), if the previous character's ;; ccc = 0, then the beginning of the block is the previous ;; character. Otherwise, the beginning of the block is the ;; searched character. ;; (2) end of the block ;; from the searched character, the end of the block is ;; searched until the character whose ccc=0. ;; ;; D. Normalize the block ;; (1) decompose the block ;; (2) sort the combining characters ;; (3) compose the block. ;;; ;;; Implementation Notes on NFD/HFS-NFD. ;; ;; A. `re-search-forward' for the combining characters sequence. ;; ;; B. Normalize the block around the searched character sequence. ;; (1) Start of the block is the previous character if it exists. ;; If not, start of the block is the searched character. ;; (2) End of the block is the end of the combining character ;; sequence. ;; ;; C. Normalize the block ;; (1) decompose the block ;; (2) sort the combining characters ;; ;; D. Translate all the rest at once. ;;; ;;; Implementation Notes on NFKC/NFKD ;; ;; NFKC/NFKD is realized as applying NFC/NFD to the text which is ;; firstly translated by NFKD(-unique) decomposition patterns. ;;; Code: (defconst normalize-composition-exclusions '(#x0958 #x0959 #x095A #x095B #x095C #x095D #x095E #x095F #x09DC #x09DD #x09DF #x0A33 #x0A36 #x0A59 #x0A5A #x0A5B #x0A5E #x0B5C #x0B5D #x0F43 #x0F4D #x0F52 #x0F57 #x0F5C #x0F69 #x0F76 #x0F78 #x0F93 #x0F9D #x0FA2 #x0FA7 #x0FAC #x0FB9 #xFB1D #xFB1F #xFB2A #xFB2B #xFB2C #xFB2D #xFB2E #xFB2F #xFB30 #xFB31 #xFB32 #xFB33 #xFB34 #xFB35 #xFB36 #xFB38 #xFB39 #xFB3A #xFB3B #xFB3C #xFB3E #xFB40 #xFB41 #xFB43 #xFB44 #xFB46 #xFB47 #xFB48 #xFB49 #xFB4A #xFB4B #xFB4C #xFB4D #xFB4E #x2ADC #x1D15E #x1D15F #x1D160 #x1D161 #x1D162 #x1D163 #x1D164 #x1D1BB #x1D1BC #x1D1BD #x1D1BE #x1D1BF #x1D1C0) "Composition Exclusion Table. The table is taken from http://www.unicode.org/Public/UNIDATA/CompositionExclusions-4.1.0.txt") ;; (eval-when-compile ;; generics (setq combining-chars nil) (setq decomp-pair-to-comp-hash (make-hash-table :test 'equal :size 2000)) ; 1003 in Unicode 4.1.0 (except hangul) ;; NFC/NFD (defun nfc-predicate (char) t) (setq nfc-search-chars nil) (setq nfc-singletons nil) (setq nfd-alist nil) ;; HFS-NFC/HFS-NFD (defun hfs-nfc-predicate (char) (or (and (>= char 0) (< char #x2000)) (and (>= char #x3000) (< char #xf900)) (and (>= char #xfb00) (< char #x2f800)) (>= char #x30000))) (setq hfs-nfc-search-chars nil) (setq hfs-nfc-singletons nil) (setq hfs-nfd-alist nil) ;; NFKC/NFKD (setq nfkd-alist nil) ;; macros (defmacro register-database (predicate search-chars singletons decomposition-alist) `(when (apply #',predicate (list char)) (let ((decomp-1st-ccc (get-char-code-property (car decomposition) 'canonical-combining-class))) (setq ,decomposition-alist (cons (cons char (vconcat decomposition)) ,decomposition-alist)) (unless (memq char normalize-composition-exclusions) (if (= 1 (length decomposition)) (setq ,singletons (cons char ,singletons)) ;; add second char of decomposition for search-chars iff ccc = 0 (if (= 0 (get-char-code-property (cadr decomposition) 'canonical-combining-class)) (add-to-list ',search-chars (cadr decomposition)))) ;; For fast normalization, the case of positive integer ;; `decomp-1st-ccc' means that the char should also be ;; considered as something like singleton. (if (and decomp-1st-ccc (/= 0 decomp-1st-ccc)) (setq ,singletons (cons char ,singletons))))))) (let ((char 0) ccc decomposition) (while (< char #x30000) (setq ccc (get-char-code-property char 'canonical-combining-class)) (setq decomposition (get-char-code-property char 'decomposition)) ;; add all combining-char to fast search list. (if (and ccc (/= 0 ccc)) (add-to-list 'combining-chars char)) (when decomposition (if (symbolp (car decomposition)) ;; compatibiliy (setq nfkd-alist (cons (cons char (vconcat (cdr decomposition))) nfkd-alist)) ;; canonical (setq nfkd-alist (cons (cons char (vconcat decomposition)) nfkd-alist)) (if (and (= 2 (length decomposition)) (null (memq char normalize-composition-exclusions))) (puthash decomposition char decomp-pair-to-comp-hash)) ;; NFC/NFD (register-database nfc-predicate nfc-search-chars nfc-singletons nfd-alist) ;; HFS-NFC/HFS-NFD (register-database hfs-nfc-predicate hfs-nfc-search-chars hfs-nfc-singletons hfs-nfd-alist))) (setq char (1+ char)))) (defun normalize-repetitively-expand-alist (alist) (let ((flag t) elem hit) (while flag (setq flag nil) (dolist (entry alist) (setq elem (mapcar (lambda (x) (setq hit (assoc x alist)) (if hit (setq flag t)) (if hit (mapcar 'identity (cdr hit)) (list x))) (cdr entry))) (if flag (setcdr entry (vconcat (apply 'append elem)))))))) (normalize-repetitively-expand-alist nfd-alist) (normalize-repetitively-expand-alist hfs-nfd-alist) (normalize-repetitively-expand-alist nfkd-alist) ) (defvar normalize-combining-chars-regexp nil "Regular expression to match sequence of combining characters.") (setq normalize-combining-chars-regexp (eval-when-compile (concat (regexp-opt (mapcar 'char-to-string combining-chars)) "+"))) (defvar normalize-decomp-pair-to-comp-hash nil "Hashtable of decomposed pair to primary composite.") (setq normalize-decomp-pair-to-comp-hash (eval-when-compile decomp-pair-to-comp-hash)) ;; NFC/NFD (defvar normalize-nfc-search-regexp nil) (setq normalize-nfc-search-regexp (eval-when-compile (regexp-opt (append (mapcar 'char-to-string nfc-search-chars) (mapcar 'char-to-string combining-chars) (mapcar 'char-to-string normalize-composition-exclusions) (mapcar 'char-to-string nfc-singletons))))) (defvar normalize-nfc-translation-chars nil) (setq normalize-nfc-translation-chars (eval-when-compile (append normalize-composition-exclusions nfc-singletons))) (define-translation-table 'normalize-nfd-table (eval-when-compile (make-translation-table-from-alist nfd-alist))) ;; HFS-NFC/NFD (defvar normalize-hfs-nfc-search-regexp nil) (setq normalize-hfs-nfc-search-regexp (eval-when-compile (regexp-opt (append (mapcar 'char-to-string hfs-nfc-search-chars) (mapcar 'char-to-string combining-chars) (mapcar 'char-to-string normalize-composition-exclusions) (mapcar 'char-to-string hfs-nfc-singletons))))) (defvar normalize-hfs-nfc-translation-chars nil) (setq normalize-hfs-nfc-translation-chars (eval-when-compile (append normalize-composition-exclusions hfs-nfc-singletons))) (define-translation-table 'normalize-hfs-nfd-table (eval-when-compile (make-translation-table-from-alist hfs-nfd-alist))) ;; NFKC/NFKD (define-translation-table 'normalize-nfkd-table (eval-when-compile (make-translation-table-from-alist nfkd-alist))) ;;------------------------------------------------------------------------------------------ ;; Normalize local region. (defun normalize-block (from to &optional translation-table compose) ;; block-string = [starter? diacritics+] (save-restriction (narrow-to-region from to) (goto-char (point-min)) ;; decompose if needed. (if translation-table (translate-region from to translation-table)) (let ((starter (char-after from)) diacritics) (if (= (get-char-code-property starter 'canonical-combining-class) 0) (setq diacritics (string-to-list (buffer-substring (1+ from) (point-max)))) (setq starter nil) (setq diacritics (string-to-list (buffer-substring from (point-max))))) ;; sort diacritical marks if needed. (if (< 1 (length diacritics)) (setq diacritics (sort diacritics (lambda (ch1 ch2) (< (get-char-code-property ch1 'canonical-combining-class) (get-char-code-property ch2 'canonical-combining-class)))))) ;; compose if needed (and both starter and diacritics exists). (if (and compose (and starter diacritics)) (let ((diac diacritics) prev (prev-ccc 0) ccc prim-comp) (while diac (setq ccc (get-char-code-property (car diac) 'canonical-combining-class)) (if (and (or (< prev-ccc ccc) (= ccc 0)) ;; e.g. #x09c7 #x09be -> #x09cb (setq prim-comp (gethash (list starter (car diac)) normalize-decomp-pair-to-comp-hash))) (progn (setq prev-ccc 0) (setq starter prim-comp) (if prev (setcdr prev (cdr diac)) (setq diacritics (cdr diacritics))) (setq diac diacritics)) (setq prev-ccc ccc prev diac diac (cdr diac)))))) ;; replace buffer (delete-region (point-min) (point-max)) (insert (concat (if starter (list starter)) diacritics))) (point-max))) ;; return the last position. (defun normalize-composition-region (from to regexp translation-table translation-chars &optional compat) (save-excursion (save-restriction (narrow-to-region from to) (if compat (translate-region from to 'normalize-nfkd-table)) (goto-char (point-min)) (let (start-pos starter) (while (re-search-forward regexp nil t) (setq starter (string-to-char (match-string 0))) (setq start-pos (match-beginning 0)) (if (memq starter translation-chars) (progn (translate-region start-pos (match-end 0) translation-table) (goto-char start-pos)) (goto-char (normalize-block ;; from (if (or (= start-pos (point-min)) (and (= 0 (get-char-code-property starter 'canonical-combining-class)) (/= 0 (get-char-code-property (char-after (1- start-pos)) 'canonical-combining-class)))) start-pos (1- start-pos)) ;; to (if (looking-at normalize-combining-chars-regexp) (match-end 0) (1+ start-pos)) 'normalize-nfd-table t)))))))) (defun normalize-decomposition-region (from to translation-table) (save-excursion (save-restriction (narrow-to-region from to) (goto-char (point-min)) (let (start-pos starter) (while (re-search-forward normalize-combining-chars-regexp nil t) (setq starter (string-to-char (match-string 0))) (setq start-pos (match-beginning 0)) (goto-char (normalize-block (if (= start-pos (point-min)) start-pos (1- start-pos)) (match-end 0) translation-table nil)))) (translate-region (point-min) (point-max) translation-table)))) ;; -------------------------------------------------------------------------------- ;;;###autoload (defun normalize-NFC-region (from to) "Normalize the current region by the Unicode NFC." (interactive "r") (normalize-composition-region from to normalize-nfc-search-regexp 'normalize-nfd-table normalize-nfc-translation-chars)) ;;;###autoload (defun normalize-NFC-string (str) "Normalize the string STR by the Unicode NFC." (with-temp-buffer (insert str) (normalize-NFC-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun normalize-NFD-region (from to) "Normalize the current region by the Unicode NFD." (interactive "r") (normalize-decomposition-region from to 'normalize-nfd-table)) ;;;###autoload (defun normalize-NFD-string (str) "Normalize the string STR by the Unicode NFD." (with-temp-buffer (insert str) (normalize-NFD-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun normalize-HFS-NFC-region (from to) "Normalize the current region by the Unicode NFC and Mac OS's HFS Plus." (interactive "r") (normalize-composition-region from to normalize-hfs-nfc-search-regexp 'normalize-hfs-nfd-table normalize-hfs-nfc-translation-chars)) ;;;###autoload (defun normalize-HFS-NFC-string (str) "Normalize the string STR by the Unicode NFC and Mac OS's HFS Plus." (with-temp-buffer (insert str) (normalize-HFS-NFC-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun normalize-HFS-NFD-region (from to) "Normalize the current region by the Unicode NFD and Mac OS's HFS Plus." (interactive "r") (normalize-decomposition-region from to 'normalize-hfs-nfd-table)) ;;;###autoload (defun normalize-HFS-NFD-string (str) "Normalize the string STR by the Unicode NFD and Mac OS's HFS Plus." (with-temp-buffer (insert str) (normalize-HFS-NFD-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun normalize-NFKC-region (from to) "Normalize the current region by the Unicode NFKC." (interactive "r") (normalize-composition-region from to normalize-nfc-search-regexp 'normalize-nfd-table normalize-nfc-translation-chars t)) ;;;###autoload (defun normalize-NFKC-string (str) "Normalize the string STR by the Unicode NFKC." (with-temp-buffer (insert str) (normalize-NFKC-region (point-min) (point-max)) (buffer-string))) ;;;###autoload (defun normalize-NFKD-region (from to) "Normalize the current region by the Unicode NFKD." (interactive "r") (normalize-decomposition-region from to 'normalize-nfkd-table)) ;;;###autoload (defun normalize-NFKD-string (str) "Normalize the string STR by the Unicode NFKD." (with-temp-buffer (insert str) (normalize-NFKD-region (point-min) (point-max)) (buffer-string))) ;; Post-read-conversion function for `utf-8-hfs'. (defun normalize-hfs-nfd-post-read-conversion (len) (save-excursion (save-restriction (narrow-to-region (point) (+ (point) len)) (let ((buffer-modified-p (buffer-modified-p))) (normalize-hfs-NFC-region (point-min) (point-max)) (- (point-max) (point-min)))))) ;;; coding-system definition (define-coding-system 'utf-8-hfs "UTF-8 base coding system with normalization on decoding. The characters in HFS normalization exclusion area are not normalized. On encoding, don't perform normalization." :coding-type 'utf-8 :mnemonic ?U :charset-list '(unicode) ;; NFD decoder ;; :decode-translation-table (not necessary) :post-read-conversion 'normalize-hfs-nfd-post-read-conversion ;; NFD encoder is not necessary because MacOS will automatically do it. ;; :encode-translation-table 'nfd-encode-translation-table ;; :pre-write-conversion 'nfd-encode-function ) ;;; normalize.el ends here