From: Kenichi Handa <handa@m17n.org>
To: emacs-devel@gnu.org
Cc: kawabata.taichi@gmail.com
Subject: normalize.el
Date: Thu, 02 Apr 2009 15:40:44 +0900 [thread overview]
Message-ID: <E1LpGbU-0001ub-65@etlken> (raw)
The attached is an Unicode normalization tool contributed by
Kawabata-san. It performs all the Unicode normalization
NFC/NFD/NFKD/NFKC, and provides a coding system utf-8-hfs
that is suitable to be used for Mac OS 8.1's file names.
The assignment paper from Kawabata-san is already arrived at
FSF.
Perhaps committing it to the trunk now is not good
considering that we are already pretesting for 23.1. What
should I do with it? At least, I want Mac users to test it
by setting the default-file-name-coding-system to
`utf-8-hfs'.
---
Kenichi Handa
handa@m17n.org
;;; normalize.el --- Unicode normalization NFC/NFD/NFKD/NFKC
;; Copyright (C) 2009
;; Free Software Foundation, Inc.
;; Author: Taichi Kawabata <kawabata.taichi@gmail.com>
;; Maintainer: Taichi Kawabata <kawabata.taichi@gmail.com>
;; Keywords: unicode, normalization
;; This file is part of GNU Emacs.
;; GNU Emacs is free software: you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 3 of the License, or
;; (at your option) any later version.
;; GNU Emacs is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;; GNU General Public License for more details.
;; You should have received a copy of the GNU General Public License
;; along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>.
;;; Commentary:
;; The following code will normalize the string or text according to
;; the NFC/NFD/NFKD (single step) NFKC (dual step).
;; This program has passed the NormalizationTest-4.1.0.txt except
;; Hangul normalizations.
;; References:
;; http://www.unicode.org/reports/tr15/
;; http://www.unicode.org/review/pr-29.html
;; Furthermore, it supports HFS normalization, for people who don't
;; want to normalize the CJK characters.
;;; Note on HFS Normalization.
;;
;; HFS-Normalization is the way to normalize text, as described by HFS
;; Plus Volue format specification
;; (http://developer.apple.com/technotes/tn/tn1150.html). This is
;; useful espcially for CJK people, since it doesn't normlize CJK
;; compatibility characters.
;;
;; Normalization Exclusion Area:
;; U+02000 .. U+02FFF :: Punctuation, symbols, dingbats, arrows, etc.
;; U+0F900 .. U+0FAFF :: CJK compatibility Ideographs.
;; U+2F800 .. U+2FFFF :: CJK compatibility Ideographs.
;;;
;;; Implementation Notes on NFC/HFS-NFC. (ccc = canonical combining class)
;;
;; A. `re-search-forward' for the following characters.
;; (1) characters that should be decomposed anyway.
;; (1-a) composition-exclusion characters
;; (1-b) singleton characters
;; (1-c) characters whose decomposed first char does not have ccc=0.
;; (2) characters that may appear as second of composed character.
;; (2-a) character whose ccc != 0.
;; (2-b) character whose ccc = 0.
;;
;; B. In case of (1) in step A, translate character, go to the
;; beginning position of translated region, and go back to step A.
;;
;; C. In case of (2) in step A, normalize the block around the
;; searched character. (`normalize-block')
;;
;; (1) start of the block
;; In case of (2-a) in step A, if the previous character
;; exists, then the beginning of the block is the previous
;; character. In case of (2-b), if the previous character's
;; ccc = 0, then the beginning of the block is the previous
;; character. Otherwise, the beginning of the block is the
;; searched character.
;; (2) end of the block
;; from the searched character, the end of the block is
;; searched until the character whose ccc=0.
;;
;; D. Normalize the block
;; (1) decompose the block
;; (2) sort the combining characters
;; (3) compose the block.
;;;
;;; Implementation Notes on NFD/HFS-NFD.
;;
;; A. `re-search-forward' for the combining characters sequence.
;;
;; B. Normalize the block around the searched character sequence.
;; (1) Start of the block is the previous character if it exists.
;; If not, start of the block is the searched character.
;; (2) End of the block is the end of the combining character
;; sequence.
;;
;; C. Normalize the block
;; (1) decompose the block
;; (2) sort the combining characters
;;
;; D. Translate all the rest at once.
;;;
;;; Implementation Notes on NFKC/NFKD
;;
;; NFKC/NFKD is realized as applying NFC/NFD to the text which is
;; firstly translated by NFKD(-unique) decomposition patterns.
;;; Code:
(defconst normalize-composition-exclusions
'(#x0958 #x0959 #x095A #x095B #x095C #x095D #x095E #x095F
#x09DC #x09DD #x09DF #x0A33 #x0A36 #x0A59 #x0A5A #x0A5B
#x0A5E #x0B5C #x0B5D #x0F43 #x0F4D #x0F52 #x0F57 #x0F5C
#x0F69 #x0F76 #x0F78 #x0F93 #x0F9D #x0FA2 #x0FA7 #x0FAC
#x0FB9 #xFB1D #xFB1F #xFB2A #xFB2B #xFB2C #xFB2D #xFB2E
#xFB2F #xFB30 #xFB31 #xFB32 #xFB33 #xFB34 #xFB35 #xFB36
#xFB38 #xFB39 #xFB3A #xFB3B #xFB3C #xFB3E #xFB40 #xFB41
#xFB43 #xFB44 #xFB46 #xFB47 #xFB48 #xFB49 #xFB4A #xFB4B
#xFB4C #xFB4D #xFB4E #x2ADC #x1D15E #x1D15F #x1D160 #x1D161
#x1D162 #x1D163 #x1D164 #x1D1BB #x1D1BC #x1D1BD #x1D1BE
#x1D1BF #x1D1C0)
"Composition Exclusion Table.
The table is taken from
http://www.unicode.org/Public/UNIDATA/CompositionExclusions-4.1.0.txt")
;;
(eval-when-compile
;; generics
(setq combining-chars nil)
(setq decomp-pair-to-comp-hash (make-hash-table :test 'equal :size 2000)) ; 1003 in Unicode 4.1.0 (except hangul)
;; NFC/NFD
(defun nfc-predicate (char) t)
(setq nfc-search-chars nil)
(setq nfc-singletons nil)
(setq nfd-alist nil)
;; HFS-NFC/HFS-NFD
(defun hfs-nfc-predicate (char)
(or (and (>= char 0) (< char #x2000))
(and (>= char #x3000) (< char #xf900))
(and (>= char #xfb00) (< char #x2f800))
(>= char #x30000)))
(setq hfs-nfc-search-chars nil)
(setq hfs-nfc-singletons nil)
(setq hfs-nfd-alist nil)
;; NFKC/NFKD
(setq nfkd-alist nil)
;; macros
(defmacro register-database (predicate
search-chars
singletons
decomposition-alist)
`(when (apply #',predicate (list char))
(let ((decomp-1st-ccc
(get-char-code-property
(car decomposition) 'canonical-combining-class)))
(setq ,decomposition-alist (cons (cons char (vconcat decomposition))
,decomposition-alist))
(unless (memq char normalize-composition-exclusions)
(if (= 1 (length decomposition))
(setq ,singletons (cons char ,singletons))
;; add second char of decomposition for search-chars iff ccc = 0
(if (= 0 (get-char-code-property (cadr decomposition)
'canonical-combining-class))
(add-to-list ',search-chars (cadr decomposition))))
;; For fast normalization, the case of positive integer
;; `decomp-1st-ccc' means that the char should also be
;; considered as something like singleton.
(if (and decomp-1st-ccc (/= 0 decomp-1st-ccc))
(setq ,singletons (cons char ,singletons)))))))
(let ((char 0) ccc decomposition)
(while (< char #x30000)
(setq ccc (get-char-code-property
char 'canonical-combining-class))
(setq decomposition (get-char-code-property
char 'decomposition))
;; add all combining-char to fast search list.
(if (and ccc (/= 0 ccc)) (add-to-list 'combining-chars char))
(when decomposition
(if (symbolp (car decomposition))
;; compatibiliy
(setq nfkd-alist (cons (cons char (vconcat (cdr decomposition)))
nfkd-alist))
;; canonical
(setq nfkd-alist (cons (cons char (vconcat decomposition))
nfkd-alist))
(if (and (= 2 (length decomposition))
(null (memq char normalize-composition-exclusions)))
(puthash decomposition char decomp-pair-to-comp-hash))
;; NFC/NFD
(register-database nfc-predicate
nfc-search-chars
nfc-singletons
nfd-alist)
;; HFS-NFC/HFS-NFD
(register-database hfs-nfc-predicate
hfs-nfc-search-chars
hfs-nfc-singletons
hfs-nfd-alist)))
(setq char (1+ char))))
(defun normalize-repetitively-expand-alist (alist)
(let ((flag t) elem hit)
(while flag
(setq flag nil)
(dolist (entry alist)
(setq elem (mapcar (lambda (x)
(setq hit (assoc x alist))
(if hit (setq flag t))
(if hit (mapcar 'identity (cdr hit)) (list x)))
(cdr entry)))
(if flag (setcdr entry (vconcat (apply 'append elem))))))))
(normalize-repetitively-expand-alist nfd-alist)
(normalize-repetitively-expand-alist hfs-nfd-alist)
(normalize-repetitively-expand-alist nfkd-alist)
)
(defvar normalize-combining-chars-regexp nil
"Regular expression to match sequence of combining characters.")
(setq normalize-combining-chars-regexp
(eval-when-compile (concat (regexp-opt (mapcar 'char-to-string combining-chars)) "+")))
(defvar normalize-decomp-pair-to-comp-hash nil
"Hashtable of decomposed pair to primary composite.")
(setq normalize-decomp-pair-to-comp-hash
(eval-when-compile decomp-pair-to-comp-hash))
;; NFC/NFD
(defvar normalize-nfc-search-regexp nil)
(setq normalize-nfc-search-regexp
(eval-when-compile (regexp-opt
(append (mapcar 'char-to-string nfc-search-chars)
(mapcar 'char-to-string combining-chars)
(mapcar 'char-to-string normalize-composition-exclusions)
(mapcar 'char-to-string nfc-singletons)))))
(defvar normalize-nfc-translation-chars nil)
(setq normalize-nfc-translation-chars
(eval-when-compile (append normalize-composition-exclusions nfc-singletons)))
(define-translation-table 'normalize-nfd-table
(eval-when-compile (make-translation-table-from-alist nfd-alist)))
;; HFS-NFC/NFD
(defvar normalize-hfs-nfc-search-regexp nil)
(setq normalize-hfs-nfc-search-regexp
(eval-when-compile (regexp-opt
(append (mapcar 'char-to-string hfs-nfc-search-chars)
(mapcar 'char-to-string combining-chars)
(mapcar 'char-to-string normalize-composition-exclusions)
(mapcar 'char-to-string hfs-nfc-singletons)))))
(defvar normalize-hfs-nfc-translation-chars nil)
(setq normalize-hfs-nfc-translation-chars
(eval-when-compile (append normalize-composition-exclusions hfs-nfc-singletons)))
(define-translation-table 'normalize-hfs-nfd-table
(eval-when-compile (make-translation-table-from-alist hfs-nfd-alist)))
;; NFKC/NFKD
(define-translation-table 'normalize-nfkd-table
(eval-when-compile (make-translation-table-from-alist nfkd-alist)))
;;------------------------------------------------------------------------------------------
;; Normalize local region.
(defun normalize-block (from to &optional translation-table compose)
;; block-string = [starter? diacritics+]
(save-restriction
(narrow-to-region from to)
(goto-char (point-min))
;; decompose if needed.
(if translation-table
(translate-region from to translation-table))
(let ((starter (char-after from)) diacritics)
(if (= (get-char-code-property starter 'canonical-combining-class) 0)
(setq diacritics (string-to-list (buffer-substring (1+ from) (point-max))))
(setq starter nil)
(setq diacritics (string-to-list (buffer-substring from (point-max)))))
;; sort diacritical marks if needed.
(if (< 1 (length diacritics))
(setq diacritics
(sort diacritics
(lambda (ch1 ch2)
(< (get-char-code-property ch1 'canonical-combining-class)
(get-char-code-property ch2 'canonical-combining-class))))))
;; compose if needed (and both starter and diacritics exists).
(if (and compose (and starter diacritics))
(let ((diac diacritics) prev (prev-ccc 0) ccc prim-comp)
(while diac
(setq ccc (get-char-code-property (car diac)
'canonical-combining-class))
(if (and (or (< prev-ccc ccc) (= ccc 0)) ;; e.g. #x09c7 #x09be -> #x09cb
(setq prim-comp (gethash (list starter (car diac))
normalize-decomp-pair-to-comp-hash)))
(progn
(setq prev-ccc 0)
(setq starter prim-comp)
(if prev (setcdr prev (cdr diac)) (setq diacritics (cdr diacritics)))
(setq diac diacritics))
(setq prev-ccc ccc
prev diac
diac (cdr diac))))))
;; replace buffer
(delete-region (point-min) (point-max))
(insert (concat (if starter (list starter)) diacritics)))
(point-max))) ;; return the last position.
(defun normalize-composition-region (from to regexp translation-table translation-chars
&optional compat)
(save-excursion
(save-restriction
(narrow-to-region from to)
(if compat (translate-region from to 'normalize-nfkd-table))
(goto-char (point-min))
(let (start-pos starter)
(while (re-search-forward regexp nil t)
(setq starter (string-to-char (match-string 0)))
(setq start-pos (match-beginning 0))
(if (memq starter translation-chars)
(progn
(translate-region start-pos (match-end 0) translation-table)
(goto-char start-pos))
(goto-char
(normalize-block
;; from
(if (or (= start-pos (point-min))
(and (= 0 (get-char-code-property starter 'canonical-combining-class))
(/= 0 (get-char-code-property (char-after (1- start-pos))
'canonical-combining-class))))
start-pos (1- start-pos))
;; to
(if (looking-at normalize-combining-chars-regexp)
(match-end 0) (1+ start-pos))
'normalize-nfd-table t))))))))
(defun normalize-decomposition-region (from to translation-table)
(save-excursion
(save-restriction
(narrow-to-region from to)
(goto-char (point-min))
(let (start-pos starter)
(while (re-search-forward normalize-combining-chars-regexp nil t)
(setq starter (string-to-char (match-string 0)))
(setq start-pos (match-beginning 0))
(goto-char
(normalize-block
(if (= start-pos (point-min)) start-pos (1- start-pos))
(match-end 0)
translation-table nil))))
(translate-region (point-min) (point-max) translation-table))))
;; --------------------------------------------------------------------------------
;;;###autoload
(defun normalize-NFC-region (from to)
"Normalize the current region by the Unicode NFC."
(interactive "r")
(normalize-composition-region from to
normalize-nfc-search-regexp
'normalize-nfd-table
normalize-nfc-translation-chars))
;;;###autoload
(defun normalize-NFC-string (str)
"Normalize the string STR by the Unicode NFC."
(with-temp-buffer
(insert str)
(normalize-NFC-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-NFD-region (from to)
"Normalize the current region by the Unicode NFD."
(interactive "r")
(normalize-decomposition-region from to 'normalize-nfd-table))
;;;###autoload
(defun normalize-NFD-string (str)
"Normalize the string STR by the Unicode NFD."
(with-temp-buffer
(insert str)
(normalize-NFD-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-HFS-NFC-region (from to)
"Normalize the current region by the Unicode NFC and Mac OS's HFS Plus."
(interactive "r")
(normalize-composition-region from to
normalize-hfs-nfc-search-regexp
'normalize-hfs-nfd-table
normalize-hfs-nfc-translation-chars))
;;;###autoload
(defun normalize-HFS-NFC-string (str)
"Normalize the string STR by the Unicode NFC and Mac OS's HFS Plus."
(with-temp-buffer
(insert str)
(normalize-HFS-NFC-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-HFS-NFD-region (from to)
"Normalize the current region by the Unicode NFD and Mac OS's HFS Plus."
(interactive "r")
(normalize-decomposition-region from to 'normalize-hfs-nfd-table))
;;;###autoload
(defun normalize-HFS-NFD-string (str)
"Normalize the string STR by the Unicode NFD and Mac OS's HFS Plus."
(with-temp-buffer
(insert str)
(normalize-HFS-NFD-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-NFKC-region (from to)
"Normalize the current region by the Unicode NFKC."
(interactive "r")
(normalize-composition-region from to
normalize-nfc-search-regexp
'normalize-nfd-table
normalize-nfc-translation-chars t))
;;;###autoload
(defun normalize-NFKC-string (str)
"Normalize the string STR by the Unicode NFKC."
(with-temp-buffer
(insert str)
(normalize-NFKC-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-NFKD-region (from to)
"Normalize the current region by the Unicode NFKD."
(interactive "r")
(normalize-decomposition-region from to 'normalize-nfkd-table))
;;;###autoload
(defun normalize-NFKD-string (str)
"Normalize the string STR by the Unicode NFKD."
(with-temp-buffer
(insert str)
(normalize-NFKD-region (point-min) (point-max))
(buffer-string)))
;; Post-read-conversion function for `utf-8-hfs'.
(defun normalize-hfs-nfd-post-read-conversion (len)
(save-excursion
(save-restriction
(narrow-to-region (point) (+ (point) len))
(let ((buffer-modified-p (buffer-modified-p)))
(normalize-hfs-NFC-region (point-min) (point-max))
(- (point-max) (point-min))))))
;;; coding-system definition
(define-coding-system 'utf-8-hfs
"UTF-8 base coding system with normalization on decoding.
The characters in HFS normalization exclusion area are not normalized.
On encoding, don't perform normalization."
:coding-type 'utf-8
:mnemonic ?U
:charset-list '(unicode)
;; NFD decoder
;; :decode-translation-table (not necessary)
:post-read-conversion 'normalize-hfs-nfd-post-read-conversion
;; NFD encoder is not necessary because MacOS will automatically do it.
;; :encode-translation-table 'nfd-encode-translation-table
;; :pre-write-conversion 'nfd-encode-function
)
;;; normalize.el ends here
next reply other threads:[~2009-04-02 6:40 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-02 6:40 Kenichi Handa [this message]
2009-04-02 8:10 ` normalize.el Kim F. Storm
2009-04-02 11:54 ` normalize.el Kenichi Handa
2009-04-02 16:49 ` normalize.el Stefan Monnier
2009-04-08 5:39 ` normalize.el Kenichi Handa
2009-04-18 19:06 ` normalize.el 川幡 太一
2009-04-19 3:39 ` normalize.el 川幡 太一
2009-04-21 2:15 ` normalize.el Kenichi Handa
2009-04-21 17:10 ` normalize.el 川幡 太一
2009-04-03 9:10 ` normalize.el Richard M Stallman
2009-04-03 13:24 ` normalize.el Stefan Monnier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E1LpGbU-0001ub-65@etlken \
--to=handa@m17n.org \
--cc=emacs-devel@gnu.org \
--cc=kawabata.taichi@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).