unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* normalize.el
@ 2009-04-02  6:40 Kenichi Handa
  2009-04-02  8:10 ` normalize.el Kim F. Storm
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Kenichi Handa @ 2009-04-02  6:40 UTC (permalink / raw)
  To: emacs-devel; +Cc: kawabata.taichi

The attached is an Unicode normalization tool contributed by
Kawabata-san.  It performs all the Unicode normalization
NFC/NFD/NFKD/NFKC, and provides a coding system utf-8-hfs
that is suitable to be used for Mac OS 8.1's file names.

The assignment paper from Kawabata-san is already arrived at
FSF.

Perhaps committing it to the trunk now is not good
considering that we are already pretesting for 23.1.  What
should I do with it?  At least, I want Mac users to test it
by setting the default-file-name-coding-system to
`utf-8-hfs'.

---
Kenichi Handa
handa@m17n.org

;;; normalize.el --- Unicode normalization NFC/NFD/NFKD/NFKC

;; Copyright (C) 2009
;;   Free Software Foundation, Inc.

;; Author: Taichi Kawabata <kawabata.taichi@gmail.com>
;; Maintainer: Taichi Kawabata <kawabata.taichi@gmail.com>
;; Keywords: unicode, normalization

;; This file is part of GNU Emacs.

;; GNU Emacs is free software: you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 3 of the License, or
;; (at your option) any later version.

;; GNU Emacs is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;; GNU General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.

;;; Commentary:

;; The following code will normalize the string or text according to
;; the NFC/NFD/NFKD (single step) NFKC (dual step).

;; This program has passed the NormalizationTest-4.1.0.txt except
;; Hangul normalizations.

;; References:
;; http://www.unicode.org/reports/tr15/
;; http://www.unicode.org/review/pr-29.html

;; Furthermore, it supports HFS normalization, for people who don't
;; want to normalize the CJK characters.

;;; Note on HFS Normalization.
;;
;; HFS-Normalization is the way to normalize text, as described by HFS
;; Plus Volue format specification
;; (http://developer.apple.com/technotes/tn/tn1150.html).  This is
;; useful espcially for CJK people, since it doesn't normlize CJK
;; compatibility characters.
;;
;; Normalization Exclusion Area:
;;  U+02000 .. U+02FFF :: Punctuation, symbols, dingbats, arrows, etc.
;;  U+0F900 .. U+0FAFF :: CJK compatibility Ideographs.
;;  U+2F800 .. U+2FFFF :: CJK compatibility Ideographs.

;;;
;;; Implementation Notes on NFC/HFS-NFC. (ccc = canonical combining class)
;;
;; A. `re-search-forward' for the following characters.
;;   (1) characters that should be decomposed anyway.
;;   (1-a) composition-exclusion characters
;;   (1-b) singleton characters
;;   (1-c) characters whose decomposed first char does not have ccc=0.
;;   (2) characters that may appear as second of composed character.
;;   (2-a) character whose ccc != 0.
;;   (2-b) character whose ccc = 0.
;;
;; B. In case of (1) in step A, translate character, go to the
;;    beginning position of translated region, and go back to step A.
;;
;; C. In case of (2) in step A, normalize the block around the
;;    searched character.  (`normalize-block')
;;
;;    (1) start of the block
;;        In case of (2-a) in step A, if the previous character
;;        exists, then the beginning of the block is the previous
;;        character.  In case of (2-b), if the previous character's
;;        ccc = 0, then the beginning of the block is the previous
;;        character.  Otherwise, the beginning of the block is the
;;        searched character.
;;    (2) end of the block
;;        from the searched character, the end of the block is
;;        searched until the character whose ccc=0.
;;
;; D. Normalize the block
;;    (1) decompose the block
;;    (2) sort the combining characters
;;    (3) compose the block.

;;;
;;; Implementation Notes on NFD/HFS-NFD.
;;
;; A. `re-search-forward' for the combining characters sequence.
;;
;; B. Normalize the block around the searched character sequence.
;;    (1) Start of the block is the previous character if it exists.
;;        If not, start of the block is the searched character.
;;    (2) End of the block is the end of the combining character
;;        sequence.
;;
;; C. Normalize the block
;;    (1) decompose the block
;;    (2) sort the combining characters
;;
;; D. Translate all the rest at once.

;;;
;;; Implementation Notes on NFKC/NFKD
;;
;; NFKC/NFKD is realized as applying NFC/NFD to the text which is
;; firstly translated by NFKD(-unique) decomposition patterns.

;;; Code:

(defconst normalize-composition-exclusions
  '(#x0958 #x0959 #x095A #x095B #x095C #x095D #x095E #x095F
    #x09DC #x09DD #x09DF #x0A33 #x0A36 #x0A59 #x0A5A #x0A5B
    #x0A5E #x0B5C #x0B5D #x0F43 #x0F4D #x0F52 #x0F57 #x0F5C
    #x0F69 #x0F76 #x0F78 #x0F93 #x0F9D #x0FA2 #x0FA7 #x0FAC
    #x0FB9 #xFB1D #xFB1F #xFB2A #xFB2B #xFB2C #xFB2D #xFB2E
    #xFB2F #xFB30 #xFB31 #xFB32 #xFB33 #xFB34 #xFB35 #xFB36
    #xFB38 #xFB39 #xFB3A #xFB3B #xFB3C #xFB3E #xFB40 #xFB41
    #xFB43 #xFB44 #xFB46 #xFB47 #xFB48 #xFB49 #xFB4A #xFB4B
    #xFB4C #xFB4D #xFB4E #x2ADC #x1D15E #x1D15F #x1D160 #x1D161
    #x1D162 #x1D163 #x1D164 #x1D1BB #x1D1BC #x1D1BD #x1D1BE
    #x1D1BF #x1D1C0)
 "Composition Exclusion Table.
The table is taken from
  http://www.unicode.org/Public/UNIDATA/CompositionExclusions-4.1.0.txt")

;; 
(eval-when-compile

  ;; generics
  (setq combining-chars nil)
  (setq decomp-pair-to-comp-hash (make-hash-table :test 'equal :size 2000)) ; 1003 in Unicode 4.1.0 (except hangul)

  ;; NFC/NFD
  (defun nfc-predicate (char) t)
  (setq nfc-search-chars nil)
  (setq nfc-singletons nil)
  (setq nfd-alist nil)

  ;; HFS-NFC/HFS-NFD
  (defun hfs-nfc-predicate (char)
    (or (and (>= char 0) (< char #x2000))
             (and (>= char #x3000) (< char #xf900))
             (and (>= char #xfb00) (< char #x2f800))
             (>= char #x30000)))
  (setq hfs-nfc-search-chars nil)
  (setq hfs-nfc-singletons nil)
  (setq hfs-nfd-alist nil)

  ;; NFKC/NFKD
  (setq nfkd-alist nil)

  ;; macros
  (defmacro register-database (predicate 
                               search-chars
                               singletons
                               decomposition-alist)
    `(when (apply #',predicate (list char))
       (let ((decomp-1st-ccc
              (get-char-code-property 
               (car decomposition) 'canonical-combining-class)))
         (setq ,decomposition-alist (cons (cons char (vconcat decomposition))
                                          ,decomposition-alist))
         (unless (memq char normalize-composition-exclusions)
           (if (= 1 (length decomposition))
               (setq ,singletons (cons char ,singletons))
             ;; add second char of decomposition for search-chars iff ccc = 0
             (if (= 0 (get-char-code-property (cadr decomposition)
                                              'canonical-combining-class))
                 (add-to-list ',search-chars (cadr decomposition))))
           ;; For fast normalization, the case of positive integer
           ;; `decomp-1st-ccc' means that the char should also be
           ;; considered as something like singleton.
           (if (and decomp-1st-ccc (/= 0 decomp-1st-ccc))
               (setq ,singletons (cons char ,singletons)))))))

  (let ((char 0) ccc decomposition)
    (while (< char #x30000) 
      (setq ccc (get-char-code-property
                 char 'canonical-combining-class))
      (setq decomposition (get-char-code-property 
                           char 'decomposition))
      ;; add all combining-char to fast search list.
      (if (and ccc (/= 0 ccc)) (add-to-list 'combining-chars char))
      (when decomposition
        (if (symbolp (car decomposition))
            ;; compatibiliy
            (setq nfkd-alist (cons (cons char (vconcat (cdr decomposition)))
                                   nfkd-alist))
          ;; canonical
          (setq nfkd-alist (cons (cons char (vconcat decomposition))
                                 nfkd-alist))
          (if (and (= 2 (length decomposition))
                   (null (memq char normalize-composition-exclusions)))
              (puthash decomposition char decomp-pair-to-comp-hash))
          ;; NFC/NFD
          (register-database nfc-predicate
                             nfc-search-chars
                             nfc-singletons
                             nfd-alist)
          ;; HFS-NFC/HFS-NFD
          (register-database hfs-nfc-predicate
                             hfs-nfc-search-chars
                             hfs-nfc-singletons
                             hfs-nfd-alist)))
      (setq char (1+ char))))

  (defun normalize-repetitively-expand-alist (alist)
    (let ((flag t) elem hit)
      (while flag
        (setq flag nil)
        (dolist (entry alist)
          (setq elem (mapcar (lambda (x) 
                               (setq hit (assoc x alist))
                               (if hit (setq flag t))
                               (if hit (mapcar 'identity (cdr hit)) (list x)))
                             (cdr entry)))
          (if flag (setcdr entry (vconcat (apply 'append elem))))))))

  (normalize-repetitively-expand-alist nfd-alist)
  (normalize-repetitively-expand-alist hfs-nfd-alist)
  (normalize-repetitively-expand-alist nfkd-alist)

)


(defvar normalize-combining-chars-regexp nil
  "Regular expression to match sequence of combining characters.")
  (setq normalize-combining-chars-regexp
  (eval-when-compile (concat (regexp-opt (mapcar 'char-to-string combining-chars)) "+")))

(defvar normalize-decomp-pair-to-comp-hash nil
  "Hashtable of decomposed pair to primary composite.")
  (setq normalize-decomp-pair-to-comp-hash
  (eval-when-compile decomp-pair-to-comp-hash))

;; NFC/NFD
(defvar normalize-nfc-search-regexp nil)
  (setq normalize-nfc-search-regexp
  (eval-when-compile (regexp-opt 
                      (append (mapcar 'char-to-string nfc-search-chars)
                              (mapcar 'char-to-string combining-chars)
                              (mapcar 'char-to-string normalize-composition-exclusions)
                              (mapcar 'char-to-string nfc-singletons)))))
(defvar normalize-nfc-translation-chars nil)
  (setq normalize-nfc-translation-chars
  (eval-when-compile (append normalize-composition-exclusions nfc-singletons)))
(define-translation-table 'normalize-nfd-table 
  (eval-when-compile (make-translation-table-from-alist nfd-alist)))

;; HFS-NFC/NFD
(defvar normalize-hfs-nfc-search-regexp nil)
  (setq normalize-hfs-nfc-search-regexp
  (eval-when-compile (regexp-opt
                      (append (mapcar 'char-to-string hfs-nfc-search-chars)
                              (mapcar 'char-to-string combining-chars)
                              (mapcar 'char-to-string normalize-composition-exclusions)
                              (mapcar 'char-to-string hfs-nfc-singletons)))))
(defvar normalize-hfs-nfc-translation-chars nil)
  (setq normalize-hfs-nfc-translation-chars
  (eval-when-compile (append normalize-composition-exclusions hfs-nfc-singletons)))
(define-translation-table 'normalize-hfs-nfd-table
  (eval-when-compile (make-translation-table-from-alist hfs-nfd-alist)))

;; NFKC/NFKD
(define-translation-table 'normalize-nfkd-table
  (eval-when-compile (make-translation-table-from-alist nfkd-alist)))

;;------------------------------------------------------------------------------------------

;; Normalize local region.
(defun normalize-block (from to &optional translation-table compose)
  ;; block-string = [starter?  diacritics+]
  (save-restriction
    (narrow-to-region from to)
    (goto-char (point-min))
    ;; decompose if needed.
    (if translation-table
        (translate-region from to translation-table))
    (let ((starter (char-after from)) diacritics)
      (if (= (get-char-code-property starter 'canonical-combining-class) 0)
          (setq diacritics (string-to-list (buffer-substring (1+ from) (point-max))))
        (setq starter nil)
        (setq diacritics (string-to-list (buffer-substring from (point-max)))))
      ;; sort diacritical marks if needed.
      (if (< 1 (length diacritics))
          (setq diacritics
                (sort diacritics
                      (lambda (ch1 ch2)
                        (< (get-char-code-property ch1 'canonical-combining-class)
                           (get-char-code-property ch2 'canonical-combining-class))))))
      ;; compose if needed (and both starter and diacritics exists).
      (if (and compose (and starter diacritics))
          (let ((diac diacritics) prev (prev-ccc 0) ccc prim-comp)
            (while diac
              (setq ccc (get-char-code-property (car diac)
                                                     'canonical-combining-class))
              (if (and (or (< prev-ccc ccc) (= ccc 0)) ;; e.g. #x09c7 #x09be -> #x09cb
                       (setq prim-comp (gethash (list starter (car diac))
                                                normalize-decomp-pair-to-comp-hash)))
                  (progn
                    (setq prev-ccc 0)
                    (setq starter prim-comp)
                    (if prev (setcdr prev (cdr diac)) (setq diacritics (cdr diacritics)))
                    (setq diac diacritics))
                (setq prev-ccc ccc
                      prev diac
                      diac (cdr diac))))))
      ;; replace buffer
      (delete-region (point-min) (point-max))
      (insert (concat (if starter (list starter)) diacritics)))
    (point-max))) ;; return the last position.

(defun normalize-composition-region (from to regexp translation-table translation-chars 
                                          &optional compat)
  (save-excursion
    (save-restriction
      (narrow-to-region from to)
      (if compat (translate-region from to 'normalize-nfkd-table))
      (goto-char (point-min))
      (let (start-pos starter)
        (while (re-search-forward regexp nil t)
          (setq starter (string-to-char (match-string 0)))
          (setq start-pos (match-beginning 0))
          (if (memq starter translation-chars)
              (progn
                (translate-region start-pos (match-end 0) translation-table)
                (goto-char start-pos))
            (goto-char
             (normalize-block 
              ;; from
              (if (or (= start-pos (point-min))
                      (and (= 0 (get-char-code-property starter 'canonical-combining-class))
                           (/= 0 (get-char-code-property (char-after (1- start-pos))
                                                        'canonical-combining-class))))
                  start-pos (1- start-pos))
              ;; to
              (if (looking-at normalize-combining-chars-regexp)
                  (match-end 0) (1+ start-pos))
              'normalize-nfd-table t))))))))

(defun normalize-decomposition-region (from to translation-table)
  (save-excursion
    (save-restriction
      (narrow-to-region from to)
      (goto-char (point-min))
      (let (start-pos starter)
        (while (re-search-forward normalize-combining-chars-regexp nil t)
          (setq starter (string-to-char (match-string 0)))
          (setq start-pos (match-beginning 0))
          (goto-char
           (normalize-block
            (if (= start-pos (point-min)) start-pos (1- start-pos))
            (match-end 0)
            translation-table nil))))
      (translate-region (point-min) (point-max) translation-table))))

;; --------------------------------------------------------------------------------

;;;###autoload
(defun normalize-NFC-region (from to)
  "Normalize the current region by the Unicode NFC."
  (interactive "r")
  (normalize-composition-region from to 
                                normalize-nfc-search-regexp
                                'normalize-nfd-table
                                normalize-nfc-translation-chars))
;;;###autoload
(defun normalize-NFC-string (str)
  "Normalize the string STR by the Unicode NFC."
  (with-temp-buffer
    (insert str)
    (normalize-NFC-region (point-min) (point-max))
    (buffer-string)))
;;;###autoload
(defun normalize-NFD-region (from to)
  "Normalize the current region by the Unicode NFD."
  (interactive "r")
  (normalize-decomposition-region from to 'normalize-nfd-table))
;;;###autoload
(defun normalize-NFD-string (str)
  "Normalize the string STR by the Unicode NFD."
  (with-temp-buffer
    (insert str)
    (normalize-NFD-region (point-min) (point-max))
    (buffer-string)))


;;;###autoload
(defun normalize-HFS-NFC-region (from to)
  "Normalize the current region by the Unicode NFC and Mac OS's HFS Plus."
  (interactive "r")
  (normalize-composition-region from to
                                normalize-hfs-nfc-search-regexp
                                'normalize-hfs-nfd-table
                                normalize-hfs-nfc-translation-chars))
;;;###autoload
(defun normalize-HFS-NFC-string (str)
  "Normalize the string STR by the Unicode NFC and Mac OS's HFS Plus."
  (with-temp-buffer
    (insert str)
    (normalize-HFS-NFC-region (point-min) (point-max))
    (buffer-string)))
;;;###autoload
(defun normalize-HFS-NFD-region (from to)
  "Normalize the current region by the Unicode NFD and Mac OS's HFS Plus."
  (interactive "r")
  (normalize-decomposition-region from to 'normalize-hfs-nfd-table))
;;;###autoload
(defun normalize-HFS-NFD-string (str)
  "Normalize the string STR by the Unicode NFD and Mac OS's HFS Plus."
  (with-temp-buffer
    (insert str)
    (normalize-HFS-NFD-region (point-min) (point-max))
    (buffer-string)))


;;;###autoload
(defun normalize-NFKC-region (from to)
  "Normalize the current region by the Unicode NFKC."
  (interactive "r")
  (normalize-composition-region from to
                                normalize-nfc-search-regexp
                                'normalize-nfd-table
                                normalize-nfc-translation-chars t))
;;;###autoload
(defun normalize-NFKC-string (str)
  "Normalize the string STR by the Unicode NFKC."
  (with-temp-buffer
    (insert str)
    (normalize-NFKC-region (point-min) (point-max))
    (buffer-string)))
;;;###autoload
(defun normalize-NFKD-region (from to)
  "Normalize the current region by the Unicode NFKD."
  (interactive "r")
  (normalize-decomposition-region from to 'normalize-nfkd-table))
;;;###autoload
(defun normalize-NFKD-string (str)
  "Normalize the string STR by the Unicode NFKD."
  (with-temp-buffer
    (insert str)
    (normalize-NFKD-region (point-min) (point-max))
    (buffer-string)))

;; Post-read-conversion function for `utf-8-hfs'.
(defun normalize-hfs-nfd-post-read-conversion (len)
  (save-excursion
    (save-restriction
      (narrow-to-region (point) (+ (point) len))
      (let ((buffer-modified-p (buffer-modified-p)))
        (normalize-hfs-NFC-region (point-min) (point-max))
        (- (point-max) (point-min))))))

;;; coding-system definition
(define-coding-system 'utf-8-hfs
  "UTF-8 base coding system with normalization on decoding.
The characters in HFS normalization exclusion area are not normalized.
On encoding, don't perform normalization."
  :coding-type 'utf-8
  :mnemonic ?U
  :charset-list '(unicode)
  ;; NFD decoder
  ;; :decode-translation-table (not necessary)
  :post-read-conversion 'normalize-hfs-nfd-post-read-conversion
  ;; NFD encoder is not necessary because MacOS will automatically do it.
  ;; :encode-translation-table 'nfd-encode-translation-table
  ;; :pre-write-conversion 'nfd-encode-function
  )

;;; normalize.el ends here




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-04-21 17:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-02  6:40 normalize.el Kenichi Handa
2009-04-02  8:10 ` normalize.el Kim F. Storm
2009-04-02 11:54   ` normalize.el Kenichi Handa
2009-04-02 16:49 ` normalize.el Stefan Monnier
2009-04-08  5:39   ` normalize.el Kenichi Handa
2009-04-18 19:06     ` normalize.el 川幡 太一
2009-04-19  3:39       ` normalize.el 川幡 太一
2009-04-21  2:15         ` normalize.el Kenichi Handa
2009-04-21 17:10           ` normalize.el 川幡 太一
2009-04-03  9:10 ` normalize.el Richard M Stallman
2009-04-03 13:24   ` normalize.el Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).