From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data Date: Tue, 4 Oct 2016 03:10:25 +0200 Message-ID: <1475543441-10493-2-git-send-email-mina86@mina86.com> References: <1475543441-10493-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1475543705 449 195.159.176.226 (4 Oct 2016 01:15:05 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Oct 2016 01:15:05 +0000 (UTC) To: 24603@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 04 03:15:01 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEJw-000848-CE for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Oct 2016 03:15:00 +0200 Original-Received: from localhost ([::1]:39729 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEJu-0000FZ-S3 for geb-bug-gnu-emacs@m.gmane.org; Mon, 03 Oct 2016 21:14:58 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56515) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEHD-0006qF-SB for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1brEH5-0002Qv-OM for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:08 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:37364) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEH5-0002Qm-K4 for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1brEH5-0006jO-GC for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:03 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Oct 2016 01:12:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.147554346725636 (code B ref 24603); Tue, 04 Oct 2016 01:12:03 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 4 Oct 2016 01:11:07 +0000 Original-Received: from localhost ([127.0.0.1]:43524 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEG9-0006f7-Jv for submit@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:07 -0400 Original-Received: from mail-wm0-f54.google.com ([74.125.82.54]:36266) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEG4-0006cT-At for 24603@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:01 -0400 Original-Received: by mail-wm0-f54.google.com with SMTP id k125so178182001wma.1 for <24603@debbugs.gnu.org>; Mon, 03 Oct 2016 18:11:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=HGkhygXWN31F0Beqz9lh7bPu1hZ5Iq+M1O0VC55cv9Y=; b=dfwgt6lHWeb4mgHjNFnhAeMcYh2yB6FdgsMWAfB2vVCPvJEUsj4R1hpWNOQqhCi+GV 8RL2VcSiyT8Pr9sG+GT3iaoFyFHZz/5yNCq7WXsIrd7k/JVigEyuQM8EKtgLZlEYzObr 6G1h1PwQdNNCaEQEIvzqfQxM02EgB8du9rZ9SC++pZwYU0s4EeJGFa4B5Wd+DP055k9H /dfgJyNAO6mbwmgqgGKBvhX9pS0ftM89hWWmTjaw3jVVkQmasqbmU85Q/8el1bduSfcA Pj7o8qA1s60H4k6TWqgPyisBYZoLnPJ7ghLwxdQGR2DkvAAWn4PF/Fe6JD4OUq6jx6Xu UzBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=HGkhygXWN31F0Beqz9lh7bPu1hZ5Iq+M1O0VC55cv9Y=; b=mAkTC8ZL6owalrUBKHfnmL59FEE9GVP/Qqd6RKS0LnvW0XXLrIRcukTkWpNN6QSui0 PhLTrUIVSYLowxVIp0JaTJySNl5/ytuxbe99wlFBTuoiN7czrcx5YI8HAaiTqb4rrwEa laRHPQ3j/k0L0Mw3tdYb/P10i+nD2VC/sQ4iSWbbnIqUzCjgUsMQI6Tkn+hvnHhYJ8Nj y9ogG+GSPI5ixj8MkubyPggq0r2GOxoShFNqwb4Etru4t4KYjKR67EewG/IDWC8xD5Ak ye3u45k5z7TLi9/Q7EGIgKvglGz4Y4gyRV5O9EGVH/RTnXh7BNbNfk2+oJZzQr0I/TJZ u6aA== X-Gm-Message-State: AA6/9RkhzIcwbrPBRAxfcFwba+DPgpwLBM/zDzQStYpD8IoOJ2zuQtRXBdtVrXDC+TMPIBZY X-Received: by 10.194.95.105 with SMTP id dj9mr724784wjb.20.1475543454226; Mon, 03 Oct 2016 18:10:54 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id lf9sm742907wjb.22.2016.10.03.18.10.51 for <24603@debbugs.gnu.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Oct 2016 18:10:52 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id C11E51E0297; Tue, 4 Oct 2016 03:10:48 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 In-Reply-To: <1475543441-10493-1-git-send-email-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:123996 Archived-At: Use Unicode data to generate case tables instead of mostly repeating them in lisp code. Do that in a way which maps ‘Dz’ (and similar) digraph to ‘dz’ when down- and ‘DZ’ when upcasing. * lisp/international/characters.el: Remove case-pairs defined with explicit Lisp code and instead use Unicode character properties. * test/src/casefiddle-tests.el (casefiddle-tests--characters, casefiddle-tests-casing): Update test cases which are now working as they should. --- lisp/international/characters.el | 338 ++++++++------------------------------- test/src/casefiddle-tests.el | 7 +- 2 files changed, 66 insertions(+), 279 deletions(-) diff --git a/lisp/international/characters.el b/lisp/international/characters.el index 1757d2b..67b0149 100644 --- a/lisp/international/characters.el +++ b/lisp/international/characters.el @@ -543,10 +543,6 @@ ?L (set-case-syntax ?½ "_" tbl) (set-case-syntax ?¾ "_" tbl) (set-case-syntax ?¿ "." tbl) - (let ((c 192)) - (while (<= c 222) - (set-case-syntax-pair c (+ c 32) tbl) - (setq c (1+ c)))) (set-case-syntax ?× "_" tbl) (set-case-syntax ?ß "w" tbl) (set-case-syntax ?÷ "_" tbl) @@ -558,101 +554,8 @@ ?L (modify-category-entry c ?l) (setq c (1+ c))) - (let ((pair-ranges '((#x0100 . #x012F) - (#x0132 . #x0137) - (#x0139 . #x0148) - (#x014a . #x0177) - (#x0179 . #x017E) - (#x0182 . #x0185) - (#x0187 . #x0188) - (#x018B . #x018C) - (#x0191 . #x0192) - (#x0198 . #x0199) - (#x01A0 . #x01A5) - (#x01A7 . #x01A8) - (#x01AC . #x01AD) - (#x01AF . #x01B0) - (#x01B3 . #x01B6) - (#x01B8 . #x01B9) - (#x01BC . #x01BD) - (#x01CD . #x01DC) - (#x01DE . #x01EF) - (#x01F4 . #x01F5) - (#x01F8 . #x021F) - (#x0222 . #x0233) - (#x023B . #x023C) - (#x0241 . #x0242) - (#x0246 . #x024F)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?Ÿ ?ÿ tbl) - - ;; In some languages, such as Turkish, U+0049 LATIN CAPITAL LETTER I - ;; and U+0131 LATIN SMALL LETTER DOTLESS I make a case pair, and so - ;; do U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE and U+0069 LATIN - ;; SMALL LETTER I. - - ;; We used to set up half of those correspondence unconditionally, - ;; but that makes searches slow. So now we don't set up either half - ;; of these correspondences by default. - - ;; (set-downcase-syntax ?İ ?i tbl) - ;; (set-upcase-syntax ?I ?ı tbl) - - (set-case-syntax-pair ?Ɓ ?ɓ tbl) - (set-case-syntax-pair ?Ɔ ?ɔ tbl) - (set-case-syntax-pair ?Ɖ ?ɖ tbl) - (set-case-syntax-pair ?Ɗ ?ɗ tbl) - (set-case-syntax-pair ?Ǝ ?ǝ tbl) - (set-case-syntax-pair ?Ə ?ə tbl) - (set-case-syntax-pair ?Ɛ ?ɛ tbl) - (set-case-syntax-pair ?Ɠ ?ɠ tbl) - (set-case-syntax-pair ?Ɣ ?ɣ tbl) - (set-case-syntax-pair ?Ɩ ?ɩ tbl) - (set-case-syntax-pair ?Ɨ ?ɨ tbl) - (set-case-syntax-pair ?Ɯ ?ɯ tbl) - (set-case-syntax-pair ?Ɲ ?ɲ tbl) - (set-case-syntax-pair ?Ɵ ?ɵ tbl) - (set-case-syntax-pair ?Ʀ ?ʀ tbl) - (set-case-syntax-pair ?Ʃ ?ʃ tbl) - (set-case-syntax-pair ?Ʈ ?ʈ tbl) - (set-case-syntax-pair ?Ʊ ?ʊ tbl) - (set-case-syntax-pair ?Ʋ ?ʋ tbl) - (set-case-syntax-pair ?Ʒ ?ʒ tbl) - ;; We use set-downcase-syntax below, since we want upcase of dž - ;; return DŽ, not Dž, and the same for the rest. - (set-case-syntax-pair ?DŽ ?dž tbl) - (set-downcase-syntax ?Dž ?dž tbl) - (set-case-syntax-pair ?LJ ?lj tbl) - (set-downcase-syntax ?Lj ?lj tbl) - (set-case-syntax-pair ?NJ ?nj tbl) - (set-downcase-syntax ?Nj ?nj tbl) - - ;; 01F0; F; 006A 030C; # LATIN SMALL LETTER J WITH CARON - - (set-case-syntax-pair ?DZ ?dz tbl) - (set-downcase-syntax ?Dz ?dz tbl) - (set-case-syntax-pair ?Ƕ ?ƕ tbl) - (set-case-syntax-pair ?Ƿ ?ƿ tbl) - (set-case-syntax-pair ?Ⱥ ?ⱥ tbl) - (set-case-syntax-pair ?Ƚ ?ƚ tbl) - (set-case-syntax-pair ?Ⱦ ?ⱦ tbl) - (set-case-syntax-pair ?Ƀ ?ƀ tbl) - (set-case-syntax-pair ?Ʉ ?ʉ tbl) - (set-case-syntax-pair ?Ʌ ?ʌ tbl) - ;; Latin Extended Additional (modify-category-entry '(#x1e00 . #x1ef9) ?l) - (setq c #x1e00) - (while (<= c #x1ef9) - (and (zerop (% c 2)) - (or (<= c #x1e94) (>= c #x1ea0)) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) ;; Latin Extended-C (setq c #x2C60) @@ -660,57 +563,12 @@ ?L (modify-category-entry c ?l) (setq c (1+ c))) - (let ((pair-ranges '((#x2C60 . #x2C61) - (#x2C67 . #x2C6C) - (#x2C72 . #x2C73) - (#x2C75 . #x2C76)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?Ɫ ?ɫ tbl) - (set-case-syntax-pair ?Ᵽ ?ᵽ tbl) - (set-case-syntax-pair ?Ɽ ?ɽ tbl) - (set-case-syntax-pair ?Ɑ ?ɑ tbl) - (set-case-syntax-pair ?Ɱ ?ɱ tbl) - (set-case-syntax-pair ?Ɐ ?ɐ tbl) - (set-case-syntax-pair ?Ɒ ?ɒ tbl) - (set-case-syntax-pair ?Ȿ ?ȿ tbl) - (set-case-syntax-pair ?Ɀ ?ɀ tbl) - ;; Latin Extended-D (setq c #xA720) (while (<= c #xA7FF) (modify-category-entry c ?l) (setq c (1+ c))) - (let ((pair-ranges '((#xA722 . #xA72F) - (#xA732 . #xA76F) - (#xA779 . #xA77C) - (#xA77E . #xA787) - (#xA78B . #xA78E) - (#xA790 . #xA793) - (#xA796 . #xA7A9) - (#xA7B4 . #xA7B7)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?Ᵹ ?ᵹ tbl) - (set-case-syntax-pair ?Ɦ ?ɦ tbl) - (set-case-syntax-pair ?Ɜ ?ɜ tbl) - (set-case-syntax-pair ?Ɡ ?ɡ tbl) - (set-case-syntax-pair ?Ɬ ?ɬ tbl) - (set-case-syntax-pair ?Ɪ ?ɪ tbl) - (set-case-syntax-pair ?Ʞ ?ʞ tbl) - (set-case-syntax-pair ?Ʇ ?ʇ tbl) - (set-case-syntax-pair ?Ʝ ?ʝ tbl) - (set-case-syntax-pair ?Ꭓ ?ꭓ tbl) - ;; Latin Extended-E (setq c #xAB30) (while (<= c #xAB64) @@ -719,102 +577,19 @@ ?L ;; Greek (modify-category-entry '(#x0370 . #x03ff) ?g) - (setq c #x0370) - (while (<= c #x03ff) - (if (or (and (>= c #x0391) (<= c #x03a1)) - (and (>= c #x03a3) (<= c #x03ab))) - (set-case-syntax-pair c (+ c 32) tbl)) - (and (>= c #x03da) - (<= c #x03ee) - (zerop (% c 2)) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?Ά ?ά tbl) - (set-case-syntax-pair ?Έ ?έ tbl) - (set-case-syntax-pair ?Ή ?ή tbl) - (set-case-syntax-pair ?Ί ?ί tbl) - (set-case-syntax-pair ?Ό ?ό tbl) - (set-case-syntax-pair ?Ύ ?ύ tbl) - (set-case-syntax-pair ?Ώ ?ώ tbl) ;; Armenian (setq c #x531) - (while (<= c #x556) - (set-case-syntax-pair c (+ c #x30) tbl) - (setq c (1+ c))) ;; Greek Extended (modify-category-entry '(#x1f00 . #x1fff) ?g) - (setq c #x1f00) - (while (<= c #x1fff) - (and (<= (logand c #x000f) 7) - (<= c #x1fa7) - (not (memq c '(#x1f16 #x1f17 #x1f56 #x1f57 - #x1f50 #x1f52 #x1f54 #x1f56))) - (/= (logand c #x00f0) #x70) - (set-case-syntax-pair (+ c 8) c tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?Ᾰ ?ᾰ tbl) - (set-case-syntax-pair ?Ᾱ ?ᾱ tbl) - (set-case-syntax-pair ?Ὰ ?ὰ tbl) - (set-case-syntax-pair ?Ά ?ά tbl) - (set-case-syntax-pair ?ᾼ ?ᾳ tbl) - (set-case-syntax-pair ?Ὲ ?ὲ tbl) - (set-case-syntax-pair ?Έ ?έ tbl) - (set-case-syntax-pair ?Ὴ ?ὴ tbl) - (set-case-syntax-pair ?Ή ?ή tbl) - (set-case-syntax-pair ?ῌ ?ῃ tbl) - (set-case-syntax-pair ?Ῐ ?ῐ tbl) - (set-case-syntax-pair ?Ῑ ?ῑ tbl) - (set-case-syntax-pair ?Ὶ ?ὶ tbl) - (set-case-syntax-pair ?Ί ?ί tbl) - (set-case-syntax-pair ?Ῠ ?ῠ tbl) - (set-case-syntax-pair ?Ῡ ?ῡ tbl) - (set-case-syntax-pair ?Ὺ ?ὺ tbl) - (set-case-syntax-pair ?Ύ ?ύ tbl) - (set-case-syntax-pair ?Ῥ ?ῥ tbl) - (set-case-syntax-pair ?Ὸ ?ὸ tbl) - (set-case-syntax-pair ?Ό ?ό tbl) - (set-case-syntax-pair ?Ὼ ?ὼ tbl) - (set-case-syntax-pair ?Ώ ?ώ tbl) - (set-case-syntax-pair ?ῼ ?ῳ tbl) ;; cyrillic (modify-category-entry '(#x0400 . #x04FF) ?y) - (setq c #x0400) - (while (<= c #x04ff) - (and (>= c #x0400) - (<= c #x040f) - (set-case-syntax-pair c (+ c 80) tbl)) - (and (>= c #x0410) - (<= c #x042f) - (set-case-syntax-pair c (+ c 32) tbl)) - (and (zerop (% c 2)) - (or (and (>= c #x0460) (<= c #x0480)) - (and (>= c #x048c) (<= c #x04be)) - (and (>= c #x04d0) (<= c #x052e))) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?Ӂ ?ӂ tbl) - (set-case-syntax-pair ?Ӄ ?ӄ tbl) - (set-case-syntax-pair ?Ӈ ?ӈ tbl) - (set-case-syntax-pair ?Ӌ ?ӌ tbl) - (modify-category-entry '(#xA640 . #xA69F) ?y) - (setq c #xA640) - (while (<= c #xA66C) - (set-case-syntax-pair c (+ c 1) tbl) - (setq c (+ c 2))) - (setq c #xA680) - (while (<= c #xA69A) - (set-case-syntax-pair c (+ c 1) tbl) - (setq c (+ c 2))) ;; Georgian (setq c #x10A0) - (while (<= c #x10CD) - (set-case-syntax-pair c (+ c #x1C60) tbl) - (setq c (1+ c))) ;; Cyrillic Extended-C (modify-category-entry '(#x1C80 . #x1C8F) ?y) @@ -844,12 +619,6 @@ ?L (set-case-syntax c "." tbl) (setq c (1+ c))) - ;; Roman numerals - (setq c #x2160) - (while (<= c #x216f) - (set-case-syntax-pair c (+ c #x10) tbl) - (setq c (1+ c))) - ;; Fixme: The following blocks might be better as symbol rather than ;; punctuation. ;; Arrows @@ -873,25 +642,11 @@ ?L ;; Circled Latin (setq c #x24b6) (while (<= c #x24cf) - (set-case-syntax-pair c (+ c 26) tbl) (modify-category-entry c ?l) (modify-category-entry (+ c 26) ?l) (setq c (1+ c))) - ;; Glagolitic - (setq c #x2C00) - (while (<= c #x2C2E) - (set-case-syntax-pair c (+ c 48) tbl) - (setq c (1+ c))) - ;; Coptic - (let ((pair-ranges '((#x2C80 . #x2CE2) - (#x2CEB . #x2CF2)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) ;; There's no Coptic category. However, Coptic letters that are ;; part of the Greek block above get the Greek category, and those ;; in this block are derived from Greek letters, so let's be @@ -901,45 +656,78 @@ ?L ;; Fullwidth Latin (setq c #xff21) (while (<= c #xff3a) - (set-case-syntax-pair c (+ c #x20) tbl) (modify-category-entry c ?l) (modify-category-entry (+ c #x20) ?l) (setq c (1+ c))) - ;; Deseret - (setq c #x10400) - (while (<= c #x10427) - (set-case-syntax-pair c (+ c 28) tbl) - (setq c (1+ c))) + ;; Combining diacritics + (modify-category-entry '(#x300 . #x362) ?^) + ;; Combining marks + (modify-category-entry '(#x20d0 . #x20ff) ?^) - ;; Osage - (setq c #x104B0) - (while (<= c #x104D3) - (set-case-syntax-pair c (+ c 40) tbl) - (setq c (1+ c))) + ;; Set all Letter, uppercase; Letter, lowercase and Letter, titlecase syntax + ;; to word. FIXME: Should this also be done for Letter, modifier and Letter, + ;; other? What about other alphabetic characters? + (let ((syn-tab (standard-syntax-table))) + (map-char-table + (lambda (ch cat) + (when (memq cat '(Lu Ll Lt)) + (modify-syntax-entry ch "w " syn-tab))) + (unicode-property-table-internal 'general-category))) - ;; Old Hungarian - (setq c #x10c80) - (while (<= c #x10cb2) - (set-case-syntax-pair c (+ c #x40) tbl) - (setq c (1+ c))) + ;; Set downcase and upcase from Unicode properties - ;; Warang Citi - (setq c #x118a0) - (while (<= c #x118bf) - (set-case-syntax-pair c (+ c #x20) tbl) - (setq c (1+ c))) + ;; In some languages, such as Turkish, U+0049 LATIN CAPITAL LETTER I and + ;; U+0131 LATIN SMALL LETTER DOTLESS I make a case pair, and so do U+0130 + ;; LATIN CAPITAL LETTER I WITH DOT ABOVE and U+0069 LATIN SMALL LETTER I. - ;; Adlam - (setq c #x1e900) - (while (<= c #x1e921) - (set-case-syntax-pair c (+ c #x22) tbl) - (setq c (1+ c))) + ;; We used to set up half of those correspondence unconditionally, but that + ;; makes searches slow. So now we don't set up either half of these + ;; correspondences by default. - ;; Combining diacritics - (modify-category-entry '(#x300 . #x362) ?^) - ;; Combining marks - (modify-category-entry '(#x20d0 . #x20ff) ?^) + ;; (set-downcase-syntax ?İ ?i tbl) + ;; (set-upcase-syntax ?I ?ı tbl) + + (let ((map-unicode-property + (lambda (property func) + (map-char-table + (lambda (ch cased) + ;; ASCII characters skipped due to reasons outlined above. As of + ;; Unicode 9.0, this exception affects the following: + ;; lc(U+0130 İ) = i + ;; uc(U+0131 ı) = I + ;; uc(U+017F ſ) = S + ;; uc(U+212A K) = k + (when (> cased 127) + (let ((end (if (consp ch) (cdr ch) ch))) + (setq ch (max 128 (if (consp ch) (car ch) ch))) + (while (<= ch end) + (funcall func ch cased) + (setq ch (1+ ch)))))) + (unicode-property-table-internal property)))) + (down tbl) + (up (case-table-get-table tbl 'up))) + + ;; This works on an assumption that if toUpper(x) != x then toLower(x) == + ;; x (and the opposite for toLower/toUpper). This doesn’t hold for title + ;; case characters but those incorrect mappings will be overwritten later. + (funcall map-unicode-property 'uppercase + (lambda (lc uc) (aset down lc lc) (aset up uc uc))) + (funcall map-unicode-property 'lowercase + (lambda (uc lc) (aset down lc lc) (aset up uc uc))) + + ;; Now deal with the actual mapping. This will correctly assign casing for + ;; title-case characters. + (funcall map-unicode-property 'uppercase + (lambda (lc uc) (aset up lc uc) (aset up uc uc))) + (funcall map-unicode-property 'lowercase + (lambda (uc lc) (aset down uc lc) (aset down lc lc)))) + + ;; Clear out the extra slots so that they will be recomputed from the main + ;; (downcase) table and upcase table. Since we’re side-stepping the usual + ;; set-case-syntax-* functions, we need to do it explicitly. + (set-char-table-extra-slot tbl 1 nil) + (set-char-table-extra-slot tbl 2 nil) ;; Fixme: syntax for symbols &c ) diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el index 4b2eeaf..ca3657d 100644 --- a/test/src/casefiddle-tests.el +++ b/test/src/casefiddle-tests.el @@ -72,8 +72,7 @@ casefiddle-tests--characters (?Σ ?Σ ?σ ?Σ) (?σ ?Σ ?σ ?Σ) - ;; FIXME: Another broken one: - ;;(?ς ?Σ ?ς ?Σ) + (?ς ?Σ ?ς ?Σ) (?Ⅷ ?Ⅷ ?ⅷ ?Ⅷ) (?ⅷ ?Ⅷ ?ⅷ ?Ⅷ))) @@ -151,7 +150,6 @@ casefiddle-tests--characters ;;("fish" "FIsh" "fish" "Fish" "Fish") ;;("Straße" "STRASSE" "straße" "Straße" "Straße") ;;("ΌΣΟΣ" "ΌΣΟΣ" "όσος" "Όσος" "Όσος") - ;;("όσος" "ΌΣΟΣ" "όσος" "Όσος" "Όσος") ;; And here’s what is actually happening: ("DŽUNGLA" "DŽUNGLA" "džungla" "DŽungla" "DŽUNGLA") ("Džungla" "DžUNGLA" "džungla" "Džungla" "Džungla") @@ -160,7 +158,8 @@ casefiddle-tests--characters ("fish" "fiSH" "fish" "fish" "fish") ("Straße" "STRAßE" "straße" "Straße" "Straße") ("ΌΣΟΣ" "ΌΣΟΣ" "όσοσ" "Όσοσ" "ΌΣΟΣ") - ("όσος" "ΌΣΟς" "όσος" "Όσος" "Όσος")) + + ("όσος" "ΌΣΟΣ" "όσος" "Όσος" "Όσος")) (nreverse errors)) (let* ((input (car test)) (expected (cdr test)) -- 2.8.0.rc3.226.g39d4020