From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [PATCH 2/3] Generate upcase and downcase tables from Unicode data Date: Tue, 18 Oct 2016 00:03:44 +0200 Message-ID: <1476741825-32172-3-git-send-email-mina86@mina86.com> References: <1476741825-32172-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1476741927 27418 195.159.176.226 (17 Oct 2016 22:05:27 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 17 Oct 2016 22:05:27 +0000 (UTC) To: 24603@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 18 00:05:22 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bwG1w-0005Ya-Vc for geb-bug-gnu-emacs@m.gmane.org; Tue, 18 Oct 2016 00:05:13 +0200 Original-Received: from localhost ([::1]:36288 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bwG1z-0001l9-25 for geb-bug-gnu-emacs@m.gmane.org; Mon, 17 Oct 2016 18:05:15 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38056) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bwG0s-00014g-94 for bug-gnu-emacs@gnu.org; Mon, 17 Oct 2016 18:04:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bwG0p-0007WW-2K for bug-gnu-emacs@gnu.org; Mon, 17 Oct 2016 18:04:06 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:59189) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1bwG0o-0007WH-Ut for bug-gnu-emacs@gnu.org; Mon, 17 Oct 2016 18:04:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bwG0o-0000J9-OI for bug-gnu-emacs@gnu.org; Mon, 17 Oct 2016 18:04:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 17 Oct 2016 22:04:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.14767418381152 (code B ref 24603); Mon, 17 Oct 2016 22:04:02 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 17 Oct 2016 22:03:58 +0000 Original-Received: from localhost ([127.0.0.1]:37142 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bwG0j-0000IQ-Kc for submit@debbugs.gnu.org; Mon, 17 Oct 2016 18:03:58 -0400 Original-Received: from mail-qk0-f172.google.com ([209.85.220.172]:33684) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bwG0g-0000Hj-50 for 24603@debbugs.gnu.org; Mon, 17 Oct 2016 18:03:54 -0400 Original-Received: by mail-qk0-f172.google.com with SMTP id n189so261405607qke.0 for <24603@debbugs.gnu.org>; Mon, 17 Oct 2016 15:03:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Lr4QvC4LWGYvTQ+GG4zCohe2xdOsxi2nmm112/OM4r8=; b=f9+uQjkSMvbTFJiW2yWqmgSXSbxX+1yUErc9KH81A3nf6JILzQ2f7tbXwKa/CQKLiG hN5l3wjZW1rIiUOZICAo1St9WdIDzGre1hisKNCcBbk2KNRrKzB1kYaEj+KvYn7gSZ9V /eAqNz2qTm3mBSEaMcmbl0po5evpKUYgYqE9jFM15WJuJEUD4rQoGgDA9ABZunal9zan CvGYWgR4RlA5VhOE9S3vDTd8jI2T4KUd+GBUSheRs+ylrvZq2aDX5x0TmXTxbClCF4mV +wQnWx0M2CE4q33eqVzpe63jiHlKb2qSgPCz1L7i6ZAzR+8EhC058YtgFY4y/P+XBYN6 U7vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=Lr4QvC4LWGYvTQ+GG4zCohe2xdOsxi2nmm112/OM4r8=; b=hT42g2y8wT4ssBlYh7B78B3Oblzjq3V6Ho9E6i6IEXvARNw/qFTHmfoYaRKrREqtRU Ge6rx4KI0c0Dx23JD4n/7aj91UDSNgob9wdzQxbiSn18bDcF3lG9CvDnzIRSo/Tn0dBi RkrT5Lr+kSZVyJvKlJo0J9SfY6wtki//vgydaKDxrNanBrGxmoqQDicvv96jjsLTtO4T Y2lbn6gueFdOQx/fpaPp7Noug6MzhmifhX54RiVcHsmEVHdLHWwNq4RmFOf9Z16MgY83 P/OA+bczBRI/5WUryxCr4P3yh58/rXcELL3ukh9dN1PNP/irytklpawM+8tl2Q0UXPS0 KEXQ== X-Gm-Message-State: AA6/9Rkags322s7FSzAkAVDApIsfC3zoiMgwNmNPfdSGn6S4dkRyJIgk06HQZASzj2u2OKbd X-Received: by 10.194.200.39 with SMTP id jp7mr14642325wjc.64.1476741828069; Mon, 17 Oct 2016 15:03:48 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id jt8sm56495708wjc.33.2016.10.17.15.03.47 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Oct 2016 15:03:47 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 8EB0E1E0209; Tue, 18 Oct 2016 00:03:46 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 In-Reply-To: <1476741825-32172-1-git-send-email-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:124610 Archived-At: Use Unicode data to generate case tables instead of mostly repeating them in lisp code. Do that in a way which maps ‘Dz’ (and similar) digraph to ‘dz’ when down- and ‘DZ’ when upcasing. https://debbugs.gnu.org/cgi/bugreport.cgi?msg=89;bug=24603 lists all changes to syntax table and case tables introduced by this commit. * lisp/international/characters.el: Remove case-pairs defined with explicit Lisp code and instead use Unicode character properties. * test/src/casefiddle-tests.el (casefiddle-tests--characters, casefiddle-tests-casing): Update test cases which are now working as they should. --- lisp/international/characters.el | 345 ++++++++------------------------------- test/src/casefiddle-tests.el | 7 +- 2 files changed, 73 insertions(+), 279 deletions(-) diff --git a/lisp/international/characters.el b/lisp/international/characters.el index 1757d2b..8dd9c73 100644 --- a/lisp/international/characters.el +++ b/lisp/international/characters.el @@ -543,10 +543,6 @@ ?L (set-case-syntax ?½ "_" tbl) (set-case-syntax ?¾ "_" tbl) (set-case-syntax ?¿ "." tbl) - (let ((c 192)) - (while (<= c 222) - (set-case-syntax-pair c (+ c 32) tbl) - (setq c (1+ c)))) (set-case-syntax ?× "_" tbl) (set-case-syntax ?ß "w" tbl) (set-case-syntax ?÷ "_" tbl) @@ -558,101 +554,8 @@ ?L (modify-category-entry c ?l) (setq c (1+ c))) - (let ((pair-ranges '((#x0100 . #x012F) - (#x0132 . #x0137) - (#x0139 . #x0148) - (#x014a . #x0177) - (#x0179 . #x017E) - (#x0182 . #x0185) - (#x0187 . #x0188) - (#x018B . #x018C) - (#x0191 . #x0192) - (#x0198 . #x0199) - (#x01A0 . #x01A5) - (#x01A7 . #x01A8) - (#x01AC . #x01AD) - (#x01AF . #x01B0) - (#x01B3 . #x01B6) - (#x01B8 . #x01B9) - (#x01BC . #x01BD) - (#x01CD . #x01DC) - (#x01DE . #x01EF) - (#x01F4 . #x01F5) - (#x01F8 . #x021F) - (#x0222 . #x0233) - (#x023B . #x023C) - (#x0241 . #x0242) - (#x0246 . #x024F)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?Ÿ ?ÿ tbl) - - ;; In some languages, such as Turkish, U+0049 LATIN CAPITAL LETTER I - ;; and U+0131 LATIN SMALL LETTER DOTLESS I make a case pair, and so - ;; do U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE and U+0069 LATIN - ;; SMALL LETTER I. - - ;; We used to set up half of those correspondence unconditionally, - ;; but that makes searches slow. So now we don't set up either half - ;; of these correspondences by default. - - ;; (set-downcase-syntax ?İ ?i tbl) - ;; (set-upcase-syntax ?I ?ı tbl) - - (set-case-syntax-pair ?Ɓ ?ɓ tbl) - (set-case-syntax-pair ?Ɔ ?ɔ tbl) - (set-case-syntax-pair ?Ɖ ?ɖ tbl) - (set-case-syntax-pair ?Ɗ ?ɗ tbl) - (set-case-syntax-pair ?Ǝ ?ǝ tbl) - (set-case-syntax-pair ?Ə ?ə tbl) - (set-case-syntax-pair ?Ɛ ?ɛ tbl) - (set-case-syntax-pair ?Ɠ ?ɠ tbl) - (set-case-syntax-pair ?Ɣ ?ɣ tbl) - (set-case-syntax-pair ?Ɩ ?ɩ tbl) - (set-case-syntax-pair ?Ɨ ?ɨ tbl) - (set-case-syntax-pair ?Ɯ ?ɯ tbl) - (set-case-syntax-pair ?Ɲ ?ɲ tbl) - (set-case-syntax-pair ?Ɵ ?ɵ tbl) - (set-case-syntax-pair ?Ʀ ?ʀ tbl) - (set-case-syntax-pair ?Ʃ ?ʃ tbl) - (set-case-syntax-pair ?Ʈ ?ʈ tbl) - (set-case-syntax-pair ?Ʊ ?ʊ tbl) - (set-case-syntax-pair ?Ʋ ?ʋ tbl) - (set-case-syntax-pair ?Ʒ ?ʒ tbl) - ;; We use set-downcase-syntax below, since we want upcase of dž - ;; return DŽ, not Dž, and the same for the rest. - (set-case-syntax-pair ?DŽ ?dž tbl) - (set-downcase-syntax ?Dž ?dž tbl) - (set-case-syntax-pair ?LJ ?lj tbl) - (set-downcase-syntax ?Lj ?lj tbl) - (set-case-syntax-pair ?NJ ?nj tbl) - (set-downcase-syntax ?Nj ?nj tbl) - - ;; 01F0; F; 006A 030C; # LATIN SMALL LETTER J WITH CARON - - (set-case-syntax-pair ?DZ ?dz tbl) - (set-downcase-syntax ?Dz ?dz tbl) - (set-case-syntax-pair ?Ƕ ?ƕ tbl) - (set-case-syntax-pair ?Ƿ ?ƿ tbl) - (set-case-syntax-pair ?Ⱥ ?ⱥ tbl) - (set-case-syntax-pair ?Ƚ ?ƚ tbl) - (set-case-syntax-pair ?Ⱦ ?ⱦ tbl) - (set-case-syntax-pair ?Ƀ ?ƀ tbl) - (set-case-syntax-pair ?Ʉ ?ʉ tbl) - (set-case-syntax-pair ?Ʌ ?ʌ tbl) - ;; Latin Extended Additional (modify-category-entry '(#x1e00 . #x1ef9) ?l) - (setq c #x1e00) - (while (<= c #x1ef9) - (and (zerop (% c 2)) - (or (<= c #x1e94) (>= c #x1ea0)) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) ;; Latin Extended-C (setq c #x2C60) @@ -660,57 +563,12 @@ ?L (modify-category-entry c ?l) (setq c (1+ c))) - (let ((pair-ranges '((#x2C60 . #x2C61) - (#x2C67 . #x2C6C) - (#x2C72 . #x2C73) - (#x2C75 . #x2C76)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?Ɫ ?ɫ tbl) - (set-case-syntax-pair ?Ᵽ ?ᵽ tbl) - (set-case-syntax-pair ?Ɽ ?ɽ tbl) - (set-case-syntax-pair ?Ɑ ?ɑ tbl) - (set-case-syntax-pair ?Ɱ ?ɱ tbl) - (set-case-syntax-pair ?Ɐ ?ɐ tbl) - (set-case-syntax-pair ?Ɒ ?ɒ tbl) - (set-case-syntax-pair ?Ȿ ?ȿ tbl) - (set-case-syntax-pair ?Ɀ ?ɀ tbl) - ;; Latin Extended-D (setq c #xA720) (while (<= c #xA7FF) (modify-category-entry c ?l) (setq c (1+ c))) - (let ((pair-ranges '((#xA722 . #xA72F) - (#xA732 . #xA76F) - (#xA779 . #xA77C) - (#xA77E . #xA787) - (#xA78B . #xA78E) - (#xA790 . #xA793) - (#xA796 . #xA7A9) - (#xA7B4 . #xA7B7)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?Ᵹ ?ᵹ tbl) - (set-case-syntax-pair ?Ɦ ?ɦ tbl) - (set-case-syntax-pair ?Ɜ ?ɜ tbl) - (set-case-syntax-pair ?Ɡ ?ɡ tbl) - (set-case-syntax-pair ?Ɬ ?ɬ tbl) - (set-case-syntax-pair ?Ɪ ?ɪ tbl) - (set-case-syntax-pair ?Ʞ ?ʞ tbl) - (set-case-syntax-pair ?Ʇ ?ʇ tbl) - (set-case-syntax-pair ?Ʝ ?ʝ tbl) - (set-case-syntax-pair ?Ꭓ ?ꭓ tbl) - ;; Latin Extended-E (setq c #xAB30) (while (<= c #xAB64) @@ -719,102 +577,19 @@ ?L ;; Greek (modify-category-entry '(#x0370 . #x03ff) ?g) - (setq c #x0370) - (while (<= c #x03ff) - (if (or (and (>= c #x0391) (<= c #x03a1)) - (and (>= c #x03a3) (<= c #x03ab))) - (set-case-syntax-pair c (+ c 32) tbl)) - (and (>= c #x03da) - (<= c #x03ee) - (zerop (% c 2)) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?Ά ?ά tbl) - (set-case-syntax-pair ?Έ ?έ tbl) - (set-case-syntax-pair ?Ή ?ή tbl) - (set-case-syntax-pair ?Ί ?ί tbl) - (set-case-syntax-pair ?Ό ?ό tbl) - (set-case-syntax-pair ?Ύ ?ύ tbl) - (set-case-syntax-pair ?Ώ ?ώ tbl) ;; Armenian (setq c #x531) - (while (<= c #x556) - (set-case-syntax-pair c (+ c #x30) tbl) - (setq c (1+ c))) ;; Greek Extended (modify-category-entry '(#x1f00 . #x1fff) ?g) - (setq c #x1f00) - (while (<= c #x1fff) - (and (<= (logand c #x000f) 7) - (<= c #x1fa7) - (not (memq c '(#x1f16 #x1f17 #x1f56 #x1f57 - #x1f50 #x1f52 #x1f54 #x1f56))) - (/= (logand c #x00f0) #x70) - (set-case-syntax-pair (+ c 8) c tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?Ᾰ ?ᾰ tbl) - (set-case-syntax-pair ?Ᾱ ?ᾱ tbl) - (set-case-syntax-pair ?Ὰ ?ὰ tbl) - (set-case-syntax-pair ?Ά ?ά tbl) - (set-case-syntax-pair ?ᾼ ?ᾳ tbl) - (set-case-syntax-pair ?Ὲ ?ὲ tbl) - (set-case-syntax-pair ?Έ ?έ tbl) - (set-case-syntax-pair ?Ὴ ?ὴ tbl) - (set-case-syntax-pair ?Ή ?ή tbl) - (set-case-syntax-pair ?ῌ ?ῃ tbl) - (set-case-syntax-pair ?Ῐ ?ῐ tbl) - (set-case-syntax-pair ?Ῑ ?ῑ tbl) - (set-case-syntax-pair ?Ὶ ?ὶ tbl) - (set-case-syntax-pair ?Ί ?ί tbl) - (set-case-syntax-pair ?Ῠ ?ῠ tbl) - (set-case-syntax-pair ?Ῡ ?ῡ tbl) - (set-case-syntax-pair ?Ὺ ?ὺ tbl) - (set-case-syntax-pair ?Ύ ?ύ tbl) - (set-case-syntax-pair ?Ῥ ?ῥ tbl) - (set-case-syntax-pair ?Ὸ ?ὸ tbl) - (set-case-syntax-pair ?Ό ?ό tbl) - (set-case-syntax-pair ?Ὼ ?ὼ tbl) - (set-case-syntax-pair ?Ώ ?ώ tbl) - (set-case-syntax-pair ?ῼ ?ῳ tbl) ;; cyrillic (modify-category-entry '(#x0400 . #x04FF) ?y) - (setq c #x0400) - (while (<= c #x04ff) - (and (>= c #x0400) - (<= c #x040f) - (set-case-syntax-pair c (+ c 80) tbl)) - (and (>= c #x0410) - (<= c #x042f) - (set-case-syntax-pair c (+ c 32) tbl)) - (and (zerop (% c 2)) - (or (and (>= c #x0460) (<= c #x0480)) - (and (>= c #x048c) (<= c #x04be)) - (and (>= c #x04d0) (<= c #x052e))) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?Ӂ ?ӂ tbl) - (set-case-syntax-pair ?Ӄ ?ӄ tbl) - (set-case-syntax-pair ?Ӈ ?ӈ tbl) - (set-case-syntax-pair ?Ӌ ?ӌ tbl) - (modify-category-entry '(#xA640 . #xA69F) ?y) - (setq c #xA640) - (while (<= c #xA66C) - (set-case-syntax-pair c (+ c 1) tbl) - (setq c (+ c 2))) - (setq c #xA680) - (while (<= c #xA69A) - (set-case-syntax-pair c (+ c 1) tbl) - (setq c (+ c 2))) ;; Georgian (setq c #x10A0) - (while (<= c #x10CD) - (set-case-syntax-pair c (+ c #x1C60) tbl) - (setq c (1+ c))) ;; Cyrillic Extended-C (modify-category-entry '(#x1C80 . #x1C8F) ?y) @@ -844,12 +619,6 @@ ?L (set-case-syntax c "." tbl) (setq c (1+ c))) - ;; Roman numerals - (setq c #x2160) - (while (<= c #x216f) - (set-case-syntax-pair c (+ c #x10) tbl) - (setq c (1+ c))) - ;; Fixme: The following blocks might be better as symbol rather than ;; punctuation. ;; Arrows @@ -873,25 +642,11 @@ ?L ;; Circled Latin (setq c #x24b6) (while (<= c #x24cf) - (set-case-syntax-pair c (+ c 26) tbl) (modify-category-entry c ?l) (modify-category-entry (+ c 26) ?l) (setq c (1+ c))) - ;; Glagolitic - (setq c #x2C00) - (while (<= c #x2C2E) - (set-case-syntax-pair c (+ c 48) tbl) - (setq c (1+ c))) - ;; Coptic - (let ((pair-ranges '((#x2C80 . #x2CE2) - (#x2CEB . #x2CF2)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) ;; There's no Coptic category. However, Coptic letters that are ;; part of the Greek block above get the Greek category, and those ;; in this block are derived from Greek letters, so let's be @@ -901,45 +656,85 @@ ?L ;; Fullwidth Latin (setq c #xff21) (while (<= c #xff3a) - (set-case-syntax-pair c (+ c #x20) tbl) (modify-category-entry c ?l) (modify-category-entry (+ c #x20) ?l) (setq c (1+ c))) - ;; Deseret - (setq c #x10400) - (while (<= c #x10427) - (set-case-syntax-pair c (+ c 28) tbl) - (setq c (1+ c))) + ;; Combining diacritics + (modify-category-entry '(#x300 . #x362) ?^) + ;; Combining marks + (modify-category-entry '(#x20d0 . #x20ff) ?^) - ;; Osage - (setq c #x104B0) - (while (<= c #x104D3) - (set-case-syntax-pair c (+ c 40) tbl) - (setq c (1+ c))) + ;; Set all Letter, uppercase; Letter, lowercase and Letter, titlecase syntax + ;; to word. + (let ((syn-tab (standard-syntax-table))) + (map-char-table + (lambda (ch cat) + (when (memq cat '(Lu Ll Lt)) + (modify-syntax-entry ch "w " syn-tab))) + (unicode-property-table-internal 'general-category)) - ;; Old Hungarian - (setq c #x10c80) - (while (<= c #x10cb2) - (set-case-syntax-pair c (+ c #x40) tbl) - (setq c (1+ c))) + ;; Ⅰ through Ⅻ had word syntax in the past so set it here as well. + ;; General category of those characers is Number, Letter. + (modify-syntax-entry '(#x2160 . #x216b) "w " syn-tab) - ;; Warang Citi - (setq c #x118a0) - (while (<= c #x118bf) - (set-case-syntax-pair c (+ c #x20) tbl) - (setq c (1+ c))) + ;; ⓐ thourgh ⓩ are symbols, other according to Unicode but Emacs set + ;; their syntax to word in the past so keep backwards compatibility. + (modify-syntax-entry '(#x24D0 . #x24E9) "w " syn-tab)) - ;; Adlam - (setq c #x1e900) - (while (<= c #x1e921) - (set-case-syntax-pair c (+ c #x22) tbl) - (setq c (1+ c))) + ;; Set downcase and upcase from Unicode properties - ;; Combining diacritics - (modify-category-entry '(#x300 . #x362) ?^) - ;; Combining marks - (modify-category-entry '(#x20d0 . #x20ff) ?^) + ;; In some languages, such as Turkish, U+0049 LATIN CAPITAL LETTER I and + ;; U+0131 LATIN SMALL LETTER DOTLESS I make a case pair, and so do U+0130 + ;; LATIN CAPITAL LETTER I WITH DOT ABOVE and U+0069 LATIN SMALL LETTER I. + + ;; We used to set up half of those correspondence unconditionally, but that + ;; makes searches slow. So now we don't set up either half of these + ;; correspondences by default. + + ;; (set-downcase-syntax ?İ ?i tbl) + ;; (set-upcase-syntax ?I ?ı tbl) + + (let ((map-unicode-property + (lambda (property func) + (map-char-table + (lambda (ch cased) + ;; ASCII characters skipped due to reasons outlined above. As of + ;; Unicode 9.0, this exception affects the following: + ;; lc(U+0130 İ) = i + ;; uc(U+0131 ı) = I + ;; uc(U+017F ſ) = S + ;; uc(U+212A K) = k + (when (> cased 127) + (let ((end (if (consp ch) (cdr ch) ch))) + (setq ch (max 128 (if (consp ch) (car ch) ch))) + (while (<= ch end) + (funcall func ch cased) + (setq ch (1+ ch)))))) + (unicode-property-table-internal property)))) + (down tbl) + (up (case-table-get-table tbl 'up))) + + ;; This works on an assumption that if toUpper(x) != x then toLower(x) == + ;; x (and the opposite for toLower/toUpper). This doesn’t hold for title + ;; case characters but those incorrect mappings will be overwritten later. + (funcall map-unicode-property 'uppercase + (lambda (lc uc) (aset down lc lc) (aset up uc uc))) + (funcall map-unicode-property 'lowercase + (lambda (uc lc) (aset down lc lc) (aset up uc uc))) + + ;; Now deal with the actual mapping. This will correctly assign casing for + ;; title-case characters. + (funcall map-unicode-property 'uppercase + (lambda (lc uc) (aset up lc uc) (aset up uc uc))) + (funcall map-unicode-property 'lowercase + (lambda (uc lc) (aset down uc lc) (aset down lc lc)))) + + ;; Clear out the extra slots so that they will be recomputed from the main + ;; (downcase) table and upcase table. Since we’re side-stepping the usual + ;; set-case-syntax-* functions, we need to do it explicitly. + (set-char-table-extra-slot tbl 1 nil) + (set-char-table-extra-slot tbl 2 nil) ;; Fixme: syntax for symbols &c ) diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el index b5a77a1..4142aa0 100644 --- a/test/src/casefiddle-tests.el +++ b/test/src/casefiddle-tests.el @@ -72,8 +72,7 @@ casefiddle-tests--characters (?Σ ?Σ ?σ ?Σ) (?σ ?Σ ?σ ?Σ) - ;; FIXME: Another broken one: - ;;(?ς ?Σ ?ς ?Σ) + (?ς ?Σ ?ς ?Σ) (?Ⅷ ?Ⅷ ?ⅷ ?Ⅷ) (?ⅷ ?Ⅷ ?ⅷ ?Ⅷ))) @@ -195,7 +194,6 @@ casefiddle-tests--test-casing ;;("fish" "FIsh" "fish" "Fish" "Fish") ;;("Straße" "STRASSE" "straße" "Straße" "Straße") ;;("ΌΣΟΣ" "ΌΣΟΣ" "όσος" "Όσος" "Όσος") - ;;("όσος" "ΌΣΟΣ" "όσος" "Όσος" "Όσος") ;; And here’s what is actually happening: ("DŽUNGLA" "DŽUNGLA" "džungla" "DŽungla" "DŽUNGLA") ("Džungla" "DžUNGLA" "džungla" "Džungla" "Džungla") @@ -204,7 +202,8 @@ casefiddle-tests--test-casing ("fish" "fiSH" "fish" "fish" "fish") ("Straße" "STRAßE" "straße" "Straße" "Straße") ("ΌΣΟΣ" "ΌΣΟΣ" "όσοσ" "Όσοσ" "ΌΣΟΣ") - ("όσος" "ΌΣΟς" "όσος" "Όσος" "Όσος")))))) + + ("όσος" "ΌΣΟΣ" "όσος" "Όσος" "Όσος")))))) (ert-deftest casefiddle-tests-casing-byte8 () (should-not -- 2.8.0.rc3.226.g39d4020