From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data Date: Thu, 06 Oct 2016 22:29:06 +0200 Organization: http://mina86.com/ Message-ID: References: <1475543441-10493-1-git-send-email-mina86@mina86.com> <1475543441-10493-2-git-send-email-mina86@mina86.com> <83a8eko9pl.fsf@gnu.org> <83oa30m9v4.fsf@gnu.org> <83d1jgm3dl.fsf@gnu.org> <838tu4m2kg.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: blaine.gmane.org 1475785844 13946 195.159.176.226 (6 Oct 2016 20:30:44 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 6 Oct 2016 20:30:44 +0000 (UTC) User-Agent: Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.2 (x86_64-unknown-linux-gnu) Cc: 24603@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Oct 06 22:30:38 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bsFJ3-0000WL-1c for geb-bug-gnu-emacs@m.gmane.org; Thu, 06 Oct 2016 22:30:17 +0200 Original-Received: from localhost ([::1]:59249 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bsFJ1-0003tF-Mi for geb-bug-gnu-emacs@m.gmane.org; Thu, 06 Oct 2016 16:30:15 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56523) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bsFIt-0003qW-Aa for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2016 16:30:09 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bsFIp-00065W-5k for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2016 16:30:07 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:40749) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bsFIp-00065A-1p for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2016 16:30:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bsFIo-0002oH-Kl for bug-gnu-emacs@gnu.org; Thu, 06 Oct 2016 16:30:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 06 Oct 2016 20:30:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.147578575910713 (code B ref 24603); Thu, 06 Oct 2016 20:30:02 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 6 Oct 2016 20:29:19 +0000 Original-Received: from localhost ([127.0.0.1]:46939 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bsFI6-0002mi-8k for submit@debbugs.gnu.org; Thu, 06 Oct 2016 16:29:18 -0400 Original-Received: from mail-wm0-f42.google.com ([74.125.82.42]:37195) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bsFI3-0002mS-Rt for 24603@debbugs.gnu.org; Thu, 06 Oct 2016 16:29:16 -0400 Original-Received: by mail-wm0-f42.google.com with SMTP id b201so72094409wmb.0 for <24603@debbugs.gnu.org>; Thu, 06 Oct 2016 13:29:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:in-reply-to:organization:references :user-agent:face:date:message-id:mime-version; bh=jWJEcD7b5ShiWhUVIF7vbwEh9ERm7lDBob7WCTuPJjg=; b=E1lNi1lfDIKIaHsEw6MesFhWjx0szrsIqVYFqFCOSPc4sY+VEwBkILAoWs2BX6AKwx a9Xbq6nx097j6juwQrjQV6jKo4BgpYOSVuAB/aazsOfAJtUBsKyhGwPzR1yEpJTTKAc5 dkErxPUNCH8WLfskDn91ZwfCfSfyhNWUSfk/RCjJ0n4S4CvyFKzkU1CUz35005J6aUw4 vTsiw9bPReSEOB6h6EczxlpLGwzayogo4A+m61Nx4VSCMCC2Tih97nNzxTsOWtraJoBx ACRAlsuSxhI3nZMS20UNxxemfaTzdaE0ZUSYrYEygvd3wU8PwyyQtcvevSLE2aNoQGhu y5GQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:in-reply-to :organization:references:user-agent:face:date:message-id :mime-version; bh=jWJEcD7b5ShiWhUVIF7vbwEh9ERm7lDBob7WCTuPJjg=; b=WR2t/Z42KMI31GSEGWkDI9d8a8xXGEQtmfx1TvZL4c8W7bBA80+ePXIjpp4XlnGJIF qO5MS7LbYY5UsxTWU4POJYgqulnKJwEeqaaibtRltmwlmyFJg6D0yhHbcc2w70uVbPhW oX/LneroEA3FrMyj5RWtUK9yZBAFUqvk2/zoABy4jFKeuk/8YU3uF9YfI8juSgg4a6z+ TnlBCzJbX9jQlMYt3IkWgwICJdSSPDVgq7olPTqbz6ypm+9dVz1/24u4A6dWJSkXa5lk K9cXapiesFmI5OE3yVrSMM2s/3rR5Jiq1Vwgutcm3WuWggm/KLwWcKmZ5ntBEAvHOWFm PZpw== X-Gm-Message-State: AA6/9RkjRAHaV3+adnMoQzBYRTS17qT7RZtCidcn3cU72H14Rrsj9L3ip+OcClQB11l795UL X-Received: by 10.28.97.86 with SMTP id v83mr9651073wmb.49.1475785749816; Thu, 06 Oct 2016 13:29:09 -0700 (PDT) Original-Received: from mpn-glaptop ([2620:0:105f:310:d49a:e80a:8f82:e086]) by smtp.gmail.com with ESMTPSA id 123sm16796148wmj.5.2016.10.06.13.29.07 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Thu, 06 Oct 2016 13:29:08 -0700 (PDT) In-Reply-To: <838tu4m2kg.fsf@gnu.org> Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4 fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 X-Hashcash: 1:20:161006:24603@debbugs.gnu.org::NuJ5DQnkHJ3qmoQQ:00000000000000000000000000000000000000000bjf X-Hashcash: 1:20:161006:eliz@gnu.org::Yl1efzpSwpoEyxzu:0000008og X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:124131 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Tue, Oct 04 2016, Eli Zaretskii wrote: >> Date: Tue, 04 Oct 2016 20:27:02 +0300 >> From: Eli Zaretskii >> Cc: 24603@debbugs.gnu.org >>=20 >> > From: Michal Nazarewicz >> > Cc: 24603@debbugs.gnu.org >> > Date: Tue, 04 Oct 2016 18:57:03 +0200 >> >=20 >> > > I think we should document all the changes. >> >=20 >> > I wouldn=E2=80=99t know where to put such documentation. >>=20 >> On a separate file under admin/unidata/, if we cannot find a better >> place. > > Or maybe just mention in the commit log the URL of the message in the > bugtracker's records, where you posted the diffs, it might be good > enough. That=E2=80=99s easy enough. --=20 Best regards =E3=83=9F=E3=83=8F=E3=82=A6 =E2=80=9C=F0=9D=93=B6=F0=9D=93=B2=F0=9D=93=B7= =F0=9D=93=AA86=E2=80=9D =E3=83=8A=E3=82=B6=E3=83=AC=E3=83=B4=E3=82=A4=E3=83= =84 =C2=ABIf at first you don=E2=80=99t succeed, give up skydiving=C2=BB --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: inline; filename=0002-Generate-upcase-and-downcase-tables-from-Unicode-dat.patch Content-Transfer-Encoding: quoted-printable >From 9d2fd43c4d442543a650a4d3cb95b0c2aa6a0c4e Mon Sep 17 00:00:00 2001 From: Michal Nazarewicz Date: Mon, 19 Sep 2016 00:23:40 +0200 Subject: [PATCH 02/19] Generate upcase and downcase tables from Unicode data MIME-Version: 1.0 Content-Type: text/plain; charset=3DUTF-8 Content-Transfer-Encoding: 8bit Use Unicode data to generate case tables instead of mostly repeating them in lisp code. Do that in a way which maps =E2=80=98Dz=E2=80=99 (and s= imilar) digraph to =E2=80=98dz=E2=80=99 when down- and =E2=80=98DZ=E2=80=99 when up= casing. https://debbugs.gnu.org/cgi/bugreport.cgi?msg=3D89;bug=3D24603 lists all changes to syntax table and case tables introduced by this commit. * lisp/international/characters.el: Remove case-pairs defined with explicit Lisp code and instead use Unicode character properties. * test/src/casefiddle-tests.el (casefiddle-tests--characters, casefiddle-tests-casing): Update test cases which are now working as they should. --- lisp/international/characters.el | 345 ++++++++---------------------------= ---- test/src/casefiddle-tests.el | 7 +- 2 files changed, 73 insertions(+), 279 deletions(-) diff --git a/lisp/international/characters.el b/lisp/international/characte= rs.el index 1757d2b..8dd9c73 100644 --- a/lisp/international/characters.el +++ b/lisp/international/characters.el @@ -543,10 +543,6 @@ ?L (set-case-syntax ?=C2=BD "_" tbl) (set-case-syntax ?=C2=BE "_" tbl) (set-case-syntax ?=C2=BF "." tbl) - (let ((c 192)) - (while (<=3D c 222) - (set-case-syntax-pair c (+ c 32) tbl) - (setq c (1+ c)))) (set-case-syntax ?=C3=97 "_" tbl) (set-case-syntax ?=C3=9F "w" tbl) (set-case-syntax ?=C3=B7 "_" tbl) @@ -558,101 +554,8 @@ ?L (modify-category-entry c ?l) (setq c (1+ c))) =20 - (let ((pair-ranges '((#x0100 . #x012F) - (#x0132 . #x0137) - (#x0139 . #x0148) - (#x014a . #x0177) - (#x0179 . #x017E) - (#x0182 . #x0185) - (#x0187 . #x0188) - (#x018B . #x018C) - (#x0191 . #x0192) - (#x0198 . #x0199) - (#x01A0 . #x01A5) - (#x01A7 . #x01A8) - (#x01AC . #x01AD) - (#x01AF . #x01B0) - (#x01B3 . #x01B6) - (#x01B8 . #x01B9) - (#x01BC . #x01BD) - (#x01CD . #x01DC) - (#x01DE . #x01EF) - (#x01F4 . #x01F5) - (#x01F8 . #x021F) - (#x0222 . #x0233) - (#x023B . #x023C) - (#x0241 . #x0242) - (#x0246 . #x024F)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?=C5=B8 ?=C3=BF tbl) - - ;; In some languages, such as Turkish, U+0049 LATIN CAPITAL LETTER I - ;; and U+0131 LATIN SMALL LETTER DOTLESS I make a case pair, and so - ;; do U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE and U+0069 LATIN - ;; SMALL LETTER I. - - ;; We used to set up half of those correspondence unconditionally, - ;; but that makes searches slow. So now we don't set up either half - ;; of these correspondences by default. - - ;; (set-downcase-syntax ?=C4=B0 ?i tbl) - ;; (set-upcase-syntax ?I ?=C4=B1 tbl) - - (set-case-syntax-pair ?=C6=81 ?=C9=93 tbl) - (set-case-syntax-pair ?=C6=86 ?=C9=94 tbl) - (set-case-syntax-pair ?=C6=89 ?=C9=96 tbl) - (set-case-syntax-pair ?=C6=8A ?=C9=97 tbl) - (set-case-syntax-pair ?=C6=8E ?=C7=9D tbl) - (set-case-syntax-pair ?=C6=8F ?=C9=99 tbl) - (set-case-syntax-pair ?=C6=90 ?=C9=9B tbl) - (set-case-syntax-pair ?=C6=93 ?=C9=A0 tbl) - (set-case-syntax-pair ?=C6=94 ?=C9=A3 tbl) - (set-case-syntax-pair ?=C6=96 ?=C9=A9 tbl) - (set-case-syntax-pair ?=C6=97 ?=C9=A8 tbl) - (set-case-syntax-pair ?=C6=9C ?=C9=AF tbl) - (set-case-syntax-pair ?=C6=9D ?=C9=B2 tbl) - (set-case-syntax-pair ?=C6=9F ?=C9=B5 tbl) - (set-case-syntax-pair ?=C6=A6 ?=CA=80 tbl) - (set-case-syntax-pair ?=C6=A9 ?=CA=83 tbl) - (set-case-syntax-pair ?=C6=AE ?=CA=88 tbl) - (set-case-syntax-pair ?=C6=B1 ?=CA=8A tbl) - (set-case-syntax-pair ?=C6=B2 ?=CA=8B tbl) - (set-case-syntax-pair ?=C6=B7 ?=CA=92 tbl) - ;; We use set-downcase-syntax below, since we want upcase of =C7=86 - ;; return =C7=84, not =C7=85, and the same for the rest. - (set-case-syntax-pair ?=C7=84 ?=C7=86 tbl) - (set-downcase-syntax ?=C7=85 ?=C7=86 tbl) - (set-case-syntax-pair ?=C7=87 ?=C7=89 tbl) - (set-downcase-syntax ?=C7=88 ?=C7=89 tbl) - (set-case-syntax-pair ?=C7=8A ?=C7=8C tbl) - (set-downcase-syntax ?=C7=8B ?=C7=8C tbl) - - ;; 01F0; F; 006A 030C; # LATIN SMALL LETTER J WITH CARON - - (set-case-syntax-pair ?=C7=B1 ?=C7=B3 tbl) - (set-downcase-syntax ?=C7=B2 ?=C7=B3 tbl) - (set-case-syntax-pair ?=C7=B6 ?=C6=95 tbl) - (set-case-syntax-pair ?=C7=B7 ?=C6=BF tbl) - (set-case-syntax-pair ?=C8=BA ?=E2=B1=A5 tbl) - (set-case-syntax-pair ?=C8=BD ?=C6=9A tbl) - (set-case-syntax-pair ?=C8=BE ?=E2=B1=A6 tbl) - (set-case-syntax-pair ?=C9=83 ?=C6=80 tbl) - (set-case-syntax-pair ?=C9=84 ?=CA=89 tbl) - (set-case-syntax-pair ?=C9=85 ?=CA=8C tbl) - ;; Latin Extended Additional (modify-category-entry '(#x1e00 . #x1ef9) ?l) - (setq c #x1e00) - (while (<=3D c #x1ef9) - (and (zerop (% c 2)) - (or (<=3D c #x1e94) (>=3D c #x1ea0)) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) =20 ;; Latin Extended-C (setq c #x2C60) @@ -660,57 +563,12 @@ ?L (modify-category-entry c ?l) (setq c (1+ c))) =20 - (let ((pair-ranges '((#x2C60 . #x2C61) - (#x2C67 . #x2C6C) - (#x2C72 . #x2C73) - (#x2C75 . #x2C76)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?=E2=B1=A2 ?=C9=AB tbl) - (set-case-syntax-pair ?=E2=B1=A3 ?=E1=B5=BD tbl) - (set-case-syntax-pair ?=E2=B1=A4 ?=C9=BD tbl) - (set-case-syntax-pair ?=E2=B1=AD ?=C9=91 tbl) - (set-case-syntax-pair ?=E2=B1=AE ?=C9=B1 tbl) - (set-case-syntax-pair ?=E2=B1=AF ?=C9=90 tbl) - (set-case-syntax-pair ?=E2=B1=B0 ?=C9=92 tbl) - (set-case-syntax-pair ?=E2=B1=BE ?=C8=BF tbl) - (set-case-syntax-pair ?=E2=B1=BF ?=C9=80 tbl) - ;; Latin Extended-D (setq c #xA720) (while (<=3D c #xA7FF) (modify-category-entry c ?l) (setq c (1+ c))) =20 - (let ((pair-ranges '((#xA722 . #xA72F) - (#xA732 . #xA76F) - (#xA779 . #xA77C) - (#xA77E . #xA787) - (#xA78B . #xA78E) - (#xA790 . #xA793) - (#xA796 . #xA7A9) - (#xA7B4 . #xA7B7)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) - - (set-case-syntax-pair ?=EA=9D=BD ?=E1=B5=B9 tbl) - (set-case-syntax-pair ?=EA=9E=AA ?=C9=A6 tbl) - (set-case-syntax-pair ?=EA=9E=AB ?=C9=9C tbl) - (set-case-syntax-pair ?=EA=9E=AC ?=C9=A1 tbl) - (set-case-syntax-pair ?=EA=9E=AD ?=C9=AC tbl) - (set-case-syntax-pair ?=EA=9E=AE ?=C9=AA tbl) - (set-case-syntax-pair ?=EA=9E=B0 ?=CA=9E tbl) - (set-case-syntax-pair ?=EA=9E=B1 ?=CA=87 tbl) - (set-case-syntax-pair ?=EA=9E=B2 ?=CA=9D tbl) - (set-case-syntax-pair ?=EA=9E=B3 ?=EA=AD=93 tbl) - ;; Latin Extended-E (setq c #xAB30) (while (<=3D c #xAB64) @@ -719,102 +577,19 @@ ?L =20 ;; Greek (modify-category-entry '(#x0370 . #x03ff) ?g) - (setq c #x0370) - (while (<=3D c #x03ff) - (if (or (and (>=3D c #x0391) (<=3D c #x03a1)) - (and (>=3D c #x03a3) (<=3D c #x03ab))) - (set-case-syntax-pair c (+ c 32) tbl)) - (and (>=3D c #x03da) - (<=3D c #x03ee) - (zerop (% c 2)) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?=CE=86 ?=CE=AC tbl) - (set-case-syntax-pair ?=CE=88 ?=CE=AD tbl) - (set-case-syntax-pair ?=CE=89 ?=CE=AE tbl) - (set-case-syntax-pair ?=CE=8A ?=CE=AF tbl) - (set-case-syntax-pair ?=CE=8C ?=CF=8C tbl) - (set-case-syntax-pair ?=CE=8E ?=CF=8D tbl) - (set-case-syntax-pair ?=CE=8F ?=CF=8E tbl) =20 ;; Armenian (setq c #x531) - (while (<=3D c #x556) - (set-case-syntax-pair c (+ c #x30) tbl) - (setq c (1+ c))) =20 ;; Greek Extended (modify-category-entry '(#x1f00 . #x1fff) ?g) - (setq c #x1f00) - (while (<=3D c #x1fff) - (and (<=3D (logand c #x000f) 7) - (<=3D c #x1fa7) - (not (memq c '(#x1f16 #x1f17 #x1f56 #x1f57 - #x1f50 #x1f52 #x1f54 #x1f56))) - (/=3D (logand c #x00f0) #x70) - (set-case-syntax-pair (+ c 8) c tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?=E1=BE=B8 ?=E1=BE=B0 tbl) - (set-case-syntax-pair ?=E1=BE=B9 ?=E1=BE=B1 tbl) - (set-case-syntax-pair ?=E1=BE=BA ?=E1=BD=B0 tbl) - (set-case-syntax-pair ?=E1=BE=BB ?=E1=BD=B1 tbl) - (set-case-syntax-pair ?=E1=BE=BC ?=E1=BE=B3 tbl) - (set-case-syntax-pair ?=E1=BF=88 ?=E1=BD=B2 tbl) - (set-case-syntax-pair ?=E1=BF=89 ?=E1=BD=B3 tbl) - (set-case-syntax-pair ?=E1=BF=8A ?=E1=BD=B4 tbl) - (set-case-syntax-pair ?=E1=BF=8B ?=E1=BD=B5 tbl) - (set-case-syntax-pair ?=E1=BF=8C ?=E1=BF=83 tbl) - (set-case-syntax-pair ?=E1=BF=98 ?=E1=BF=90 tbl) - (set-case-syntax-pair ?=E1=BF=99 ?=E1=BF=91 tbl) - (set-case-syntax-pair ?=E1=BF=9A ?=E1=BD=B6 tbl) - (set-case-syntax-pair ?=E1=BF=9B ?=E1=BD=B7 tbl) - (set-case-syntax-pair ?=E1=BF=A8 ?=E1=BF=A0 tbl) - (set-case-syntax-pair ?=E1=BF=A9 ?=E1=BF=A1 tbl) - (set-case-syntax-pair ?=E1=BF=AA ?=E1=BD=BA tbl) - (set-case-syntax-pair ?=E1=BF=AB ?=E1=BD=BB tbl) - (set-case-syntax-pair ?=E1=BF=AC ?=E1=BF=A5 tbl) - (set-case-syntax-pair ?=E1=BF=B8 ?=E1=BD=B8 tbl) - (set-case-syntax-pair ?=E1=BF=B9 ?=E1=BD=B9 tbl) - (set-case-syntax-pair ?=E1=BF=BA ?=E1=BD=BC tbl) - (set-case-syntax-pair ?=E1=BF=BB ?=E1=BD=BD tbl) - (set-case-syntax-pair ?=E1=BF=BC ?=E1=BF=B3 tbl) =20 ;; cyrillic (modify-category-entry '(#x0400 . #x04FF) ?y) - (setq c #x0400) - (while (<=3D c #x04ff) - (and (>=3D c #x0400) - (<=3D c #x040f) - (set-case-syntax-pair c (+ c 80) tbl)) - (and (>=3D c #x0410) - (<=3D c #x042f) - (set-case-syntax-pair c (+ c 32) tbl)) - (and (zerop (% c 2)) - (or (and (>=3D c #x0460) (<=3D c #x0480)) - (and (>=3D c #x048c) (<=3D c #x04be)) - (and (>=3D c #x04d0) (<=3D c #x052e))) - (set-case-syntax-pair c (1+ c) tbl)) - (setq c (1+ c))) - (set-case-syntax-pair ?=D3=81 ?=D3=82 tbl) - (set-case-syntax-pair ?=D3=83 ?=D3=84 tbl) - (set-case-syntax-pair ?=D3=87 ?=D3=88 tbl) - (set-case-syntax-pair ?=D3=8B ?=D3=8C tbl) - (modify-category-entry '(#xA640 . #xA69F) ?y) - (setq c #xA640) - (while (<=3D c #xA66C) - (set-case-syntax-pair c (+ c 1) tbl) - (setq c (+ c 2))) - (setq c #xA680) - (while (<=3D c #xA69A) - (set-case-syntax-pair c (+ c 1) tbl) - (setq c (+ c 2))) =20 ;; Georgian (setq c #x10A0) - (while (<=3D c #x10CD) - (set-case-syntax-pair c (+ c #x1C60) tbl) - (setq c (1+ c))) =20 ;; Cyrillic Extended-C (modify-category-entry '(#x1C80 . #x1C8F) ?y) @@ -844,12 +619,6 @@ ?L (set-case-syntax c "." tbl) (setq c (1+ c))) =20 - ;; Roman numerals - (setq c #x2160) - (while (<=3D c #x216f) - (set-case-syntax-pair c (+ c #x10) tbl) - (setq c (1+ c))) - ;; Fixme: The following blocks might be better as symbol rather than ;; punctuation. ;; Arrows @@ -873,25 +642,11 @@ ?L ;; Circled Latin (setq c #x24b6) (while (<=3D c #x24cf) - (set-case-syntax-pair c (+ c 26) tbl) (modify-category-entry c ?l) (modify-category-entry (+ c 26) ?l) (setq c (1+ c))) =20 - ;; Glagolitic - (setq c #x2C00) - (while (<=3D c #x2C2E) - (set-case-syntax-pair c (+ c 48) tbl) - (setq c (1+ c))) - ;; Coptic - (let ((pair-ranges '((#x2C80 . #x2CE2) - (#x2CEB . #x2CF2)))) - (dolist (elt pair-ranges) - (let ((from (car elt)) (to (cdr elt))) - (while (< from to) - (set-case-syntax-pair from (1+ from) tbl) - (setq from (+ from 2)))))) ;; There's no Coptic category. However, Coptic letters that are ;; part of the Greek block above get the Greek category, and those ;; in this block are derived from Greek letters, so let's be @@ -901,45 +656,85 @@ ?L ;; Fullwidth Latin (setq c #xff21) (while (<=3D c #xff3a) - (set-case-syntax-pair c (+ c #x20) tbl) (modify-category-entry c ?l) (modify-category-entry (+ c #x20) ?l) (setq c (1+ c))) =20 - ;; Deseret - (setq c #x10400) - (while (<=3D c #x10427) - (set-case-syntax-pair c (+ c 28) tbl) - (setq c (1+ c))) + ;; Combining diacritics + (modify-category-entry '(#x300 . #x362) ?^) + ;; Combining marks + (modify-category-entry '(#x20d0 . #x20ff) ?^) =20 - ;; Osage - (setq c #x104B0) - (while (<=3D c #x104D3) - (set-case-syntax-pair c (+ c 40) tbl) - (setq c (1+ c))) + ;; Set all Letter, uppercase; Letter, lowercase and Letter, titlecase sy= ntax + ;; to word. + (let ((syn-tab (standard-syntax-table))) + (map-char-table + (lambda (ch cat) + (when (memq cat '(Lu Ll Lt)) + (modify-syntax-entry ch "w " syn-tab))) + (unicode-property-table-internal 'general-category)) =20 - ;; Old Hungarian - (setq c #x10c80) - (while (<=3D c #x10cb2) - (set-case-syntax-pair c (+ c #x40) tbl) - (setq c (1+ c))) + ;; =E2=85=A0 through =E2=85=AB had word syntax in the past so set it h= ere as well. + ;; General category of those characers is Number, Letter. + (modify-syntax-entry '(#x2160 . #x216b) "w " syn-tab) =20 - ;; Warang Citi - (setq c #x118a0) - (while (<=3D c #x118bf) - (set-case-syntax-pair c (+ c #x20) tbl) - (setq c (1+ c))) + ;; =E2=93=90 thourgh =E2=93=A9 are symbols, other according to Unicode= but Emacs set + ;; their syntax to word in the past so keep backwards compatibility. + (modify-syntax-entry '(#x24D0 . #x24E9) "w " syn-tab)) =20 - ;; Adlam - (setq c #x1e900) - (while (<=3D c #x1e921) - (set-case-syntax-pair c (+ c #x22) tbl) - (setq c (1+ c))) + ;; Set downcase and upcase from Unicode properties =20 - ;; Combining diacritics - (modify-category-entry '(#x300 . #x362) ?^) - ;; Combining marks - (modify-category-entry '(#x20d0 . #x20ff) ?^) + ;; In some languages, such as Turkish, U+0049 LATIN CAPITAL LETTER I and + ;; U+0131 LATIN SMALL LETTER DOTLESS I make a case pair, and so do U+0130 + ;; LATIN CAPITAL LETTER I WITH DOT ABOVE and U+0069 LATIN SMALL LETTER I. + + ;; We used to set up half of those correspondence unconditionally, but t= hat + ;; makes searches slow. So now we don't set up either half of these + ;; correspondences by default. + + ;; (set-downcase-syntax ?=C4=B0 ?i tbl) + ;; (set-upcase-syntax ?I ?=C4=B1 tbl) + + (let ((map-unicode-property + (lambda (property func) + (map-char-table + (lambda (ch cased) + ;; ASCII characters skipped due to reasons outlined above. = As of + ;; Unicode 9.0, this exception affects the following: + ;; lc(U+0130 =C4=B0) =3D i + ;; uc(U+0131 =C4=B1) =3D I + ;; uc(U+017F =C5=BF) =3D S + ;; uc(U+212A =E2=84=AA) =3D k + (when (> cased 127) + (let ((end (if (consp ch) (cdr ch) ch))) + (setq ch (max 128 (if (consp ch) (car ch) ch))) + (while (<=3D ch end) + (funcall func ch cased) + (setq ch (1+ ch)))))) + (unicode-property-table-internal property)))) + (down tbl) + (up (case-table-get-table tbl 'up))) + + ;; This works on an assumption that if toUpper(x) !=3D x then toLower(= x) =3D=3D + ;; x (and the opposite for toLower/toUpper). This doesn=E2=80=99t hol= d for title + ;; case characters but those incorrect mappings will be overwritten la= ter. + (funcall map-unicode-property 'uppercase + (lambda (lc uc) (aset down lc lc) (aset up uc uc))) + (funcall map-unicode-property 'lowercase + (lambda (uc lc) (aset down lc lc) (aset up uc uc))) + + ;; Now deal with the actual mapping. This will correctly assign casin= g for + ;; title-case characters. + (funcall map-unicode-property 'uppercase + (lambda (lc uc) (aset up lc uc) (aset up uc uc))) + (funcall map-unicode-property 'lowercase + (lambda (uc lc) (aset down uc lc) (aset down lc lc)))) + + ;; Clear out the extra slots so that they will be recomputed from the ma= in + ;; (downcase) table and upcase table. Since we=E2=80=99re side-stepping= the usual + ;; set-case-syntax-* functions, we need to do it explicitly. + (set-char-table-extra-slot tbl 1 nil) + (set-char-table-extra-slot tbl 2 nil) =20 ;; Fixme: syntax for symbols &c ) diff --git a/test/src/casefiddle-tests.el b/test/src/casefiddle-tests.el index 4b2eeaf..ca3657d 100644 --- a/test/src/casefiddle-tests.el +++ b/test/src/casefiddle-tests.el @@ -72,8 +72,7 @@ casefiddle-tests--characters =20 (?=CE=A3 ?=CE=A3 ?=CF=83 ?=CE=A3) (?=CF=83 ?=CE=A3 ?=CF=83 ?=CE=A3) - ;; FIXME: Another broken one: - ;;(?=CF=82 ?=CE=A3 ?=CF=82 ?=CE=A3) + (?=CF=82 ?=CE=A3 ?=CF=82 ?=CE=A3) =20 (?=E2=85=A7 ?=E2=85=A7 ?=E2=85=B7 ?=E2=85=A7) (?=E2=85=B7 ?=E2=85=A7 ?=E2=85=B7 ?=E2=85=A7))) @@ -151,7 +150,6 @@ casefiddle-tests--characters ;;("=EF=AC=81sh" "FIsh" "=EF=AC=81sh" "Fish" "Fish") ;;("Stra=C3=9Fe" "STRASSE" "stra=C3=9Fe" "Stra=C3=9Fe" "Stra= =C3=9Fe") ;;("=CE=8C=CE=A3=CE=9F=CE=A3" "=CE=8C=CE=A3=CE=9F=CE=A3" "= =CF=8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF= =82") - ;;("=CF=8C=CF=83=CE=BF=CF=82" "=CE=8C=CE=A3=CE=9F=CE=A3" "= =CF=8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF= =82") ;; And here=E2=80=99s what is actually happening: ("=C7=84UNGLA" "=C7=84UNGLA" "=C7=86ungla" "=C7=84ungla" "= =C7=84UNGLA") ("=C7=85ungla" "=C7=85UNGLA" "=C7=86ungla" "=C7=85ungla" "= =C7=85ungla") @@ -160,7 +158,8 @@ casefiddle-tests--characters ("=EF=AC=81sh" "=EF=AC=81SH" "=EF=AC=81sh" "=EF=AC=81sh" "= =EF=AC=81sh") ("Stra=C3=9Fe" "STRA=C3=9FE" "stra=C3=9Fe" "Stra=C3=9Fe" "St= ra=C3=9Fe") ("=CE=8C=CE=A3=CE=9F=CE=A3" "=CE=8C=CE=A3=CE=9F=CE=A3" "=CF= =8C=CF=83=CE=BF=CF=83" "=CE=8C=CF=83=CE=BF=CF=83" "=CE=8C=CE=A3=CE=9F=CE=A3= ") - ("=CF=8C=CF=83=CE=BF=CF=82" "=CE=8C=CE=A3=CE=9F=CF=82" "=CF= =8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF=82= ")) + + ("=CF=8C=CF=83=CE=BF=CF=82" "=CE=8C=CE=A3=CE=9F=CE=A3" "=CF= =8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF=82" "=CE=8C=CF=83=CE=BF=CF=82= ")) (nreverse errors)) (let* ((input (car test)) (expected (cdr test)) --=20 2.8.0.rc3.226.g39d4020 --=-=-=--