From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#45660: 28.0.50; Changed word/whitespace syntax Date: Mon, 04 Jan 2021 20:44:37 +0200 Message-ID: <838s98bkyy.fsf@gnu.org> References: <87im8cr5t8.fsf@mail.linkov.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="24591"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 45660@debbugs.gnu.org To: Juri Linkov Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon Jan 04 19:47:14 2021 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kwUss-0006BQ-0V for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 04 Jan 2021 19:47:14 +0100 Original-Received: from localhost ([::1]:48192 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kwUsq-0005eC-QY for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 04 Jan 2021 13:47:12 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:56902) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kwUqm-0004n5-Bw for bug-gnu-emacs@gnu.org; Mon, 04 Jan 2021 13:45:05 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:48730) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kwUqk-0005Mi-L5 for bug-gnu-emacs@gnu.org; Mon, 04 Jan 2021 13:45:04 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kwUqk-0006FE-GT for bug-gnu-emacs@gnu.org; Mon, 04 Jan 2021 13:45:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 04 Jan 2021 18:45:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 45660 X-GNU-PR-Package: emacs Original-Received: via spool by 45660-submit@debbugs.gnu.org id=B45660.160978589923982 (code B ref 45660); Mon, 04 Jan 2021 18:45:02 +0000 Original-Received: (at 45660) by debbugs.gnu.org; 4 Jan 2021 18:44:59 +0000 Original-Received: from localhost ([127.0.0.1]:60276 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kwUqh-0006Ek-0s for submit@debbugs.gnu.org; Mon, 04 Jan 2021 13:44:59 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:35482) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kwUqf-0006EX-Ge for 45660@debbugs.gnu.org; Mon, 04 Jan 2021 13:44:58 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:42093) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kwUqX-0005JQ-RV; Mon, 04 Jan 2021 13:44:51 -0500 Original-Received: from 84.94.185.95.cable.012.net.il ([84.94.185.95]:4232 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kwUqT-0005sZ-Ne; Mon, 04 Jan 2021 13:44:49 -0500 In-Reply-To: <87im8cr5t8.fsf@mail.linkov.net> (message from Juri Linkov on Mon, 04 Jan 2021 19:25:23 +0200) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:197328 Archived-At: > From: Juri Linkov > Date: Mon, 04 Jan 2021 19:25:23 +0200 > > Some unidentified recent change during the last week broke the > definition of word syntax and whitespace syntax. It's this commit: commit 70484f92a1807897dcd16189442a45385c6e7bbb Author: Eli Zaretskii AuthorDate: Sat Jan 2 12:42:16 2021 +0200 Commit: Eli Zaretskii CommitDate: Sat Jan 2 12:42:16 2021 +0200 Fix syntax of symbol and punctuation characters * lisp/international/characters.el: Adjust syntax of punctuation and symbol charcaters to follow that of Unicode properties. (Bug#44974) diff --git a/lisp/international/characters.el b/lisp/international/characters.el index 64460b4..88f2e20 100644 --- a/lisp/international/characters.el +++ b/lisp/international/characters.el @@ -317,6 +317,7 @@ ?L (modify-syntax-entry #x5be ".") ; MAQAF (modify-syntax-entry #x5c0 ".") ; PASEQ (modify-syntax-entry #x5c3 ".") ; SOF PASUQ +(modify-syntax-entry #x5c6 ".") ; NUN HAFUKHA (modify-syntax-entry #x5f3 ".") ; GERESH (modify-syntax-entry #x5f4 ".") ; GERSHAYIM @@ -521,6 +522,9 @@ ?L ;; syntax: ¢£¤¥¨ª¯²³´¶¸¹º.) There should be a well-defined way of ;; relating Unicode categories to Emacs syntax codes. + ;; FIXME: We should probably just use the Unicode properties to set + ;; up the syntax table. + ;; NBSP isn't semantically interchangeable with other whitespace chars, ;; so it's more like punctuation. (set-case-syntax ?  "." tbl) @@ -558,7 +562,7 @@ ?L (setq c (1+ c))) ;; Latin Extended Additional - (modify-category-entry '(#x1e00 . #x1ef9) ?l) + (modify-category-entry '(#x1E00 . #x1EF9) ?l) ;; Latin Extended-C (setq c #x2C60) @@ -579,13 +583,13 @@ ?L (setq c (1+ c))) ;; Greek - (modify-category-entry '(#x0370 . #x03ff) ?g) + (modify-category-entry '(#x0370 . #x03FF) ?g) ;; Armenian (setq c #x531) ;; Greek Extended - (modify-category-entry '(#x1f00 . #x1fff) ?g) + (modify-category-entry '(#x1F00 . #x1FFF) ?g) ;; cyrillic (modify-category-entry '(#x0400 . #x04FF) ?y) @@ -605,40 +609,43 @@ ?L (while (<= c #x200F) (set-case-syntax c "." tbl) (setq c (1+ c))) - ;; Fixme: These aren't all right: (setq c #x2010) - (while (<= c #x2016) - (set-case-syntax c "_" tbl) + ;; Fixme: What to do with characters that have Pi and Pf + ;; Unicode properties? + (while (<= c #x2017) + (set-case-syntax c "." tbl) (setq c (1+ c))) ;; Punctuation syntax for quotation marks (like `) - (while (<= c #x201f) + (while (<= c #x201F) (set-case-syntax c "." tbl) (setq c (1+ c))) - ;; Fixme: These aren't all right: (while (<= c #x2027) - (set-case-syntax c "_" tbl) + (set-case-syntax c "." tbl) (setq c (1+ c))) - (while (<= c #x206F) + (setq c #x2030) + (while (<= c #x205E) (set-case-syntax c "." tbl) (setq c (1+ c))) + (let ((chars '(?‹ ?› ?⁄ ?⁒))) + (while chars + (modify-syntax-entry (car chars) "_") + (setq chars (cdr chars)))) - ;; Fixme: The following blocks might be better as symbol rather than - ;; punctuation. ;; Arrows (setq c #x2190) (while (<= c #x21FF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Mathematical Operators (while (<= c #x22FF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Miscellaneous Technical (while (<= c #x23FF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Control Pictures - (while (<= c #x243F) + (while (<= c #x244F) (set-case-syntax c "_" tbl) (setq c (1+ c))) @@ -652,13 +659,13 @@ ?L ;; Supplemental Mathematical Operators (setq c #x2A00) (while (<= c #x2AFF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Miscellaneous Symbols and Arrows (setq c #x2B00) (while (<= c #x2BFF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Coptic @@ -676,17 +683,34 @@ ?L ;; Symbols for Legacy Computing (setq c #x1FB00) + (while (<= c #x1FBCA) + (set-case-syntax c "_" tbl) + (setq c (1+ c))) + ;; FIXME: Should these be digits? (while (<= c #x1FBFF) (set-case-syntax c "." tbl) (setq c (1+ c))) ;; Fullwidth Latin - (setq c #xff21) - (while (<= c #xff3a) + (setq c #xFF01) + (while (<= c #xFF0F) + (set-case-syntax c "." tbl) + (setq c (1+ c))) + (set-case-syntax #xFF04 "_" tbl) + (set-case-syntax #xFF0B "_" tbl) + (setq c #xFF21) + (while (<= c #xFF3A) (modify-category-entry c ?l) (modify-category-entry (+ c #x20) ?l) (setq c (1+ c))) + ;; Halfwidth Latin + (setq c #xFF64) + (while (<= c #xFF65) + (set-case-syntax c "." tbl) + (setq c (1+ c))) + (set-case-syntax #xFF61 "." tbl) + ;; Combining diacritics (modify-category-entry '(#x300 . #x362) ?^) ;; Combining marks > I noticed the change of behavior in markchars-mode that now > disregards the character "NARROW NO-BREAK SPACE" as the word > separator between thousands, i.e.: > > In Emacs 27: > (and (string-match "\\<\\w+\\>" "4 096") (match-end 0)) > 1 > > In Emacs 28: > (and (string-match "\\<\\w+\\>" "4 096") (match-end 0)) > 5 > > Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096". Previously, many characters, including u+202F, had the punctuation ('.') syntax. I modified that to be more close to the Unicode Character Database (UCD), and u+202F is not a punctuation character according to the UCD. It has the Zs general category, which means "space separator", the same as SPC, NBSP, EN SPACE, and others. Removing u+202F and other similar characters from the "punctuation" group had the side effect of leaving it at the default 'w' syntax. Should we make all Zs characters have the ' ' (whitespace) syntax? That should be easy, but we should try being consistent in this regard.