* bug#45660: 28.0.50; Changed word/whitespace syntax @ 2021-01-04 17:25 Juri Linkov 2021-01-04 18:44 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Juri Linkov @ 2021-01-04 17:25 UTC (permalink / raw) To: 45660 Some unidentified recent change during the last week broke the definition of word syntax and whitespace syntax. I noticed the change of behavior in markchars-mode that now disregards the character "NARROW NO-BREAK SPACE" as the word separator between thousands, i.e.: In Emacs 27: (and (string-match "\\<\\w+\\>" "4 096") (match-end 0)) 1 In Emacs 28: (and (string-match "\\<\\w+\\>" "4 096") (match-end 0)) 5 Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096". Please close this bug report if this change was intentional because if it provides more correct definitions then other code could be adopted to such change. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-04 17:25 bug#45660: 28.0.50; Changed word/whitespace syntax Juri Linkov @ 2021-01-04 18:44 ` Eli Zaretskii 2021-01-04 18:54 ` martin rudalics 2021-01-05 18:20 ` Juri Linkov 0 siblings, 2 replies; 9+ messages in thread From: Eli Zaretskii @ 2021-01-04 18:44 UTC (permalink / raw) To: Juri Linkov; +Cc: 45660 > From: Juri Linkov <juri@linkov.net> > Date: Mon, 04 Jan 2021 19:25:23 +0200 > > Some unidentified recent change during the last week broke the > definition of word syntax and whitespace syntax. It's this commit: commit 70484f92a1807897dcd16189442a45385c6e7bbb Author: Eli Zaretskii <eliz@gnu.org> AuthorDate: Sat Jan 2 12:42:16 2021 +0200 Commit: Eli Zaretskii <eliz@gnu.org> CommitDate: Sat Jan 2 12:42:16 2021 +0200 Fix syntax of symbol and punctuation characters * lisp/international/characters.el: Adjust syntax of punctuation and symbol charcaters to follow that of Unicode properties. (Bug#44974) diff --git a/lisp/international/characters.el b/lisp/international/characters.el index 64460b4..88f2e20 100644 --- a/lisp/international/characters.el +++ b/lisp/international/characters.el @@ -317,6 +317,7 @@ ?L (modify-syntax-entry #x5be ".") ; MAQAF (modify-syntax-entry #x5c0 ".") ; PASEQ (modify-syntax-entry #x5c3 ".") ; SOF PASUQ +(modify-syntax-entry #x5c6 ".") ; NUN HAFUKHA (modify-syntax-entry #x5f3 ".") ; GERESH (modify-syntax-entry #x5f4 ".") ; GERSHAYIM @@ -521,6 +522,9 @@ ?L ;; syntax: ¢£¤¥¨ª¯²³´¶¸¹º.) There should be a well-defined way of ;; relating Unicode categories to Emacs syntax codes. + ;; FIXME: We should probably just use the Unicode properties to set + ;; up the syntax table. + ;; NBSP isn't semantically interchangeable with other whitespace chars, ;; so it's more like punctuation. (set-case-syntax ? "." tbl) @@ -558,7 +562,7 @@ ?L (setq c (1+ c))) ;; Latin Extended Additional - (modify-category-entry '(#x1e00 . #x1ef9) ?l) + (modify-category-entry '(#x1E00 . #x1EF9) ?l) ;; Latin Extended-C (setq c #x2C60) @@ -579,13 +583,13 @@ ?L (setq c (1+ c))) ;; Greek - (modify-category-entry '(#x0370 . #x03ff) ?g) + (modify-category-entry '(#x0370 . #x03FF) ?g) ;; Armenian (setq c #x531) ;; Greek Extended - (modify-category-entry '(#x1f00 . #x1fff) ?g) + (modify-category-entry '(#x1F00 . #x1FFF) ?g) ;; cyrillic (modify-category-entry '(#x0400 . #x04FF) ?y) @@ -605,40 +609,43 @@ ?L (while (<= c #x200F) (set-case-syntax c "." tbl) (setq c (1+ c))) - ;; Fixme: These aren't all right: (setq c #x2010) - (while (<= c #x2016) - (set-case-syntax c "_" tbl) + ;; Fixme: What to do with characters that have Pi and Pf + ;; Unicode properties? + (while (<= c #x2017) + (set-case-syntax c "." tbl) (setq c (1+ c))) ;; Punctuation syntax for quotation marks (like `) - (while (<= c #x201f) + (while (<= c #x201F) (set-case-syntax c "." tbl) (setq c (1+ c))) - ;; Fixme: These aren't all right: (while (<= c #x2027) - (set-case-syntax c "_" tbl) + (set-case-syntax c "." tbl) (setq c (1+ c))) - (while (<= c #x206F) + (setq c #x2030) + (while (<= c #x205E) (set-case-syntax c "." tbl) (setq c (1+ c))) + (let ((chars '(?‹ ?› ?⁄ ?⁒))) + (while chars + (modify-syntax-entry (car chars) "_") + (setq chars (cdr chars)))) - ;; Fixme: The following blocks might be better as symbol rather than - ;; punctuation. ;; Arrows (setq c #x2190) (while (<= c #x21FF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Mathematical Operators (while (<= c #x22FF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Miscellaneous Technical (while (<= c #x23FF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Control Pictures - (while (<= c #x243F) + (while (<= c #x244F) (set-case-syntax c "_" tbl) (setq c (1+ c))) @@ -652,13 +659,13 @@ ?L ;; Supplemental Mathematical Operators (setq c #x2A00) (while (<= c #x2AFF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Miscellaneous Symbols and Arrows (setq c #x2B00) (while (<= c #x2BFF) - (set-case-syntax c "." tbl) + (set-case-syntax c "_" tbl) (setq c (1+ c))) ;; Coptic @@ -676,17 +683,34 @@ ?L ;; Symbols for Legacy Computing (setq c #x1FB00) + (while (<= c #x1FBCA) + (set-case-syntax c "_" tbl) + (setq c (1+ c))) + ;; FIXME: Should these be digits? (while (<= c #x1FBFF) (set-case-syntax c "." tbl) (setq c (1+ c))) ;; Fullwidth Latin - (setq c #xff21) - (while (<= c #xff3a) + (setq c #xFF01) + (while (<= c #xFF0F) + (set-case-syntax c "." tbl) + (setq c (1+ c))) + (set-case-syntax #xFF04 "_" tbl) + (set-case-syntax #xFF0B "_" tbl) + (setq c #xFF21) + (while (<= c #xFF3A) (modify-category-entry c ?l) (modify-category-entry (+ c #x20) ?l) (setq c (1+ c))) + ;; Halfwidth Latin + (setq c #xFF64) + (while (<= c #xFF65) + (set-case-syntax c "." tbl) + (setq c (1+ c))) + (set-case-syntax #xFF61 "." tbl) + ;; Combining diacritics (modify-category-entry '(#x300 . #x362) ?^) ;; Combining marks > I noticed the change of behavior in markchars-mode that now > disregards the character "NARROW NO-BREAK SPACE" as the word > separator between thousands, i.e.: > > In Emacs 27: > (and (string-match "\\<\\w+\\>" "4 096") (match-end 0)) > 1 > > In Emacs 28: > (and (string-match "\\<\\w+\\>" "4 096") (match-end 0)) > 5 > > Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096". Previously, many characters, including u+202F, had the punctuation ('.') syntax. I modified that to be more close to the Unicode Character Database (UCD), and u+202F is not a punctuation character according to the UCD. It has the Zs general category, which means "space separator", the same as SPC, NBSP, EN SPACE, and others. Removing u+202F and other similar characters from the "punctuation" group had the side effect of leaving it at the default 'w' syntax. Should we make all Zs characters have the ' ' (whitespace) syntax? That should be easy, but we should try being consistent in this regard. ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-04 18:44 ` Eli Zaretskii @ 2021-01-04 18:54 ` martin rudalics 2021-01-04 19:19 ` Eli Zaretskii 2021-01-05 18:20 ` Juri Linkov 1 sibling, 1 reply; 9+ messages in thread From: martin rudalics @ 2021-01-04 18:54 UTC (permalink / raw) To: Eli Zaretskii, Juri Linkov; +Cc: 45660 > Should we make all Zs characters have the ' ' (whitespace) syntax? > That should be easy, but we should try being consistent in this > regard. What would be the downside of doing that? martin ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-04 18:54 ` martin rudalics @ 2021-01-04 19:19 ` Eli Zaretskii 0 siblings, 0 replies; 9+ messages in thread From: Eli Zaretskii @ 2021-01-04 19:19 UTC (permalink / raw) To: martin rudalics; +Cc: 45660, juri > Cc: 45660@debbugs.gnu.org > From: martin rudalics <rudalics@gmx.at> > Date: Mon, 4 Jan 2021 19:54:33 +0100 > > > Should we make all Zs characters have the ' ' (whitespace) syntax? > > That should be easy, but we should try being consistent in this > > regard. > > What would be the downside of doing that? As always, changing the syntax of at least some of those characters. What would that cause is anyone's guess. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-04 18:44 ` Eli Zaretskii 2021-01-04 18:54 ` martin rudalics @ 2021-01-05 18:20 ` Juri Linkov 2021-01-05 18:45 ` Eli Zaretskii ` (2 more replies) 1 sibling, 3 replies; 9+ messages in thread From: Juri Linkov @ 2021-01-05 18:20 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 45660 > Previously, many characters, including u+202F, had the punctuation > ('.') syntax. I modified that to be more close to the Unicode > Character Database (UCD), and u+202F is not a punctuation character > according to the UCD. It has the Zs general category, which means > "space separator", the same as SPC, NBSP, EN SPACE, and others. So according to the Unicode standard it should have whitespace syntax? And indeed, I see no reason for similar characters to have different syntax: name: NO-BREAK SPACE general-category: Zs (Separator, Space) syntax: which means: whitespace name: NARROW NO-BREAK SPACE general-category: Zs (Separator, Space) syntax: w which means: word > Removing u+202F and other similar characters from the "punctuation" > group had the side effect of leaving it at the default 'w' syntax. > > Should we make all Zs characters have the ' ' (whitespace) syntax? > That should be easy, but we should try being consistent in this > regard. Should the word characters separated by NO-BREAK SPACE by treated as one word? If there is no reason to treat space characters as part of words, then all characters with the Zs general category could have the same whitespace syntax. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-05 18:20 ` Juri Linkov @ 2021-01-05 18:45 ` Eli Zaretskii 2021-01-05 18:53 ` martin rudalics 2021-01-08 12:06 ` Eli Zaretskii 2 siblings, 0 replies; 9+ messages in thread From: Eli Zaretskii @ 2021-01-05 18:45 UTC (permalink / raw) To: Juri Linkov; +Cc: 45660 > From: Juri Linkov <juri@linkov.net> > Cc: 45660@debbugs.gnu.org > Date: Tue, 05 Jan 2021 20:20:44 +0200 > > > Previously, many characters, including u+202F, had the punctuation > > ('.') syntax. I modified that to be more close to the Unicode > > Character Database (UCD), and u+202F is not a punctuation character > > according to the UCD. It has the Zs general category, which means > > "space separator", the same as SPC, NBSP, EN SPACE, and others. > > So according to the Unicode standard it should have whitespace syntax? Unicode doesn't have the concept of "syntax", it's our invention. For some syntactic categories, it makes sense to follow the corresponding Unicode general category. Two examples are "punctuation" and "symbols". The question whether to treat Zs as whitespace syntax is on the table. We previously treated many of such characters as "punctuation", which doesn't seem right to me. Which is why I removed them from the "punctuation" syntax, and you got bitten byu the result (because the default syntax is "word-constituent"). > Should the word characters separated by NO-BREAK SPACE by treated as one word? That's a good question. Do we currently treat them as such? I don't think so, because NBSP has the '.' syntax, i.e. "punctuation". > If there is no reason to treat space characters as part of words, then all > characters with the Zs general category could have the same whitespace syntax. I tend to agree. If no objections or new issues arise, I will do that in a couple of days. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-05 18:20 ` Juri Linkov 2021-01-05 18:45 ` Eli Zaretskii @ 2021-01-05 18:53 ` martin rudalics 2021-01-05 19:26 ` Eli Zaretskii 2021-01-08 12:06 ` Eli Zaretskii 2 siblings, 1 reply; 9+ messages in thread From: martin rudalics @ 2021-01-05 18:53 UTC (permalink / raw) To: Juri Linkov, Eli Zaretskii; +Cc: 45660 > Should the word characters separated by NO-BREAK SPACE by treated as one word? 'forward-word' should stop but a line should not be broken there. So IIUC this is a question of what's cheaper in terms of implementation. martin ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-05 18:53 ` martin rudalics @ 2021-01-05 19:26 ` Eli Zaretskii 0 siblings, 0 replies; 9+ messages in thread From: Eli Zaretskii @ 2021-01-05 19:26 UTC (permalink / raw) To: martin rudalics; +Cc: 45660, juri > Cc: 45660@debbugs.gnu.org > From: martin rudalics <rudalics@gmx.at> > Date: Tue, 5 Jan 2021 19:53:13 +0100 > > > Should the word characters separated by NO-BREAK SPACE by treated as one word? > > 'forward-word' should stop but a line should not be broken there. So > IIUC this is a question of what's cheaper in terms of implementation. We don't break lines according to syntax, we break them according to "line breakable" category and other rules. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#45660: 28.0.50; Changed word/whitespace syntax 2021-01-05 18:20 ` Juri Linkov 2021-01-05 18:45 ` Eli Zaretskii 2021-01-05 18:53 ` martin rudalics @ 2021-01-08 12:06 ` Eli Zaretskii 2 siblings, 0 replies; 9+ messages in thread From: Eli Zaretskii @ 2021-01-08 12:06 UTC (permalink / raw) To: Juri Linkov; +Cc: 45660-done > From: Juri Linkov <juri@linkov.net> > Cc: 45660@debbugs.gnu.org > Date: Tue, 05 Jan 2021 20:20:44 +0200 > > > Previously, many characters, including u+202F, had the punctuation > > ('.') syntax. I modified that to be more close to the Unicode > > Character Database (UCD), and u+202F is not a punctuation character > > according to the UCD. It has the Zs general category, which means > > "space separator", the same as SPC, NBSP, EN SPACE, and others. > > So according to the Unicode standard it should have whitespace syntax? > > And indeed, I see no reason for similar characters to have different syntax: > > name: NO-BREAK SPACE > general-category: Zs (Separator, Space) > syntax: which means: whitespace > > name: NARROW NO-BREAK SPACE > general-category: Zs (Separator, Space) > syntax: w which means: word > > > Removing u+202F and other similar characters from the "punctuation" > > group had the side effect of leaving it at the default 'w' syntax. > > > > Should we make all Zs characters have the ' ' (whitespace) syntax? > > That should be easy, but we should try being consistent in this > > regard. > > Should the word characters separated by NO-BREAK SPACE by treated as one word? > If there is no reason to treat space characters as part of words, then all > characters with the Zs general category could have the same whitespace syntax. No further comments, so I've now made the change on master whereby all characters with Zs general category are given the whitespace syntax. I'm therefore closing this bug; please reopen if there any left-overs or undesired effects. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-01-08 12:06 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-01-04 17:25 bug#45660: 28.0.50; Changed word/whitespace syntax Juri Linkov 2021-01-04 18:44 ` Eli Zaretskii 2021-01-04 18:54 ` martin rudalics 2021-01-04 19:19 ` Eli Zaretskii 2021-01-05 18:20 ` Juri Linkov 2021-01-05 18:45 ` Eli Zaretskii 2021-01-05 18:53 ` martin rudalics 2021-01-05 19:26 ` Eli Zaretskii 2021-01-08 12:06 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.