From: Eli Zaretskii <eliz@gnu.org>
To: Juri Linkov <juri@linkov.net>
Cc: 45660@debbugs.gnu.org
Subject: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Mon, 04 Jan 2021 20:44:37 +0200 [thread overview]
Message-ID: <838s98bkyy.fsf@gnu.org> (raw)
In-Reply-To: <87im8cr5t8.fsf@mail.linkov.net> (message from Juri Linkov on Mon, 04 Jan 2021 19:25:23 +0200)
> From: Juri Linkov <juri@linkov.net>
> Date: Mon, 04 Jan 2021 19:25:23 +0200
>
> Some unidentified recent change during the last week broke the
> definition of word syntax and whitespace syntax.
It's this commit:
commit 70484f92a1807897dcd16189442a45385c6e7bbb
Author: Eli Zaretskii <eliz@gnu.org>
AuthorDate: Sat Jan 2 12:42:16 2021 +0200
Commit: Eli Zaretskii <eliz@gnu.org>
CommitDate: Sat Jan 2 12:42:16 2021 +0200
Fix syntax of symbol and punctuation characters
* lisp/international/characters.el: Adjust syntax of punctuation
and symbol charcaters to follow that of Unicode properties.
(Bug#44974)
diff --git a/lisp/international/characters.el b/lisp/international/characters.el
index 64460b4..88f2e20 100644
--- a/lisp/international/characters.el
+++ b/lisp/international/characters.el
@@ -317,6 +317,7 @@ ?L
(modify-syntax-entry #x5be ".") ; MAQAF
(modify-syntax-entry #x5c0 ".") ; PASEQ
(modify-syntax-entry #x5c3 ".") ; SOF PASUQ
+(modify-syntax-entry #x5c6 ".") ; NUN HAFUKHA
(modify-syntax-entry #x5f3 ".") ; GERESH
(modify-syntax-entry #x5f4 ".") ; GERSHAYIM
@@ -521,6 +522,9 @@ ?L
;; syntax: ¢£¤¥¨ª¯²³´¶¸¹º.) There should be a well-defined way of
;; relating Unicode categories to Emacs syntax codes.
+ ;; FIXME: We should probably just use the Unicode properties to set
+ ;; up the syntax table.
+
;; NBSP isn't semantically interchangeable with other whitespace chars,
;; so it's more like punctuation.
(set-case-syntax ? "." tbl)
@@ -558,7 +562,7 @@ ?L
(setq c (1+ c)))
;; Latin Extended Additional
- (modify-category-entry '(#x1e00 . #x1ef9) ?l)
+ (modify-category-entry '(#x1E00 . #x1EF9) ?l)
;; Latin Extended-C
(setq c #x2C60)
@@ -579,13 +583,13 @@ ?L
(setq c (1+ c)))
;; Greek
- (modify-category-entry '(#x0370 . #x03ff) ?g)
+ (modify-category-entry '(#x0370 . #x03FF) ?g)
;; Armenian
(setq c #x531)
;; Greek Extended
- (modify-category-entry '(#x1f00 . #x1fff) ?g)
+ (modify-category-entry '(#x1F00 . #x1FFF) ?g)
;; cyrillic
(modify-category-entry '(#x0400 . #x04FF) ?y)
@@ -605,40 +609,43 @@ ?L
(while (<= c #x200F)
(set-case-syntax c "." tbl)
(setq c (1+ c)))
- ;; Fixme: These aren't all right:
(setq c #x2010)
- (while (<= c #x2016)
- (set-case-syntax c "_" tbl)
+ ;; Fixme: What to do with characters that have Pi and Pf
+ ;; Unicode properties?
+ (while (<= c #x2017)
+ (set-case-syntax c "." tbl)
(setq c (1+ c)))
;; Punctuation syntax for quotation marks (like `)
- (while (<= c #x201f)
+ (while (<= c #x201F)
(set-case-syntax c "." tbl)
(setq c (1+ c)))
- ;; Fixme: These aren't all right:
(while (<= c #x2027)
- (set-case-syntax c "_" tbl)
+ (set-case-syntax c "." tbl)
(setq c (1+ c)))
- (while (<= c #x206F)
+ (setq c #x2030)
+ (while (<= c #x205E)
(set-case-syntax c "." tbl)
(setq c (1+ c)))
+ (let ((chars '(?‹ ?› ?⁄ ?⁒)))
+ (while chars
+ (modify-syntax-entry (car chars) "_")
+ (setq chars (cdr chars))))
- ;; Fixme: The following blocks might be better as symbol rather than
- ;; punctuation.
;; Arrows
(setq c #x2190)
(while (<= c #x21FF)
- (set-case-syntax c "." tbl)
+ (set-case-syntax c "_" tbl)
(setq c (1+ c)))
;; Mathematical Operators
(while (<= c #x22FF)
- (set-case-syntax c "." tbl)
+ (set-case-syntax c "_" tbl)
(setq c (1+ c)))
;; Miscellaneous Technical
(while (<= c #x23FF)
- (set-case-syntax c "." tbl)
+ (set-case-syntax c "_" tbl)
(setq c (1+ c)))
;; Control Pictures
- (while (<= c #x243F)
+ (while (<= c #x244F)
(set-case-syntax c "_" tbl)
(setq c (1+ c)))
@@ -652,13 +659,13 @@ ?L
;; Supplemental Mathematical Operators
(setq c #x2A00)
(while (<= c #x2AFF)
- (set-case-syntax c "." tbl)
+ (set-case-syntax c "_" tbl)
(setq c (1+ c)))
;; Miscellaneous Symbols and Arrows
(setq c #x2B00)
(while (<= c #x2BFF)
- (set-case-syntax c "." tbl)
+ (set-case-syntax c "_" tbl)
(setq c (1+ c)))
;; Coptic
@@ -676,17 +683,34 @@ ?L
;; Symbols for Legacy Computing
(setq c #x1FB00)
+ (while (<= c #x1FBCA)
+ (set-case-syntax c "_" tbl)
+ (setq c (1+ c)))
+ ;; FIXME: Should these be digits?
(while (<= c #x1FBFF)
(set-case-syntax c "." tbl)
(setq c (1+ c)))
;; Fullwidth Latin
- (setq c #xff21)
- (while (<= c #xff3a)
+ (setq c #xFF01)
+ (while (<= c #xFF0F)
+ (set-case-syntax c "." tbl)
+ (setq c (1+ c)))
+ (set-case-syntax #xFF04 "_" tbl)
+ (set-case-syntax #xFF0B "_" tbl)
+ (setq c #xFF21)
+ (while (<= c #xFF3A)
(modify-category-entry c ?l)
(modify-category-entry (+ c #x20) ?l)
(setq c (1+ c)))
+ ;; Halfwidth Latin
+ (setq c #xFF64)
+ (while (<= c #xFF65)
+ (set-case-syntax c "." tbl)
+ (setq c (1+ c)))
+ (set-case-syntax #xFF61 "." tbl)
+
;; Combining diacritics
(modify-category-entry '(#x300 . #x362) ?^)
;; Combining marks
> I noticed the change of behavior in markchars-mode that now
> disregards the character "NARROW NO-BREAK SPACE" as the word
> separator between thousands, i.e.:
>
> In Emacs 27:
> (and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
> 1
>
> In Emacs 28:
> (and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
> 5
>
> Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096".
Previously, many characters, including u+202F, had the punctuation
('.') syntax. I modified that to be more close to the Unicode
Character Database (UCD), and u+202F is not a punctuation character
according to the UCD. It has the Zs general category, which means
"space separator", the same as SPC, NBSP, EN SPACE, and others.
Removing u+202F and other similar characters from the "punctuation"
group had the side effect of leaving it at the default 'w' syntax.
Should we make all Zs characters have the ' ' (whitespace) syntax?
That should be easy, but we should try being consistent in this
regard.
next prev parent reply other threads:[~2021-01-04 18:44 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-04 17:25 bug#45660: 28.0.50; Changed word/whitespace syntax Juri Linkov
2021-01-04 18:44 ` Eli Zaretskii [this message]
2021-01-04 18:54 ` martin rudalics
2021-01-04 19:19 ` Eli Zaretskii
2021-01-05 18:20 ` Juri Linkov
2021-01-05 18:45 ` Eli Zaretskii
2021-01-05 18:53 ` martin rudalics
2021-01-05 19:26 ` Eli Zaretskii
2021-01-08 12:06 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=838s98bkyy.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=45660@debbugs.gnu.org \
--cc=juri@linkov.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).