unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#45660: 28.0.50; Changed word/whitespace syntax
@ 2021-01-04 17:25 Juri Linkov
  2021-01-04 18:44 ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Juri Linkov @ 2021-01-04 17:25 UTC (permalink / raw)
  To: 45660

Some unidentified recent change during the last week broke the
definition of word syntax and whitespace syntax.  I noticed the
change of behavior in markchars-mode that now disregards the character
"NARROW NO-BREAK SPACE" as the word separator between thousands, i.e.:

In Emacs 27:
(and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
1

In Emacs 28:
(and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
5

Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096".

Please close this bug report if this change was intentional
because if it provides more correct definitions
then other code could be adopted to such change.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-04 17:25 bug#45660: 28.0.50; Changed word/whitespace syntax Juri Linkov
@ 2021-01-04 18:44 ` Eli Zaretskii
  2021-01-04 18:54   ` martin rudalics
  2021-01-05 18:20   ` Juri Linkov
  0 siblings, 2 replies; 9+ messages in thread
From: Eli Zaretskii @ 2021-01-04 18:44 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 45660

> From: Juri Linkov <juri@linkov.net>
> Date: Mon, 04 Jan 2021 19:25:23 +0200
> 
> Some unidentified recent change during the last week broke the
> definition of word syntax and whitespace syntax.

It's this commit:

  commit 70484f92a1807897dcd16189442a45385c6e7bbb
  Author:     Eli Zaretskii <eliz@gnu.org>
  AuthorDate: Sat Jan 2 12:42:16 2021 +0200
  Commit:     Eli Zaretskii <eliz@gnu.org>
  CommitDate: Sat Jan 2 12:42:16 2021 +0200

      Fix syntax of symbol and punctuation characters

      * lisp/international/characters.el: Adjust syntax of punctuation
      and symbol charcaters to follow that of Unicode properties.
      (Bug#44974)


diff --git a/lisp/international/characters.el b/lisp/international/characters.el
index 64460b4..88f2e20 100644
--- a/lisp/international/characters.el
+++ b/lisp/international/characters.el
@@ -317,6 +317,7 @@ ?L
 (modify-syntax-entry #x5be ".") ; MAQAF
 (modify-syntax-entry #x5c0 ".") ; PASEQ
 (modify-syntax-entry #x5c3 ".") ; SOF PASUQ
+(modify-syntax-entry #x5c6 ".") ; NUN HAFUKHA
 (modify-syntax-entry #x5f3 ".") ; GERESH
 (modify-syntax-entry #x5f4 ".") ; GERSHAYIM
 
@@ -521,6 +522,9 @@ ?L
   ;; syntax: ¢£¤¥¨ª¯²³´¶¸¹º.)  There should be a well-defined way of
   ;; relating Unicode categories to Emacs syntax codes.
 
+  ;; FIXME: We should probably just use the Unicode properties to set
+  ;; up the syntax table.
+
   ;; NBSP isn't semantically interchangeable with other whitespace chars,
   ;; so it's more like punctuation.
   (set-case-syntax ?  "." tbl)
@@ -558,7 +562,7 @@ ?L
     (setq c (1+ c)))
 
   ;; Latin Extended Additional
-  (modify-category-entry '(#x1e00 . #x1ef9) ?l)
+  (modify-category-entry '(#x1E00 . #x1EF9) ?l)
 
   ;; Latin Extended-C
   (setq c #x2C60)
@@ -579,13 +583,13 @@ ?L
     (setq c (1+ c)))
 
   ;; Greek
-  (modify-category-entry '(#x0370 . #x03ff) ?g)
+  (modify-category-entry '(#x0370 . #x03FF) ?g)
 
   ;; Armenian
   (setq c #x531)
 
   ;; Greek Extended
-  (modify-category-entry '(#x1f00 . #x1fff) ?g)
+  (modify-category-entry '(#x1F00 . #x1FFF) ?g)
 
   ;; cyrillic
   (modify-category-entry '(#x0400 . #x04FF) ?y)
@@ -605,40 +609,43 @@ ?L
   (while (<= c #x200F)
     (set-case-syntax c "." tbl)
     (setq c (1+ c)))
-  ;; Fixme: These aren't all right:
   (setq c #x2010)
-  (while (<= c #x2016)
-    (set-case-syntax c "_" tbl)
+  ;; Fixme: What to do with characters that have Pi and Pf
+  ;; Unicode properties?
+  (while (<= c #x2017)
+    (set-case-syntax c "." tbl)
     (setq c (1+ c)))
   ;; Punctuation syntax for quotation marks (like `)
-  (while (<= c #x201f)
+  (while (<= c #x201F)
     (set-case-syntax  c "." tbl)
     (setq c (1+ c)))
-  ;; Fixme: These aren't all right:
   (while (<= c #x2027)
-    (set-case-syntax c "_" tbl)
+    (set-case-syntax c "." tbl)
     (setq c (1+ c)))
-  (while (<= c #x206F)
+  (setq c #x2030)
+  (while (<= c #x205E)
     (set-case-syntax c "." tbl)
     (setq c (1+ c)))
+  (let ((chars '(?‹ ?› ?⁄ ?⁒)))
+    (while chars
+      (modify-syntax-entry (car chars) "_")
+      (setq chars (cdr chars))))
 
-  ;; Fixme: The following blocks might be better as symbol rather than
-  ;; punctuation.
   ;; Arrows
   (setq c #x2190)
   (while (<= c #x21FF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
   ;; Mathematical Operators
   (while (<= c #x22FF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
   ;; Miscellaneous Technical
   (while (<= c #x23FF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
   ;; Control Pictures
-  (while (<= c #x243F)
+  (while (<= c #x244F)
     (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
 
@@ -652,13 +659,13 @@ ?L
   ;; Supplemental Mathematical Operators
   (setq c #x2A00)
   (while (<= c #x2AFF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
 
   ;; Miscellaneous Symbols and Arrows
   (setq c #x2B00)
   (while (<= c #x2BFF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
 
   ;; Coptic
@@ -676,17 +683,34 @@ ?L
 
   ;; Symbols for Legacy Computing
   (setq c #x1FB00)
+  (while (<= c #x1FBCA)
+    (set-case-syntax c "_" tbl)
+    (setq c (1+ c)))
+  ;; FIXME: Should these be digits?
   (while (<= c #x1FBFF)
     (set-case-syntax c "." tbl)
     (setq c (1+ c)))
 
   ;; Fullwidth Latin
-  (setq c #xff21)
-  (while (<= c #xff3a)
+  (setq c #xFF01)
+  (while (<= c #xFF0F)
+    (set-case-syntax c "." tbl)
+    (setq c (1+ c)))
+  (set-case-syntax #xFF04 "_" tbl)
+  (set-case-syntax #xFF0B "_" tbl)
+  (setq c #xFF21)
+  (while (<= c #xFF3A)
     (modify-category-entry c ?l)
     (modify-category-entry (+ c #x20) ?l)
     (setq c (1+ c)))
 
+  ;; Halfwidth Latin
+  (setq c #xFF64)
+  (while (<= c #xFF65)
+    (set-case-syntax c "." tbl)
+    (setq c (1+ c)))
+  (set-case-syntax #xFF61 "." tbl)
+
   ;; Combining diacritics
   (modify-category-entry '(#x300 . #x362) ?^)
   ;; Combining marks


> I noticed the change of behavior in markchars-mode that now
> disregards the character "NARROW NO-BREAK SPACE" as the word
> separator between thousands, i.e.:
> 
> In Emacs 27:
> (and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
> 1
> 
> In Emacs 28:
> (and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
> 5
> 
> Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096".

Previously, many characters, including u+202F, had the punctuation
('.') syntax.  I modified that to be more close to the Unicode
Character Database (UCD), and u+202F is not a punctuation character
according to the UCD.  It has the Zs general category, which means
"space separator", the same as SPC, NBSP, EN SPACE, and others.

Removing u+202F and other similar characters from the "punctuation"
group had the side effect of leaving it at the default 'w' syntax.

Should we make all Zs characters have the ' ' (whitespace) syntax?
That should be easy, but we should try being consistent in this
regard.





^ permalink raw reply related	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-04 18:44 ` Eli Zaretskii
@ 2021-01-04 18:54   ` martin rudalics
  2021-01-04 19:19     ` Eli Zaretskii
  2021-01-05 18:20   ` Juri Linkov
  1 sibling, 1 reply; 9+ messages in thread
From: martin rudalics @ 2021-01-04 18:54 UTC (permalink / raw)
  To: Eli Zaretskii, Juri Linkov; +Cc: 45660

 > Should we make all Zs characters have the ' ' (whitespace) syntax?
 > That should be easy, but we should try being consistent in this
 > regard.

What would be the downside of doing that?

martin





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-04 18:54   ` martin rudalics
@ 2021-01-04 19:19     ` Eli Zaretskii
  0 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2021-01-04 19:19 UTC (permalink / raw)
  To: martin rudalics; +Cc: 45660, juri

> Cc: 45660@debbugs.gnu.org
> From: martin rudalics <rudalics@gmx.at>
> Date: Mon, 4 Jan 2021 19:54:33 +0100
> 
>  > Should we make all Zs characters have the ' ' (whitespace) syntax?
>  > That should be easy, but we should try being consistent in this
>  > regard.
> 
> What would be the downside of doing that?

As always, changing the syntax of at least some of those characters.
What would that cause is anyone's guess.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-04 18:44 ` Eli Zaretskii
  2021-01-04 18:54   ` martin rudalics
@ 2021-01-05 18:20   ` Juri Linkov
  2021-01-05 18:45     ` Eli Zaretskii
                       ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Juri Linkov @ 2021-01-05 18:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 45660

> Previously, many characters, including u+202F, had the punctuation
> ('.') syntax.  I modified that to be more close to the Unicode
> Character Database (UCD), and u+202F is not a punctuation character
> according to the UCD.  It has the Zs general category, which means
> "space separator", the same as SPC, NBSP, EN SPACE, and others.

So according to the Unicode standard it should have whitespace syntax?

And indeed, I see no reason for similar characters to have different syntax:

  name: NO-BREAK SPACE
  general-category: Zs (Separator, Space)
  syntax:   	which means: whitespace

  name: NARROW NO-BREAK SPACE
  general-category: Zs (Separator, Space)
  syntax: w 	which means: word

> Removing u+202F and other similar characters from the "punctuation"
> group had the side effect of leaving it at the default 'w' syntax.
>
> Should we make all Zs characters have the ' ' (whitespace) syntax?
> That should be easy, but we should try being consistent in this
> regard.

Should the word characters separated by NO-BREAK SPACE by treated as one word?
If there is no reason to treat space characters as part of words, then all
characters with the Zs general category could have the same whitespace syntax.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-05 18:20   ` Juri Linkov
@ 2021-01-05 18:45     ` Eli Zaretskii
  2021-01-05 18:53     ` martin rudalics
  2021-01-08 12:06     ` Eli Zaretskii
  2 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2021-01-05 18:45 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 45660

> From: Juri Linkov <juri@linkov.net>
> Cc: 45660@debbugs.gnu.org
> Date: Tue, 05 Jan 2021 20:20:44 +0200
> 
> > Previously, many characters, including u+202F, had the punctuation
> > ('.') syntax.  I modified that to be more close to the Unicode
> > Character Database (UCD), and u+202F is not a punctuation character
> > according to the UCD.  It has the Zs general category, which means
> > "space separator", the same as SPC, NBSP, EN SPACE, and others.
> 
> So according to the Unicode standard it should have whitespace syntax?

Unicode doesn't have the concept of "syntax", it's our invention.  For
some syntactic categories, it makes sense to follow the corresponding
Unicode general category.  Two examples are "punctuation" and
"symbols".

The question whether to treat Zs as whitespace syntax is on the
table.  We previously treated many of such characters as
"punctuation", which doesn't seem right to me.  Which is why I removed
them from the "punctuation" syntax, and you got bitten byu the result
(because the default syntax is "word-constituent").

> Should the word characters separated by NO-BREAK SPACE by treated as one word?

That's a good question.  Do we currently treat them as such?  I don't
think so, because NBSP has the '.' syntax, i.e. "punctuation".

> If there is no reason to treat space characters as part of words, then all
> characters with the Zs general category could have the same whitespace syntax.

I tend to agree.  If no objections or new issues arise, I will do that
in a couple of days.

Thanks.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-05 18:20   ` Juri Linkov
  2021-01-05 18:45     ` Eli Zaretskii
@ 2021-01-05 18:53     ` martin rudalics
  2021-01-05 19:26       ` Eli Zaretskii
  2021-01-08 12:06     ` Eli Zaretskii
  2 siblings, 1 reply; 9+ messages in thread
From: martin rudalics @ 2021-01-05 18:53 UTC (permalink / raw)
  To: Juri Linkov, Eli Zaretskii; +Cc: 45660

 > Should the word characters separated by NO-BREAK SPACE by treated as one word?

'forward-word' should stop but a line should not be broken there.  So
IIUC this is a question of what's cheaper in terms of implementation.

martin





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-05 18:53     ` martin rudalics
@ 2021-01-05 19:26       ` Eli Zaretskii
  0 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2021-01-05 19:26 UTC (permalink / raw)
  To: martin rudalics; +Cc: 45660, juri

> Cc: 45660@debbugs.gnu.org
> From: martin rudalics <rudalics@gmx.at>
> Date: Tue, 5 Jan 2021 19:53:13 +0100
> 
>  > Should the word characters separated by NO-BREAK SPACE by treated as one word?
> 
> 'forward-word' should stop but a line should not be broken there.  So
> IIUC this is a question of what's cheaper in terms of implementation.

We don't break lines according to syntax, we break them according to
"line breakable" category and other rules.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#45660: 28.0.50; Changed word/whitespace syntax
  2021-01-05 18:20   ` Juri Linkov
  2021-01-05 18:45     ` Eli Zaretskii
  2021-01-05 18:53     ` martin rudalics
@ 2021-01-08 12:06     ` Eli Zaretskii
  2 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2021-01-08 12:06 UTC (permalink / raw)
  To: Juri Linkov; +Cc: 45660-done

> From: Juri Linkov <juri@linkov.net>
> Cc: 45660@debbugs.gnu.org
> Date: Tue, 05 Jan 2021 20:20:44 +0200
> 
> > Previously, many characters, including u+202F, had the punctuation
> > ('.') syntax.  I modified that to be more close to the Unicode
> > Character Database (UCD), and u+202F is not a punctuation character
> > according to the UCD.  It has the Zs general category, which means
> > "space separator", the same as SPC, NBSP, EN SPACE, and others.
> 
> So according to the Unicode standard it should have whitespace syntax?
> 
> And indeed, I see no reason for similar characters to have different syntax:
> 
>   name: NO-BREAK SPACE
>   general-category: Zs (Separator, Space)
>   syntax:   	which means: whitespace
> 
>   name: NARROW NO-BREAK SPACE
>   general-category: Zs (Separator, Space)
>   syntax: w 	which means: word
> 
> > Removing u+202F and other similar characters from the "punctuation"
> > group had the side effect of leaving it at the default 'w' syntax.
> >
> > Should we make all Zs characters have the ' ' (whitespace) syntax?
> > That should be easy, but we should try being consistent in this
> > regard.
> 
> Should the word characters separated by NO-BREAK SPACE by treated as one word?
> If there is no reason to treat space characters as part of words, then all
> characters with the Zs general category could have the same whitespace syntax.

No further comments, so I've now made the change on master whereby all
characters with Zs general category are given the whitespace syntax.

I'm therefore closing this bug; please reopen if there any left-overs
or undesired effects.





^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-08 12:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-04 17:25 bug#45660: 28.0.50; Changed word/whitespace syntax Juri Linkov
2021-01-04 18:44 ` Eli Zaretskii
2021-01-04 18:54   ` martin rudalics
2021-01-04 19:19     ` Eli Zaretskii
2021-01-05 18:20   ` Juri Linkov
2021-01-05 18:45     ` Eli Zaretskii
2021-01-05 18:53     ` martin rudalics
2021-01-05 19:26       ` Eli Zaretskii
2021-01-08 12:06     ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).