bug#37036: [PATCH] Inconsistent ASCII and Latin char categories

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
@ 2019-08-15 12:17 Mattias Engdegård
  2019-08-15 15:27 ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Mattias Engdegård @ 2019-08-15 12:17 UTC (permalink / raw)
  To: 37036

[-- Attachment #1: Type: text/plain, Size: 762 bytes --]

The ASCII (a) and Latin (l) character categories are inconsistent in what characters they contain.

It should be clear what the ASCII category means, but it omits 00-1f (contrary to a comment in the code).

The Latin category isn't exactly defined anywhere but should reasonably comprise letters from Latin-based scripts. Currently, it also includes many control characters and symbols from the ASCII and Latin-1 Supplement blocks, which seems hard to justify.

Other changes to Latin could be argued: what modifiers/combining chars should be included? What about IPA and non-IPA phonetics? Ligatures? What about Latin-derived forms such as circled letters? &c. The attached patch does not go there but only fixes the glaring errors in the 00-ff range.


[-- Attachment #2: 0001-Fix-ASCII-and-Latin-character-categories.patch --]
[-- Type: application/octet-stream, Size: 1598 bytes --]

From 9dbb98c7d2f7856a16efcfacdfae7890db3c45fe Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Thu, 15 Aug 2019 14:04:03 +0200
Subject: [PATCH] Fix ASCII and Latin character categories

* lisp/international/characters.el:
Make the ASCII (a) category include all ASCII characters.
Make the Latin (l) category include only letters from the range 00-ff.
---
 lisp/international/characters.el | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/lisp/international/characters.el b/lisp/international/characters.el
index 012827ba1c..379a6a170b 100644
--- a/lisp/international/characters.el
+++ b/lisp/international/characters.el
@@ -127,11 +127,8 @@ ?L
 \f
 ;;; Setting syntax and category.
 
-;; ASCII
-
-;; All ASCII characters have the category `a' (ASCII) and `l' (Latin).
-(modify-category-entry '(32 . 127) ?a)
-(modify-category-entry '(32 . 127) ?l)
+;; All ASCII characters have the category `a' (ASCII).
+(modify-category-entry '(0 . 127) ?a)
 
 ;; Deal with the CJK charsets first.  Since the syntax of blocks is
 ;; defined per charset, and the charsets may contain e.g. Latin
@@ -510,7 +507,13 @@ ?L
 
 ;; Latin
 
-(modify-category-entry '(#x80 . #x024F) ?l)
+;; ASCII
+(modify-category-entry '(?A . ?Z) ?l)
+(modify-category-entry '(?a . ?z) ?l)
+;; Latin-1 Supplement
+(modify-category-entry '(#xc0 . #xd6) ?l)
+(modify-category-entry '(#xd8 . #xf6) ?l)
+(modify-category-entry '(#xf8 . #xff) ?l)
 
 (let ((tbl (standard-case-table)) c)
 
-- 
2.20.1 (Apple Git-117)


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 12:17 bug#37036: [PATCH] Inconsistent ASCII and Latin char categories Mattias Engdegård
@ 2019-08-15 15:27 ` Eli Zaretskii
  2019-08-15 15:46   ` Mattias Engdegård
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-08-15 15:27 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37036

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 15 Aug 2019 14:17:15 +0200
> 
> The ASCII (a) and Latin (l) character categories are inconsistent in what characters they contain.
> 
> It should be clear what the ASCII category means, but it omits 00-1f (contrary to a comment in the code).
> 
> The Latin category isn't exactly defined anywhere but should reasonably comprise letters from Latin-based scripts. Currently, it also includes many control characters and symbols from the ASCII and Latin-1 Supplement blocks, which seems hard to justify.
> 
> Other changes to Latin could be argued: what modifiers/combining chars should be included? What about IPA and non-IPA phonetics? Ligatures? What about Latin-derived forms such as circled letters? &c. The attached patch does not go there but only fixes the glaring errors in the 00-ff range.

Did you try moving by words after these changes?  What happens in
words that consist of ASCII and non-ASCII Latin characters, for
example?





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 15:27 ` Eli Zaretskii
@ 2019-08-15 15:46   ` Mattias Engdegård
  2019-08-15 16:23     ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Mattias Engdegård @ 2019-08-15 15:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37036

15 aug. 2019 kl. 17.27 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> Did you try moving by words after these changes?  What happens in
> words that consist of ASCII and non-ASCII Latin characters, for
> example?

No change in behaviour observed in any such case.






^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 15:46   ` Mattias Engdegård
@ 2019-08-15 16:23     ` Eli Zaretskii
  2019-08-15 16:30       ` Mattias Engdegård
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-08-15 16:23 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37036

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 15 Aug 2019 17:46:35 +0200
> Cc: 37036@debbugs.gnu.org
> 
> 15 aug. 2019 kl. 17.27 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> > Did you try moving by words after these changes?  What happens in
> > words that consist of ASCII and non-ASCII Latin characters, for
> > example?
> 
> No change in behaviour observed in any such case.

In any case, how to justify the fact that, say, "naïve", has
characters from different scripts?





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 16:23     ` Eli Zaretskii
@ 2019-08-15 16:30       ` Mattias Engdegård
  2019-08-15 16:59         ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Mattias Engdegård @ 2019-08-15 16:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37036

15 aug. 2019 kl. 18.23 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> In any case, how to justify the fact that, say, "naïve", has
> characters from different scripts?

The proposed change does not change the categories of any character in that string.
Or did you mean something else?






^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 16:30       ` Mattias Engdegård
@ 2019-08-15 16:59         ` Eli Zaretskii
  2019-08-15 17:37           ` Mattias Engdegård
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-08-15 16:59 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37036

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 15 Aug 2019 18:30:47 +0200
> Cc: 37036@debbugs.gnu.org
> 
> 15 aug. 2019 kl. 18.23 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> > In any case, how to justify the fact that, say, "naïve", has
> > characters from different scripts?
> 
> The proposed change does not change the categories of any character in that string.

What about "abcdef^A^B"?  Does M-f stop before the control characters?

I guess I don't understand the rationale for the change.  Categories
are Emacs's invention, and their purpose is mostly to allow us to use
regexps for searching certain characters, and other similar
subtleties.  Your rationale seems to be some attempt to be formally
"consistent".  But this is not a formal attribute, it is entirely
ad-hoc, as can be easily seen by just looking at the list of the
categories.

So I wonder why would we want to rock that particular boat.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 16:59         ` Eli Zaretskii
@ 2019-08-15 17:37           ` Mattias Engdegård
  2019-08-15 19:23             ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Mattias Engdegård @ 2019-08-15 17:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37036

15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> What about "abcdef^A^B"?  Does M-f stop before the control characters?

Yes. Does forward-word use categories?

> I guess I don't understand the rationale for the change.  Categories
> are Emacs's invention, and their purpose is mostly to allow us to use
> regexps for searching certain characters, and other similar
> subtleties.  Your rationale seems to be some attempt to be formally
> "consistent".  But this is not a formal attribute, it is entirely
> ad-hoc, as can be easily seen by just looking at the list of the
> categories.

The more categories are arbitrary, the less useful they are. Why would anyone use categories to discriminate characters if they do not have a sensible, useful and predictable structure? If 'Latin' means 'Latin letters, some symbols, some whitespace, some control chars, Indo-Arabic digits and the occasional Greek letter', which it does today, then who can use it correctly?

Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. Those who reviewed that function thought it looked reasonable, as did I when I read it.

It is perfectly clear that categories have been introduced in an ad-hoc way to solve problems as they arose, but that doesn't mean that no mistakes were made even for those narrow purposes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 17:37           ` Mattias Engdegård
@ 2019-08-15 19:23             ` Eli Zaretskii
  2019-08-15 19:46               ` Eli Zaretskii
  2019-08-15 22:19               ` Mattias Engdegård
  0 siblings, 2 replies; 12+ messages in thread
From: Eli Zaretskii @ 2019-08-15 19:23 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37036

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 15 Aug 2019 19:37:49 +0200
> Cc: 37036@debbugs.gnu.org
> 
> 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> > What about "abcdef^A^B"?  Does M-f stop before the control characters?
> 
> Yes. Does forward-word use categories?

No.  Sorry, it was my faulty memory.  It uses char-script-table
instead.

> The more categories are arbitrary, the less useful they are.

I think they should become entirely useless, i.e. we should stop using
them.  We have the entire Unicode database with all the character
properties for quite some time now, and should favor using that
instead.  Categories are an old kludgey hack, which goes back to
pre-Unicode Emacs; it can never be anything but arbitrary, and we will
never be able to fix that anywhere near completely.

> Why would anyone use categories to discriminate characters if they do not have a sensible, useful and predictable structure?

I don't know why anyone should.  My recommendation is to just say no.

> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.

Can you tell the details of where this function doesn't work?  I'd
like to understand why fixing it needs to change the categories.

> It is perfectly clear that categories have been introduced in an ad-hoc way to solve problems as they arose, but that doesn't mean that no mistakes were made even for those narrow purposes.

I don't think we should fix those mistakes, because that's an
impossible goal.  We should instead gradually stop using categories
for anything serious, certainly for any new code.  We should use the
UCD properties and the various char-tables built upon that instead.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 19:23             ` Eli Zaretskii
@ 2019-08-15 19:46               ` Eli Zaretskii
  2019-08-15 22:19               ` Mattias Engdegård
  1 sibling, 0 replies; 12+ messages in thread
From: Eli Zaretskii @ 2019-08-15 19:46 UTC (permalink / raw)
  To: mattiase; +Cc: 37036

> Date: Thu, 15 Aug 2019 22:23:00 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 37036@debbugs.gnu.org
> 
> > From: Mattias Engdegård <mattiase@acm.org>
> > Date: Thu, 15 Aug 2019 19:37:49 +0200
> > Cc: 37036@debbugs.gnu.org
> > 
> > 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz@gnu.org>:
> > > 
> > > What about "abcdef^A^B"?  Does M-f stop before the control characters?
> > 
> > Yes. Does forward-word use categories?
> 
> No.  Sorry, it was my faulty memory.  It uses char-script-table
> instead.

Actually, it uses categories indirectly, via word-combining-categories
and word-separating-categories.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 19:23             ` Eli Zaretskii
  2019-08-15 19:46               ` Eli Zaretskii
@ 2019-08-15 22:19               ` Mattias Engdegård
  2019-08-16  9:33                 ` Eli Zaretskii
  1 sibling, 1 reply; 12+ messages in thread
From: Mattias Engdegård @ 2019-08-15 22:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37036

15 aug. 2019 kl. 21.23 skrev Eli Zaretskii <eliz@gnu.org>:

> I think they should become entirely useless, i.e. we should stop using
> them.  We have the entire Unicode database with all the character
> properties for quite some time now, and should favor using that
> instead.  Categories are an old kludgey hack, which goes back to
> pre-Unicode Emacs; it can never be anything but arbitrary, and we will
> never be able to fix that anywhere near completely.

Thank you, I see what you mean, and I agree that Unicode properties probably are better for most purposes.
In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial.

>> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.
> 
> Can you tell the details of where this function doesn't work?  I'd
> like to understand why fixing it needs to change the categories.

Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish).
Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 .

Of course it doesn't require the categories to be fixed. The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.

> I don't think we should fix those mistakes, because that's an
> impossible goal.  We should instead gradually stop using categories
> for anything serious, certainly for any new code.  We should use the
> UCD properties and the various char-tables built upon that instead.

Perhaps, but categories still have one thing going for them: they have fairly good regexp support.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-15 22:19               ` Mattias Engdegård
@ 2019-08-16  9:33                 ` Eli Zaretskii
  2019-08-16 10:48                   ` Mattias Engdegård
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-08-16  9:33 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37036

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 16 Aug 2019 00:19:43 +0200
> Cc: 37036@debbugs.gnu.org
> 
> In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial.

AFAIU, the patch made all the non-letter characters excluded from the
Latin category, is that right?  If so, it's a pretty significant
change IMO; who knows what it could break, including outside of the
core Emacs.  The fact that the Latin category is not well defined
doesn't yet mean we are at liberty of changing that (implied)
definition at will.  Categories are currently used for a small number
of core Emacs features, and AFAIR were created incrementally as the
ad-hoc need for each one of them arose, so we also risk breaking our
own code.  Do we really have a good reason to wake those sleeping
dogs?

> >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.
> > 
> > Can you tell the details of where this function doesn't work?  I'd
> > like to understand why fixing it needs to change the categories.
> 
> Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish).
> Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 .

So you are saying that function fails to consider punctuation and
symbols that are part of the Latin blocks?  That just means it
shouldn't use \cl in the first place (and yes, my suggestion to use
that in the bug discussion was wrong, sorry), it should use the
general-category Unicode property to filter out punctuation
characters.  Or it could use explicit ranges of codepoints.  Or we
could extend [:punct:] to support non-ASCII punctuation in a more
meaningful way.  Either way, that's not a reason good enough to make
significant changes in how the categories are defined.  If any
extensions are needed, I'd rather we made it in more modern and less
ad-hoc features.

> The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.

This argument goes both ways: there could be code out there which
relies on the current "broken" definition of the Latin category.

> > I don't think we should fix those mistakes, because that's an
> > impossible goal.  We should instead gradually stop using categories
> > for anything serious, certainly for any new code.  We should use the
> > UCD properties and the various char-tables built upon that instead.
> 
> Perhaps, but categories still have one thing going for them: they have fairly good regexp support.

I think this is in many cases an illusory advantage: specifying \cFOO
in a regexp just makes the code access some char-table.  But the same
is true for get-char-code-property and for accessing char-script-table
from Lisp, to mention just two alternatives.  And we all know that
using regular expressions for solving a problem sometimes _adds_ a
problem instead of solving one.

If we have some functionality in regular expressions that's supported
by categories, but is unavailable or inconvenient with Unicode
properties, I'd rather we extended our regex engine to support the
likes of \p{Po} and \p{script=greek}, see
http://unicode.org/reports/tr18/, instead of wasting our resources on
"fixing" the categories.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
  2019-08-16  9:33                 ` Eli Zaretskii
@ 2019-08-16 10:48                   ` Mattias Engdegård
  0 siblings, 0 replies; 12+ messages in thread
From: Mattias Engdegård @ 2019-08-16 10:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37036

tags 37036 wontfix
close 37036
stop

16 aug. 2019 kl. 11.33 skrev Eli Zaretskii <eliz@gnu.org>:
> 
>> The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.
> 
> This argument goes both ways: there could be code out there which
> relies on the current "broken" definition of the Latin category.

Well, that's an argument against fixing any bug. In general, code is more likely to depend on correctness than on errors.

That said, this is nothing I feel strongly about; let's not waste any more time. Maybe the manual section about categories should be amended to discourage would-be users.






^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-08-16 10:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-08-15 12:17 bug#37036: [PATCH] Inconsistent ASCII and Latin char categories Mattias Engdegård
2019-08-15 15:27 ` Eli Zaretskii
2019-08-15 15:46   ` Mattias Engdegård
2019-08-15 16:23     ` Eli Zaretskii
2019-08-15 16:30       ` Mattias Engdegård
2019-08-15 16:59         ` Eli Zaretskii
2019-08-15 17:37           ` Mattias Engdegård
2019-08-15 19:23             ` Eli Zaretskii
2019-08-15 19:46               ` Eli Zaretskii
2019-08-15 22:19               ` Mattias Engdegård
2019-08-16  9:33                 ` Eli Zaretskii
2019-08-16 10:48                   ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).