* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories @ 2019-08-15 12:17 Mattias Engdegård 2019-08-15 15:27 ` Eli Zaretskii 0 siblings, 1 reply; 12+ messages in thread From: Mattias Engdegård @ 2019-08-15 12:17 UTC (permalink / raw) To: 37036 [-- Attachment #1: Type: text/plain, Size: 762 bytes --] The ASCII (a) and Latin (l) character categories are inconsistent in what characters they contain. It should be clear what the ASCII category means, but it omits 00-1f (contrary to a comment in the code). The Latin category isn't exactly defined anywhere but should reasonably comprise letters from Latin-based scripts. Currently, it also includes many control characters and symbols from the ASCII and Latin-1 Supplement blocks, which seems hard to justify. Other changes to Latin could be argued: what modifiers/combining chars should be included? What about IPA and non-IPA phonetics? Ligatures? What about Latin-derived forms such as circled letters? &c. The attached patch does not go there but only fixes the glaring errors in the 00-ff range. [-- Attachment #2: 0001-Fix-ASCII-and-Latin-character-categories.patch --] [-- Type: application/octet-stream, Size: 1598 bytes --] From 9dbb98c7d2f7856a16efcfacdfae7890db3c45fe Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Thu, 15 Aug 2019 14:04:03 +0200 Subject: [PATCH] Fix ASCII and Latin character categories * lisp/international/characters.el: Make the ASCII (a) category include all ASCII characters. Make the Latin (l) category include only letters from the range 00-ff. --- lisp/international/characters.el | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/lisp/international/characters.el b/lisp/international/characters.el index 012827ba1c..379a6a170b 100644 --- a/lisp/international/characters.el +++ b/lisp/international/characters.el @@ -127,11 +127,8 @@ ?L \f ;;; Setting syntax and category. -;; ASCII - -;; All ASCII characters have the category `a' (ASCII) and `l' (Latin). -(modify-category-entry '(32 . 127) ?a) -(modify-category-entry '(32 . 127) ?l) +;; All ASCII characters have the category `a' (ASCII). +(modify-category-entry '(0 . 127) ?a) ;; Deal with the CJK charsets first. Since the syntax of blocks is ;; defined per charset, and the charsets may contain e.g. Latin @@ -510,7 +507,13 @@ ?L ;; Latin -(modify-category-entry '(#x80 . #x024F) ?l) +;; ASCII +(modify-category-entry '(?A . ?Z) ?l) +(modify-category-entry '(?a . ?z) ?l) +;; Latin-1 Supplement +(modify-category-entry '(#xc0 . #xd6) ?l) +(modify-category-entry '(#xd8 . #xf6) ?l) +(modify-category-entry '(#xf8 . #xff) ?l) (let ((tbl (standard-case-table)) c) -- 2.20.1 (Apple Git-117) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 12:17 bug#37036: [PATCH] Inconsistent ASCII and Latin char categories Mattias Engdegård @ 2019-08-15 15:27 ` Eli Zaretskii 2019-08-15 15:46 ` Mattias Engdegård 0 siblings, 1 reply; 12+ messages in thread From: Eli Zaretskii @ 2019-08-15 15:27 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 37036 > From: Mattias Engdegård <mattiase@acm.org> > Date: Thu, 15 Aug 2019 14:17:15 +0200 > > The ASCII (a) and Latin (l) character categories are inconsistent in what characters they contain. > > It should be clear what the ASCII category means, but it omits 00-1f (contrary to a comment in the code). > > The Latin category isn't exactly defined anywhere but should reasonably comprise letters from Latin-based scripts. Currently, it also includes many control characters and symbols from the ASCII and Latin-1 Supplement blocks, which seems hard to justify. > > Other changes to Latin could be argued: what modifiers/combining chars should be included? What about IPA and non-IPA phonetics? Ligatures? What about Latin-derived forms such as circled letters? &c. The attached patch does not go there but only fixes the glaring errors in the 00-ff range. Did you try moving by words after these changes? What happens in words that consist of ASCII and non-ASCII Latin characters, for example? ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 15:27 ` Eli Zaretskii @ 2019-08-15 15:46 ` Mattias Engdegård 2019-08-15 16:23 ` Eli Zaretskii 0 siblings, 1 reply; 12+ messages in thread From: Mattias Engdegård @ 2019-08-15 15:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 37036 15 aug. 2019 kl. 17.27 skrev Eli Zaretskii <eliz@gnu.org>: > > Did you try moving by words after these changes? What happens in > words that consist of ASCII and non-ASCII Latin characters, for > example? No change in behaviour observed in any such case. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 15:46 ` Mattias Engdegård @ 2019-08-15 16:23 ` Eli Zaretskii 2019-08-15 16:30 ` Mattias Engdegård 0 siblings, 1 reply; 12+ messages in thread From: Eli Zaretskii @ 2019-08-15 16:23 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 37036 > From: Mattias Engdegård <mattiase@acm.org> > Date: Thu, 15 Aug 2019 17:46:35 +0200 > Cc: 37036@debbugs.gnu.org > > 15 aug. 2019 kl. 17.27 skrev Eli Zaretskii <eliz@gnu.org>: > > > > Did you try moving by words after these changes? What happens in > > words that consist of ASCII and non-ASCII Latin characters, for > > example? > > No change in behaviour observed in any such case. In any case, how to justify the fact that, say, "naïve", has characters from different scripts? ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 16:23 ` Eli Zaretskii @ 2019-08-15 16:30 ` Mattias Engdegård 2019-08-15 16:59 ` Eli Zaretskii 0 siblings, 1 reply; 12+ messages in thread From: Mattias Engdegård @ 2019-08-15 16:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 37036 15 aug. 2019 kl. 18.23 skrev Eli Zaretskii <eliz@gnu.org>: > > In any case, how to justify the fact that, say, "naïve", has > characters from different scripts? The proposed change does not change the categories of any character in that string. Or did you mean something else? ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 16:30 ` Mattias Engdegård @ 2019-08-15 16:59 ` Eli Zaretskii 2019-08-15 17:37 ` Mattias Engdegård 0 siblings, 1 reply; 12+ messages in thread From: Eli Zaretskii @ 2019-08-15 16:59 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 37036 > From: Mattias Engdegård <mattiase@acm.org> > Date: Thu, 15 Aug 2019 18:30:47 +0200 > Cc: 37036@debbugs.gnu.org > > 15 aug. 2019 kl. 18.23 skrev Eli Zaretskii <eliz@gnu.org>: > > > > In any case, how to justify the fact that, say, "naïve", has > > characters from different scripts? > > The proposed change does not change the categories of any character in that string. What about "abcdef^A^B"? Does M-f stop before the control characters? I guess I don't understand the rationale for the change. Categories are Emacs's invention, and their purpose is mostly to allow us to use regexps for searching certain characters, and other similar subtleties. Your rationale seems to be some attempt to be formally "consistent". But this is not a formal attribute, it is entirely ad-hoc, as can be easily seen by just looking at the list of the categories. So I wonder why would we want to rock that particular boat. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 16:59 ` Eli Zaretskii @ 2019-08-15 17:37 ` Mattias Engdegård 2019-08-15 19:23 ` Eli Zaretskii 0 siblings, 1 reply; 12+ messages in thread From: Mattias Engdegård @ 2019-08-15 17:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 37036 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz@gnu.org>: > > What about "abcdef^A^B"? Does M-f stop before the control characters? Yes. Does forward-word use categories? > I guess I don't understand the rationale for the change. Categories > are Emacs's invention, and their purpose is mostly to allow us to use > regexps for searching certain characters, and other similar > subtleties. Your rationale seems to be some attempt to be formally > "consistent". But this is not a formal attribute, it is entirely > ad-hoc, as can be easily seen by just looking at the list of the > categories. The more categories are arbitrary, the less useful they are. Why would anyone use categories to discriminate characters if they do not have a sensible, useful and predictable structure? If 'Latin' means 'Latin letters, some symbols, some whitespace, some control chars, Indo-Arabic digits and the occasional Greek letter', which it does today, then who can use it correctly? Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. Those who reviewed that function thought it looked reasonable, as did I when I read it. It is perfectly clear that categories have been introduced in an ad-hoc way to solve problems as they arose, but that doesn't mean that no mistakes were made even for those narrow purposes. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 17:37 ` Mattias Engdegård @ 2019-08-15 19:23 ` Eli Zaretskii 2019-08-15 19:46 ` Eli Zaretskii 2019-08-15 22:19 ` Mattias Engdegård 0 siblings, 2 replies; 12+ messages in thread From: Eli Zaretskii @ 2019-08-15 19:23 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 37036 > From: Mattias Engdegård <mattiase@acm.org> > Date: Thu, 15 Aug 2019 19:37:49 +0200 > Cc: 37036@debbugs.gnu.org > > 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz@gnu.org>: > > > > What about "abcdef^A^B"? Does M-f stop before the control characters? > > Yes. Does forward-word use categories? No. Sorry, it was my faulty memory. It uses char-script-table instead. > The more categories are arbitrary, the less useful they are. I think they should become entirely useless, i.e. we should stop using them. We have the entire Unicode database with all the character properties for quite some time now, and should favor using that instead. Categories are an old kludgey hack, which goes back to pre-Unicode Emacs; it can never be anything but arbitrary, and we will never be able to fix that anywhere near completely. > Why would anyone use categories to discriminate characters if they do not have a sensible, useful and predictable structure? I don't know why anyone should. My recommendation is to just say no. > Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. Can you tell the details of where this function doesn't work? I'd like to understand why fixing it needs to change the categories. > It is perfectly clear that categories have been introduced in an ad-hoc way to solve problems as they arose, but that doesn't mean that no mistakes were made even for those narrow purposes. I don't think we should fix those mistakes, because that's an impossible goal. We should instead gradually stop using categories for anything serious, certainly for any new code. We should use the UCD properties and the various char-tables built upon that instead. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 19:23 ` Eli Zaretskii @ 2019-08-15 19:46 ` Eli Zaretskii 2019-08-15 22:19 ` Mattias Engdegård 1 sibling, 0 replies; 12+ messages in thread From: Eli Zaretskii @ 2019-08-15 19:46 UTC (permalink / raw) To: mattiase; +Cc: 37036 > Date: Thu, 15 Aug 2019 22:23:00 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 37036@debbugs.gnu.org > > > From: Mattias Engdegård <mattiase@acm.org> > > Date: Thu, 15 Aug 2019 19:37:49 +0200 > > Cc: 37036@debbugs.gnu.org > > > > 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz@gnu.org>: > > > > > > What about "abcdef^A^B"? Does M-f stop before the control characters? > > > > Yes. Does forward-word use categories? > > No. Sorry, it was my faulty memory. It uses char-script-table > instead. Actually, it uses categories indirectly, via word-combining-categories and word-separating-categories. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 19:23 ` Eli Zaretskii 2019-08-15 19:46 ` Eli Zaretskii @ 2019-08-15 22:19 ` Mattias Engdegård 2019-08-16 9:33 ` Eli Zaretskii 1 sibling, 1 reply; 12+ messages in thread From: Mattias Engdegård @ 2019-08-15 22:19 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 37036 15 aug. 2019 kl. 21.23 skrev Eli Zaretskii <eliz@gnu.org>: > I think they should become entirely useless, i.e. we should stop using > them. We have the entire Unicode database with all the character > properties for quite some time now, and should favor using that > instead. Categories are an old kludgey hack, which goes back to > pre-Unicode Emacs; it can never be anything but arbitrary, and we will > never be able to fix that anywhere near completely. Thank you, I see what you mean, and I agree that Unicode properties probably are better for most purposes. In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial. >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. > > Can you tell the details of where this function doesn't work? I'd > like to understand why fixing it needs to change the categories. Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish). Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 . Of course it doesn't require the categories to be fixed. The point is that if there is some code that doesn't work because of the broken categories, there may very well be more. > I don't think we should fix those mistakes, because that's an > impossible goal. We should instead gradually stop using categories > for anything serious, certainly for any new code. We should use the > UCD properties and the various char-tables built upon that instead. Perhaps, but categories still have one thing going for them: they have fairly good regexp support. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-15 22:19 ` Mattias Engdegård @ 2019-08-16 9:33 ` Eli Zaretskii 2019-08-16 10:48 ` Mattias Engdegård 0 siblings, 1 reply; 12+ messages in thread From: Eli Zaretskii @ 2019-08-16 9:33 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 37036 > From: Mattias Engdegård <mattiase@acm.org> > Date: Fri, 16 Aug 2019 00:19:43 +0200 > Cc: 37036@debbugs.gnu.org > > In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial. AFAIU, the patch made all the non-letter characters excluded from the Latin category, is that right? If so, it's a pretty significant change IMO; who knows what it could break, including outside of the core Emacs. The fact that the Latin category is not well defined doesn't yet mean we are at liberty of changing that (implied) definition at will. Categories are currently used for a small number of core Emacs features, and AFAIR were created incrementally as the ad-hoc need for each one of them arose, so we also risk breaking our own code. Do we really have a good reason to wake those sleeping dogs? > >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. > > > > Can you tell the details of where this function doesn't work? I'd > > like to understand why fixing it needs to change the categories. > > Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish). > Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 . So you are saying that function fails to consider punctuation and symbols that are part of the Latin blocks? That just means it shouldn't use \cl in the first place (and yes, my suggestion to use that in the bug discussion was wrong, sorry), it should use the general-category Unicode property to filter out punctuation characters. Or it could use explicit ranges of codepoints. Or we could extend [:punct:] to support non-ASCII punctuation in a more meaningful way. Either way, that's not a reason good enough to make significant changes in how the categories are defined. If any extensions are needed, I'd rather we made it in more modern and less ad-hoc features. > The point is that if there is some code that doesn't work because of the broken categories, there may very well be more. This argument goes both ways: there could be code out there which relies on the current "broken" definition of the Latin category. > > I don't think we should fix those mistakes, because that's an > > impossible goal. We should instead gradually stop using categories > > for anything serious, certainly for any new code. We should use the > > UCD properties and the various char-tables built upon that instead. > > Perhaps, but categories still have one thing going for them: they have fairly good regexp support. I think this is in many cases an illusory advantage: specifying \cFOO in a regexp just makes the code access some char-table. But the same is true for get-char-code-property and for accessing char-script-table from Lisp, to mention just two alternatives. And we all know that using regular expressions for solving a problem sometimes _adds_ a problem instead of solving one. If we have some functionality in regular expressions that's supported by categories, but is unavailable or inconvenient with Unicode properties, I'd rather we extended our regex engine to support the likes of \p{Po} and \p{script=greek}, see http://unicode.org/reports/tr18/, instead of wasting our resources on "fixing" the categories. ^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#37036: [PATCH] Inconsistent ASCII and Latin char categories 2019-08-16 9:33 ` Eli Zaretskii @ 2019-08-16 10:48 ` Mattias Engdegård 0 siblings, 0 replies; 12+ messages in thread From: Mattias Engdegård @ 2019-08-16 10:48 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 37036 tags 37036 wontfix close 37036 stop 16 aug. 2019 kl. 11.33 skrev Eli Zaretskii <eliz@gnu.org>: > >> The point is that if there is some code that doesn't work because of the broken categories, there may very well be more. > > This argument goes both ways: there could be code out there which > relies on the current "broken" definition of the Latin category. Well, that's an argument against fixing any bug. In general, code is more likely to depend on correctness than on errors. That said, this is nothing I feel strongly about; let's not waste any more time. Maybe the manual section about categories should be amended to discourage would-be users. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-08-16 10:48 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-08-15 12:17 bug#37036: [PATCH] Inconsistent ASCII and Latin char categories Mattias Engdegård 2019-08-15 15:27 ` Eli Zaretskii 2019-08-15 15:46 ` Mattias Engdegård 2019-08-15 16:23 ` Eli Zaretskii 2019-08-15 16:30 ` Mattias Engdegård 2019-08-15 16:59 ` Eli Zaretskii 2019-08-15 17:37 ` Mattias Engdegård 2019-08-15 19:23 ` Eli Zaretskii 2019-08-15 19:46 ` Eli Zaretskii 2019-08-15 22:19 ` Mattias Engdegård 2019-08-16 9:33 ` Eli Zaretskii 2019-08-16 10:48 ` Mattias Engdegård
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).