unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: "Mattias Engdegård" <mattiase@acm.org>
Cc: 37036@debbugs.gnu.org
Subject: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Fri, 16 Aug 2019 12:33:08 +0300	[thread overview]
Message-ID: <835zmxpi97.fsf@gnu.org> (raw)
In-Reply-To: <AD72C315-9BD9-4DDE-949C-46FAC6443F09@acm.org> (message from Mattias Engdegård on Fri, 16 Aug 2019 00:19:43 +0200)

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 16 Aug 2019 00:19:43 +0200
> Cc: 37036@debbugs.gnu.org
> 
> In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial.

AFAIU, the patch made all the non-letter characters excluded from the
Latin category, is that right?  If so, it's a pretty significant
change IMO; who knows what it could break, including outside of the
core Emacs.  The fact that the Latin category is not well defined
doesn't yet mean we are at liberty of changing that (implied)
definition at will.  Categories are currently used for a small number
of core Emacs features, and AFAIR were created incrementally as the
ad-hoc need for each one of them arose, so we also risk breaking our
own code.  Do we really have a good reason to wake those sleeping
dogs?

> >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.
> > 
> > Can you tell the details of where this function doesn't work?  I'd
> > like to understand why fixing it needs to change the categories.
> 
> Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish).
> Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 .

So you are saying that function fails to consider punctuation and
symbols that are part of the Latin blocks?  That just means it
shouldn't use \cl in the first place (and yes, my suggestion to use
that in the bug discussion was wrong, sorry), it should use the
general-category Unicode property to filter out punctuation
characters.  Or it could use explicit ranges of codepoints.  Or we
could extend [:punct:] to support non-ASCII punctuation in a more
meaningful way.  Either way, that's not a reason good enough to make
significant changes in how the categories are defined.  If any
extensions are needed, I'd rather we made it in more modern and less
ad-hoc features.

> The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.

This argument goes both ways: there could be code out there which
relies on the current "broken" definition of the Latin category.

> > I don't think we should fix those mistakes, because that's an
> > impossible goal.  We should instead gradually stop using categories
> > for anything serious, certainly for any new code.  We should use the
> > UCD properties and the various char-tables built upon that instead.
> 
> Perhaps, but categories still have one thing going for them: they have fairly good regexp support.

I think this is in many cases an illusory advantage: specifying \cFOO
in a regexp just makes the code access some char-table.  But the same
is true for get-char-code-property and for accessing char-script-table
from Lisp, to mention just two alternatives.  And we all know that
using regular expressions for solving a problem sometimes _adds_ a
problem instead of solving one.

If we have some functionality in regular expressions that's supported
by categories, but is unavailable or inconvenient with Unicode
properties, I'd rather we extended our regex engine to support the
likes of \p{Po} and \p{script=greek}, see
http://unicode.org/reports/tr18/, instead of wasting our resources on
"fixing" the categories.





  reply	other threads:[~2019-08-16  9:33 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-15 12:17 bug#37036: [PATCH] Inconsistent ASCII and Latin char categories Mattias Engdegård
2019-08-15 15:27 ` Eli Zaretskii
2019-08-15 15:46   ` Mattias Engdegård
2019-08-15 16:23     ` Eli Zaretskii
2019-08-15 16:30       ` Mattias Engdegård
2019-08-15 16:59         ` Eli Zaretskii
2019-08-15 17:37           ` Mattias Engdegård
2019-08-15 19:23             ` Eli Zaretskii
2019-08-15 19:46               ` Eli Zaretskii
2019-08-15 22:19               ` Mattias Engdegård
2019-08-16  9:33                 ` Eli Zaretskii [this message]
2019-08-16 10:48                   ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=835zmxpi97.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=37036@debbugs.gnu.org \
    --cc=mattiase@acm.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).