From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Fri, 16 Aug 2019 12:33:08 +0300
Message-ID: <835zmxpi97.fsf@gnu.org>
References: <ABE1C023-64DB-452B-984A-DC22A712E224@acm.org>
 <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org>
 <83v9uypfdm.fsf@gnu.org> <C0236B82-354A-49C6-B525-7BBF535FA786@acm.org>
 <83pnl6pdo6.fsf@gnu.org> <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org>
 <83o90qp71n.fsf@gnu.org> <AD72C315-9BD9-4DDE-949C-46FAC6443F09@acm.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="211464"; mail-complaints-to="usenet@blaine.gmane.org"
Cc: 37036@debbugs.gnu.org
To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= <mattiase@acm.org>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Aug 16 11:34:14 2019
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.89)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1hyYce-000ssP-1I
	for geb-bug-gnu-emacs@m.gmane.org; Fri, 16 Aug 2019 11:34:12 +0200
Original-Received: from localhost ([::1]:52226 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1hyYcc-0007vO-30
	for geb-bug-gnu-emacs@m.gmane.org; Fri, 16 Aug 2019 05:34:10 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:59076)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hyYcV-0007uz-Kz
 for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:05 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hyYcU-0002Yr-6l
 for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:03 -0400
Original-Received: from debbugs.gnu.org ([209.51.188.43]:44245)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1hyYcU-0002Yl-34
 for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:02 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1hyYcT-0006C8-UG
 for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:01 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Eli Zaretskii <eliz@gnu.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Fri, 16 Aug 2019 09:34:01 +0000
Resent-Message-ID: <handler.37036.B37036.156594800523771@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 37036
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: patch
Original-Received: via spool by 37036-submit@debbugs.gnu.org id=B37036.156594800523771
 (code B ref 37036); Fri, 16 Aug 2019 09:34:01 +0000
Original-Received: (at 37036) by debbugs.gnu.org; 16 Aug 2019 09:33:25 +0000
Original-Received: from localhost ([127.0.0.1]:53066 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1hyYbs-0006BL-M1
 for submit@debbugs.gnu.org; Fri, 16 Aug 2019 05:33:25 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:46580)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@gnu.org>) id 1hyYbr-0006B9-Cs
 for 37036@debbugs.gnu.org; Fri, 16 Aug 2019 05:33:23 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:37751)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
 id 1hyYbk-00022F-8q; Fri, 16 Aug 2019 05:33:18 -0400
Original-Received: from [176.228.60.248] (port=3671 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <eliz@gnu.org>)
 id 1hyYbi-0004wz-Mk; Fri, 16 Aug 2019 05:33:15 -0400
In-reply-to: <AD72C315-9BD9-4DDE-949C-46FAC6443F09@acm.org> (message from
 Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Fri, 16 Aug 2019 00:19:43 +0200)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 209.51.188.43
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: "bug-gnu-emacs"
 <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.bugs:165197
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/165197>

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 16 Aug 2019 00:19:43 +0200
> Cc: 37036@debbugs.gnu.org
> 
> In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial.

AFAIU, the patch made all the non-letter characters excluded from the
Latin category, is that right?  If so, it's a pretty significant
change IMO; who knows what it could break, including outside of the
core Emacs.  The fact that the Latin category is not well defined
doesn't yet mean we are at liberty of changing that (implied)
definition at will.  Categories are currently used for a small number
of core Emacs features, and AFAIR were created incrementally as the
ad-hoc need for each one of them arose, so we also risk breaking our
own code.  Do we really have a good reason to wake those sleeping
dogs?

> >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.
> > 
> > Can you tell the details of where this function doesn't work?  I'd
> > like to understand why fixing it needs to change the categories.
> 
> Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish).
> Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 .

So you are saying that function fails to consider punctuation and
symbols that are part of the Latin blocks?  That just means it
shouldn't use \cl in the first place (and yes, my suggestion to use
that in the bug discussion was wrong, sorry), it should use the
general-category Unicode property to filter out punctuation
characters.  Or it could use explicit ranges of codepoints.  Or we
could extend [:punct:] to support non-ASCII punctuation in a more
meaningful way.  Either way, that's not a reason good enough to make
significant changes in how the categories are defined.  If any
extensions are needed, I'd rather we made it in more modern and less
ad-hoc features.

> The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.

This argument goes both ways: there could be code out there which
relies on the current "broken" definition of the Latin category.

> > I don't think we should fix those mistakes, because that's an
> > impossible goal.  We should instead gradually stop using categories
> > for anything serious, certainly for any new code.  We should use the
> > UCD properties and the various char-tables built upon that instead.
> 
> Perhaps, but categories still have one thing going for them: they have fairly good regexp support.

I think this is in many cases an illusory advantage: specifying \cFOO
in a regexp just makes the code access some char-table.  But the same
is true for get-char-code-property and for accessing char-script-table
from Lisp, to mention just two alternatives.  And we all know that
using regular expressions for solving a problem sometimes _adds_ a
problem instead of solving one.

If we have some functionality in regular expressions that's supported
by categories, but is unavailable or inconvenient with Unicode
properties, I'd rather we extended our regex engine to support the
likes of \p{Po} and \p{script=greek}, see
http://unicode.org/reports/tr18/, instead of wasting our resources on
"fixing" the categories.