From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories Date: Fri, 16 Aug 2019 12:33:08 +0300 Message-ID: <835zmxpi97.fsf@gnu.org> References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> <83pnl6pdo6.fsf@gnu.org> <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> <83o90qp71n.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="211464"; mail-complaints-to="usenet@blaine.gmane.org" Cc: 37036@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Aug 16 11:34:14 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hyYce-000ssP-1I for geb-bug-gnu-emacs@m.gmane.org; Fri, 16 Aug 2019 11:34:12 +0200 Original-Received: from localhost ([::1]:52226 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1hyYcc-0007vO-30 for geb-bug-gnu-emacs@m.gmane.org; Fri, 16 Aug 2019 05:34:10 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:59076) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1hyYcV-0007uz-Kz for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hyYcU-0002Yr-6l for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:44245) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hyYcU-0002Yl-34 for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hyYcT-0006C8-UG for bug-gnu-emacs@gnu.org; Fri, 16 Aug 2019 05:34:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 16 Aug 2019 09:34:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 37036 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 37036-submit@debbugs.gnu.org id=B37036.156594800523771 (code B ref 37036); Fri, 16 Aug 2019 09:34:01 +0000 Original-Received: (at 37036) by debbugs.gnu.org; 16 Aug 2019 09:33:25 +0000 Original-Received: from localhost ([127.0.0.1]:53066 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyYbs-0006BL-M1 for submit@debbugs.gnu.org; Fri, 16 Aug 2019 05:33:25 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:46580) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyYbr-0006B9-Cs for 37036@debbugs.gnu.org; Fri, 16 Aug 2019 05:33:23 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:37751) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hyYbk-00022F-8q; Fri, 16 Aug 2019 05:33:18 -0400 Original-Received: from [176.228.60.248] (port=3671 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hyYbi-0004wz-Mk; Fri, 16 Aug 2019 05:33:15 -0400 In-reply-to: (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Fri, 16 Aug 2019 00:19:43 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:165197 Archived-At: > From: Mattias EngdegÄrd > Date: Fri, 16 Aug 2019 00:19:43 +0200 > Cc: 37036@debbugs.gnu.org > > In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial. AFAIU, the patch made all the non-letter characters excluded from the Latin category, is that right? If so, it's a pretty significant change IMO; who knows what it could break, including outside of the core Emacs. The fact that the Latin category is not well defined doesn't yet mean we are at liberty of changing that (implied) definition at will. Categories are currently used for a small number of core Emacs features, and AFAIR were created incrementally as the ad-hoc need for each one of them arose, so we also risk breaking our own code. Do we really have a good reason to wake those sleeping dogs? > >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. > > > > Can you tell the details of where this function doesn't work? I'd > > like to understand why fixing it needs to change the categories. > > Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish). > Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 . So you are saying that function fails to consider punctuation and symbols that are part of the Latin blocks? That just means it shouldn't use \cl in the first place (and yes, my suggestion to use that in the bug discussion was wrong, sorry), it should use the general-category Unicode property to filter out punctuation characters. Or it could use explicit ranges of codepoints. Or we could extend [:punct:] to support non-ASCII punctuation in a more meaningful way. Either way, that's not a reason good enough to make significant changes in how the categories are defined. If any extensions are needed, I'd rather we made it in more modern and less ad-hoc features. > The point is that if there is some code that doesn't work because of the broken categories, there may very well be more. This argument goes both ways: there could be code out there which relies on the current "broken" definition of the Latin category. > > I don't think we should fix those mistakes, because that's an > > impossible goal. We should instead gradually stop using categories > > for anything serious, certainly for any new code. We should use the > > UCD properties and the various char-tables built upon that instead. > > Perhaps, but categories still have one thing going for them: they have fairly good regexp support. I think this is in many cases an illusory advantage: specifying \cFOO in a regexp just makes the code access some char-table. But the same is true for get-char-code-property and for accessing char-script-table from Lisp, to mention just two alternatives. And we all know that using regular expressions for solving a problem sometimes _adds_ a problem instead of solving one. If we have some functionality in regular expressions that's supported by categories, but is unavailable or inconvenient with Unicode properties, I'd rather we extended our regex engine to support the likes of \p{Po} and \p{script=greek}, see http://unicode.org/reports/tr18/, instead of wasting our resources on "fixing" the categories.