From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#1877: Request: Regular expressions that can match Unicode general categories Date: Mon, 30 Sep 2019 11:45:14 +0300 Message-ID: <83r23ycgv9.fsf@gnu.org> References: <1231792692.22467.115.camel@eep> <87zhimfcs4.fsf@gnus.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="1169"; mail-complaints-to="usenet@blaine.gmane.org" Cc: derick.eddington@gmail.com, 1877@debbugs.gnu.org To: Lars Ingebrigtsen Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Sep 30 10:46:17 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1iErJu-0018GY-7Y for geb-bug-gnu-emacs@m.gmane.org; Mon, 30 Sep 2019 10:46:14 +0200 Original-Received: from localhost ([::1]:47350 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iErJs-00044d-Oc for geb-bug-gnu-emacs@m.gmane.org; Mon, 30 Sep 2019 04:46:12 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:53825) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iErJl-00044P-3R for bug-gnu-emacs@gnu.org; Mon, 30 Sep 2019 04:46:06 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iErJj-0004zw-Sn for bug-gnu-emacs@gnu.org; Mon, 30 Sep 2019 04:46:05 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:48092) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iErJj-0004zK-Mf for bug-gnu-emacs@gnu.org; Mon, 30 Sep 2019 04:46:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1iErJi-0005Hs-7Z for bug-gnu-emacs@gnu.org; Mon, 30 Sep 2019 04:46:03 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 30 Sep 2019 08:46:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 1877 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: moreinfo Original-Received: via spool by 1877-submit@debbugs.gnu.org id=B1877.156983313020276 (code B ref 1877); Mon, 30 Sep 2019 08:46:02 +0000 Original-Received: (at 1877) by debbugs.gnu.org; 30 Sep 2019 08:45:30 +0000 Original-Received: from localhost ([127.0.0.1]:56913 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iErJC-0005Gy-B7 for submit@debbugs.gnu.org; Mon, 30 Sep 2019 04:45:30 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:44451) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iErJ9-0005Gk-Vy for 1877@debbugs.gnu.org; Mon, 30 Sep 2019 04:45:28 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:35580) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1iErJ0-00043Z-7I; Mon, 30 Sep 2019 04:45:20 -0400 Original-Received: from [176.228.60.248] (port=4816 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1iErIz-0000qO-JQ; Mon, 30 Sep 2019 04:45:18 -0400 In-reply-to: <87zhimfcs4.fsf@gnus.org> (message from Lars Ingebrigtsen on Mon, 30 Sep 2019 09:45:15 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:167857 Archived-At: > From: Lars Ingebrigtsen > Date: Mon, 30 Sep 2019 09:45:15 +0200 > Cc: 1877@debbugs.gnu.org > > Derick Eddington writes: > > > A new Scheme major mode I've made [1] requires regular expressions that > > can match characters by their Unicode general categories. It seems > > Emacs regular expressions do not provide a way to do that directly (I'm > > using GNU Emacs 23.0.60.1) > > (I'm going through old bug reports that unfortunately didn't get any > response at the time.) > > I'm not quite sure what Unicode general categories you're referring to, > but the Emacs regexp matcher has gained a bunch of categories in the ten > years since you made the request. > > Are the categories below what you were thinking of? > > ‘[:print:]’ > This matches any printing character—either whitespace, or a graphic > character matched by ‘[:graph:]’. > ‘[:punct:]’ > This matches any punctuation character. (At present, for multibyte > characters, it matches anything that has non-word syntax.) > ‘[:space:]’ > This matches any character that has whitespace syntax (*note Syntax > Class Table::). > ‘[:upper:]’ > This matches any upper-case letter, as determined by the current > case table (*note Case Tables::). If ‘case-fold-search’ is > non-‘nil’, this also matches any lower-case letter. > ‘[:word:]’ > This matches any character that has word syntax (*note Syntax Class > Table::). No, he means the categories described in the node "Character Properties" of the ELisp manual. We don't yet have full support for the Unicode Regular Expressions, as specified in UTS#18. In particular, see http://unicode.org/reports/tr18/#General_Category_Property for General Category regexp specs. It is not clear to me which categories are of interest here. Some of them are nowadays definitely available indirectly via the classes mentioned above (they weren't available in Emacs 23 when the bug was filed). Maybe the OP could provide an explicit list of categories needed for this Scheme mode, together with their required usage in this mode. Looking at R6RS sec 4.2.1, all I see is "whitespace" (which we provide via [:blank:]), "letter" (provided by [:alpha:]), "digit" (provided by [:alnum:]), and "intraline whitespace" (provided by [:blank:]). If this is all, then we have all the required support now.