From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Derick Eddington Newsgroups: gmane.emacs.bugs Subject: bug#1877: Request: Regular expressions that can match Unicode general categories Date: Mon, 12 Jan 2009 12:38:12 -0800 Message-ID: <1231792692.22467.115.camel@eep> Reply-To: Derick Eddington , 1877@emacsbugs.donarmstrong.com NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1231794394 23315 80.91.229.12 (12 Jan 2009 21:06:34 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 12 Jan 2009 21:06:34 +0000 (UTC) To: bug-gnu-emacs@gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Jan 12 22:07:44 2009 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LMTzl-0008Jk-VV for geb-bug-gnu-emacs@m.gmane.org; Mon, 12 Jan 2009 22:06:51 +0100 Original-Received: from localhost ([127.0.0.1]:45346 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LMTyV-0002YK-OO for geb-bug-gnu-emacs@m.gmane.org; Mon, 12 Jan 2009 16:05:31 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LMTwa-0000JV-P7 for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 16:03:33 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LMTwY-0000EP-DF for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 16:03:31 -0500 Original-Received: from [199.232.76.173] (port=40306 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LMTwX-0000Da-8g for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 16:03:29 -0500 Original-Received: from rzlab.ucr.edu ([138.23.92.77]:41295) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1LMTwV-0006I7-Vt for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 16:03:28 -0500 Original-Received: from rzlab.ucr.edu (rzlab.ucr.edu [127.0.0.1]) by rzlab.ucr.edu (8.13.8/8.13.8/Debian-3) with ESMTP id n0CL3PxU011188; Mon, 12 Jan 2009 13:03:26 -0800 Original-Received: (from debbugs@localhost) by rzlab.ucr.edu (8.13.8/8.13.8/Submit) id n0CKj2pT006362; Mon, 12 Jan 2009 12:45:02 -0800 X-Loop: owner@emacsbugs.donarmstrong.com Resent-From: Derick Eddington Resent-To: bug-submit-list@donarmstrong.com Resent-CC: Emacs Bugs Resent-Date: Mon, 12 Jan 2009 20:45:02 +0000 Resent-Message-ID: Resent-Sender: owner@emacsbugs.donarmstrong.com X-Emacs-PR-Message: report 1877 X-Emacs-PR-Package: emacs X-Emacs-PR-Keywords: Original-Received: via spool by submit@emacsbugs.donarmstrong.com id=B.12317927105072 (code B ref -1); Mon, 12 Jan 2009 20:45:02 +0000 Original-Received: (at submit) by emacsbugs.donarmstrong.com; 12 Jan 2009 20:38:30 +0000 X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available. hammytokens:Tokens not available. Original-Received: from lists.gnu.org (lists.gnu.org [199.232.76.165]) by rzlab.ucr.edu (8.13.8/8.13.8/Debian-3) with ESMTP id n0CKcR7L005066 for ; Mon, 12 Jan 2009 12:38:28 -0800 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LMTYI-00020w-Ey for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 15:38:26 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LMTYE-0001zl-Lh for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 15:38:24 -0500 Original-Received: from [199.232.76.173] (port=48856 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LMTYE-0001zc-1i for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 15:38:22 -0500 Original-Received: from rv-out-0708.google.com ([209.85.198.242]:12632) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LMTYD-0002Cm-HX for bug-gnu-emacs@gnu.org; Mon, 12 Jan 2009 15:38:21 -0500 Original-Received: by rv-out-0708.google.com with SMTP id k29so13038755rvb.6 for ; Mon, 12 Jan 2009 12:38:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:subject:from:to:content-type :date:message-id:mime-version:x-mailer:content-transfer-encoding; bh=O0iZ+h5Pz6ufGrCNgD87H6KRlGuoyapQ/89yBt37cG4=; b=qS5xWeGfQBfyi9mFIFw69ZGnzp97xSjE2ny2D9sjF0u74lvM8D/hLVa30SzAk2z2Ci mgfB6OlRcgzGFyVEYgWHj3zh6+bjR10zPv/+667iDBY/Q47TlhDY0cUBgh5yuDXjJiK0 c5HSnax7UXnW09mLcY7wJpjD1vjIRKAadHb0k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:from:to:content-type:date:message-id:mime-version:x-mailer :content-transfer-encoding; b=a22iK9MEk4alGtvXXvt4HtzgHS/iy6HlajjuPqkwEcxUNFVIX3Ek/EDJnguMrZA1Ut 43Li9vlV4zxbHxQjXhNxyIEWqiTtTdwil8vRM1hUNMDtwvD/ExU79Kd1ndmgvheZeujT ByXvixW5B4b1BDD4WmI7+ElkCr19UJRJsomkY= Original-Received: by 10.114.147.7 with SMTP id u7mr19737822wad.138.1231792699779; Mon, 12 Jan 2009 12:38:19 -0800 (PST) Original-Received: from ?192.168.1.2? (pool-173-51-86-88.lsanca.fios.verizon.net [173.51.86.88]) by mx.google.com with ESMTPS id y25sm47669915pod.10.2009.01.12.12.38.17 (version=SSLv3 cipher=RC4-MD5); Mon, 12 Jan 2009 12:38:18 -0800 (PST) X-Mailer: Evolution 2.24.2 X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 3) Resent-Date: Mon, 12 Jan 2009 16:03:31 -0500 X-BeenThere: bug-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:24043 Archived-At: A new Scheme major mode I've made [1] requires regular expressions that can match characters by their Unicode general categories. It seems Emacs regular expressions do not provide a way to do that directly (I'm using GNU Emacs 23.0.60.1) (I couldn't find anything about it in the Emacs documentation, emacswiki.org, or by asking on help-gnu-emacs@gnu.org or in that list's archives). So currently I pre-compute character sets for the needed general categories (using `get-char-code-property') and place these in their positions in the larger regular expressions. However, including character sets for every general category I need makes the regular expressions too large for Emacs and it errors trying to use them (some of them are pretty big); so currently I'm not supporting all of them that are required. Another issue is these character sets are duplicated in different regular expressions and since they're so large this causes code size bloat. Another issue is I suspect matching character sets this large is not the most time-efficient. If Emacs regular expressions had some construct, similar to the existing `\cC' one, that matched a character by its general category, I think that would solve all the above issues nicely. PLT Scheme regular expressions have this ability [2]. [1] https://code.launchpad.net/~derick-eddington/scheme-mode/derick-.emacs.d [2] http://docs.plt-scheme.org/reference/regexp.html Thank you for your work on Emacs and for your time, -- : Derick ----------------------------------------------------------------