From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Derick Eddington Newsgroups: gmane.emacs.help Subject: Re: Regular expressions for Unicode general categories Date: Sun, 07 Dec 2008 16:49:55 -0800 Message-ID: <1228697395.4393.103.camel@eep> References: <1228682833.4393.35.camel@eep> <9F1B5061-5B57-4370-8681-A10D9B7FAE9B@Web.DE> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1228718242 21181 80.91.229.12 (8 Dec 2008 06:37:22 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 8 Dec 2008 06:37:22 +0000 (UTC) Cc: help-gnu-emacs@gnu.org To: Peter Dyballa Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Mon Dec 08 07:38:25 2008 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1L9ZlA-00077o-UX for geh-help-gnu-emacs@m.gmane.org; Mon, 08 Dec 2008 07:38:25 +0100 Original-Received: from localhost ([127.0.0.1]:38344 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L9Zjz-0004w1-LU for geh-help-gnu-emacs@m.gmane.org; Mon, 08 Dec 2008 01:37:11 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1L9UK4-0005Gp-MI for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 19:50:04 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1L9UK2-0005GI-Np for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 19:50:04 -0500 Original-Received: from [199.232.76.173] (port=60805 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L9UK2-0005Fw-Gl for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 19:50:02 -0500 Original-Received: from rv-out-0708.google.com ([209.85.198.248]:62943) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1L9UK1-0005NP-Tb for help-gnu-emacs@gnu.org; Sun, 07 Dec 2008 19:50:02 -0500 Original-Received: by rv-out-0708.google.com with SMTP id k29so1161774rvb.6 for ; Sun, 07 Dec 2008 16:50:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:subject:from:to:cc :in-reply-to:references:content-type:date:message-id:mime-version :x-mailer:content-transfer-encoding; bh=MAxKl55WkLiMTM5b0Wrys5x6k2hx2UkyVPfA+6nXR7g=; b=qxLPdG2r2sj0Rqxt+E4Jj7YBsO6ynTKnJcvp4lzw1hCWRYN2Diid8+m1YTLlbWzFCD ytijsq1k3Ow9HIekLr8+3fh5HVi2VLzwYbY6qxICa5NAocpB0VV0Q7eHYAC8e8ITHxZC GOKmwU04vNkpm+4lCuaA3aSzqoDp9bFKdsH/A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:from:to:cc:in-reply-to:references:content-type:date :message-id:mime-version:x-mailer:content-transfer-encoding; b=Lrkr81La/1JSMzkBLAVaMOsCWeWt5MbFY7IIZD8/QXg7TeVlWCZ39p+oE8aTqqSX4l y/ZrCBbDqYcxcrl2Ge8hKF0vM563YpXZqSnvSk+VBjgPz7ZNWFB7mFxhjl2di38L6DU4 rll3LmT8s6kbzoZ+6hrbezLnlLaloJ6MxGuZM= Original-Received: by 10.142.187.8 with SMTP id k8mr1207592wff.106.1228697399864; Sun, 07 Dec 2008 16:49:59 -0800 (PST) Original-Received: from ?192.168.1.2? (pool-173-51-86-88.lsanca.fios.verizon.net [173.51.86.88]) by mx.google.com with ESMTPS id 22sm1471389wfg.30.2008.12.07.16.49.57 (version=SSLv3 cipher=RC4-MD5); Sun, 07 Dec 2008 16:49:58 -0800 (PST) In-Reply-To: <9F1B5061-5B57-4370-8681-A10D9B7FAE9B@Web.DE> X-Mailer: Evolution 2.24.2 X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2) X-Mailman-Approved-At: Mon, 08 Dec 2008 01:36:38 -0500 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:60445 Archived-At: On Mon, 2008-12-08 at 00:35 +0100, Peter Dyballa wrote: > Am 07.12.2008 um 21:47 schrieb Derick Eddington: > > > So, what can I do? If Emacs regular expressions' backslash construct > > `\cC' supported Unicode general categories, or if there was some > > construct which did, I think that would do it nicely. Is that > > planned, or should I resort to doing more manual parsing, or something > > else? > > > Can't you use the Unicode characters themselves? In ranges like [À-Ëà- > ë]? I'm using `rx-to-string' on my computed character sets (i.e., using it on my one big SRE (s-expression regular expression) that has sub-SREs of `(char . ,)), and `rx-to-string' consolidates the characters into ranges and I'm assuming it does so as much as possible, so I think I'm already using ranges as much as possible. Here's a modified simplified version of what I'm doing and it shows `rx-to-string' is computing ranges: (require 'rx) (let* ((general-categories (let ((al (list (list 'Po) (list 'Sc))) ;; removed a bunch of others (c 0)) (while (< c #x110000) (unless (and (<= #xD800 c) (<= c #xDFFF)) (let* ((gc (get-char-code-property c 'general-category)) (a (assq gc al))) (when a (setcdr a (cons c (cdr a)))))) (setq c (1+ c))) al)) (char-set (lambda (gc) `(char . ,(cdr (assq gc general-categories))))) (Po (funcall char-set 'Po)) (Sc (funcall char-set 'Sc)) ;; removed a bunch of other stuff (thing `(seq "foo" (or ,Po ,Sc) "bar"))) (rx-to-string thing)) => "\\(?:foo\\(?:[!-#%-'*,./:;?@\\¡·¿;·՚-՟։׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹।॥॰෴๏๚๛༄-༒྅࿐-࿔၊-၏჻፡-፨᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠅᠇-᠊᥄᥅᧞᧟᨞᨟᭚-᭠᰻-᰿᱾᱿‖‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⸀⸁⸆-⸈⸋⸎-⸖⸘⸙⸛⸞⸟⸪-⸮⸰、-〃〽・꘍-꘏꙳꙾꡴-꡷꣎꣏꤮꤯꥟꩜-꩟︐-︖︙︰﹅﹆﹉-﹌﹐-﹒﹔-﹗﹟-﹡﹨﹪﹫!-#%-'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐤟𐤿𐩐-𐩘𒑰-𒑳]\\|[$¢-¥؋৲৳૱௹฿៛₠-₵﷼﹩$¢£¥₩]\\)bar\\)" If I did type the characters themselves in "[x-y]" ranges, I'd have to figure out a lot them because the Unicode general categories are not simple ranges, they're scattered across the code-points. I need these general categories which have these numbers of elements: ((Lu 1438) (Ll 1770) (Lt 31) (Lm 187) (Lo 90794) (Mn 1082) (Nl 214) (No 349) (Pd 20) (Pc 10) (Po 318) (Sc 41) (Sm 946) (Sk 99) (So 3695) (Co 137468) (Nd 408) (Mc 269) (Me 13) (Zs 18) (Zl 1) (Zp 1)) which is way more than I can manually manage. -- : Derick ----------------------------------------------------------------