From: Derick Eddington <derick.eddington@gmail.com>
To: Peter Dyballa <Peter_Dyballa@Web.DE>
Cc: help-gnu-emacs@gnu.org
Subject: Re: Regular expressions for Unicode general categories
Date: Sun, 07 Dec 2008 16:49:55 -0800 [thread overview]
Message-ID: <1228697395.4393.103.camel@eep> (raw)
In-Reply-To: <9F1B5061-5B57-4370-8681-A10D9B7FAE9B@Web.DE>
On Mon, 2008-12-08 at 00:35 +0100, Peter Dyballa wrote:
> Am 07.12.2008 um 21:47 schrieb Derick Eddington:
>
> > So, what can I do? If Emacs regular expressions' backslash construct
> > `\cC' supported Unicode general categories, or if there was some
> > construct which did, I think that would do it nicely. Is that
> > planned, or should I resort to doing more manual parsing, or something
> > else?
>
>
> Can't you use the Unicode characters themselves? In ranges like [À-Ëà-
> ë]?
I'm using `rx-to-string' on my computed character sets (i.e., using it
on my one big SRE (s-expression regular expression) that has sub-SREs of
`(char . ,<list-of-characters>)), and `rx-to-string' consolidates the
characters into ranges and I'm assuming it does so as much as possible,
so I think I'm already using ranges as much as possible. Here's a
modified simplified version of what I'm doing and it shows
`rx-to-string' is computing ranges:
(require 'rx)
(let* ((general-categories
(let ((al (list (list 'Po) (list 'Sc))) ;; removed a bunch of others
(c 0))
(while (< c #x110000)
(unless (and (<= #xD800 c) (<= c #xDFFF))
(let* ((gc (get-char-code-property c 'general-category))
(a (assq gc al)))
(when a (setcdr a (cons c (cdr a))))))
(setq c (1+ c)))
al))
(char-set (lambda (gc) `(char . ,(cdr (assq gc general-categories)))))
(Po (funcall char-set 'Po))
(Sc (funcall char-set 'Sc))
;; removed a bunch of other stuff
(thing `(seq "foo" (or ,Po ,Sc) "bar")))
(rx-to-string thing))
=>
"\\(?:foo\\(?:[!-#%-'*,./:;?@\\¡·¿;·՚-՟։׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹।॥॰෴๏๚๛༄-༒྅࿐-࿔၊-၏჻፡-፨᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠅᠇-᠊᥄᥅᧞᧟᨞᨟᭚-᭠᰻-᰿᱾᱿‖‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⸀⸁⸆-⸈⸋⸎-⸖⸘⸙⸛⸞⸟⸪-⸮⸰、-〃〽・꘍-꘏꙳꙾꡴-꡷꣎꣏꤮꤯꥟꩜-꩟︐-︖︙︰﹅﹆﹉-﹌﹐-﹒﹔-﹗﹟-﹡﹨﹪﹫!-#%-'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐤟𐤿𐩐-𐩘𒑰-𒑳]\\|[$¢-¥؋৲৳૱௹฿៛₠-₵﷼﹩$¢£¥₩]\\)bar\\)"
If I did type the characters themselves in "[x-y]" ranges, I'd have to
figure out a lot them because the Unicode general categories are not
simple ranges, they're scattered across the code-points. I need these
general categories which have these numbers of elements:
((Lu 1438) (Ll 1770) (Lt 31) (Lm 187) (Lo 90794) (Mn 1082) (Nl 214)
(No 349) (Pd 20) (Pc 10) (Po 318) (Sc 41) (Sm 946) (Sk 99) (So 3695)
(Co 137468) (Nd 408) (Mc 269) (Me 13) (Zs 18) (Zl 1) (Zp 1))
which is way more than I can manually manage.
--
: Derick
----------------------------------------------------------------
prev parent reply other threads:[~2008-12-08 0:49 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-07 20:47 Regular expressions for Unicode general categories Derick Eddington
2008-12-07 23:35 ` Peter Dyballa
2008-12-08 0:49 ` Derick Eddington [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1228697395.4393.103.camel@eep \
--to=derick.eddington@gmail.com \
--cc=Peter_Dyballa@Web.DE \
--cc=help-gnu-emacs@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).