unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: Derick Eddington <derick.eddington@gmail.com>
To: Peter Dyballa <Peter_Dyballa@Web.DE>
Cc: help-gnu-emacs@gnu.org
Subject: Re: Regular expressions for Unicode general categories
Date: Sun, 07 Dec 2008 16:49:55 -0800	[thread overview]
Message-ID: <1228697395.4393.103.camel@eep> (raw)
In-Reply-To: <9F1B5061-5B57-4370-8681-A10D9B7FAE9B@Web.DE>

On Mon, 2008-12-08 at 00:35 +0100, Peter Dyballa wrote:
> Am 07.12.2008 um 21:47 schrieb Derick Eddington:
> 
> > So, what can I do?  If Emacs regular expressions' backslash construct
> > `\cC' supported Unicode general categories, or if there was some
> > construct which did, I think that would do it nicely.  Is that
> > planned, or should I resort to doing more manual parsing, or something
> > else?
> 
> 
> Can't you use the Unicode characters themselves? In ranges like [À-Ëà- 
> ë]?

I'm using `rx-to-string' on my computed character sets (i.e., using it
on my one big SRE (s-expression regular expression) that has sub-SREs of
`(char . ,<list-of-characters>)), and `rx-to-string' consolidates the
characters into ranges and I'm assuming it does so as much as possible,
so I think I'm already using ranges as much as possible.  Here's a
modified simplified version of what I'm doing and it shows
`rx-to-string' is computing ranges:

(require 'rx)

(let* ((general-categories
        (let ((al (list (list 'Po) (list 'Sc))) ;; removed a bunch of others
              (c 0))
          (while (< c #x110000)
            (unless (and (<= #xD800 c) (<= c #xDFFF))
              (let* ((gc (get-char-code-property c 'general-category))
                     (a (assq gc al)))
                (when a (setcdr a (cons c (cdr a))))))
            (setq c (1+ c)))
          al))
       (char-set (lambda (gc) `(char . ,(cdr (assq gc general-categories)))))
       (Po (funcall char-set 'Po))
       (Sc (funcall char-set 'Sc))
       ;; removed a bunch of other stuff
       (thing `(seq "foo" (or ,Po ,Sc) "bar")))
  (rx-to-string thing))
=> 
"\\(?:foo\\(?:[!-#%-'*,./:;?@\\¡·¿;·՚-՟։׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹।॥॰෴๏๚๛༄-༒྅࿐-࿔၊-၏჻፡-፨᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠅᠇-᠊᥄᥅᧞᧟᨞᨟᭚-᭠᰻-᰿᱾᱿‖‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⸀⸁⸆-⸈⸋⸎-⸖⸘⸙⸛⸞⸟⸪-⸮⸰、-〃〽・꘍-꘏꙳꙾꡴-꡷꣎꣏꤮꤯꥟꩜-꩟︐-︖︙︰﹅﹆﹉-﹌﹐-﹒﹔-﹗﹟-﹡﹨﹪﹫!-#%-'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐤟𐤿𐩐-𐩘𒑰-𒑳]\\|[$¢-¥؋৲৳૱௹฿៛₠-₵﷼﹩$¢£¥₩]\\)bar\\)"

If I did type the characters themselves in "[x-y]" ranges, I'd have to
figure out a lot them because the Unicode general categories are not
simple ranges, they're scattered across the code-points.  I need these
general categories which have these numbers of elements:

((Lu 1438) (Ll 1770) (Lt 31) (Lm 187) (Lo 90794) (Mn 1082) (Nl 214) 
(No 349) (Pd 20) (Pc 10) (Po 318) (Sc 41) (Sm 946) (Sk 99) (So 3695) 
(Co 137468) (Nd 408) (Mc 269) (Me 13) (Zs 18) (Zl 1) (Zp 1))

which is way more than I can manually manage.

-- 
: Derick
----------------------------------------------------------------






      reply	other threads:[~2008-12-08  0:49 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-07 20:47 Regular expressions for Unicode general categories Derick Eddington
2008-12-07 23:35 ` Peter Dyballa
2008-12-08  0:49   ` Derick Eddington [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1228697395.4393.103.camel@eep \
    --to=derick.eddington@gmail.com \
    --cc=Peter_Dyballa@Web.DE \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).