* Regular expressions for Unicode general categories
@ 2008-12-07 20:47 Derick Eddington
2008-12-07 23:35 ` Peter Dyballa
0 siblings, 1 reply; 3+ messages in thread
From: Derick Eddington @ 2008-12-07 20:47 UTC (permalink / raw)
To: help-gnu-emacs
Hello,
I am making an Emacs regular expression for matching R6RS Scheme
"identifiers" (part of the syntax highlighting of a major mode I'm
making), and it needs to match characters based on their Unicode
general categories. It seems Emacs regular expressions do not provide
a way to do that directly (I'm using Emacs 23.0.60.1) (I couldn't find
anything about this in the Info docs, emacswiki.org, or this list's
archives), so I computed regular expression character sets for the
needed general categories (using `get-char-code-property') and placed
these in their positions in the larger regular expression.
My problem is I can't use it because I get this error:
Error during redisplay: (invalid-regexp Regular expression too big)
which is understandable because the general category character sets
are giant and a bunch of them are used, and I suspect they might have
been too inefficient anyways.
So, what can I do? If Emacs regular expressions' backslash construct
`\cC' supported Unicode general categories, or if there was some
construct which did, I think that would do it nicely. Is that
planned, or should I resort to doing more manual parsing, or something
else?
JTMI, the reason identifiers need to be recognized using their
complete lexical specification is because I'm also highlighting
numbers and they have a lexical syntax which overlaps with
identifiers and so identifiers need to be fontified first just so
they're not partially fontified as numbers.
Thank you for help,
--
: Derick
----------------------------------------------------------------
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Regular expressions for Unicode general categories
2008-12-07 20:47 Regular expressions for Unicode general categories Derick Eddington
@ 2008-12-07 23:35 ` Peter Dyballa
2008-12-08 0:49 ` Derick Eddington
0 siblings, 1 reply; 3+ messages in thread
From: Peter Dyballa @ 2008-12-07 23:35 UTC (permalink / raw)
To: Derick Eddington; +Cc: help-gnu-emacs
Am 07.12.2008 um 21:47 schrieb Derick Eddington:
> So, what can I do? If Emacs regular expressions' backslash construct
> `\cC' supported Unicode general categories, or if there was some
> construct which did, I think that would do it nicely. Is that
> planned, or should I resort to doing more manual parsing, or something
> else?
Can't you use the Unicode characters themselves? In ranges like [À-Ëà-
ë]?
Remember: you're in GNU Emacs 23.0.60, the Unicode Emacs. (But
actually I have no idea of R6RS Scheme and its "identifiers," how you
call these entities.
--
Greetings
Pete
A child of five could understand this! Fetch me a child of five.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Regular expressions for Unicode general categories
2008-12-07 23:35 ` Peter Dyballa
@ 2008-12-08 0:49 ` Derick Eddington
0 siblings, 0 replies; 3+ messages in thread
From: Derick Eddington @ 2008-12-08 0:49 UTC (permalink / raw)
To: Peter Dyballa; +Cc: help-gnu-emacs
On Mon, 2008-12-08 at 00:35 +0100, Peter Dyballa wrote:
> Am 07.12.2008 um 21:47 schrieb Derick Eddington:
>
> > So, what can I do? If Emacs regular expressions' backslash construct
> > `\cC' supported Unicode general categories, or if there was some
> > construct which did, I think that would do it nicely. Is that
> > planned, or should I resort to doing more manual parsing, or something
> > else?
>
>
> Can't you use the Unicode characters themselves? In ranges like [À-Ëà-
> ë]?
I'm using `rx-to-string' on my computed character sets (i.e., using it
on my one big SRE (s-expression regular expression) that has sub-SREs of
`(char . ,<list-of-characters>)), and `rx-to-string' consolidates the
characters into ranges and I'm assuming it does so as much as possible,
so I think I'm already using ranges as much as possible. Here's a
modified simplified version of what I'm doing and it shows
`rx-to-string' is computing ranges:
(require 'rx)
(let* ((general-categories
(let ((al (list (list 'Po) (list 'Sc))) ;; removed a bunch of others
(c 0))
(while (< c #x110000)
(unless (and (<= #xD800 c) (<= c #xDFFF))
(let* ((gc (get-char-code-property c 'general-category))
(a (assq gc al)))
(when a (setcdr a (cons c (cdr a))))))
(setq c (1+ c)))
al))
(char-set (lambda (gc) `(char . ,(cdr (assq gc general-categories)))))
(Po (funcall char-set 'Po))
(Sc (funcall char-set 'Sc))
;; removed a bunch of other stuff
(thing `(seq "foo" (or ,Po ,Sc) "bar")))
(rx-to-string thing))
=>
"\\(?:foo\\(?:[!-#%-'*,./:;?@\\¡·¿;·՚-՟։׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹।॥॰෴๏๚๛༄-༒྅࿐-࿔၊-၏჻፡-፨᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠅᠇-᠊᥄᥅᧞᧟᨞᨟᭚-᭠᰻-᰿᱾᱿‖‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⸀⸁⸆-⸈⸋⸎-⸖⸘⸙⸛⸞⸟⸪-⸮⸰、-〃〽・꘍-꘏꙳꙾꡴-꡷꣎꣏꤮꤯꥟꩜-꩟︐-︖︙︰﹅﹆﹉-﹌﹐-﹒﹔-﹗﹟-﹡﹨﹪﹫!-#%-'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐤟𐤿𐩐-𐩘𒑰-𒑳]\\|[$¢-¥؋৲৳૱௹฿៛₠-₵﷼﹩$¢£¥₩]\\)bar\\)"
If I did type the characters themselves in "[x-y]" ranges, I'd have to
figure out a lot them because the Unicode general categories are not
simple ranges, they're scattered across the code-points. I need these
general categories which have these numbers of elements:
((Lu 1438) (Ll 1770) (Lt 31) (Lm 187) (Lo 90794) (Mn 1082) (Nl 214)
(No 349) (Pd 20) (Pc 10) (Po 318) (Sc 41) (Sm 946) (Sk 99) (So 3695)
(Co 137468) (Nd 408) (Mc 269) (Me 13) (Zs 18) (Zl 1) (Zp 1))
which is way more than I can manually manage.
--
: Derick
----------------------------------------------------------------
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-12-08 0:49 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-07 20:47 Regular expressions for Unicode general categories Derick Eddington
2008-12-07 23:35 ` Peter Dyballa
2008-12-08 0:49 ` Derick Eddington
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.