unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* Regular expressions for Unicode general categories
@ 2008-12-07 20:47 Derick Eddington
  2008-12-07 23:35 ` Peter Dyballa
  0 siblings, 1 reply; 3+ messages in thread
From: Derick Eddington @ 2008-12-07 20:47 UTC (permalink / raw)
  To: help-gnu-emacs

Hello,

I am making an Emacs regular expression for matching R6RS Scheme
"identifiers" (part of the syntax highlighting of a major mode I'm
making), and it needs to match characters based on their Unicode
general categories.  It seems Emacs regular expressions do not provide
a way to do that directly (I'm using Emacs 23.0.60.1) (I couldn't find
anything about this in the Info docs, emacswiki.org, or this list's
archives), so I computed regular expression character sets for the
needed general categories (using `get-char-code-property') and placed
these in their positions in the larger regular expression.

My problem is I can't use it because I get this error: 
  Error during redisplay: (invalid-regexp Regular expression too big) 
which is understandable because the general category character sets
are giant and a bunch of them are used, and I suspect they might have
been too inefficient anyways.

So, what can I do?  If Emacs regular expressions' backslash construct
`\cC' supported Unicode general categories, or if there was some
construct which did, I think that would do it nicely.  Is that
planned, or should I resort to doing more manual parsing, or something
else?

JTMI, the reason identifiers need to be recognized using their
complete lexical specification is because I'm also highlighting
numbers and they have a lexical syntax which overlaps with
identifiers and so identifiers need to be fontified first just so
they're not partially fontified as numbers.

Thank you for help,

-- 
: Derick
----------------------------------------------------------------






^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Regular expressions for Unicode general categories
  2008-12-07 20:47 Regular expressions for Unicode general categories Derick Eddington
@ 2008-12-07 23:35 ` Peter Dyballa
  2008-12-08  0:49   ` Derick Eddington
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Dyballa @ 2008-12-07 23:35 UTC (permalink / raw)
  To: Derick Eddington; +Cc: help-gnu-emacs


Am 07.12.2008 um 21:47 schrieb Derick Eddington:

> So, what can I do?  If Emacs regular expressions' backslash construct
> `\cC' supported Unicode general categories, or if there was some
> construct which did, I think that would do it nicely.  Is that
> planned, or should I resort to doing more manual parsing, or something
> else?


Can't you use the Unicode characters themselves? In ranges like [À-Ëà- 
ë]?

Remember: you're in GNU Emacs 23.0.60, the Unicode Emacs. (But  
actually I have no idea of R6RS Scheme and its "identifiers," how you  
call these entities.

--
Greetings

   Pete

A child of five could understand this!  Fetch me a child of five.






^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Regular expressions for Unicode general categories
  2008-12-07 23:35 ` Peter Dyballa
@ 2008-12-08  0:49   ` Derick Eddington
  0 siblings, 0 replies; 3+ messages in thread
From: Derick Eddington @ 2008-12-08  0:49 UTC (permalink / raw)
  To: Peter Dyballa; +Cc: help-gnu-emacs

On Mon, 2008-12-08 at 00:35 +0100, Peter Dyballa wrote:
> Am 07.12.2008 um 21:47 schrieb Derick Eddington:
> 
> > So, what can I do?  If Emacs regular expressions' backslash construct
> > `\cC' supported Unicode general categories, or if there was some
> > construct which did, I think that would do it nicely.  Is that
> > planned, or should I resort to doing more manual parsing, or something
> > else?
> 
> 
> Can't you use the Unicode characters themselves? In ranges like [À-Ëà- 
> ë]?

I'm using `rx-to-string' on my computed character sets (i.e., using it
on my one big SRE (s-expression regular expression) that has sub-SREs of
`(char . ,<list-of-characters>)), and `rx-to-string' consolidates the
characters into ranges and I'm assuming it does so as much as possible,
so I think I'm already using ranges as much as possible.  Here's a
modified simplified version of what I'm doing and it shows
`rx-to-string' is computing ranges:

(require 'rx)

(let* ((general-categories
        (let ((al (list (list 'Po) (list 'Sc))) ;; removed a bunch of others
              (c 0))
          (while (< c #x110000)
            (unless (and (<= #xD800 c) (<= c #xDFFF))
              (let* ((gc (get-char-code-property c 'general-category))
                     (a (assq gc al)))
                (when a (setcdr a (cons c (cdr a))))))
            (setq c (1+ c)))
          al))
       (char-set (lambda (gc) `(char . ,(cdr (assq gc general-categories)))))
       (Po (funcall char-set 'Po))
       (Sc (funcall char-set 'Sc))
       ;; removed a bunch of other stuff
       (thing `(seq "foo" (or ,Po ,Sc) "bar")))
  (rx-to-string thing))
=> 
"\\(?:foo\\(?:[!-#%-'*,./:;?@\\¡·¿;·՚-՟։׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹।॥॰෴๏๚๛༄-༒྅࿐-࿔၊-၏჻፡-፨᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠅᠇-᠊᥄᥅᧞᧟᨞᨟᭚-᭠᰻-᰿᱾᱿‖‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⸀⸁⸆-⸈⸋⸎-⸖⸘⸙⸛⸞⸟⸪-⸮⸰、-〃〽・꘍-꘏꙳꙾꡴-꡷꣎꣏꤮꤯꥟꩜-꩟︐-︖︙︰﹅﹆﹉-﹌﹐-﹒﹔-﹗﹟-﹡﹨﹪﹫!-#%-'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐤟𐤿𐩐-𐩘𒑰-𒑳]\\|[$¢-¥؋৲৳૱௹฿៛₠-₵﷼﹩$¢£¥₩]\\)bar\\)"

If I did type the characters themselves in "[x-y]" ranges, I'd have to
figure out a lot them because the Unicode general categories are not
simple ranges, they're scattered across the code-points.  I need these
general categories which have these numbers of elements:

((Lu 1438) (Ll 1770) (Lt 31) (Lm 187) (Lo 90794) (Mn 1082) (Nl 214) 
(No 349) (Pd 20) (Pc 10) (Po 318) (Sc 41) (Sm 946) (Sk 99) (So 3695) 
(Co 137468) (Nd 408) (Mc 269) (Me 13) (Zs 18) (Zl 1) (Zp 1))

which is way more than I can manually manage.

-- 
: Derick
----------------------------------------------------------------






^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-12-08  0:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-07 20:47 Regular expressions for Unicode general categories Derick Eddington
2008-12-07 23:35 ` Peter Dyballa
2008-12-08  0:49   ` Derick Eddington

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).