Regexps and strings once again

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Regexps and strings once again
@ 2014-09-14 23:27 Lars Magne Ingebrigtsen
  2014-09-15  0:50 ` Daniel Colascione
  2014-09-15  1:38 ` Yuri Khan
  0 siblings, 2 replies; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-09-14 23:27 UTC (permalink / raw)
  To: emacs-devel

(Skip to 1) if you're not interested in why I started thinking about
this now.)

I was just fiddling around with a DOM traversal library (i.e., "document
object model", or something -- HTML traversal, like), and it has
functions for finding nodes by various criteria, like IDs.  So there are
functions like `dom-by-id' that take a DOM fragment and an ID and
returns the matching nodes.

I wrote the function as taking a regexp.  And I find what I'm doing
wrong 90% of the time when using it is that I expect an exact match, but
instead I'm getting all matching nodes.

This reminded me of this pretty general problem once again.  We have
oodles of functions in Emacs that does matching either on exact(ish)
strings, or regexps, and then we have an optional parameter that says
whether we want to interpret the string as an exact string or a
parameter.

It's kinda annoying, especially when the function defaults to the
interpretation you don't want.  And you have to remember which optional
parameter you're supposed to set.

So:  Here's yet another suggestion for how to deal with regexps in a
more general way in Emacs.  Or rather two.

1) New Special Syntax

A while ago, there was some suggestion about introducing a special
syntax for string literals, and it didn't really go anywhere, because
introducing a new syntax to Emacs is kinda a big deal.  But let's just
suggest it anyway:

(dom-by-id dom #/I (can)?haz new syntax/)

And see!  Perl Regexp syntax as well!  No more backslashitis!

Anyway, I assume that everybody would want this, but that it's too much
work for anybody to actually commit to.

2) Cheat; i.e., introduce a convention

What if we just mark a string as a regexp?

(dom-by-id dom (regexp "I \\(couldn't\\)?haz new syntax"))

It would basically just put a text property on the string, and functions
like `dom-by-id' would just do

(if (regexp-p match)
    (string-match match id)
  (string= match id))

Of course, both `regexp' and the proposed new syntax could compile the
regexp and return a regexp object and stuff if we wanted to be more
efficient...  But the regexp cache is already quite efficient, isn't it?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-14 23:27 Regexps and strings once again Lars Magne Ingebrigtsen
@ 2014-09-15  0:50 ` Daniel Colascione
  2014-09-15  2:14   ` Stefan Monnier
  2014-09-15  6:39   ` Lars Magne Ingebrigtsen
  2014-09-15  1:38 ` Yuri Khan
  1 sibling, 2 replies; 15+ messages in thread
From: Daniel Colascione @ 2014-09-15  0:50 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1130 bytes --]

On 09/14/2014 04:27 PM, Lars Magne Ingebrigtsen wrote:
> 
> (dom-by-id dom (regexp "I \\(couldn't\\)?haz new syntax"))
> 
> It would basically just put a text property on the string, and functions
> like `dom-by-id' would just do
> 
> (if (regexp-p match)
>     (string-match match id)
>   (string= match id))
> 
> Of course, both `regexp' and the proposed new syntax could compile the
> regexp and return a regexp object and stuff if we wanted to be more
> efficient...  But the regexp cache is already quite efficient, isn't it?
> 

I've been working on an NFA combinator facility lately. The basic idea
is that you don't work in terms of regular expressions per se, but in
terms of state-matching machines (like the ones Ragel has) that you can
combine using the standard union, repeat, negative, and intersection
operators. You'd then build a matcher from an NFA-representation object
when you needed to build a recognizer. Stefan has something similar in
ELPA --- lex.el --- except his code seems to do the conversion in one
shot instead of supporting the incremental building of matching machines.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  0:50 ` Daniel Colascione
@ 2014-09-15  2:14   ` Stefan Monnier
  2014-09-15  3:41     ` Daniel Colascione
  2014-09-15 10:04     ` Lars Magne Ingebrigtsen
  2014-09-15  6:39   ` Lars Magne Ingebrigtsen
  1 sibling, 2 replies; 15+ messages in thread
From: Stefan Monnier @ 2014-09-15  2:14 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: emacs-devel

> On 09/14/2014 04:27 PM, Lars Magne Ingebrigtsen wrote:
>> (dom-by-id dom (regexp "I \\(couldn't\\)?haz new syntax"))

`regexp' could just as well take a new syntax.  And it doesn't have to
return a string with a funny text-property but can really return a new
kind of object.

I think it could be fairly elegant.

>>>>> "Daniel" == Daniel Colascione <dancol@dancol.org> writes:
> I've been working on an NFA combinator facility lately.  The basic idea
> is that you don't work in terms of regular expressions per se, but in
> terms of state-matching machines (like the ones Ragel has) that you can
> combine using the standard union, repeat, negative, and intersection
> operators.

Do you really mean NFA or are you actually manipulating DFAs?
If NFAs, how do you implement intersection?

> Stefan has something similar in ELPA --- lex.el --- except his code
> seems to do the conversion in one shot instead of supporting the
> incremental building of matching machines.

Indeed.


        Stefan



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  2:14   ` Stefan Monnier
@ 2014-09-15  3:41     ` Daniel Colascione
  2014-09-15 12:52       ` Stefan Monnier
  2014-09-15 10:04     ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Colascione @ 2014-09-15  3:41 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

On 09/14/2014 07:14 PM, Stefan Monnier wrote:
>> On 09/14/2014 04:27 PM, Lars Magne Ingebrigtsen wrote:
>>> (dom-by-id dom (regexp "I \\(couldn't\\)?haz new syntax"))
> 
> `regexp' could just as well take a new syntax.  And it doesn't have to
> return a string with a funny text-property but can really return a new
> kind of object.
> 
> I think it could be fairly elegant.
> 
>>>>>> "Daniel" == Daniel Colascione <dancol@dancol.org> writes:
>> I've been working on an NFA combinator facility lately.  The basic idea
>> is that you don't work in terms of regular expressions per se, but in
>> terms of state-matching machines (like the ones Ragel has) that you can
>> combine using the standard union, repeat, negative, and intersection
>> operators.
> 
> Do you really mean NFA or are you actually manipulating DFAs?
> If NFAs, how do you implement intersection?

Actual NFAs --- the kinda-sorta working code is in the Jezebel repo;
I've been adding arbitrary predicate support. DFA construction just
produces an NFA object that happens not to contain any ambiguity, which
recognizer generators can treat specially. I haven't actually
implemented intersection yet, although I suspect the dumb union-negation
algorithm should be good enough after DFA construction and minimization.
If it isn't, I'll just see what Ragel does and do that.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  3:41     ` Daniel Colascione
@ 2014-09-15 12:52       ` Stefan Monnier
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Monnier @ 2014-09-15 12:52 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: emacs-devel

> recognizer generators can treat specially.  I haven't actually
> implemented intersection yet, although I suspect the dumb union-negation
> algorithm should be good enough after DFA construction and minimization.

Damn!


        Stefan "Still looking for intersection on NFAs"



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  2:14   ` Stefan Monnier
  2014-09-15  3:41     ` Daniel Colascione
@ 2014-09-15 10:04     ` Lars Magne Ingebrigtsen
  2014-09-15 10:26       ` Andreas Schwab
  2014-09-15 12:56       ` Stefan Monnier
  1 sibling, 2 replies; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-09-15 10:04 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> On 09/14/2014 04:27 PM, Lars Magne Ingebrigtsen wrote:
>>> (dom-by-id dom (regexp "I \\(couldn't\\)?haz new syntax"))
>
> `regexp' could just as well take a new syntax.

Sure, it could take Perl regexps, but having the argument not be a
string would be a stretch, wouldn't it?

If we want to do a more string-ey syntax, but not require wrapping it in
(regexp ...), then we could have something like:

#r"This is (not )?a Perl regexp"

for regexp literals.  Quoting " characters would still be necessary, but
I don't think that's all that important.  (And quoting quote chars is
less annoying than quoting slashes.)

If you're constructing the regexp from strings, you'd need the `regexp'
call to turn it into a regexp object.

I kinda envision all the functions that currently have a regexp option,
or a regexp version, to also take a regexp object, no matter how it's
defined.  Like `search-forward'/`re-search-forward'...

And this could be done gradually once we've introduced the regexp object
type, so it doesn't seem like an insurmountable change, if somebody
wanted to work on this...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15 10:04     ` Lars Magne Ingebrigtsen
@ 2014-09-15 10:26       ` Andreas Schwab
  2014-09-15 10:33         ` Lars Magne Ingebrigtsen
  2014-09-15 12:56       ` Stefan Monnier
  1 sibling, 1 reply; 15+ messages in thread
From: Andreas Schwab @ 2014-09-15 10:26 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: Stefan Monnier, emacs-devel

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> #r"This is (not )?a Perl regexp"
>
> for regexp literals.  Quoting " characters would still be necessary, but
> I don't think that's all that important.  (And quoting quote chars is
> less annoying than quoting slashes.)

You don't need a new literal syntax for establishing an additional
regexp syntax.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15 10:26       ` Andreas Schwab
@ 2014-09-15 10:33         ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-09-15 10:33 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Stefan Monnier, emacs-devel

Andreas Schwab <schwab@suse.de> writes:

> You don't need a new literal syntax for establishing an additional
> regexp syntax.

No, they're kinda orthogonal.  But introducing both at the same time
might be nice.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15 10:04     ` Lars Magne Ingebrigtsen
  2014-09-15 10:26       ` Andreas Schwab
@ 2014-09-15 12:56       ` Stefan Monnier
  1 sibling, 0 replies; 15+ messages in thread
From: Stefan Monnier @ 2014-09-15 12:56 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

> If we want to do a more string-ey syntax, but not require wrapping it in
> (regexp ...), then we could have something like:
> #r"This is (not )?a Perl regexp"

The difference between #r"REGEXP" and (SYMBOL REGEXP) is enormous in
terms of tool support.  So if you don't like (regexp "foo"), I'd
recommend you try a shorter symbol first.


        Stefan



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  0:50 ` Daniel Colascione
  2014-09-15  2:14   ` Stefan Monnier
@ 2014-09-15  6:39   ` Lars Magne Ingebrigtsen
  2014-09-15  7:08     ` Daniel Colascione
  1 sibling, 1 reply; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 2014-09-15  6:39 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: emacs-devel

Daniel Colascione <dancol@dancol.org> writes:

> I've been working on an NFA combinator facility lately.

Let's see...  National Firearms Association?  New Farmers of America?  I
give up.  >"?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  6:39   ` Lars Magne Ingebrigtsen
@ 2014-09-15  7:08     ` Daniel Colascione
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel Colascione @ 2014-09-15  7:08 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 363 bytes --]

On 09/14/2014 11:39 PM, Lars Magne Ingebrigtsen wrote:
> Daniel Colascione <dancol@dancol.org> writes:
> 
>> I've been working on an NFA combinator facility lately.
> 
> Let's see...  National Firearms Association?  New Farmers of America?  I
> give up.  >"?
> 

Non-deterministic Finite Automaton. If you want, I can describe how
these things work.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-14 23:27 Regexps and strings once again Lars Magne Ingebrigtsen
  2014-09-15  0:50 ` Daniel Colascione
@ 2014-09-15  1:38 ` Yuri Khan
  2014-09-15  9:22   ` Andreas Schwab
  1 sibling, 1 reply; 15+ messages in thread
From: Yuri Khan @ 2014-09-15  1:38 UTC (permalink / raw)
  To: Emacs developers

On Mon, Sep 15, 2014 at 6:27 AM, Lars Magne Ingebrigtsen <larsi@gnus.org> wrote:

> I wrote the function as taking a regexp.  And I find what I'm doing
> wrong 90% of the time when using it is that I expect an exact match, but
> instead I'm getting all matching nodes.

> 1) New Special Syntax
> (dom-by-id dom #/I (can)?haz new syntax/)

> 2) Cheat; i.e., introduce a convention
> (dom-by-id dom (regexp "I \\(couldn't\\)?haz new syntax"))

3) Adopt a convention that matches are literal by default; for regexp
matching, start and end the pattern with a slash.

(dom-by-id dom "/Some *regex+/"))
(dom-by-id dom "Some* literal|string")

4) Mark literal patterns: have a function that turns a string into a
regex, by quoting every metacharacter.

(dom-by-id dom "Some *regex+"))
(dom-by-id dom (literal "Some* literal|string"))

5) Allow the pattern to be an array or list of literal strings. For a
single literal string, use a singleton array/list.

(dom-by-id dom "Some *regex+"))
(dom-by-id dom ["Some* literal|string"])
(dom-by-id dom '("Some* literal|string"))



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  1:38 ` Yuri Khan
@ 2014-09-15  9:22   ` Andreas Schwab
  2014-09-15 10:12     ` Eric Abrahamsen
  0 siblings, 1 reply; 15+ messages in thread
From: Andreas Schwab @ 2014-09-15  9:22 UTC (permalink / raw)
  To: Yuri Khan; +Cc: Emacs developers

Yuri Khan <yuri.v.khan@gmail.com> writes:

> 4) Mark literal patterns: have a function that turns a string into a
> regex, by quoting every metacharacter.

regexp-quote

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15  9:22   ` Andreas Schwab
@ 2014-09-15 10:12     ` Eric Abrahamsen
  2014-09-15 10:22       ` Eric Abrahamsen
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Abrahamsen @ 2014-09-15 10:12 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@suse.de> writes:

> Yuri Khan <yuri.v.khan@gmail.com> writes:
>
>> 4) Mark literal patterns: have a function that turns a string into a
>> regex, by quoting every metacharacter.
>
> regexp-quote

I think the idea was, instead of

(regexp-quote "my (camaro|thunderbird) goes fast") ->
"my (camaro|thunderbird) goes fast"

to have

(regexp "my (camaro|thunderbird) goes fast") ->
"my \\(camaro\\|thunderbird\\) goes fast"

or even better

(regexp "my (camaro|thunderbird) goes fast") ->
#mysterious-regexp-object

or best of all IMHO:

(rx "my (camaro|thunderbird) goes fast") ->
#mysterious-regexp-object




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regexps and strings once again
  2014-09-15 10:12     ` Eric Abrahamsen
@ 2014-09-15 10:22       ` Eric Abrahamsen
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Abrahamsen @ 2014-09-15 10:22 UTC (permalink / raw)
  To: emacs-devel

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> Andreas Schwab <schwab@suse.de> writes:
>
>> Yuri Khan <yuri.v.khan@gmail.com> writes:
>>
>>> 4) Mark literal patterns: have a function that turns a string into a
>>> regex, by quoting every metacharacter.
>>
>> regexp-quote
>
> I think the idea was, instead of
>
> (regexp-quote "my (camaro|thunderbird) goes fast") ->
> "my (camaro|thunderbird) goes fast"
>
> to have
>
> (regexp "my (camaro|thunderbird) goes fast") ->
> "my \\(camaro\\|thunderbird\\) goes fast"

Hmm, I think I made a dumb mistake.

> or even better
>
> (regexp "my (camaro|thunderbird) goes fast") ->
> #mysterious-regexp-object
>
> or best of all IMHO:
>
> (rx "my (camaro|thunderbird) goes fast") ->
> #mysterious-regexp-object




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-09-15 12:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-14 23:27 Regexps and strings once again Lars Magne Ingebrigtsen
2014-09-15  0:50 ` Daniel Colascione
2014-09-15  2:14   ` Stefan Monnier
2014-09-15  3:41     ` Daniel Colascione
2014-09-15 12:52       ` Stefan Monnier
2014-09-15 10:04     ` Lars Magne Ingebrigtsen
2014-09-15 10:26       ` Andreas Schwab
2014-09-15 10:33         ` Lars Magne Ingebrigtsen
2014-09-15 12:56       ` Stefan Monnier
2014-09-15  6:39   ` Lars Magne Ingebrigtsen
2014-09-15  7:08     ` Daniel Colascione
2014-09-15  1:38 ` Yuri Khan
2014-09-15  9:22   ` Andreas Schwab
2014-09-15 10:12     ` Eric Abrahamsen
2014-09-15 10:22       ` Eric Abrahamsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).