Make regexp handling more regular

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Make regexp handling more regular
@ 2020-12-02  9:05 Lars Ingebrigtsen
  2020-12-02 10:44 ` Lars Ingebrigtsen
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-02  9:05 UTC (permalink / raw)
  To: emacs-devel

Today's idle shower thought:

I constant source of confusion and subtle bugs is the way Emacs does
regexp match handling: The way `string-match' (and the rest) sets a
global state, and you sort of have to catch them "early" is often a
challenge for new users.

Experienced Emacs Lisp programmers know to be safe and will say:

(when (string-match "[a-z]" string)
  (let ((match (match-string 0 string)))
    (foo)
    (bar match)))

while people new to Emacs Lisp will expect this to work:

(when (string-match "[a-z]" string)
  (foo)
  (bar (match-string - string)))

And sometimes it does, and sometimes it doesn't, depending on whether
`foo' also messes with the match data.

So my idle shower thought for the day is: Is there any reasonable path
forward that the Emacs Lisp language could take here?

Well, we obviously can't alter functions like `string-match' and
`re-search-forward' -- they have well-defined semantics, and we can't
make them return a match object.  But we could make a new set of
functions that are more, er, functional.

Naming is, of course, the most difficult problem here.  I wondered
whether the namespace would allow us to just add -p to the functions,
but names like `string-match-p' are already taken for variations on the
non-p functions.

In any case, if we happen upon a naming convention that's good, the new
interface for these functions would then be to return a "match object",
that can then be used for looking at details of the match.  I.e.,

(when (setq match (rx-string-match "[a-z]" string))
  (foo)
  (bar (match match 0)))

The match object would know what it had matched, too.  The following
code is an error:

(when (re-search-forward "p[a-z]+" nil t)
  (with-temp-buffer
    (insert (match-string 0))
    (buffer-string)))

But the following would work:

(when (setq match (rx-search-forward "p[a-z]+" nil t))
  (with-temp-buffer
    (insert (match match 0))
    (buffer-string)))

And the same for functions working on strings, of course.  And
equivalent forms for match-beginning/-end.  And we could finally get rid
of the confusingly-named `match-string' function.

There's nothing but upsides, people!

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02  9:05 Make regexp handling more regular Lars Ingebrigtsen
@ 2020-12-02 10:44 ` Lars Ingebrigtsen
  2020-12-02 11:12 ` Stefan Kangas
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-02 10:44 UTC (permalink / raw)
  To: emacs-devel

Lars Ingebrigtsen <larsi@gnus.org> writes:

> (when (setq match (rx-string-match "[a-z]" string))
>   (foo)
>   (bar (match match 0)))

This would, of course, be more idiomatic as

(when-let ((match (rx-string-match "[a-z]" string)))
  (foo)
  (bar (match match 0)))

Another thing that occurred to me is that we could allow the match
object accessors to return nil on nil objects.  Then, what's currently
this:

(let ((string (foo)))
  (and (string-match "[a-z]" string)
       (match-string 0 string)))

would just be this:

(match (rx-string-match "[a-z]" (foo)) 0)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02  9:05 Make regexp handling more regular Lars Ingebrigtsen
  2020-12-02 10:44 ` Lars Ingebrigtsen
@ 2020-12-02 11:12 ` Stefan Kangas
  2020-12-02 11:21   ` Philipp Stephani
  2020-12-03  8:31   ` Lars Ingebrigtsen
  2020-12-02 17:17 ` Stefan Monnier
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Stefan Kangas @ 2020-12-02 11:12 UTC (permalink / raw)
  To: Lars Ingebrigtsen, emacs-devel

Lars Ingebrigtsen <larsi@gnus.org> writes:

> So my idle shower thought for the day is: Is there any reasonable path
> forward that the Emacs Lisp language could take here?
>
> Well, we obviously can't alter functions like `string-match' and
> `re-search-forward' -- they have well-defined semantics, and we can't
> make them return a match object.  But we could make a new set of
> functions that are more, er, functional.

I like the idea of adding an entirely new built-in API based on the
current state of the art.  I would begin such a project by looking into
what other Lisps are doing, such as CL, Clojure, Guile and Racket.  Why
shouldn't Emacs Lisp be best-in-class?

As for naming, how about just using a short prefix such as "re-"?
AFAICT, we currently have only five functions using that prefix.

Tangentially, I have always been wondering if its feasible to add a new
regular expression type to `read' where you don't have to incessantly
double quote all special characters.  (One could take inspiration from
Python, for example, which adds an "r" character to strings to turn them
into regexps: r"regexp".)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02 11:12 ` Stefan Kangas
@ 2020-12-02 11:21   ` Philipp Stephani
  2020-12-03  8:31   ` Lars Ingebrigtsen
  1 sibling, 0 replies; 20+ messages in thread
From: Philipp Stephani @ 2020-12-02 11:21 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Lars Ingebrigtsen, Emacs developers

Am Mi., 2. Dez. 2020 um 12:14 Uhr schrieb Stefan Kangas
<stefankangas@gmail.com>:
>
> Lars Ingebrigtsen <larsi@gnus.org> writes:
>
> > So my idle shower thought for the day is: Is there any reasonable path
> > forward that the Emacs Lisp language could take here?
> >
> > Well, we obviously can't alter functions like `string-match' and
> > `re-search-forward' -- they have well-defined semantics, and we can't
> > make them return a match object.  But we could make a new set of
> > functions that are more, er, functional.
>
> I like the idea of adding an entirely new built-in API based on the
> current state of the art.  I would begin such a project by looking into
> what other Lisps are doing, such as CL, Clojure, Guile and Racket.  Why
> shouldn't Emacs Lisp be best-in-class?
>
> As for naming, how about just using a short prefix such as "re-"?
> AFAICT, we currently have only five functions using that prefix.
>
> Tangentially, I have always been wondering if its feasible to add a new
> regular expression type to `read' where you don't have to incessantly
> double quote all special characters.  (One could take inspiration from
> Python, for example, which adds an "r" character to strings to turn them
> into regexps: r"regexp".)
>

Yes, I think all of these make sense:
1. Support for stateless matching, with functions returning match
objects (like s-match, but also for searching)
2. Support for PCRE/"extended" regexp. Add customization options for
the interactive commands to read this dialect.
3. Support for raw strings, maybe using a syntax like #"...".

If we want to take more cues from other programming languages, we
should create a "compiled regex pattern" type. Multiple dialects
(traditional Emacs regexp, rx, PCRE) would then compile down to a
single such type.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02 11:12 ` Stefan Kangas
  2020-12-02 11:21   ` Philipp Stephani
@ 2020-12-03  8:31   ` Lars Ingebrigtsen
  1 sibling, 0 replies; 20+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-03  8:31 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: emacs-devel

Stefan Kangas <stefankangas@gmail.com> writes:

> I like the idea of adding an entirely new built-in API based on the
> current state of the art.  I would begin such a project by looking into
> what other Lisps are doing, such as CL, Clojure, Guile and Racket.  Why
> shouldn't Emacs Lisp be best-in-class?

Sure.

Common Lisp doesn't have regexps, but (some) implementations do, and
there's a bunch of libraries, like http://edicl.github.io/cl-ppcre/
I'm not much in favour:

* (scan "(a)*b" "xaaabd")
1
5
#(3)
#(4)

* (let ((s (create-scanner "(([a-c])+)x")))
    (scan s "abcxy"))
0
4
#(0 2)
#(3 3)

And since it's Common Lisp, of course you have special forms for
destructing: 

* (register-groups-bind (first second third fourth)
      ("((a)|(b)|(c))+" "abababc" :sharedp t)
    (list first second third fourth))
("c" "a" "b" "c")

Guile: https://www.gnu.org/software/guile/manual/html_node/Regexp-Functions.html

(string-match "[0-9][0-9][0-9][0-9]" "blah2002")
⇒ #("blah2002" (4 . 8))

(map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
⇒ ("abc" "def")

Clojure: https://purelyfunctional.tv/mini-guide/regexes-in-clojure/

(re-matches #"abc(.*)" "abcxyz")
   ["abcxyz" "xyz"]

I.e., if there's one match, we return the match substring, otherwise an
array.  It's nice in one way, but the cleverness leads to errors when
(re-)writing code.

(subs (re-matches #"[a-z]+" "fooo baar") 3)

but then you add some more and you have to rewrite to something like:

(let [[_ s1 s2] (re-matches #"([a-z]+) ([a-z]+)" full-name)]
  (subs s1 3))

I hate that.

The thing that makes looking at other languages here slightly less
useful is that Emacs has buffers.  We're often not interested in the
(sub-)matches themselves at all, but instead their buffer positions
(i.e., match-beginning/end).

> As for naming, how about just using a short prefix such as "re-"?
> AFAICT, we currently have only five functions using that prefix.

Sure.

> Tangentially, I have always been wondering if its feasible to add a new
> regular expression type to `read' where you don't have to incessantly
> double quote all special characters.  (One could take inspiration from
> Python, for example, which adds an "r" character to strings to turn them
> into regexps: r"regexp".)

I'm all for adding a regexp object type (and a new read syntax), but I
think it's a somewhat orthogonal?  Not totally, though: I've long wished
for match/searching functions to be generic, and work differently on
strings and regexps.  That is, if fed a string, then do comparison with
`string-equal' and when fed a regexp, do the comparison with
`string-match'.

So you could say

(search-forward "foo")

and

(search-forward #r"fo+")

or

(search-forward (re-make "fo+"))

-- no reason for there to be separate functions if we have regexp objects.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02  9:05 Make regexp handling more regular Lars Ingebrigtsen
  2020-12-02 10:44 ` Lars Ingebrigtsen
  2020-12-02 11:12 ` Stefan Kangas
@ 2020-12-02 17:17 ` Stefan Monnier
  2020-12-02 17:45   ` Yuan Fu
  2020-12-03  8:38   ` Lars Ingebrigtsen
  2020-12-02 21:19 ` Juri Linkov
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Stefan Monnier @ 2020-12-02 17:17 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel

> Naming is, of course, the most difficult problem here.

I agree that it might be worth looking at what other languages do.
But we could also just follow "traditional regexp" libraries's
suggestions for naming and go with something like:

    (re-match  REGEXP &optional OBJECT START END)
    (re-search REGEXP &optional OBJECT START END)

[ the first being like `looking-at` (i.e. an "anchored" match).  ]

I'd also suggest to make those functions accept other arguments than
strings for REGEXP, i.e. to make them into generic functions.


        Stefan




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02 17:17 ` Stefan Monnier
@ 2020-12-02 17:45   ` Yuan Fu
  2020-12-02 19:24     ` Stefan Monnier
  2020-12-03  8:38   ` Lars Ingebrigtsen
  1 sibling, 1 reply; 20+ messages in thread
From: Yuan Fu @ 2020-12-02 17:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Lars Ingebrigtsen, emacs-devel



> On Dec 2, 2020, at 12:17 PM, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
> 
>> Naming is, of course, the most difficult problem here.
> 
> I agree that it might be worth looking at what other languages do.
> But we could also just follow "traditional regexp" libraries's
> suggestions for naming and go with something like:
> 
>    (re-match  REGEXP &optional OBJECT START END)
>    (re-search REGEXP &optional OBJECT START END)
> 
> [ the first being like `looking-at` (i.e. an "anchored" match).  ]

Whatever the name is, we should make sure they don’t introduce even more confusion on top of the already confusing names.

re-search-forward
re-search-backward
re-search
re-match

It’s hard to see what each function does from a glance, IMO. That’s not counting string regexp functions.


> 
> I'd also suggest to make those functions accept other arguments than
> strings for REGEXP, i.e. to make them into generic functions.

It would be cool if these functions accept rx forms.

Yuan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02 17:45   ` Yuan Fu
@ 2020-12-02 19:24     ` Stefan Monnier
  2020-12-03  8:40       ` Lars Ingebrigtsen
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Monnier @ 2020-12-02 19:24 UTC (permalink / raw)
  To: Yuan Fu; +Cc: Lars Ingebrigtsen, emacs-devel

> Whatever the name is, we should make sure they don’t introduce even more
> confusion on top of the already confusing names.
>
> re-search-forward
> re-search-backward
> re-search
> re-match
>
> It’s hard to see what each function does from a glance, IMO. That’s not
> counting string regexp functions.

The `re-search` I propose would work both for strings and buffers
(depending on OBJECT), and both forward and backward (depending on the
relative position of START and END, probably with some convention to
simplify the common "from point to point-min").

>> I'd also suggest to make those functions accept other arguments than
>> strings for REGEXP, i.e. to make them into generic functions.
> It would be cool if these functions accept rx forms.

If they're generic functions, rx.el could do that indeed.


        Stefan




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02 19:24     ` Stefan Monnier
@ 2020-12-03  8:40       ` Lars Ingebrigtsen
  0 siblings, 0 replies; 20+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-03  8:40 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Yuan Fu, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> The `re-search` I propose would work both for strings and buffers
> (depending on OBJECT), and both forward and backward (depending on the
> relative position of START and END, probably with some convention to
> simplify the common "from point to point-min").

Good idea, I think.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02 17:17 ` Stefan Monnier
  2020-12-02 17:45   ` Yuan Fu
@ 2020-12-03  8:38   ` Lars Ingebrigtsen
  2020-12-03 15:10     ` Stefan Monnier
  1 sibling, 1 reply; 20+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-03  8:38 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Naming is, of course, the most difficult problem here.
>
> I agree that it might be worth looking at what other languages do.
> But we could also just follow "traditional regexp" libraries's
> suggestions for naming and go with something like:
>
>     (re-match  REGEXP &optional OBJECT START END)
>     (re-search REGEXP &optional OBJECT START END)
>
> [ the first being like `looking-at` (i.e. an "anchored" match).  ]

I like it.

Off-list, it's been pointed out that the current implementation of
functions like re-search-forward would be faster than these interfaces
because they produce less garbage -- since there's just one global match
object, it's static, while

(while (setq match (re-search "[a-z]+"))
  (bar (re-data match 0)))

would create a whole lot of garbage to be collected.  Now, that's not
really that much of an issue if you're just saying (when (re-match
"foo") ...) here and there, but searching through a buffer for matches
is a very common use case, and we wouldn't want that to be slower than
now.

So the suggestion is to be able to pass in an (optional) match object.
However, we don't really need to create that explicitly everywhere.  The
idiom could be just:

(while (setq match (re-search "[a-z]+" match))
  (bar (re-data match 0)))

In the first iteration, it's nil, which means that `re-search' will
allocate one, but in subsequent iterations, it'll be reused.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-03  8:38   ` Lars Ingebrigtsen
@ 2020-12-03 15:10     ` Stefan Monnier
  2020-12-03 16:58       ` Lars Ingebrigtsen
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Monnier @ 2020-12-03 15:10 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel

> Off-list, it's been pointed out that the current implementation of
> functions like re-search-forward would be faster than these interfaces
> because they produce less garbage -- since there's just one global match
> object, it's static, while

Yes, it's indeed my main worry.
Reusing the match-data sounds like a good practical approach.

Maybe another approach would be to use an API where the match doesn't
return a "match data" but instead let-binds some variables with the
relevant data.  IOW, specify right away in which part of the data you're
interested, so only the relevant data is returned.  That would also
remove the need for the `string-match-p` alternatives which don't return
any match data.

I'm not completely sure what it would look like, tho.  Maybe

    (let-re-match (overall (beg end)) (re-match "regexp")
      ...)

which would be equivalent to

    (progn
      (re-match "regexp")
      (let ((overall (match-string 0))
            (beg (match-beginning 1))
            (end (match-end 1)))
        ...))

??

This has problems dealing with match-failure tho: it works, but it with
a lot of spurious match-data extraction.

        Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-03 15:10     ` Stefan Monnier
@ 2020-12-03 16:58       ` Lars Ingebrigtsen
  2020-12-03 17:40         ` Stefan Monnier
  0 siblings, 1 reply; 20+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-03 16:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> I'm not completely sure what it would look like, tho.  Maybe
>
>     (let-re-match (overall (beg end)) (re-match "regexp")
>       ...)
>
> which would be equivalent to
>
>     (progn
>       (re-match "regexp")
>       (let ((overall (match-string 0))
>             (beg (match-beginning 1))
>             (end (match-end 1)))
>         ...))
>
> ??

Then we don't really need the re-match function at all...

     (let-re-match (overall (beg end)) ("regexp")
       ...)

or

     (let-re-match "regexp" (overall (beg end))
       ...)

(or something) would be sufficient...

But I think this would be too somewhat cumbersome.  Like, if you want to
write

(while (setq m1 (re-search "foo"))
  (setq m2 (re-match "[0-9]"))
  (zot (re-string m1 0) (re-end m2 0)))

you quickly find yourself deeply nested, and sometimes with awkward ways
of using the let-ret-matched variable if you want to use it later,
outside the form, etc.

Hm...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-03 16:58       ` Lars Ingebrigtsen
@ 2020-12-03 17:40         ` Stefan Monnier
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Monnier @ 2020-12-03 17:40 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel

>>     (let-re-match (overall (beg end)) (re-match "regexp")
>>       ...)
> Then we don't really need the re-match function at all...

Yes and no: we do need some place to put the information about the
OBJECT where we search, whether the search is anchored, where we START
and where we END.

> But I think this would be too somewhat cumbersome.  Like, if you want to
> write
>
> (while (setq m1 (re-search "foo"))
>   (setq m2 (re-match "[0-9]"))
>   (zot (re-string m1 0) (re-end m2 0)))

Yes, that's related to the problem with dealing with match-failure.


        Stefan




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02  9:05 Make regexp handling more regular Lars Ingebrigtsen
                   ` (2 preceding siblings ...)
  2020-12-02 17:17 ` Stefan Monnier
@ 2020-12-02 21:19 ` Juri Linkov
  2020-12-03  8:41   ` Lars Ingebrigtsen
  2020-12-02 21:28 ` Daniel Martín
  2020-12-03  4:16 ` Adam Porter
  5 siblings, 1 reply; 20+ messages in thread
From: Juri Linkov @ 2020-12-02 21:19 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel

> So my idle shower thought for the day is: Is there any reasonable path
> forward that the Emacs Lisp language could take here?

Currently the match data is like a dynamically bound variable accessible
to the callee.  But maybe the match data should be only lexically-bound?
(This is just a vague idea, I don't know how to implement this.)



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02 21:19 ` Juri Linkov
@ 2020-12-03  8:41   ` Lars Ingebrigtsen
  2020-12-03 15:00     ` Stefan Monnier
  0 siblings, 1 reply; 20+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-03  8:41 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

Juri Linkov <juri@linkov.net> writes:

>> So my idle shower thought for the day is: Is there any reasonable path
>> forward that the Emacs Lisp language could take here?
>
> Currently the match data is like a dynamically bound variable accessible
> to the callee.  But maybe the match data should be only lexically-bound?
> (This is just a vague idea, I don't know how to implement this.)

Yes, I wondered whether one could use some lexical magic here, but I
didn't quite see what that would look like.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-03  8:41   ` Lars Ingebrigtsen
@ 2020-12-03 15:00     ` Stefan Monnier
  2020-12-03 21:02       ` Juri Linkov
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Monnier @ 2020-12-03 15:00 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel, Juri Linkov

>>> So my idle shower thought for the day is: Is there any reasonable path
>>> forward that the Emacs Lisp language could take here?
>> Currently the match data is like a dynamically bound variable accessible
>> to the callee.  But maybe the match data should be only lexically-bound?
>> (This is just a vague idea, I don't know how to implement this.)
> Yes, I wondered whether one could use some lexical magic here, but I
> didn't quite see what that would look like.

Actually, currently the match-data is *not* like a dynamically-scoped
var, but like a global var.  And we don't really need it to be lexically
scoped, we would be already well-served with a dynamically-scoped var.

E.g. we could have

    (with-re-match "regexp"
      ...
      (match-beginning 0)
      ...)

where `with-re-match` could look like

    `(if re-match-data-in-use
         (save-match-data
           ,@body)
       (let ((re-match-data-in-use t))
         ,@body))

so we'd save the match-data lazily.  [ Tho, it would still save the
match data more often than we currently do, of course.  ]
  

        Stefan




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-03 15:00     ` Stefan Monnier
@ 2020-12-03 21:02       ` Juri Linkov
  2020-12-03 22:20         ` Vasilij Schneidermann
  0 siblings, 1 reply; 20+ messages in thread
From: Juri Linkov @ 2020-12-03 21:02 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Lars Ingebrigtsen, emacs-devel

>>> Currently the match data is like a dynamically bound variable accessible
>>> to the callee.  But maybe the match data should be only lexically-bound?
>>> (This is just a vague idea, I don't know how to implement this.)
>> Yes, I wondered whether one could use some lexical magic here, but I
>> didn't quite see what that would look like.
>
> Actually, currently the match-data is *not* like a dynamically-scoped
> var, but like a global var.  And we don't really need it to be lexically
> scoped, we would be already well-served with a dynamically-scoped var.

Notably in Ruby e.g. /(.)(.)(.)/.match("foo") returns a MatchData object:

  #<MatchData "foo" 1:"f" 2:"o" 3:"o">

Shouldn't a function like string-match (or rather some new function)
return a #<MatchData> object too?  Or the current list returned
by the function 'match-data' is sufficient?

Binding it to a variable will avoid the need to have global data
(unless global data is a requirement for performance).  Then:

  (let ((match-data (string-match regexp string)))
    (list (match-beginning subexp match-data)
          (match-end subexp match-data)))

with an additional arg MATCH-DATA added to match-processing functions:

  (match-beginning SUBEXP &optional MATCH-DATA)
  (match-end SUBEXP &optional MATCH-DATA)
  (match-string NUM &optional STRING MATCH-DATA)



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-03 21:02       ` Juri Linkov
@ 2020-12-03 22:20         ` Vasilij Schneidermann
  0 siblings, 0 replies; 20+ messages in thread
From: Vasilij Schneidermann @ 2020-12-03 22:20 UTC (permalink / raw)
  To: Juri Linkov; +Cc: Lars Ingebrigtsen, Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 775 bytes --]

> Notably in Ruby e.g. /(.)(.)(.)/.match("foo") returns a MatchData object:
> 
>   #<MatchData "foo" 1:"f" 2:"o" 3:"o">
> 
> Shouldn't a function like string-match (or rather some new function)
> return a #<MatchData> object too?  Or the current list returned
> by the function 'match-data' is sufficient?

Personally I find the pattern of mixing a check for a match object, then
access to the global match variables a lot more convenient in Ruby than
extracting the data from the match object:

    'foo123bar'[/[a-z]+([0-9]+)[a-z]+/] && $1 #=> "123"

The alternative:

    m = /[a-z]+([0-9]+)[a-z]+/.match('foo123bar')
    m && m[1] #=> "123"

The above is more attractive if there was an if-let/when-let
equivalent. So that's what I'd design against.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02  9:05 Make regexp handling more regular Lars Ingebrigtsen
                   ` (3 preceding siblings ...)
  2020-12-02 21:19 ` Juri Linkov
@ 2020-12-02 21:28 ` Daniel Martín
  2020-12-03  4:16 ` Adam Porter
  5 siblings, 0 replies; 20+ messages in thread
From: Daniel Martín @ 2020-12-02 21:28 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel

Lars Ingebrigtsen <larsi@gnus.org> writes:
>
> (when (setq match (rx-search-forward "p[a-z]+" nil t))
>   (with-temp-buffer
>     (insert (match match 0))
>     (buffer-string)))
>

The way other Lisp-like languages like Clojure work is by returning
either nil (if no match), a string (if the regular expression matched
the string and didn't have any capture group), or a vector where the
first element is the entire match and the rest of elements are the group
matches.

I think it's a nice API, specially because you can use destructuring on
the return value to discern the cases.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Make regexp handling more regular
  2020-12-02  9:05 Make regexp handling more regular Lars Ingebrigtsen
                   ` (4 preceding siblings ...)
  2020-12-02 21:28 ` Daniel Martín
@ 2020-12-03  4:16 ` Adam Porter
  5 siblings, 0 replies; 20+ messages in thread
From: Adam Porter @ 2020-12-03  4:16 UTC (permalink / raw)
  To: emacs-devel

Lars Ingebrigtsen <larsi@gnus.org> writes:

> I constant source of confusion and subtle bugs is the way Emacs does
> regexp match handling: The way `string-match' (and the rest) sets a
> global state, and you sort of have to catch them "early" is often a
> challenge for new users.
>
> Experienced Emacs Lisp programmers know to be safe and will say:
>
> (when (string-match "[a-z]" string)
>   (let ((match (match-string 0 string)))
>     (foo)
>     (bar match)))
>
> while people new to Emacs Lisp will expect this to work:
>
> (when (string-match "[a-z]" string)
>   (foo)
>   (bar (match-string - string)))
>
> And sometimes it does, and sometimes it doesn't, depending on whether
> `foo' also messes with the match data.
>
> So my idle shower thought for the day is: Is there any reasonable path
> forward that the Emacs Lisp language could take here?

It's funny that you should post this today, Lars, because I just
encountered this very problem while using code from your format-spec
function in combination with code from your shr-insert-document function
(the latter of which changed the match data, making the former fail
inexplicably...until I figured it out).  Not that I'm blaming you, of
course--it's me who's using your code in unintended ways.  :)

Anyway, I'd be very happy if Emacs had "safer" matching functions like
this.  And I like the idea of prefixing them with "rx-", as was
suggested.

Thanks for your work on Emacs!

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-12-03 22:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-02  9:05 Make regexp handling more regular Lars Ingebrigtsen
2020-12-02 10:44 ` Lars Ingebrigtsen
2020-12-02 11:12 ` Stefan Kangas
2020-12-02 11:21   ` Philipp Stephani
2020-12-03  8:31   ` Lars Ingebrigtsen
2020-12-02 17:17 ` Stefan Monnier
2020-12-02 17:45   ` Yuan Fu
2020-12-02 19:24     ` Stefan Monnier
2020-12-03  8:40       ` Lars Ingebrigtsen
2020-12-03  8:38   ` Lars Ingebrigtsen
2020-12-03 15:10     ` Stefan Monnier
2020-12-03 16:58       ` Lars Ingebrigtsen
2020-12-03 17:40         ` Stefan Monnier
2020-12-02 21:19 ` Juri Linkov
2020-12-03  8:41   ` Lars Ingebrigtsen
2020-12-03 15:00     ` Stefan Monnier
2020-12-03 21:02       ` Juri Linkov
2020-12-03 22:20         ` Vasilij Schneidermann
2020-12-02 21:28 ` Daniel Martín
2020-12-03  4:16 ` Adam Porter

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).