Pattern matching on match-string groups #elisp #question

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Pattern matching on match-string groups #elisp #question
@ 2021-02-25  5:11 Ag Ibragimov
  2021-02-25 14:55 ` Basil L. Contovounesios
  0 siblings, 1 reply; 13+ messages in thread
From: Ag Ibragimov @ 2021-02-25  5:11 UTC (permalink / raw)
  To: emacs-devel


Can someone please tell if it is possible to use pcase-let to dissect groups (parenthesized expressions) of match-string. Here's a dumb example:

(let* ((s "lorem-foo-1-ipsum-bar-21_baz-42")
       (_ (string-match "\\(foo-[0-9]+\\).*\\(bar-[0-9]+\\).*\\(baz-[0-9]+\\)" s))
       (group-1 (match-string 1 s))
       (group-2 (match-string 2 s))
       (group-3 (match-string 3 s)))
  (format "%s-%s-%s" group-1 group-2 group-3)) ;; => "foo-1-bar-21-baz-42"

Is there a more elegant way of retrieving these groups?




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-25  5:11 Pattern matching on match-string groups #elisp #question Ag Ibragimov
@ 2021-02-25 14:55 ` Basil L. Contovounesios
  2021-02-25 15:32   ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Basil L. Contovounesios @ 2021-02-25 14:55 UTC (permalink / raw)
  To: Ag Ibragimov; +Cc: emacs-devel

Ag Ibragimov <agzam.ibragimov@gmail.com> writes:

> Can someone please tell if it is possible to use pcase-let to dissect
> groups (parenthesized expressions) of match-string. Here's a dumb
> example:
>
> (let* ((s "lorem-foo-1-ipsum-bar-21_baz-42")
>       (_ (string-match "\\(foo-[0-9]+\\).*\\(bar-[0-9]+\\).*\\(baz-[0-9]+\\)" s))
>       (group-1 (match-string 1 s))
>       (group-2 (match-string 2 s))
>       (group-3 (match-string 3 s)))
>  (format "%s-%s-%s" group-1 group-2 group-3)) ;; => "foo-1-bar-21-baz-42"

The closest equivalent pcase magic I can think of is:

  (pcase "lorem-foo-1-ipsum-bar-21_baz-42"
    ((rx (let group-1 "foo-" (+ num)) (* nonl)
         (let group-2 "bar-" (+ num)) (* nonl)
         (let group-3 "baz-" (+ num)))
     (format "%s-%s-%s" group-1 group-2 group-3)))
  ;; => "foo-1-bar-21-baz-42"

The same won't work with pcase-let, so it's either unsupported or a bug.

> Is there a more elegant way of retrieving these groups?

Personally I'm fine with either of the above, although I wouldn't
generally feel comfortable with using the pcase magic in code that
others work on, since there are usually more vanilla/common ways of
structuring the code.  Either way, I think it's a matter of taste.

See also the following recentish thread:
https://lists.gnu.org/r/emacs-devel/2020-12/msg00069.html

-- 
Basil



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-25 14:55 ` Basil L. Contovounesios
@ 2021-02-25 15:32   ` Stefan Monnier
  2021-02-25 18:28     ` Mattias Engdegård
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Monnier @ 2021-02-25 15:32 UTC (permalink / raw)
  To: Basil L. Contovounesios; +Cc: Mattias Engdegård, Ag Ibragimov, emacs-devel

> The closest equivalent pcase magic I can think of is:
>
>   (pcase "lorem-foo-1-ipsum-bar-21_baz-42"
>     ((rx (let group-1 "foo-" (+ num)) (* nonl)
>          (let group-2 "bar-" (+ num)) (* nonl)
>          (let group-3 "baz-" (+ num)))
>      (format "%s-%s-%s" group-1 group-2 group-3)))
>   ;; => "foo-1-bar-21-baz-42"
>
> The same won't work with pcase-let, so it's either unsupported or a bug.

I'd say it's a bug.  The patch below would fix it.  Mattias, WDYT?

FWIW, I have a similar pcase pattern using "old style regexps" instead of
rx, but I haven't bothered to install it:

    (pcase-defmacro re-match (re)
      "Matches a string if that string matches RE.
    RE should be a regular expression (a string).
    It can use the special syntax \\(?VAR: to bind a sub-match
    to variable VAR.  All other subgroups are treated as shy.
    
    Multiple uses of this macro in a single `pcase' are not optimized
    together, so don't expect lex-like performance.  But in order for
    such optimization to be possible in some distant future, back-references
    are not supported."
      (let ((start 0)
            (last 0)
            (new-re '())
            (vars '())
            (gn 0))
        (while (string-match "\\\\(\\(?:\\?\\([-[:alnum:]]*\\):\\)?" re start)
          (setq start (match-end 0))
          (let ((beg (match-beginning 0))
                (name (match-string 1 re)))
            ;; Skip false positives, either backslash-escaped or within [...].
            (when (subregexp-context-p re start last)
              (cond
               ((null name)
                (push (concat (substring re last beg) "\\(?:") new-re))
               ((string-match "\\`[0-9]" name)
                (error "Variable can't start with a digit: %S" name))
               (t
                (let* ((var (intern name))
                       (id (cdr (assq var vars))))
                  (unless id
                    (setq gn (1+ gn))
                    (setq id gn)
                    (push (cons var gn) vars))
                  (push (concat (substring re last beg) (format "\\(?%d:" id))
                        new-re))))
              (setq last start))))
        (push (substring re last) new-re)
        (setq new-re (mapconcat #'identity (nreverse new-re) ""))
        `(and (pred stringp)
              (app (lambda (s)
                     (when (string-match ,new-re s)
                       (vector ,@(mapcar (lambda (x) `(match-string ,(cdr x) s))
                                         vars))))
                   (,'\` [,@(mapcar (lambda (x) (list '\, (car x))) vars)])))))

[ Not sure why I decided to use a vector internally.  ]


        Stefan


diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 58584f300c..619bc32752 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -1437,7 +1437,8 @@ rx
                    construct."
   (let* ((rx--pcase-vars nil)
          (regexp (rx--to-expr (rx--pcase-transform (cons 'seq regexps)))))
-    `(and (pred (string-match ,regexp))
+    `(and (pred stringp)
+          (app (lambda (s) (string-match ,regexp s)) (pred identity))
           ,@(let ((i 0))
               (mapcar (lambda (name)
                         (setq i (1+ i))




^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-25 15:32   ` Stefan Monnier
@ 2021-02-25 18:28     ` Mattias Engdegård
  2021-02-26  4:31       ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Engdegård @ 2021-02-25 18:28 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Basil L. Contovounesios, Ag Ibragimov, emacs-devel

25 feb. 2021 kl. 16.32 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

>> The same won't work with pcase-let, so it's either unsupported or a bug.
> 
> I'd say it's a bug.  The patch below would fix it.  Mattias, WDYT?

Thank you, it looks like multiple bugs. The lack of (pred stringp) was of course an oversight but unrelated to pcase-let, right?

>   (let* ((rx--pcase-vars nil)
>          (regexp (rx--to-expr (rx--pcase-transform (cons 'seq regexps)))))
> -    `(and (pred (string-match ,regexp))
> +    `(and (pred stringp)
> +          (app (lambda (s) (string-match ,regexp s)) (pred identity))

It does seem to work, but why exactly do we need this monstrosity instead of (pred (string-match ,regexp))? Is it because pcase-let throws away all `pred` clauses somewhere? It makes sense to do so but I haven't found exactly where this takes place in pcase.el yet...

Perhaps the assumption that non-binding clauses like `pred` (and what else, `guard`?) are all side-effect free and can be thrown away in pcase-let[*] should be documented? Not that I would have read it, of course...

I'll push a fix as soon as I understand the machinery a bit better, but right now I'm wary of getting my fingers snapped off by the gears and knives.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-25 18:28     ` Mattias Engdegård
@ 2021-02-26  4:31       ` Stefan Monnier
  2021-02-26 10:24         ` Mattias Engdegård
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Monnier @ 2021-02-26  4:31 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Basil L. Contovounesios, Ag Ibragimov, emacs-devel

>> I'd say it's a bug.  The patch below would fix it.  Mattias, WDYT?
> Thank you, it looks like multiple bugs.  The lack of (pred stringp) was of
> course an oversight but unrelated to pcase-let, right?

Yes, it's unrelated.  I just noticed it along the way.

>>   (let* ((rx--pcase-vars nil)
>>          (regexp (rx--to-expr (rx--pcase-transform (cons 'seq regexps)))))
>> -    `(and (pred (string-match ,regexp))
>> +    `(and (pred stringp)
>> +          (app (lambda (s) (string-match ,regexp s)) (pred identity))
> It does seem to work, but why exactly do we need this monstrosity instead of
> (pred (string-match ,regexp))?

Good question.  I'm not sure how best to explain or document it, sadly.
One of the reasons is that predicates are presumed to be (mostly) pure
functions, so `pcase` feels free to call them fewer times or more times
as it pleases.

But that also largely applies to `app`, so that's not a very
good explanation.

Maybe a better explanation is that `pcase-let` optimizes the pattern
match code under the assumption that the pattern will match, so it skips
the tests that determine whether the pattern matches or not.

[ That doesn't mean it skips all the tests: if the pattern is
  (or `(a ,v) `(b ,_ ,v)) it *will* test to see if the first element is `a` in
  order to decide what to bind `v` to, but it won't bother to check if
  the first element is `b` since it presumes that the pattern does match
  and it knows that there's no further alternative.  ]

Note that this explanation is not very convincing either because it's
not clear if the test that it skipped is `(identity VAR)`
or `(identity (string-match ...))` so it's unclear whether the
`string-match` is eliminated.

> Is it because pcase-let throws away all `pred` clauses somewhere?
> It makes sense to do so but I haven't found exactly where this takes
> place in pcase.el yet...

The magic is in `pcase--if`.  I.e. a lower-level than `pred`.
It's linked to the special undocumented pcase pattern `pcase--dontcare`
(whose name is not well chosen, suggestions for better names are
welcome) which is a pattern that not only can never match but also
prevents pcase from attempting to match any further patterns (IOW it
forces pcase to just go with the last branch that it failed to match).

You might also want to look at `byte-optimize--pcase` for another
example where I use this pattern.

> Perhaps the assumption that non-binding clauses like `pred` (and what else,
> `guard`?) are all side-effect free and can be thrown away in pcase-let[*]
> should be documented?

Agreed.

> Not that I would have read it, of course...

At least I'd get to point you to the doc and shame you publicly.

> I'll push a fix as soon as I understand the machinery a bit better, but
> right now I'm wary of getting my fingers snapped off by the gears
> and knives.

Come on, that's why we start with ten of those damn fingers.  You surely
can still afford to risk one or two.

        Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-26  4:31       ` Stefan Monnier
@ 2021-02-26 10:24         ` Mattias Engdegård
  2021-02-26 19:38           ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Engdegård @ 2021-02-26 10:24 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Basil L. Contovounesios, Ag Ibragimov, emacs-devel

26 feb. 2021 kl. 05.31 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> Good question.  I'm not sure how best to explain or document it, sadly.
> One of the reasons is that predicates are presumed to be (mostly) pure
> functions, so `pcase` feels free to call them fewer times or more times
> as it pleases.
> 
> But that also largely applies to `app`, so that's not a very
> good explanation.
> 
> Maybe a better explanation is that `pcase-let` optimizes the pattern
> match code under the assumption that the pattern will match, so it skips
> the tests that determine whether the pattern matches or not.
> 
> [ That doesn't mean it skips all the tests: if the pattern is
>  (or `(a ,v) `(b ,_ ,v)) it *will* test to see if the first element is `a` in
>  order to decide what to bind `v` to, but it won't bother to check if
>  the first element is `b` since it presumes that the pattern does match
>  and it knows that there's no further alternative.  ]
> 
> Note that this explanation is not very convincing either because it's
> not clear if the test that it skipped is `(identity VAR)`
> or `(identity (string-match ...))` so it's unclear whether the
> `string-match` is eliminated.

Thank you, I think this is good enough -- I've pushed the fix (with tests, so it matters less whether I've understood it) to master. (If pcase one day gets uppity enough to optimise based on the target expression as well, then a lot of tests will become meaningless.)

A clearer but less efficient pattern would be something like

(app (lambda (s) (and (string-match REGEXP s)
                      (list (match-string 1 s)
                            (match-string 2 s)
                            ...)))
     `(,VAR1 ,VAR2 ...))

which would, unless I'm mistaken, be the only way if string-match returned a match object instead of setting global match data.
Of course a sufficiently optimising compiler would eliminate the consing!

> It's linked to the special undocumented pcase pattern `pcase--dontcare`
> (whose name is not well chosen, suggestions for better names are
> welcome)

pcase--give-up
pcase--!,fail




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-26 10:24         ` Mattias Engdegård
@ 2021-02-26 19:38           ` Stefan Monnier
  2021-02-27 10:17             ` Mattias Engdegård
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Monnier @ 2021-02-26 19:38 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Basil L. Contovounesios, Ag Ibragimov, emacs-devel

> Thank you, I think this is good enough -- I've pushed the fix (with
> tests, so it matters less whether I've understood it) to master. (If
> pcase one day gets uppity enough to optimise based on the target
> expression as well, then a lot of tests will become meaningless.)

BTW, I was thinking about making the optimization more conservative, so
it only throws away the actual `if` but keeps the computation of the test:

    diff --git a/lisp/emacs-lisp/pcase.el b/lisp/emacs-lisp/pcase.el
    index c7288b7fa2..0036c1882d 100644
    --- a/lisp/emacs-lisp/pcase.el
    +++ b/lisp/emacs-lisp/pcase.el
    @@ -469,8 +469,8 @@ pcase--small-branch-p
     ;; the depth of the generated tree.
     (defun pcase--if (test then else)
       (cond
    -   ((eq else :pcase--dontcare) then)
    -   ((eq then :pcase--dontcare) (debug) else) ;Can/should this ever happen?
    +   ((eq else :pcase--dontcare) `(progn ,test ,then))
    +   ((eq then :pcase--dontcare) `(progn ,test ,else))
        (t (macroexp-if test then else))))

     ;; Note about MATCH:

and it does fix the `pcase-let` problem with your original code.
The problem, OTOH is that we now end up with warnings like

    In toplevel form:
    progmodes/elisp-mode.el:896:38: Warning: value returned from
        (memq (type-of l) cl-struct-xref-elisp-location-tags) is unused

and these are rather hard/inconvenient to fix (short of not using
`pcase-let`).  [ Of course, another problem is that we generate
marginally less efficient code in some cases, but that should be very
minor and that's more a failure of the byte-compiler than a problem in
`pcase`.  ]

This said, there is a fundamental problem with both the previous and the
new code:

        (pcase STR
          ((and (rx (group-n 1 (* "a"))) (guard (not (looking-at "^")))) nil)
          ((rx (let FOO (* "a"))) FOO))

It should macroexpand to something morally equivalent to:

    (cond ((not (stringp STR)) nil)
          ((not (string-match "\\(?1:a*\\)" STR)) nil)
          ((looking-at "^"")
           (let* ((x1464 (match-string 1 STR)))
             (let ((FOO x1464)) FOO))))

at which point you'll presumably notice that the match data is likely
garbled when we try to use it.
[ Cue old message about how side effects are bad for your health.  ]

> A clearer but less efficient pattern would be something like
>
> (app (lambda (s) (and (string-match REGEXP s)
>                       (list (match-string 1 s)
>                             (match-string 2 s)
>                             ...)))
>      `(,VAR1 ,VAR2 ...))

That's indeed what I did in my code (tho using a vector instead of
a list), probably because the above risk did occur to me back then.

> Of course a sufficiently optimising compiler would eliminate the consing!

Indeed, and it's not a difficult optimization (at least if you can
presume that this data is immutable).

>> It's linked to the special undocumented pcase pattern `pcase--dontcare`
>> (whose name is not well chosen, suggestions for better names are
>> welcome)
>
> pcase--give-up

Hmm... probably not much more explanatory than "dontcare".

> pcase--!,fail

That begs the question of what does it mean for the overall pcase to "fail".

I was thinking of `pcase--impossible` as well.

        Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-26 19:38           ` Stefan Monnier
@ 2021-02-27 10:17             ` Mattias Engdegård
  2021-02-27 14:39               ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Engdegård @ 2021-02-27 10:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Basil L. Contovounesios, Ag Ibragimov, emacs-devel

26 feb. 2021 kl. 20.38 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> BTW, I was thinking about making the optimization more conservative, so
> it only throws away the actual `if` but keeps the computation of the test:
[...]
> and it does fix the `pcase-let` problem with your original code.

Given the trouble I think we can defend not respecting side-effects in something as functional as pcase!

> It should macroexpand to something morally equivalent to:
> 
>    (cond ((not (stringp STR)) nil)
>          ((not (string-match "\\(?1:a*\\)" STR)) nil)
>          ((looking-at "^"")
>           (let* ((x1464 (match-string 1 STR)))
>             (let ((FOO x1464)) FOO))))

Oh dear... perhaps we should just go with the intermediate list (or vector) and suffer the small allocation penalty? (At least we should treat the case of a single variable specially, since no consing would then be necessary.)

My guess is that a vector may be faster than a list if there are more than N elements, for some N.
Should we use string-match-p when there are no variables bound in the rx clause?

>> Of course a sufficiently optimising compiler would eliminate the consing!
> 
> Indeed, and it's not a difficult optimization (at least if you can
> presume that this data is immutable).

Right, although we would need some more serious data-flow infrastructure first. It would be useful for pattern-matching two or more values at the same time.

>>> It's linked to the special undocumented pcase pattern `pcase--dontcare`
>>> (whose name is not well chosen, suggestions for better names are
>>> welcome)
>> 
>> pcase--give-up
> 
> Hmm... probably not much more explanatory than "dontcare".

Well, 'dontcare' suggests that anything would do and the value not being used, like '_', but that's quite misleading.

> I was thinking of `pcase--impossible` as well.

Yes, that looks acceptable. In any case, it isn't really a user-facing symbol, is it? Otherwise we'd need crystal-clear semantics (and lose the double dashes).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-27 10:17             ` Mattias Engdegård
@ 2021-02-27 14:39               ` Stefan Monnier
  2021-02-27 18:10                 ` Mattias Engdegård
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Monnier @ 2021-02-27 14:39 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Basil L. Contovounesios, Ag Ibragimov, emacs-devel

>> BTW, I was thinking about making the optimization more conservative, so
>> it only throws away the actual `if` but keeps the computation of the test:
> [...]
>> and it does fix the `pcase-let` problem with your original code.
> Given the trouble I think we can defend not respecting side-effects in
> something as functional as pcase!

Nevertheless, I went ahead with this change (after remembering that
wrapping the code in `ignore` should eliminate the extra warnings).

>> It should macroexpand to something morally equivalent to:
>> 
>>    (cond ((not (stringp STR)) nil)
>>          ((not (string-match "\\(?1:a*\\)" STR)) nil)
>>          ((looking-at "^"")
>>           (let* ((x1464 (match-string 1 STR)))
>>             (let ((FOO x1464)) FOO))))
>
> Oh dear... perhaps we should just go with the intermediate list (or vector)
> and suffer the small allocation penalty? (At least we should treat the case
> of a single variable specially, since no consing would then be necessary.)

It's clearly The Right Thing™.

> My guess is that a vector may be faster than a list if there are more than N elements, for some N.

I'll let you benchmark it to determine the N.

> Should we use string-match-p when there are no variables bound in the rx clause?

We could.  Tho, IIRC currently `string-match-p` is ever so slightly
slower than `string-match` and since we clobber the match data in other
cases, we might as well clobber the match data in this case as well: any
code which presumes the match data isn't affected by some other code
which uses regular expressions is quite confused.

>>> Of course a sufficiently optimising compiler would eliminate the consing!
>> Indeed, and it's not a difficult optimization (at least if you can
>> presume that this data is immutable).
> Right, although we would need some more serious data-flow infrastructure
> first. It would be useful for pattern-matching two or more values at the
> same time.

I don't think it's much more complicated than your current constant
folding: when you see a let-binding of a variable to a *constructor*,
stash that expression in your context as a "partially known constant"
and then do the constant folding when you see a matching *destructor*.
The problems I see are:

- you need to detect side effects between the constructor and the
  destructor.  You could just consider any *read* of a variable holding
  such partial-constant as a (potential) side-effect (except for those
  reads recognized as part of destructors, of course).  It should be
  good enough for the case under discussion.

- more importantly in order not to duplicate code and its side-effects
  (and suffer risks linked to moving code into a different scope), you
  need to convert your constructor so all its arguments are trivial and
  "scope safe" (e.g. gensym'd variables, integer constants, symbols,
  ...).

>>>> It's linked to the special undocumented pcase pattern `pcase--dontcare`
>>>> (whose name is not well chosen, suggestions for better names are
>>>> welcome)
>>> 
>>> pcase--give-up
>> 
>> Hmm... probably not much more explanatory than "dontcare".
>
> Well, 'dontcare' suggests that anything would do and the value not being
> used, like '_', but that's quite misleading.

Right, that's part of the problem with this naming: it doesn't want to
mean that anything matches, like `_`, but that any behavior is
acceptable when pcase has to try and match against this pattern.
The other part is that it's not true: we wouldn't settle for "any"
behavior, it still has to be sane-ish.

>> I was thinking of `pcase--impossible` as well.
> Yes, that looks acceptable. In any case, it isn't really a user-facing
> symbol, is it? Otherwise we'd need crystal-clear semantics (and lose the
> double dashes).

I would settle for something a bit less than "crystal-clear", but yes to
drop the "--" we'd need a clear enough semantics.

Here's another idea for its name: `pcase--go-back` since it makes pcase
go back to the last option it tried and accept it even though it failed
to match.  It still sucks, but maybe it'll give someone else a better idea?

        Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-27 14:39               ` Stefan Monnier
@ 2021-02-27 18:10                 ` Mattias Engdegård
  2021-02-27 20:32                   ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Engdegård @ 2021-02-27 18:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Basil L. Contovounesios, Ag Ibragimov, Emacs developers

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

27 feb. 2021 kl. 15.39 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> Nevertheless, I went ahead with this change (after remembering that
> wrapping the code in `ignore` should eliminate the extra warnings).

So where does that leave us with the rx pattern? There's still the interleaved match data problem, which I've tried to address below.

> It's clearly The Right Thing™.

Perhaps it is; a proposed diff is attached below which treats zero and one variable specially and uses a list followed by immediate destructuring for >1 variables. (By the way, using a backquote form to generate backquote forms is annoying.)

My guess is that a vector may be faster than a list if there are more than N elements, for some N.

>> My guess is that a vector may be faster than a list if there are more than N elements, for some N.
> 
> I'll let you benchmark it to determine the N.

I now have, and am sad to say that a list is always faster for any practical number of N (I didn't bother trying more than 30) although the difference narrows as N grows. This is despite the destructuring code becoming considerably bigger for lists (as we get a long chain of tests and branches) than for vectors. It all boils down to vector construction being more expensive than lists.

Maybe we should pack N>1 variables into N-1 cons cells by using the last cdr (improper list), but list* is no primitive so it may be a loss for N>M, for some M>2.

> currently `string-match-p` is ever so slightly
> slower than `string-match` and since we clobber the match data in other
> cases, we might as well clobber the match data in this case as well: any
> code which presumes the match data isn't affected by some other code
> which uses regular expressions is quite confused.

Right; I'm sticking to string-match for the time being.

> I don't think it's much more complicated than your current constant
> folding: when you see a let-binding of a variable to a *constructor*,
> stash that expression in your context as a "partially known constant"
> and then do the constant folding when you see a matching *destructor*.

Doable, but definitely not low-hanging fruit. Since pcase has made a dog's breakfast of the destructuring code it's not straightforward to recognise it as such in the optimiser. Efforts needed elsewhere first!

> go back to the last option it tried and accept it even though it failed
> to match.  It still sucks, but maybe it'll give someone else a better idea?

Sounds like pcase--dead-end would fit then, at least as an internal name.

Or pcase--Sherlock-Holmes.


[-- Attachment #2: rx-pcase-list-destructure.diff --]
[-- Type: application/octet-stream, Size: 2401 bytes --]

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index ffc21951b6..736758d01f 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -1436,17 +1436,31 @@ rx
                    introduced by a previous (let REF ...)
                    construct."
   (let* ((rx--pcase-vars nil)
-         (regexp (rx--to-expr (rx--pcase-transform (cons 'seq regexps)))))
+         (regexp (rx--to-expr (rx--pcase-transform (cons 'seq regexps))))
+         (nvars (length rx--pcase-vars)))
     `(and (pred stringp)
-          ;; `pcase-let' takes a match for granted and discards all unnecessary
-          ;; conditions, which means that a `pred' clause cannot be used for
-          ;; the match condition.  The following construct seems to survive.
-          (app (lambda (s) (string-match ,regexp s)) (pred identity))
-          ,@(let ((i 0))
-              (mapcar (lambda (name)
-                        (setq i (1+ i))
-                        `(app (match-string ,i) ,name))
-                      (reverse rx--pcase-vars))))))
+          ,(cond ((= nvars 0)
+                  ;; No variables bound: a single predicate suffices.
+                  `(pred (string-match ,regexp)))
+                 ((= nvars 1)
+                  ;; Single variable: bind it to the result of the
+                  ;; lambda function below.
+                  `(app (lambda (s)
+                          (and (string-match ,regexp s)
+                               (match-string 1 s)))
+                        ,(car rx--pcase-vars)))
+                 (t
+                  ;; Multiple variables: pack the submatches into a list
+                  ;; which is then immediately destructured into individual
+                  ;; variables again.  This is of course slightly inefficient.
+                  `(app (lambda (s)
+                          (and (string-match ,regexp s)
+                               (list
+                                ,@(mapcar (lambda (i) `(match-string ,i s))
+                                          (number-sequence 1 nvars)))))
+                        ,(list '\`
+                               (mapcar (lambda (name) (list '\, name))
+                                       (reverse rx--pcase-vars)))))))))
 
 ;; Obsolete internal symbol, used in old versions of the `flycheck' package.
 (define-obsolete-function-alias 'rx-submatch-n 'rx-to-string "27.1")

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-27 18:10                 ` Mattias Engdegård
@ 2021-02-27 20:32                   ` Stefan Monnier
  2021-02-28 13:46                     ` Mattias Engdegård
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Monnier @ 2021-02-27 20:32 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Basil L. Contovounesios, Ag Ibragimov, Emacs developers

>> Nevertheless, I went ahead with this change (after remembering that
>> wrapping the code in `ignore` should eliminate the extra warnings).
> So where does that leave us with the rx pattern?

It doesn't affect it directly, except in the fact that the old
definition would now also work for `pcase-let`.

> Perhaps it is; a proposed diff is attached below which treats zero and one
> variable specially and uses a list followed by immediate destructuring for
> >1 variables. (By the way, using a backquote form to generate backquote
> forms is annoying.)

Indeed, nested backquotes are painful.

> I now have, and am sad to say that a list is always faster for any practical
> number of N (I didn't bother trying more than 30) although the difference
> narrows as N grows. This is despite the destructuring code becoming
> considerably bigger for lists (as we get a long chain of tests and branches)
> than for vectors. It all boils down to vector construction being more
> expensive than lists.

Indeed, I think there's some optimization opportunities in our
implementation of vector construction.  We've already improved the
performance significantly compared to Emacs<24, but there's still room
for improvement.

>> I don't think it's much more complicated than your current constant
>> folding: when you see a let-binding of a variable to a *constructor*,
>> stash that expression in your context as a "partially known constant"
>> and then do the constant folding when you see a matching *destructor*.
>
> Doable, but definitely not low-hanging fruit. Since pcase has made a dog's
> breakfast of the destructuring code it's not straightforward to recognise it
> as such in the optimiser. Efforts needed elsewhere first!

I wasn't talking about pcase-generated code in particular.  But yes,
it'd probably need to be combined with more general rewritings like "let
associativity" in order to handle your particular case.  I think we'd
quickly get tired of scoping issues when doing that, so we'd want to
first change the representation to one where that's less problematic
(e.g. gensym all the lexical vars so we know there can't be any name
capture when we massage the code, also also clearly separate lexical
bindings from the dynamic bindings).

>> go back to the last option it tried and accept it even though it failed
>> to match.  It still sucks, but maybe it'll give someone else a better idea?
> Sounds like pcase--dead-end would fit then, at least as an internal name.
> Or pcase--Sherlock-Holmes.

Inspiring suggestions, thanks,

        Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-27 20:32                   ` Stefan Monnier
@ 2021-02-28 13:46                     ` Mattias Engdegård
  2021-02-28 15:37                       ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Engdegård @ 2021-02-28 13:46 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Basil L. Contovounesios, Ag Ibragimov, Emacs developers

27 feb. 2021 kl. 21.32 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

>> So where does that leave us with the rx pattern?
> 
> It doesn't affect it directly, except in the fact that the old
> definition would now also work for `pcase-let`.

Well, the new rx definition is now in place since the old one, and the even older one in Emacs 26 and earlier, were buggy as you showed. This means that the 'rx' pattern no longer relies on side-effects being retained in pcase.

I went with dotted lists (a b c . d) because benchmarking showed it to be faster than either proper lists or vectors, the generated code is smaller than for lists, and the case of a single variable reduces naturally to no consing at all.

>> I now have, and am sad to say that a list is always faster for any practical
>> number of N (I didn't bother trying more than 30) although the difference
>> narrows as N grows. This is despite the destructuring code becoming
>> considerably bigger for lists (as we get a long chain of tests and branches)
>> than for vectors. It all boils down to vector construction being more
>> expensive than lists.
> 
> Indeed, I think there's some optimization opportunities in our
> implementation of vector construction.  We've already improved the
> performance significantly compared to Emacs<24, but there's still room
> for improvement.

Looking a bit closer it gets more nuanced: one reason why

  (pcase (list 1 2 3) (`(,a ,b ,c) (+ a b c)))

is faster than

  (pcase (vector 1 2 3) (`[,a ,b ,c] (+ a b c)))

is that the latter contains three general function calls: to `vector`, `vectorp`, and `eql` (for checking the length), whereas the list version has byte-ops for everything.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern matching on match-string groups #elisp #question
  2021-02-28 13:46                     ` Mattias Engdegård
@ 2021-02-28 15:37                       ` Stefan Monnier
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Monnier @ 2021-02-28 15:37 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Basil L. Contovounesios, Ag Ibragimov, Emacs developers

> I went with dotted lists (a b c . d) because benchmarking showed it to be
> faster than either proper lists or vectors, the generated code is smaller
> than for lists, and the case of a single variable reduces naturally to no
> consing at all.

Nice.

> Looking a bit closer it gets more nuanced: one reason why
>
>   (pcase (list 1 2 3) (`(,a ,b ,c) (+ a b c)))
>
> is faster than
>
>   (pcase (vector 1 2 3) (`[,a ,b ,c] (+ a b c)))
>
> is that the latter contains three general function calls: to `vector`,
> `vectorp`, and `eql` (for checking the length), whereas the list version has
> byte-ops for everything.

Very good point.  The choice of which primitives deserve their own
bytecode was made many many years ago and it likely deserves
a serious reconsideration [ we still have dedicated byte codes for
Btemp_output_buffer_show and Btemp_output_buffer_setup ;-(  ]

Personally, I'd vote to start by making the `eq` byte code return the
value of `Feql` ;-)


        Stefan




^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-02-28 15:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-25  5:11 Pattern matching on match-string groups #elisp #question Ag Ibragimov
2021-02-25 14:55 ` Basil L. Contovounesios
2021-02-25 15:32   ` Stefan Monnier
2021-02-25 18:28     ` Mattias Engdegård
2021-02-26  4:31       ` Stefan Monnier
2021-02-26 10:24         ` Mattias Engdegård
2021-02-26 19:38           ` Stefan Monnier
2021-02-27 10:17             ` Mattias Engdegård
2021-02-27 14:39               ` Stefan Monnier
2021-02-27 18:10                 ` Mattias Engdegård
2021-02-27 20:32                   ` Stefan Monnier
2021-02-28 13:46                     ` Mattias Engdegård
2021-02-28 15:37                       ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).