unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Stefan Monnier <monnier@iro.umontreal.ca>
To: emacs-devel@gnu.org
Subject: Re: Improving regexp-opt
Date: Thu, 07 Feb 2019 22:48:01 -0500	[thread overview]
Message-ID: <jwvimxvt0h5.fsf-monnier+emacs@gnu.org> (raw)
In-Reply-To: CAARdmY1m2+ROr+orDMoUK7zXkFbf94o-89z6bATjzPuoVt=nBA@mail.gmail.com

> (regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr"))
> -> "\\(?:c\\(?:\\(?:a[ad]\\|d[ad]\\|[ad]\\)r\\)\\)"
> First of all, there is an (apparently) unnecessary group around the result.

FWIW, I think this is not an error: we want (concat (regexp-opt STRS) "*")
to have a well-defined behavior (i.e. allow any number of repetitions of
STRS).

> (regexp-opt '("master" "monster" "mister"))
> -> "\\(?:m\\(?:\\(?:on\\|[ai]\\)ster\\)\\)"
> It would be better (imo) eliminate unnecessary groups (those without "\\|")
> that the result was
>   "m\\(?:on\\|[ai]\\)ster"

Here, OTOH, the second (shy) subgroup is indeed unnecessary.

Regarding improving regexp-opt, in the general case you're looking at
minimizing finite state automatons.  When regexp-opt was written, the
main purpose was to try and reduce backtracking and for that it's
perfectly sufficient to turn ("ack" "attack") into
"a\\(?:ck\\|ttack\\)".  I later added "tail sharing" so that ("ack"
"attack") turns into "a\\(?:tta\\)?ck" but that's not really much use in
practice.  We could try and get fancier, but it will tend to slow down
regexp-opt even more for rather small benefits (except in corner cases).
A much better approach is to go for a real "regexp to NFA/DFA
conversion".  The `lex.el` package is one such example, but it's very
inefficient (in terms of building the FA and in the size of the FA, not
in terms of running the FA).


        Stefan




  reply	other threads:[~2019-02-08  3:48 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-07 16:41 Improving regexp-opt Miguel V. S. Frasson
2019-02-08  3:48 ` Stefan Monnier [this message]
2019-04-12 15:06   ` Miguel V. S. Frasson
2019-04-12 15:40     ` Miguel V. S. Frasson
2019-04-12 16:53     ` Stefan Monnier
2019-04-18  0:59       ` Miguel V. S. Frasson
2019-04-18  3:49         ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jwvimxvt0h5.fsf-monnier+emacs@gnu.org \
    --to=monnier@iro.umontreal.ca \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).