bug#192: regexp does not work as documented

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: Thomas Lord <lord@emf.net>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: Chong Yidong <cyd@stupidchicken.com>,
	192@emacsbugs.donarmstrong.com, emacs-devel@gnu.org,
	David Koppelman <koppel@ece.lsu.edu>,
	Bruno Haible <bruno@clisp.org>
Subject: bug#192: regexp does not work as documented
Date: Sun, 11 May 2008 20:30:00 -0700	[thread overview]
Message-ID: <4827B9B8.30406__26274.8840534643$1210588121$gmane$org@emf.net> (raw)
In-Reply-To: <jwvlk2gpas3.fsf-monnier+emacsbugreports@gnu.org>

Stefan Monnier wrote:
> I have most of the DFA construction code written, but I may take you up
> on that anyway.  BTW, regarding the "very expensive in the worst case",
> how common is this worst case in real life?
>   

Talking about general purpose use of an engine, as opposed to something
more narrow like "likely font-lock expressions":

It's common enough in "real life" (in my experience) to be exactly the
minimal amount required to make it annoying.

Let me give you some rules of thumb to describe what I mean but I'll
refrain from a long treatise explaining the theory that supports these.
If you have questions I can "unpack" this on list or off.

First, be careful to make a clear distinction between (let's dub the
distinction) "regexps vs. regular expressions".   "Regular expressions"
correspond to formally regular languages.  Regular expressions can
always be converted to a DFA.   "regexps" are what most people use
in most situations.   regexps include things like features to extract the
positions that match a sub-pattern.   regexps do *not* correspond to
regular languages and *can not* always be converted to DFA.   DFA
conversion can help with regexp matching, but it can't solve it completely.
Moreover, you can make pathological regexps (very slow to match) for
every regexp I know of.   Henry Spencer (I've heard) asserts that regexps
are NP complete or something around there, though I haven't seen the
proof.  (Rx is a hybrid regular expression and regexp engine based on an
on-line (caching, incremental) DFA conversion suggested in the "Dragon"
compiler book and originating from (as I recall) an experiment by
Thompson and one of the other Bell Labs guys.  One of whoever it was
privately mentioned that they abandoned it themselves because it was taking
too much code to implement, or something like that.)

That aside, regular expressions are probably plenty for font-lock 
functionality,
so let's just talk about those.   On-line DFA conversion for those is 
likely
practical by multiple means.

The size of a DFA is, worst-case, exponential in the size of the regular 
expression.
This is the source of pathological cases that can thwart any  DFA 
engine.  (In exponential
cases, Rx keeps running but more like an NFA, taking a corresponding 
performance hit.)

For a time, I was pushing hard on Rx, trying to use it for absolutely 
everything.
To make up new problems to solve as in "Ok, let's suppose I can run 
X-Kbyte long
regular expressions with lots of | operators, stars, etc.   What can I 
apply that to?"
One example is lexing for real-world computer languages.

[Aside: you can also use the same engine as the core of a shift-reduce 
parser, of course -- and I
did that a long time ago with Guile, to useful effect.]

Almost always, the expressions were such that Rx would have no problem 
and give
very pleasing results.   However, there were definitely times (for huge 
regular expressions
and smaller ones) when things would just absolutely crawl.   They happen 
"just often
enough" to be an annoyance.

Now, every case of that annoyance that I found (in real-life 
applications) had a solution.
I could think for a while on *why* the regular expression I wrote was 
blowing up and
then think of a different approach that eliminated the problem.   Most 
often that
meant not a different but equivalent regular expression -- it meant 
going one level up and
changing the way the caller used regular expressions.   The caller would 
retain the same
functionality but different demands would be made on the regular 
expression (or regexp) engine.

And that makes it tricky to drop *blindly* into Emacs.   To solve the 
unavoidable annoying cases
you really had to know how regular expressions worked at a deep level.   
To diagnose
and work around the pathological cases took some expertise.

As a general rule, for something "general purpose" like a libc 
implementation or the
default regexp functions in Emacs lisp, there is something to be said 
for using NFA matchers.
It's perverse why:

Vast swaths of things that a DFA matcher can do very quickly an NFA 
matcher can not.
It's reliably slow.  Therefore, people not prepared to think hard about 
regexps tend to use
an NFA matcher in pretty limited ways.   A powerful regular expression 
engine could
simplify their code, at the cost of risking finding a "hard case" -- but 
there's no issue since
people quickly give up and don't try to push the match engine that hard.

More concisely:

If you use a DFA matcher You Better Know What You're Doing but also
Most of the Time You Won't Need To So It's Very Convenient.

If you use an NFA matcher You Can Get Away With Not Really Knowing What
You're Doing but also You Won't Be Trying Anything Fancy, Perhaps Losing 
Out.

One approach to toss into the mix, this one suggested by Per Bothner 
years ago,
is to consider *offline* DFA conversion (a la 'lex(1)').    The 
advantage of offline
(batch) conversion is that you can burn a lot of cycles on DFA 
minimization and,
if your offline converter terminates, you've got a reliably linear 
matcher.   The
disadvantages for *many* uses of regular expressions in Emacs should be 
pretty obvious.
For something like font-lock, where the regular expressions don't change 
that often,
that might be a good approach -- precompile a minimal DFA and then add 
support
for "regular expression continuations" when using those tables.

-t

next prev parent reply	other threads:[~2008-05-12  3:30 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <87k5i8ukq8.fsf@stupidchicken.com>
     [not found] ` <200805061335.11379.bruno@clisp.org>
     [not found]   ` <48204B3D.6000500@gmx.at>
2008-05-10 19:18     ` bug#192: regexp does not work as documented David Koppelman
     [not found]     ` <yg5skwqc6ho.fsf@nested.ece.lsu.edu>
2008-05-10 20:13       ` David Koppelman
     [not found]       ` <yg5bq3ddij2.fsf@nested.ece.lsu.edu>
2008-05-11  7:40         ` martin rudalics
     [not found]         ` <4826A303.3030002@gmx.at>
2008-05-11 14:27           ` Chong Yidong
     [not found]           ` <87abiwoqzd.fsf@stupidchicken.com>
2008-05-11 15:36             ` David Koppelman
     [not found]             ` <yg57ie0df8u.fsf@nested.ece.lsu.edu>
2008-05-11 18:44               ` Stefan Monnier
     [not found]               ` <jwv4p94r8vp.fsf-monnier+emacsbugreports@gnu.org>
2008-05-11 19:09                 ` David Koppelman
     [not found]                 ` <yg5tzh4bqtw.fsf@nested.ece.lsu.edu>
2008-05-12  1:28                   ` Stefan Monnier
     [not found]                   ` <jwvr6c8pbd6.fsf-monnier+emacsbugreports@gnu.org>
2008-05-12 15:03                     ` David Koppelman
     [not found]                     ` <yg5d4nra7jb.fsf@nested.ece.lsu.edu>
2008-05-12 16:29                       ` Stefan Monnier
     [not found]                       ` <jwvzlqvmr6g.fsf-monnier+emacsbugreports@gnu.org>
2008-05-12 17:04                         ` David Koppelman
2008-05-11 18:44             ` Stefan Monnier
     [not found]             ` <jwv8wygrbss.fsf-monnier+emacsbugreports@gnu.org>
2008-05-11 20:03               ` Thomas Lord
     [not found]               ` <482750F4.2050102@emf.net>
2008-05-12  1:43                 ` Stefan Monnier
     [not found]                 ` <jwvlk2gpas3.fsf-monnier+emacsbugreports@gnu.org>
2008-05-12  3:30                   ` Thomas Lord [this message]
     [not found]                   ` <4827B9B8.30406@emf.net>
2008-05-12 13:43                     ` Stefan Monnier
     [not found]                     ` <jwvfxsnpryp.fsf-monnier+emacsbugreports@gnu.org>
2008-05-12 15:55                       ` Thomas Lord
     [not found]                       ` <48286862.6040105@emf.net>
2008-05-12 16:18                         ` tomas
     [not found]   ` <jwvfxsvbgg5.fsf-monnier+emacs@gnu.org>
2008-05-10 20:04     ` Bruno Haible
2008-05-06  1:30 Bruno Haible
2015-12-29 17:48 ` bug#192: " Bruno Haible

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='4827B9B8.30406__26274.8840534643$1210588121$gmane$org@emf.net' \
    --to=lord@emf.net \
    --cc=192@emacsbugs.donarmstrong.com \
    --cc=bruno@clisp.org \
    --cc=cyd@stupidchicken.com \
    --cc=emacs-devel@gnu.org \
    --cc=koppel@ece.lsu.edu \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).