From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Ken Anderson <kanderson@bbn.com>
Newsgroups: gmane.lisp.guile.user
Subject: Re: Stupid module and pregexp questions
Date: Tue, 29 Apr 2003 20:04:41 -0400
Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org
Message-ID: <5.2.0.9.2.20030429194519.027d1ac0@zima.bbn.com>
References: <slrnbattr6.7tv.markj+0111@bouncing.localnet>
 <877k9eobcv.fsf@raven.i.defaultvalue.org>
 <slrnbad5pj.57k.markj+0111@bouncing.localnet>
 <slrnbattr6.7tv.markj+0111@bouncing.localnet>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Trace: main.gmane.org 1051661727 25275 80.91.224.249 (30 Apr 2003 00:15:27 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Wed, 30 Apr 2003 00:15:27 +0000 (UTC)
Cc: markj@cloaked.freeserve.co.uk
Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Wed Apr 30 02:15:24 2003
Return-path: <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by main.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 19AfFs-0006ZB-00
	for <guile-user@m.gmane.org>; Wed, 30 Apr 2003 02:15:24 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.10.13)
	id 19AfCI-0001gg-00
	for guile-user@m.gmane.org; Tue, 29 Apr 2003 20:11:42 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13)
	id 19AfB8-0001PF-00
	for guile-user@gnu.org; Tue, 29 Apr 2003 20:10:30 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13)
	id 19Af9M-0000jL-00
	for guile-user@gnu.org; Tue, 29 Apr 2003 20:08:46 -0400
Original-Received: from zima.bbn.com ([128.89.72.16])
	by monty-python.gnu.org with esmtp (Exim 4.10.13)
	id 19Af5l-00074I-00
	for guile-user@gnu.org; Tue, 29 Apr 2003 20:04:57 -0400
Original-Received: from ale.bbn.com (ale [128.89.72.125])
	by zima.bbn.com (8.11.4/8.11.4) with ESMTP id h3U04g913776;
	Tue, 29 Apr 2003 20:04:42 -0400 (EDT)
X-Sender: kanderso@zima.bbn.com
X-Mailer: QUALCOMM Windows Eudora Version 5.2.0.9
Original-To: Tom Lord <lord@emf.net>
In-Reply-To: <200304292321.QAA04172@morrowfield.regexps.com>
Original-cc: guile-user@gnu.org
X-BeenThere: guile-user@gnu.org
X-Mailman-Version: 2.1b5
Precedence: list
List-Id: General Guile related discussions <guile-user.gnu.org>
List-Help: <mailto:guile-user-request@gnu.org?subject=help>
List-Post: <mailto:guile-user@gnu.org>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=subscribe>
List-Archive: <http://mail.gnu.org/pipermail/guile-user>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=unsubscribe>
Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org
Xref: main.gmane.org gmane.lisp.guile.user:1875
X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.user:1875

At 04:21 PM 4/29/2003 -0700, Tom Lord wrote:

>       > And then, of course, there's the issue of speed.  Regexps are used for
>       > enough processing that IMHO they must be matched by compiled, not
>       > interpreted, code or they risk being unacceptably slow.  [...]
>
>       Compiled code is just interpreted code at a different level,
>       surely?  A good optimisation will often beat dropping down
>       levels, and scheme allows easier optimisation while avoiding
>       some typical errors.  Do the minimum directly in C, IMHO.
>
>I have some experience in regexp implementation, so may I offer my
>$0.02?
>
>a) In popular use on unix, fairly low performance matchers dominate
>   with the exception of apps like awk and grep.  But when I say
>   "fairly low performance" -- please don't misunderstand -- I'm 
>   oversimplifying.   What really dominates are matchers that are 
>   pretty slow on patterns that are at all tricky, but that optimize
>   some common cases like a constant-string pattern and patterns close
>   to that.   It mostly just doesn't occur to people to use regexps
>   for problems that fall outside of those common cases.
>
>
>b) A really fast and general matcher, like Rx (as in hackerlab C
>   library, not as in the ancient fork on the GNU FTP site), opens a
>   lot of doors.  You can apply dynamically generated regexps to
>   applications that were previously out of reach.   A nice example
>   might be to write parsers for a really rich wiki language.
>
>   To my mind, opening the door to applications like that through the
>   provision of an extra fancy regexp engine is a neat thing to do --
>   and is a way Guile could differentiate itself from other
>   languages.   At the same time, it takes a lot of code and it's
>   touchy to tune -- so it risks violating the KISS principle.
>
>   And, oh yeah -- you'll want shared substrings to make things really
>   hum along nicely (ahem :-).
>
>
>Matchers of type (a) are butt simple to implement.  (Hard to get right
>if you want to nail all the details to be Posix conformant, but the
>algorithms are pretty straightforward.)  People have gotten a lot of
>milage by writing some in Java.   I don't see any reason why a Scheme
>version couldn't be competitive -- but really, you'd need it to be
>compiled and to have at least a few relevant compiler optimizations.
>
>Matchers of type (b) are hard to implement.  Really hard.  And the
>competition among them boils down to O(1-3) instructions in critical
>loops, and to fast paths through exceptional but, alas, important
>non-common cases.  There isn't a practical chance in hell of any
>Scheme compiler competing in this space in the next 5 years (at
>least).   The compiler theory is arguably there -- but its practical
>application is quite a ways off.
>
>So whether you prefer (a) or (b), if you want competitive performance,
>you're stuck doing some compiler hacking if you choose to use an
>engine in Scheme.  A serious but tractable amount of compiler hacking
>for (a) (catch up to Java, at least), a researchy amount for (b).
>
>So, "do the minimum in C": that means write the regexp engine in C,
>for all practical purposes.

I think there are alternatives more people should know about.
Henry Baker has a "Pragmatic parsing" paper:
http://citeseer.nj.nec.com/henry91pragmatic.html
that suggests writing tokenizers in a direct style.  In terms of regular expressions, its like writing the expression in a fully deterministic way.

This may sound bad at first, but i've written a tokenizer and its common lisp compiler for CORBA IDL in a variation on this style and it was smaller than Sun's input to LEX for the same task.  It compiled into goto's not objects or tables, so i suspect it was relatively fast.

The domain of regex matching requires a fairly narrow subset of Scheme so it might be possible to do fairly well, but it ultimately depends on how good the compiler is.  At least in CL, with some work, you could get C like code produced.

Another issue is when you should convert your regex from nondetermanistic form to deterministic form.  I've seen Java regex that require 25 seconds to compile, so lazy compiling may pay off.

>Say:  suppose I implement an Emacs buffer-like string type either as a
>gap buffer or, better, as a tree of some sort (perhaps a splay tree)
>-- a good question for your regexp engine is "can it handle a string
>that isn't contiguous in memory like that".
>
>-t
>
>(here's a snippet of what I do with mine -- the SRE-like expressions
>get compiled down to an extended version of Posix extended regexp
>syntax:)
>
>(begin
>  (define wiki-paragraph-rules
>    ;; (type test separator)
>    ;; 
>    ;; Test is a structured regexp to be compiled in a larg `(| ...)'
>    ;; of all of the test expressions.  The leftmost-longest matching
>    ;; test expression determines the type of the first paragraph in a
>    ;; given string.
>    ;; 
>    ;; `(separator string)' returns a list: `(paragraph remaining-string)',
>    ;; separating the first paragraph from the rest of the string.
>    ;; 
>    ;; The `test' expression and `separator' procedure can safely assume 
>    ;; that the string is not empty, and does not begin with any blank lines.
>    ;; 
>
>    `((:form-feed               (& (* ([] blank)) "\f" (* ([] blank)))
>                                ,(lambda (s) (one-line-separator s)))
>      (:comment-line            (&  "%%%"(* ([^] "\n")))
>                                ,(lambda (s) (one-line-separator s)))
>      (:rfc822ish               (& (* ([] blank)) "+++" (* ([] blank)))
>                                ,(lambda (s) (ordinary-paragraph-separator s)))
>
>      (:title                   (& "!" (+ ([^] "\n")))
>                                ,(lambda (s) (ordinary-paragraph-separator s)))
>
>      (:card-boundary           "\f---"
>                                ,(lambda (s) (one-line-separator s)))
>
>      (:heading                 (& (* ([] blank))
>                                   (+ "*")
>                                   ([] blank)
>                                   ([^] ")#*\n")
>                                   (* ([^] "\n")))
>                                ,(lambda (s) (ordinary-paragraph-separator s)))
>
>      (:menu                    (& (* ([] blank))
>                                   "-*-*-"
>                                   (* ([] blank)))
>                                ,(lambda (s) (one-line-separator s)))
>
>      (:verbatim                (& (* ([] blank)) "<<<" (* ([] blank)))
>                                ,(lambda (s) (verbatim-paragraph-separator s)))
>
>      (:small-paragraph         (& (* ([] blank)) "(((" (* ([] blank)))
>                                ,(lambda (s) (small-paragraph-separator s)))
>
>      (:text-area               (& (* ([] blank)) "?<<<" (* ([^] #\nl)))
>                                ,(lambda (s) (verbatim-paragraph-separator s)))
>
>      (:one-line-verbatim       (& "#" (* ([^] "\n")))
>                                ,(lambda (s) (one-line-separator s)))
>
>      (:separator-line          (& (* ([] blank)) "---" (* "-") (* ([] blank)))
>                                ,(lambda (s) (one-line-separator s)))
>
>
>
>[....]
>
>
>
>_______________________________________________
>Guile-user mailing list
>Guile-user@gnu.org
>http://mail.gnu.org/mailman/listinfo/guile-user


_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user