unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: Tom Lord <lord@emf.net>
Cc: guile-user@gnu.org
Subject: Re: Stupid module and pregexp questions
Date: Tue, 29 Apr 2003 16:21:10 -0700 (PDT)	[thread overview]
Message-ID: <200304292321.QAA04172@morrowfield.regexps.com> (raw)
In-Reply-To: <slrnbattr6.7tv.markj+0111@bouncing.localnet> (message from MJ Ray on Tue, 29 Apr 2003 22:06:30 -0000)


       > And then, of course, there's the issue of speed.  Regexps are used for
       > enough processing that IMHO they must be matched by compiled, not
       > interpreted, code or they risk being unacceptably slow.  [...]

       Compiled code is just interpreted code at a different level,
       surely?  A good optimisation will often beat dropping down
       levels, and scheme allows easier optimisation while avoiding
       some typical errors.  Do the minimum directly in C, IMHO.

I have some experience in regexp implementation, so may I offer my
$0.02?

a) In popular use on unix, fairly low performance matchers dominate
   with the exception of apps like awk and grep.  But when I say
   "fairly low performance" -- please don't misunderstand -- I'm 
   oversimplifying.   What really dominates are matchers that are 
   pretty slow on patterns that are at all tricky, but that optimize
   some common cases like a constant-string pattern and patterns close
   to that.   It mostly just doesn't occur to people to use regexps
   for problems that fall outside of those common cases.


b) A really fast and general matcher, like Rx (as in hackerlab C
   library, not as in the ancient fork on the GNU FTP site), opens a
   lot of doors.  You can apply dynamically generated regexps to
   applications that were previously out of reach.   A nice example
   might be to write parsers for a really rich wiki language.

   To my mind, opening the door to applications like that through the
   provision of an extra fancy regexp engine is a neat thing to do --
   and is a way Guile could differentiate itself from other
   languages.   At the same time, it takes a lot of code and it's
   touchy to tune -- so it risks violating the KISS principle.

   And, oh yeah -- you'll want shared substrings to make things really
   hum along nicely (ahem :-).


Matchers of type (a) are butt simple to implement.  (Hard to get right
if you want to nail all the details to be Posix conformant, but the
algorithms are pretty straightforward.)  People have gotten a lot of
milage by writing some in Java.   I don't see any reason why a Scheme
version couldn't be competitive -- but really, you'd need it to be
compiled and to have at least a few relevant compiler optimizations.

Matchers of type (b) are hard to implement.  Really hard.  And the
competition among them boils down to O(1-3) instructions in critical
loops, and to fast paths through exceptional but, alas, important
non-common cases.  There isn't a practical chance in hell of any
Scheme compiler competing in this space in the next 5 years (at
least).   The compiler theory is arguably there -- but its practical
application is quite a ways off.

So whether you prefer (a) or (b), if you want competitive performance,
you're stuck doing some compiler hacking if you choose to use an
engine in Scheme.  A serious but tractable amount of compiler hacking
for (a) (catch up to Java, at least), a researchy amount for (b).

So, "do the minimum in C": that means write the regexp engine in C,
for all practical purposes.

Say:  suppose I implement an Emacs buffer-like string type either as a
gap buffer or, better, as a tree of some sort (perhaps a splay tree)
-- a good question for your regexp engine is "can it handle a string
that isn't contiguous in memory like that".

-t

(here's a snippet of what I do with mine -- the SRE-like expressions
get compiled down to an extended version of Posix extended regexp
syntax:)

(begin
  (define wiki-paragraph-rules
    ;; (type test separator)
    ;; 
    ;; Test is a structured regexp to be compiled in a larg `(| ...)'
    ;; of all of the test expressions.  The leftmost-longest matching
    ;; test expression determines the type of the first paragraph in a
    ;; given string.
    ;; 
    ;; `(separator string)' returns a list: `(paragraph remaining-string)',
    ;; separating the first paragraph from the rest of the string.
    ;; 
    ;; The `test' expression and `separator' procedure can safely assume 
    ;; that the string is not empty, and does not begin with any blank lines.
    ;; 

    `((:form-feed		(& (* ([] blank)) "\f" (* ([] blank)))
				,(lambda (s) (one-line-separator s)))
      (:comment-line		(&  "%%%"(* ([^] "\n")))
				,(lambda (s) (one-line-separator s)))
      (:rfc822ish		(& (* ([] blank)) "+++" (* ([] blank)))
				,(lambda (s) (ordinary-paragraph-separator s)))

      (:title			(& "!" (+ ([^] "\n")))
				,(lambda (s) (ordinary-paragraph-separator s)))

      (:card-boundary		"\f---"
				,(lambda (s) (one-line-separator s)))

      (:heading			(& (* ([] blank))
				   (+ "*")
				   ([] blank)
				   ([^] ")#*\n")
				   (* ([^] "\n")))
				,(lambda (s) (ordinary-paragraph-separator s)))

      (:menu			(& (* ([] blank))
				   "-*-*-"
				   (* ([] blank)))
				,(lambda (s) (one-line-separator s)))

      (:verbatim		(& (* ([] blank)) "<<<" (* ([] blank)))
				,(lambda (s) (verbatim-paragraph-separator s)))

      (:small-paragraph		(& (* ([] blank)) "(((" (* ([] blank)))
				,(lambda (s) (small-paragraph-separator s)))

      (:text-area		(& (* ([] blank)) "?<<<" (* ([^] #\nl)))
				,(lambda (s) (verbatim-paragraph-separator s)))

      (:one-line-verbatim	(& "#" (* ([^] "\n")))
				,(lambda (s) (one-line-separator s)))

      (:separator-line		(& (* ([] blank)) "---" (* "-") (* ([] blank)))
				,(lambda (s) (one-line-separator s)))



[....]



_______________________________________________
Guile-user mailing list
Guile-user@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-user


  reply	other threads:[~2003-04-29 23:21 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-04-23 13:37 Stupid module and pregexp questions MJ Ray
2003-04-23 14:56 ` Paul Jarc
2003-04-24 10:01   ` MJ Ray
2003-04-24 12:52 ` Andreas Rottmann
2003-04-24 13:15   ` MJ Ray
2003-04-24 13:36     ` Andreas Rottmann
2003-04-24 16:58       ` Marius Vollmer
2003-04-24 22:55         ` Andreas Rottmann
2003-04-24 17:58       ` MJ Ray
2003-04-28 16:06 ` Rob Browning
2003-04-28 16:44   ` MJ Ray
2003-04-28 17:03     ` Rob Browning
2003-04-28 17:51       ` MJ Ray
2003-04-28 18:18         ` Rob Browning
2003-04-28 18:07       ` Dr. Peter Ivanyi
2003-04-29 18:38         ` MJ Ray
2003-04-28 17:53   ` tomas
2003-04-28 17:12     ` Rob Browning
2003-04-28 17:55     ` MJ Ray
2003-04-29  8:12       ` Low level things in C or Scheme [was Stupid module and pregexp questions] tomas
2003-04-29 17:35         ` Thamer Al-Harbash
2003-04-29 19:34           ` Low level things in C or Scheme Mikael Djurfeldt
2003-04-29 20:24             ` Ken Anderson
2003-04-30  4:27           ` Low level things in C or Scheme [was Stupid module and pregexp questions] Robert Uhl
2003-04-30 13:27             ` Thamer Al-Harbash
2003-04-30  6:39           ` tomas
2003-04-29  0:45   ` Stupid module and pregexp questions Robert Uhl
2003-04-29 22:06     ` MJ Ray
2003-04-29 23:21       ` Tom Lord [this message]
2003-04-30  0:04         ` Ken Anderson
2003-04-30  6:48         ` tomas
2003-04-30  6:31           ` Tom Lord
2003-04-30  6:35             ` Tom Lord
2003-10-24 21:29             ` Thien-Thi Nguyen
2003-10-24 22:30               ` Tom Lord
2003-10-26 18:38                 ` Thien-Thi Nguyen
2003-04-30  6:58           ` Thien-Thi Nguyen
2003-04-30 10:34             ` tomas
2003-04-30 17:11               ` Tom Lord
2003-05-06  9:50                 ` tomas
2003-05-06  9:28                   ` Tom Lord
2003-05-08 11:47                     ` tomas
2003-10-24 21:45               ` Thien-Thi Nguyen
2003-10-24 22:37                 ` Tom Lord
2003-10-26 18:47                   ` Thien-Thi Nguyen
2003-10-27 10:48                 ` tomas
2003-05-05  5:11         ` Rob Browning
2003-05-05  6:18           ` Tom Lord
2003-05-05  7:47             ` Rob Browning
2003-05-05 17:33               ` Tom Lord
2003-05-05 19:37                 ` Rob Browning
2003-05-05 20:19                   ` Tom Lord
2003-10-24 22:26             ` Thien-Thi Nguyen
2003-10-24 22:58               ` Tom Lord
2003-10-26 19:02                 ` Thien-Thi Nguyen
2003-10-27 10:26                 ` tomas
2003-10-27 14:19                 ` Dale P. Smith
2003-10-27 14:54                   ` rm
2003-10-28  0:57                     ` Robert Marlow
2003-10-28  1:59                       ` Tom Lord
2003-10-29  9:36                         ` Harri Haataja
2003-10-28  2:05                       ` lord
     [not found]                         ` <lord@morrowfield.regexps.com>
2003-10-28  2:23                           ` Thien-Thi Nguyen
2003-04-30  4:38       ` Robert Uhl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200304292321.QAA04172@morrowfield.regexps.com \
    --to=lord@emf.net \
    --cc=guile-user@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).