From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Tom Lord Newsgroups: gmane.lisp.guile.user Subject: Re: Stupid module and pregexp questions Date: Mon, 5 May 2003 10:33:09 -0700 (PDT) Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Message-ID: <200305051733.KAA11843@morrowfield.regexps.com> References: <877k9eobcv.fsf@raven.i.defaultvalue.org> <200304292321.QAA04172@morrowfield.regexps.com> <877k96htat.fsf@raven.i.defaultvalue.org> <87of2h25tj.fsf@raven.i.defaultvalue.org> NNTP-Posting-Host: main.gmane.org X-Trace: main.gmane.org 1052155943 21480 80.91.224.249 (5 May 2003 17:32:23 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Mon, 5 May 2003 17:32:23 +0000 (UTC) Cc: guile-user@gnu.org Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Mon May 05 19:32:20 2003 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 19Cjcz-00040e-00 for ; Mon, 05 May 2003 19:19:49 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 19Cjdp-0001T3-01 for guile-user@m.gmane.org; Mon, 05 May 2003 13:20:41 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13) id 19CjcX-0000PN-00 for guile-user@gnu.org; Mon, 05 May 2003 13:19:21 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13) id 19CjcC-0008Tl-00 for guile-user@gnu.org; Mon, 05 May 2003 13:19:01 -0400 Original-Received: from 1cust86.tnt13.sfo8.da.uu.net ([65.234.195.86] helo=morrowfield.regexps.com) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 19Cjba-0007vQ-00 for guile-user@gnu.org; Mon, 05 May 2003 13:18:23 -0400 Original-Received: (from lord@localhost) by morrowfield.regexps.com (8.9.1/8.9.1) id KAA11843; Mon, 5 May 2003 10:33:09 -0700 (PDT) (envelope-from lord@morrowfield.regexps.com) Original-To: rlb@defaultvalue.org In-reply-to: <87of2h25tj.fsf@raven.i.defaultvalue.org> (message from Rob Browning on Mon, 05 May 2003 02:47:20 -0500) X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: General Guile related discussions List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.user:1909 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.user:1909 > If all (or nearly all) of the libs that guile might choose to > link against for (ice-9 regex) on various platforms are > consistent with each other, They are not. > then I should perhaps withdraw my > suggestions. I was just under the impression that they vary > substantially, and wanted to have at least one familiar regex > subsystem in the core that eliminated the variance. You might want to ask RMS if he has a version of GNU regex that is reasonably correct these days. I recall that he did some work on that last year -- but that might have been just for the version in Emacs. GNU regex has the virtues of being a very small implementation with standard Posix entry points, and the drawback (for _some_ apps) of using a backtracking-based implementation. That would be a net win for (ice-9 regex): (a correct) GNU regex (if you can scare one up) is small enough to distribute with Guile, and then those people who need a faster version need only find one that offers posix entry points. Be careful because I think that most versions of GNU regex in circulation (many slightly different versions, all labeled "0.12") have subtle bugs. In src/hackerlab/tests/rx-posix-tests (e.g., in arch-1.0pre18) there are some conformance tests for Posix matchers and a driver program "=generic-test-regex.c" that you can use to run them. We were slightly talking at cross purposes: when I concluded that you should really consider starting with a minimalist truly-regular pattern language, I was thinking beyond just (ice-9 regex) -- for example, what would we want in a regex srfi? Then, when people start talking about using Perl regexps to be "compatible" to Perl and Python: well, what's the point of that? Some of the biggest wins for using Scheme as the basis of an app-framework/extension/scripting language are the opportunities to provide a far superior programming environment based on a language that a fair number of implementors work on making quite fast. > Also, if one of the main things you're arguing is that perl > and emacs-style regexes have extensions that we need to do > without if we want good performance, then I'm not trying to > argue against that assertion. That's not quite what I'm saying. It isn't about "extensions": it's about the basic semantics of regexp operators like | and *. Some regexp languages (Emacs is one example, Javascript's another) are defined in terms of backtracking implementations. They aren't "regular expression" engines in the theoretical CS sense at all. They don't provide equivalent DFA's even for very simple patterns like "else|elseif". You can't build a program like "lex" to statically compile their patterns for a linear scanner. You can't build a dynamic DFA engine like Rx. Those pattern languages simply don't scale to large problems (font lock mode, anyone?) and don't have nice simple algebraic properties for automatic manipulation of patterns (SREs, anyone?). The extensions in those languages are a little bit relevant: they tend to be extensions that are easy to implement given a backtracking matcher as a starting point, and basically impossible to implement (efficiently) in DFA-based implementations. So the extensions compound the problem -- but the root of the problem is the semantics of very basic operators. (Subexpression position reporting, anchors, and backreferences in Posix appear to have started out as similar backtracking-oriented extensions. When Posix was standardized, those extensions were given semantics wrongly presumed to be DFA-friendly. As a result, Posix regexps are in some sense the worst of both worlds: they force backtracking matchers to do nearly exhaustive searches and they force DFA-based implementations to add a complicated layer of weird-ass recursive backtracking.) > I'm really just ruminating on the advisability of a > well-defined, invariant, and reasonably familiar regex > syntax for guile's core. To my ears, that's different from just thinking about what backs up (ice-9 regex). That should be more at the level of "What would a good regex SRFI do?" And I stand by my answer there: none of the standard regexp languages are any good, though the XML Schema version comes closest. A pattern language that managed to be truly a regular expression language in the CS-theory sense is the only sane basis for something like an SRE approach. > I'd probably be perfectly happy with a good POSIX > implementation in the core, perhaps even with a subset of > POSIX if dropping certain bits were somehow important... As I said, ask about for a working GNU regex for (ice-9 regex). I think that's the only practical choice you (might) have. But I encourage you also to have higher ambitions and just point to my experience using Rx in systas. Having a dynamic DFA engine in there enabled a bunch of neat applications that would be simply out of reach in many environments. -t _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user