From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Tom Lord Newsgroups: gmane.lisp.guile.user Subject: Re: Stupid module and pregexp questions Date: Wed, 30 Apr 2003 10:11:53 -0700 (PDT) Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Message-ID: <200304301711.KAA07475@morrowfield.regexps.com> References: <877k9eobcv.fsf@raven.i.defaultvalue.org> <20030430103445.GB23809@www> NNTP-Posting-Host: main.gmane.org X-Trace: main.gmane.org 1051722884 3167 80.91.224.249 (30 Apr 2003 17:14:44 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Wed, 30 Apr 2003 17:14:44 +0000 (UTC) Cc: tomas@fabula.de Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Wed Apr 30 19:14:42 2003 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 19AvAH-0000ou-00 for ; Wed, 30 Apr 2003 19:14:41 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 19Av0W-00079D-01 for guile-user@m.gmane.org; Wed, 30 Apr 2003 13:04:36 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10.13) id 19AuuC-0004gQ-00 for guile-user@gnu.org; Wed, 30 Apr 2003 12:58:04 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10.13) id 19Auu7-0004aO-00 for guile-user@gnu.org; Wed, 30 Apr 2003 12:58:01 -0400 Original-Received: from 1cust8.tnt13.sfo8.da.uu.net ([65.234.195.8] helo=morrowfield.regexps.com) by monty-python.gnu.org with esmtp (Exim 4.10.13) id 19Auu5-0004Wy-00 for guile-user@gnu.org; Wed, 30 Apr 2003 12:57:58 -0400 Original-Received: (from lord@localhost) by morrowfield.regexps.com (8.9.1/8.9.1) id KAA07475; Wed, 30 Apr 2003 10:11:53 -0700 (PDT) (envelope-from lord@morrowfield.regexps.com) Original-To: tomas@fabula.de In-reply-to: <20030430103445.GB23809@www> (tomas@fabula.de) Original-cc: ttn@glug.org Original-cc: guile-user@gnu.org X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1b5 Precedence: list List-Id: General Guile related discussions List-Help: List-Post: List-Subscribe: , List-Archive: List-Unsubscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.user:1886 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.user:1886 > Uh, oh. What I'd like to see would be a `standard' regexp > interface for Scheme (and may be a `standard' regexp > implementation for Guile). > This would boost all those portable programs which nowadays > try to parse strange things the hard way. I realize it's a bit cliche but: the nice thing about regexp standards is that there are so many to choose from. Just to throw out some observations: Posix regexps are pretty well specified, big subsets of them have the nice mathematical properties we expect of regular expressions, and they are _fairly_ friendly to DFA based matchers. There are a bunch of decent test suites for them and, as of last year, one huge test suite that combines all of those and adds some new tests. Alas, there are a lot of buggy implmentations of them, and one apparently irreconcilable disagreement about interpreting the spec (is regexp concatenation left associative or right associative?). I've always thought of Perl regexps as a kind of moving target and the first-found semantics are not so mathematically nice (so are they really "Schemey"? and does that matter). They are easier to implement but first-found doesn't mix too well with DFA techniques. There are lots of implementations of variations on Perl regexps. Lots of programmers are familiar with them. Some of their non-regular extensions (e.g., look-ahead) are apparently pretty convenient. The Unicode consortium publishes a tech report that almost specifies a really minimalist, very clean regular expression language (particularly addressing how character sets should work given the size and complexity of Unicode). The regular expression language in the XML Schema standard is based on this. It's a bit low on expressive power compared to Posix and Perl -- but the limitations make it a DFA-friendly regexp language (you can always see if a given string matches with a single pass over the string). GNU Emacs has its own little language: roughly a Posix-like syntax and feature set, with a Perl-like semantics. It's chief virtue is a simple and long-lived implementation with syntax options and the option to do matching with strict Posix semantics. It's chief drawback is speed -- it's frustratingly too slow for some applications that would otherwise be natural in an extensible editor. -------- I think using Perl or Emacs expressions as the primary type of regexp in Guile would be a mistake. Since they're mathematically icky, there's less you can do to process them automatically. Since they're mostly DFA-unfriendly, they don't scale too well to performance-intensive applications of non-trivial expressions. If someone needs Perl expressions (because, say, their application has to read in expressions from some data file that Perl apps might also read), can't they just add a trivial Guile binding for them? Posix regexps would have some advantages. They can scale _pretty_ well (if you avoid submatch reporting, backrefs, and context operators). They're reasonably clean. You can do tricks like calling `grep' or `sed' as a subprocess with a Posix expression, then use that same expression internally. But also disadvantages: the backreference operator, context operators, and sub-match position reporting features are hard to implement correctly and are DFA-unfriendly (meaning that they can force backtracking to take place either in time (lots of string passes) or space (lots of ancillary data during a sing string pass)). As a historic note: my impression is that context, sub-match reporting and backrefs happened to be easy to add to an early NFA-based implementation. I doubt that the early forms had the strict recursively-leftmost-longest semantics of the current standard. In other words, I think it's just an unfortunate accident that Posix regexps have these features, and strongly suspect that when those features were first standardized, there were exactly 0 correct implementations. The standard authors, I'm guessing, simply underestimated the difficulty and cost of getting them right as specified. If one had the freedom to start fresh and design regexp tools today, I don't think the result would look much like Posix regexps. ----- So what does that leave? Something really minimalist like the XML Schema regular expression language -- a language that's less expressive, but mathematically clean and very DFA-friendly. In a non-lisp language, that minimalism would be a problem. In something like Perl, it seems to me, the regexp language grows essentially to embed control structures that would be hard or at least tedious to express directly in regular Perl code. It's as if there's an embedded language in Perl just for writing pattern-matching "loops" that use more traditional regular expressions as primitive statements. But in lisp languages: well, that's part of what macros and higher-order functions are for -- building little languages. So my vague idea is: pick a truly minimalist, DFA-friendly pattern language. Leave out anchors ("^" and "$"), context operators (e.g. "match the empty string if followed by whitespace"), sub-match position reporting, and backreferences. Provide primitives like "return the length of the shortest/longest match starting here", "return #t if any match starts here". Add features like state-labeling (let users attach data to NFA states) and interfaces like "run the DFA for the next 10 characters of this string" and "return a list of all the NFA state labels of the current state". Next, invent little languages on top of that to take over the role of backrefs, sub-match reporting, context operators and so forth. Things like sub-match reporting should probably _not_ have the same semantics as in Posix -- but should be designed so that scans can avoid backtracking. Finally, design a surface syntax that can compete with the Posix or Perl syntax, but that "compiles" fairly trivially into Scheme expressions over the minimalist expression language. In other words, even though the minimalist language would lack anchors or submatches, a user typing a pattern into a dialog box for a regexp-search or regexp-query-replace command should have _some_ access to anchoring and submatches without having to resort to typeing s-exps. Maybe that suggestion, to choose a minimalist, truly regular regular expression language -- then do the rest in scheme -- satisfies the spirit of "do as little as possible in C". Another design dimension to consider: what are Guile's plans re: Unicode? -t _______________________________________________ Guile-user mailing list Guile-user@gnu.org http://mail.gnu.org/mailman/listinfo/guile-user