Alan Mackenzie writes: >> rx.el is one of the best concepts I've discovered in a long time. >> It's another instance of "Don't come up with a new (mini)language when >> Lisp can do better": it's easier to learn, more flexible, easier to >> write, much easier to read and as a consequence much more maintainable. > > Much easier than what? Than the putative mini-language that doesn't get > written? I meant that in my opinion rx is easier to write than regexps. That it is not popular is the root of the question here. >> I think it's high time we moved away from traditional regexps and >> embraced the concept of rx.el. I'm thinking of implementing it for >> Guile. > > There's nothing stopping anybody from using rx.el. However, people have > mostly _not_ used it. The "I think it's high time ...." suggests in > some way forcing people to use it. Before mandating something like > this, I think we should find out why it's not already in common use. Sorry if you felt I was forcing, that wasn't my intention. I was referring to the long period regexps have been around. I thought the reason it's not already in common use had already been discussed: it's barely referenced anywhere, it needs more advertising. Correct me if this is wrong. >> At the moment the rx.el implementation is built on top of Emacs regexps >> which are implemented in C. I believe this does not use the power of >> Lisp as much as it could. > > But would any alternative use the power of regexps? Yes, rx.el is a drop-in replacement of regexps. What do you mean? > Emacs has a (moderately large) cache of regexps, so that building the > automatons is done very rarely. Possibly just once each for each > session of Emacs. That's the whole point: if possible (see below), remove the requirements for regexp cache management. >> In high-level languages, automatons are automatically cached to save the >> cost of building them. > > Emacs Lisp does this too. I did not exclude it :) >> The rx.el library/concept could alleviate this issue altogether: because >> we express the automaton directly in Lisp, the parsing step is not >> needed and thus the building cost could be tremendously reduced. > >> So the rx.el building steps > >> rx expression -> regexp string -> C regexp automaton > >> could boil down to simply > >> rx automaton > > I don't see what you're trying to save, here. At some stage, the regexp > source, in whatever form, needs to be converted to an automaton. Yes, that's what I meant with "rx automaton". My suggestion (not necessarily for Emacs Lisp) is to remove the step that converts the rx symbolic automaton to a string, and the conversion from a string to the actual automaton. > Are you suggesting here building an interpreter in Lisp directly to > execute rx expressions? Yes, but maybe in Guile or some other Lisp. Don't know if it's feasible in Emacs Lisp. >> It would be interesting to compare the performance. This also means >> that there would be no need for caching on behalf of the supporting >> language. > > I will predict that an rx interpreter built in Lisp will be two orders > of magnitude slower than the current regexp machine, where both the > construction of an automaton, and the byte-code interpreter which runs > it are written in C (and probably quite optimised C at that). Obviously, and this is the prime reason why the author of rx.el implemented it on top of C regexp. My point was that with a fast Lisp (or a specifically designed C support), a Lisp automaton would be just as fast: the Lisp code would directly map the equivalent C automaton. Again, I have no clue if that's doable in Emacs Lisp. > I can't get excited about rx syntax, which I'm sure would be just as > tedious, and possibly more difficult to read than a standard regexp. Have you used rx? The whole point of the library is to increase readability, and it does a great job at it in my opinion. > Analagously, as a musician, I read standard musical notation (with > sets of five lines and dots) far more easily and fluently than I could > any "simplified" system designed for beginners, which would be bloated > by comparison. rx.el is meant to be "simplified for beginners". You could also reverse the analogy in saying that regexps are the "simplified version for beginners"... The analogy does not map very well. A better analogy would be the mapping between assembly and the hexadecimal codes of CPU instructions: I don't think many people find hexedecimal codes more explicit than assembly verbs and symbols (although most assembly languages abuse abbreviations, but the intention is there). > Regular expressions can be difficult. I don't believe this difficulty > lies, in the main, in the compact notation used to express them. Rather > it lies in the concepts and the semantics of the regexp elements, and > being able to express a "mental automaton" in regexp semantics. The semantic between rx and regexp does not differ. It's purely syntactical. Let's consider some points: - rx can be written over multiple lines and indented. This is a great readibility booster for groups, which can be _grouped_ together with linebreaks and indentation. - rx does not require escaping any character with backslashes. This is always a great source of confusion when switching from BRE to ERE, between different interpreters and when storing regexp in Lisp strings where backslashes must be escaped themselves for instance. - Symbols with non-trivial meanings in regexp (e.g. \<, :, ^, etc.) have a trivial _English_ counterpart in rx: (respectively "word-start", nothing, "line-start" _and_ "not"). - No more special-case symbols like "-" for ranges or "^" (negation when first character in square brackets). Thus less cognitive burden. - The "^" has a double-meaning in regexp: "line-start" and "not". The list goes on. -- Pierre Neidhardt