Re: rx.el sexp regexp syntax (WAS: Off Topic)

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Alan Mackenzie <acm@muc.de>
To: Pierre Neidhardt <pe.neidhardt@googlemail.com>
Cc: van@scratch.space, eliz@gnu.org, emacs-devel@gnu.org,
	rms@gnu.org, Noam Postavsky <npostavs@gmail.com>
Subject: Re: rx.el sexp regexp syntax (WAS: Off Topic)
Date: Fri, 25 May 2018 18:17:10 +0000	[thread overview]
Message-ID: <20180525181710.GC4096@ACM> (raw)
In-Reply-To: <87lgc7hebk.fsf@gmail.com>

Hello again, Pierre.

On Fri, May 25, 2018 at 18:47:59 +0200, Pierre Neidhardt wrote:

> Alan Mackenzie <acm@muc.de> writes:

> >> rx.el is one of the best concepts I've discovered in a long time.
> >> It's another instance of "Don't come up with a new (mini)language when
> >> Lisp can do better": it's easier to learn, more flexible, easier to
> >> write, much easier to read and as a consequence much more maintainable.

> > Much easier than what?  Than the putative mini-language that doesn't get
> > written?

> I meant that in my opinion rx is easier to write than regexps.  That it
> is not popular is the root of the question here.

I think it will be easier only for beginners.

> >> I think it's high time we moved away from traditional regexps and
> >> embraced the concept of rx.el.  I'm thinking of implementing it for
> >> Guile.

> > There's nothing stopping anybody from using rx.el.  However, people have
> > mostly _not_ used it.  The "I think it's high time ...." suggests in
> > some way forcing people to use it.  Before mandating something like
> > this, I think we should find out why it's not already in common use.

> Sorry if you felt I was forcing, that wasn't my intention.  I was
> referring to the long period regexps have been around.

> I thought the reason it's not already in common use had already been
> discussed: it's barely referenced anywhere, it needs more advertising.

> Correct me if this is wrong.

It may be part of the explanation.  But more salient, I think, is that
hackers prefer powerful means of expression.  A single character in a
string regexp has the power of a sexp in the corresponding rx regexp.
Paul Graham (at http://www.paulgraham.com) has had quite a bit to say
about this in the (distant) past.  Conciseness of expression is where
it's at.

> >> At the moment the rx.el implementation is built on top of Emacs regexps
> >> which are implemented in C.  I believe this does not use the power of
> >> Lisp as much as it could.

> > But would any alternative use the power of regexps?

> Yes, rx.el is a drop-in replacement of regexps.  What do you mean?

I'm not sure, any more.  Sorry.

> > Emacs has a (moderately large) cache of regexps, so that building the
> > automatons is done very rarely.  Possibly just once each for each
> > session of Emacs.

> That's the whole point: if possible (see below), remove the requirements
> for regexp cache management.

I don't think that would be wise.  Manipulating the cache is far faster
than generating the automatons at each use.

[ .... ]

> >> The rx.el library/concept could alleviate this issue altogether: because
> >> we express the automaton directly in Lisp, the parsing step is not
> >> needed and thus the building cost could be tremendously reduced.

> >> So the rx.el building steps

> >>   rx expression -> regexp string -> C regexp automaton

> >> could boil down to simply

> >>   rx automaton

> > I don't see what you're trying to save, here.  At some stage, the regexp
> > source, in whatever form, needs to be converted to an automaton.

> Yes, that's what I meant with "rx automaton".  My suggestion (not
> necessarily for Emacs Lisp) is to remove the step that converts the rx
> symbolic automaton to a string, and the conversion from a string to the
> actual automaton.

OK.  That would save only a little, at automaton building time, which
likely would happen just once in any Emacs session.

> > Are you suggesting here building an interpreter in Lisp directly to
> > execute rx expressions?

> Yes, but maybe in Guile or some other Lisp.  Don't know if it's feasible
> in Emacs Lisp.

> >> It would be interesting to compare the performance.  This also means
> >> that there would be no need for caching on behalf of the supporting
> >> language.

> > I will predict that an rx interpreter built in Lisp will be two orders
> > of magnitude slower than the current regexp machine, where both the
> > construction of an automaton, and the byte-code interpreter which runs
> > it are written in C (and probably quite optimised C at that).

> Obviously, and this is the prime reason why the author of rx.el
> implemented it on top of C regexp.  My point was that with a fast Lisp
> (or a specifically designed C support), a Lisp automaton would be just
> as fast: the Lisp code would directly map the equivalent C automaton.

> Again, I have no clue if that's doable in Emacs Lisp.

It might be.  But it might be a lot of work for little benefit.

> > I can't get excited about rx syntax, which I'm sure would be just as
> > tedious, and possibly more difficult to read than a standard regexp.

> Have you used rx?

No.  Neither have I used Cobol (much).

> The whole point of the library is to increase readability, and it does
> a great job at it in my opinion.

You seem to want to increase the readability for beginners, for people
who have laboriously to slog through an expression trying to make sense
of each bit of it.  I don't think experienced regexp users have
difficulty with the syntax.  I don't, for one.

There was a time when people thought that

    ADD 1 TO A GIVING B

was more readable than

    b = a + 1;

, and generations of programmers suffered as a result.

> > Analagously, as a musician, I read standard musical notation (with
> > sets of five lines and dots) far more easily and fluently than I could
> > any "simplified" system designed for beginners, which would be bloated
> > by comparison.

> rx.el is meant to be "simplified for beginners".  You could also reverse
> the analogy in saying that regexps are the "simplified version for
> beginners"... The analogy does not map very well.

> A better analogy would be the mapping between assembly and the
> hexadecimal codes of CPU instructions: I don't think many people find
> hexedecimal codes more explicit than assembly verbs and symbols
> (although most assembly languages abuse abbreviations, but the
> intention is there).

Hexadecimal CPU codes aren't and aren't intended to be human-readable.
String regular expressions are.

> > Regular expressions can be difficult.  I don't believe this difficulty
> > lies, in the main, in the compact notation used to express them.  Rather
> > it lies in the concepts and the semantics of the regexp elements, and
> > being able to express a "mental automaton" in regexp semantics.

> The semantic between rx and regexp does not differ.  It's purely
> syntactical.

Yes.

> Let's consider some points:

> - rx can be written over multiple lines and indented.  This is a great
>   readibility booster for groups, which can be _grouped_ together with
>   linebreaks and indentation.

rx MUST be written over several lines and indented.  A string regexp, by
contrast, usually fits onto a single line.

> - rx does not require escaping any character with backslashes.  This
>   is always a great source of confusion when switching from BRE to ERE,
>   between different interpreters and when storing regexp in Lisp strings
>   where backslashes must be escaped themselves for instance.

It is an inconvenience, yes, but I think you're exaggerating its
importance somewhat.  In rx, literal characters have to be "escaped" by
string quotes.  This might be an irritation.

> - Symbols with non-trivial meanings in regexp (e.g. \<, :, ^, etc.) have
>   a trivial _English_ counterpart in rx: (respectively "word-start",
>   nothing, "line-start" _and_ "not").

The "English" counterpart used in rx is bulky and difficult to learn.
Somehow, you've got to learn that it's "word-start" and not
"word-beginning", that it's "not" and not "non", and so on.  This is more
difficult than just learning \< and ^.  If your native language isn't
English, it might be much more difficult.

> - No more special-case symbols like "-" for ranges or "^" (negation when
>   first character in square brackets).  Thus less cognitive burden.

That remains in dispute.

> - The "^" has a double-meaning in regexp: "line-start" and "not".

Yes, it is context dependent.  I don't think this causes confusion in
practice.

> The list goes on.

Well, so far, on this list, two or three people have said they "like"
rx.el.  Nobody has said "I'm going to be using rx.el in my programs from
now on".  I don't think they will.

We'll see.

> --
> Pierre Neidhardt

-- 
Alan Mackenzie (Nuremberg, Germany).

next prev parent reply	other threads:[~2018-05-25 18:17 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-24 10:47 rx.el sexp regexp syntax (WAS: Off Topic) Noam Postavsky
2018-05-24 10:58 ` Van L
2018-05-25  2:57 ` Richard Stallman
2018-05-25  8:52   ` Pierre Neidhardt
2018-05-25 15:51     ` Alan Mackenzie
2018-05-25 16:47       ` Pierre Neidhardt
2018-05-25 18:01         ` rx.el sexp regexp syntax Eric Abrahamsen
2018-05-25 18:12           ` Pierre Neidhardt
2018-05-25 18:56             ` Eric Abrahamsen
2018-05-25 21:42               ` Clément Pit-Claudel
2018-05-25 21:51                 ` Eric Abrahamsen
2018-05-25 22:27                   ` Michael Heerdegen
2018-05-25 22:44                     ` Eric Abrahamsen
2018-05-27 20:27           ` Stefan Monnier
2018-05-28 16:37             ` Pierre Neidhardt
2018-05-28 17:15               ` Stefan Monnier
2018-05-29  3:10                 ` Richard Stallman
2018-05-29  7:28                   ` Robert Pluim
2018-05-29  8:27                 ` Philipp Stephani
2018-05-30  3:24                   ` Richard Stallman
2018-05-30  7:25                     ` Robert Pluim
2018-05-31  3:53                       ` Richard Stallman
2018-05-31  8:57                         ` Robert Pluim
2018-05-31  4:13                       ` Clément Pit-Claudel
2018-05-31 14:19                       ` Stefan Monnier
2018-05-31 15:43                         ` Drew Adams
2018-05-31 16:12                           ` João Távora
2018-05-31 16:18                             ` Robert Pluim
2018-05-31 16:48                               ` Basil L. Contovounesios
2018-05-31 17:02                                 ` Basil L. Contovounesios
2018-05-31 18:40                                   ` João Távora
2018-06-02 19:33             ` Eric Abrahamsen
2018-06-03  3:49               ` Stefan Monnier
2018-06-03  4:59                 ` Eric Abrahamsen
2018-06-03 14:51                 ` Helmut Eller
2018-06-03 15:15                   ` Eric Abrahamsen
2018-06-03 15:53                     ` Helmut Eller
2018-06-03 16:40                       ` Eric Abrahamsen
2018-06-03 19:57                       ` Drew Adams
2018-06-03 21:15                         ` Eric Abrahamsen
2018-06-03 23:23                           ` Drew Adams
2018-06-04 13:56                         ` Stefan Monnier
2018-06-04 15:24                           ` Drew Adams
2018-06-04 15:44                             ` Pierre Neidhardt
2018-05-25 18:17         ` Alan Mackenzie [this message]
2018-05-25 20:35           ` rx.el sexp regexp syntax (WAS: Off Topic) Peter Neidhardt
2018-05-25 21:01           ` rx.el sexp regexp syntax Michael Heerdegen
2018-05-25 23:32             ` Peter Neidhardt
2018-05-27 16:56       ` Tom Tromey
2018-05-27 20:16         ` Alan Mackenzie
2018-05-27 20:23       ` Stefan Monnier
2018-05-27 20:16     ` Stefan Monnier
2018-05-28 16:36       ` Pierre Neidhardt
2018-05-28 17:04         ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180525181710.GC4096@ACM \
    --to=acm@muc.de \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=npostavs@gmail.com \
    --cc=pe.neidhardt@googlemail.com \
    --cc=rms@gnu.org \
    --cc=van@scratch.space \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).