From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Pierre Neidhardt Newsgroups: gmane.emacs.devel Subject: Re: rx.el sexp regexp syntax (WAS: Off Topic) Date: Fri, 25 May 2018 18:47:59 +0200 Message-ID: <87lgc7hebk.fsf@gmail.com> References: <87h8mw3yoc.fsf@gmail.com> <20180525155126.GA4096@ACM> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Trace: blaine.gmane.org 1527270486 22313 195.159.176.226 (25 May 2018 17:48:06 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 25 May 2018 17:48:06 +0000 (UTC) User-Agent: mu4e 1.0; emacs 26.1 Cc: van@scratch.space, eliz@gnu.org, emacs-devel@gnu.org, rms@gnu.org, Noam Postavsky To: Alan Mackenzie Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri May 25 19:48:01 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fMGop-0005eM-Kp for ged-emacs-devel@m.gmane.org; Fri, 25 May 2018 19:47:59 +0200 Original-Received: from localhost ([::1]:46187 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fMGqw-0004ol-QE for ged-emacs-devel@m.gmane.org; Fri, 25 May 2018 13:50:10 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58521) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fMFsu-0003yM-Hh for emacs-devel@gnu.org; Fri, 25 May 2018 12:48:10 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fMFsq-0002pe-FK for emacs-devel@gnu.org; Fri, 25 May 2018 12:48:08 -0400 Original-Received: from mail-wm0-x230.google.com ([2a00:1450:400c:c09::230]:36111) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fMFsq-0002oq-4z; Fri, 25 May 2018 12:48:04 -0400 Original-Received: by mail-wm0-x230.google.com with SMTP id n10-v6so16220310wmc.1; Fri, 25 May 2018 09:48:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20161025; h=from:references:user-agent:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=UayCs0SPLGlmyR5bKmTUHHr8XF0arQ6Fri52EKoAv1A=; b=ILi2jDxA8X/qiO/Xk8akfggNOQ69yLJslo2Ugs5W1aYmiEXqzebkYjg8eijhohorxb 6bpsU87JTU+vzBQb5L0C7lhsq4pv0lUDwV44mm0qRKgH3E4I9ye/0dc+hHrBvGyEBoVx ULUy6jz6K2zTjj0JB7G5FwGC8jEd8kXby36NFA0h+xUus7f5mc3kn47HxbofBZyYMNPE n2YaKWnRxdosmT8+lNvE+gltOQMBTkZdS+3chtlWu1LfEj6VNqKPlc01GeeJW1BU20ov W6Jv36iL1h2Ul9kktZ0j9R1+Pnm0TAYtKj7yiMG1e5GkMq0Bm6eDTdw2leWmSmL5rtHM VcGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:references:user-agent:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=UayCs0SPLGlmyR5bKmTUHHr8XF0arQ6Fri52EKoAv1A=; b=B2NiUrgZB8UsqNZrwFGdG7VIlmKyEHrpIyG7RgSFnFAq6EQ0ZRd2qGWIJ5lNtXfMFx PRdsV6n2UPt4No+Ot0kNHMqi1Jeu83F1axa/LzCAnN9TFFguQoNgrZz7CaXStyS538f9 a37a/5gjPGZQaL22eZPAHRCFSBCcc3iI62EnfV8AOyAjHw1Ib9rsZwusvpe1FYeZ0Kdo 1oBVFi2oi3RuIdTcV/AAmrRHh2zvoRFEvKBxwDwYJS+1Ry3awe1UsL87V/bkFOJAxEIv BPSu/mHsBFqPQUeYcq7qKUeBzKsdUFji7+iWTeQPQIE7gTsOdvXPZU2vhk/U2S2ULkrV 4LFQ== X-Gm-Message-State: ALKqPwcH6+xpwUJ48NXOOAdhOvZP+9BmRtqQGGtBbrP5sJRfcvfRz/IQ bqq1oOsfbOoAQbnc15CWA5FpriOU X-Google-Smtp-Source: ADUXVKLP2FWKfOg0ToJkqBesENI25GOmuGLHULZYDqyz3rSjCgnrBir3nuDyC+nycmF5X3pKrSj5UA== X-Received: by 2002:a1c:a750:: with SMTP id q77-v6mr2221622wme.111.1527266882433; Fri, 25 May 2018 09:48:02 -0700 (PDT) Original-Received: from mimimi (87-89-234-173.abo.bbox.fr. [87.89.234.173]) by smtp.gmail.com with ESMTPSA id q194-v6sm10288192wmd.26.2018.05.25.09.48.00 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 25 May 2018 09:48:01 -0700 (PDT) X-Google-Original-From: Pierre Neidhardt In-reply-to: <20180525155126.GA4096@ACM> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:400c:c09::230 X-Mailman-Approved-At: Fri, 25 May 2018 13:49:29 -0400 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:225714 Archived-At: --=-=-= Content-Type: text/plain Alan Mackenzie writes: >> rx.el is one of the best concepts I've discovered in a long time. >> It's another instance of "Don't come up with a new (mini)language when >> Lisp can do better": it's easier to learn, more flexible, easier to >> write, much easier to read and as a consequence much more maintainable. > > Much easier than what? Than the putative mini-language that doesn't get > written? I meant that in my opinion rx is easier to write than regexps. That it is not popular is the root of the question here. >> I think it's high time we moved away from traditional regexps and >> embraced the concept of rx.el. I'm thinking of implementing it for >> Guile. > > There's nothing stopping anybody from using rx.el. However, people have > mostly _not_ used it. The "I think it's high time ...." suggests in > some way forcing people to use it. Before mandating something like > this, I think we should find out why it's not already in common use. Sorry if you felt I was forcing, that wasn't my intention. I was referring to the long period regexps have been around. I thought the reason it's not already in common use had already been discussed: it's barely referenced anywhere, it needs more advertising. Correct me if this is wrong. >> At the moment the rx.el implementation is built on top of Emacs regexps >> which are implemented in C. I believe this does not use the power of >> Lisp as much as it could. > > But would any alternative use the power of regexps? Yes, rx.el is a drop-in replacement of regexps. What do you mean? > Emacs has a (moderately large) cache of regexps, so that building the > automatons is done very rarely. Possibly just once each for each > session of Emacs. That's the whole point: if possible (see below), remove the requirements for regexp cache management. >> In high-level languages, automatons are automatically cached to save the >> cost of building them. > > Emacs Lisp does this too. I did not exclude it :) >> The rx.el library/concept could alleviate this issue altogether: because >> we express the automaton directly in Lisp, the parsing step is not >> needed and thus the building cost could be tremendously reduced. > >> So the rx.el building steps > >> rx expression -> regexp string -> C regexp automaton > >> could boil down to simply > >> rx automaton > > I don't see what you're trying to save, here. At some stage, the regexp > source, in whatever form, needs to be converted to an automaton. Yes, that's what I meant with "rx automaton". My suggestion (not necessarily for Emacs Lisp) is to remove the step that converts the rx symbolic automaton to a string, and the conversion from a string to the actual automaton. > Are you suggesting here building an interpreter in Lisp directly to > execute rx expressions? Yes, but maybe in Guile or some other Lisp. Don't know if it's feasible in Emacs Lisp. >> It would be interesting to compare the performance. This also means >> that there would be no need for caching on behalf of the supporting >> language. > > I will predict that an rx interpreter built in Lisp will be two orders > of magnitude slower than the current regexp machine, where both the > construction of an automaton, and the byte-code interpreter which runs > it are written in C (and probably quite optimised C at that). Obviously, and this is the prime reason why the author of rx.el implemented it on top of C regexp. My point was that with a fast Lisp (or a specifically designed C support), a Lisp automaton would be just as fast: the Lisp code would directly map the equivalent C automaton. Again, I have no clue if that's doable in Emacs Lisp. > I can't get excited about rx syntax, which I'm sure would be just as > tedious, and possibly more difficult to read than a standard regexp. Have you used rx? The whole point of the library is to increase readability, and it does a great job at it in my opinion. > Analagously, as a musician, I read standard musical notation (with > sets of five lines and dots) far more easily and fluently than I could > any "simplified" system designed for beginners, which would be bloated > by comparison. rx.el is meant to be "simplified for beginners". You could also reverse the analogy in saying that regexps are the "simplified version for beginners"... The analogy does not map very well. A better analogy would be the mapping between assembly and the hexadecimal codes of CPU instructions: I don't think many people find hexedecimal codes more explicit than assembly verbs and symbols (although most assembly languages abuse abbreviations, but the intention is there). > Regular expressions can be difficult. I don't believe this difficulty > lies, in the main, in the compact notation used to express them. Rather > it lies in the concepts and the semantics of the regexp elements, and > being able to express a "mental automaton" in regexp semantics. The semantic between rx and regexp does not differ. It's purely syntactical. Let's consider some points: - rx can be written over multiple lines and indented. This is a great readibility booster for groups, which can be _grouped_ together with linebreaks and indentation. - rx does not require escaping any character with backslashes. This is always a great source of confusion when switching from BRE to ERE, between different interpreters and when storing regexp in Lisp strings where backslashes must be escaped themselves for instance. - Symbols with non-trivial meanings in regexp (e.g. \<, :, ^, etc.) have a trivial _English_ counterpart in rx: (respectively "word-start", nothing, "line-start" _and_ "not"). - No more special-case symbols like "-" for ranges or "^" (negation when first character in square brackets). Thus less cognitive burden. - The "^" has a double-meaning in regexp: "line-start" and "not". The list goes on. -- Pierre Neidhardt --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlsIPj8ACgkQm9z0l6S7 zH8pMAgAhaQcnEZS7q+wO+Kr6ZcepT7eyu1z2N0h9QPwDkRaN9o1IrJFHg33CmLF Nai6N96+hc9ShOc17ASLPzPwedRgTX2P76XkdSUL4Zy10LxAviA305ICxvTW7LDV DMrblf5ulwQrbuzXaQSoBJ2OokQnNX0wPTtY0iMs/tu5F3DzN09Kt7XtWIaPKCK4 uW8/UQzIzKL1d3YeTES6AYHd9jZ0MMRjgZkoZDj7BRHWaqmXAuO2ODQl5MKGNCcK hsIsK+YeLozSFrPjwT5vRVm2cucsMsfWsAZvn2Bw1vG9fqZSNkMh7bj7DS/ie9A/ IwIZeF3bIwmqS7/kFyfwe8iRItxMkg== =qxQB -----END PGP SIGNATURE----- --=-=-=--