How to grok a complicated regex?

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* How to grok a complicated regex?
@ 2015-03-13 21:35 Marcin Borkowski
  2015-03-13 21:45 ` Marcin Borkowski
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Marcin Borkowski @ 2015-03-13 21:35 UTC (permalink / raw
  To: Help Gnu Emacs mailing list

Hi all,

so I have this monstrosity [note: I know, there are much worse ones,
too!]:

"\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"

(it's in the org-latex--script-size function in ox-latex.el, if you're
curious).

I'm not asking “what does this match” – I can read it myself.  But it
comes with a considerable effort.  Are you aware of any tools that might
help to understand such regexen?

I know about re-builder, but it’s well suited for constructing a regex
matching a given string, not the other way round.

For instance, show-paren-mode does not really help here, since it seems
to pair “\\(“ with unescaped “)”.

Any ideas?

(Note: if there are no such tools, I might be tempted to craft one.  Two
things that come to my mind are proper highlighting of matching parens
of various kinds and eldoc-like hints for all the regex constructs –
I never seem to remember what does “\\`” do, for instance.  Also,
displaying the string with single backslashes and not in the way it is
actually typed in in Elisp, with all the backslash escaping, might be
helpful.  Would there be a demand for such a tool larger than one
person?)

Best,

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-13 21:35 Marcin Borkowski
@ 2015-03-13 21:45 ` Marcin Borkowski
  2015-03-13 21:47 ` Alexis
  2015-03-23 12:18 ` Vaidheeswaran C
  2 siblings, 0 replies; 24+ messages in thread
From: Marcin Borkowski @ 2015-03-13 21:45 UTC (permalink / raw
  To: Help Gnu Emacs mailing list


On 2015-03-13, at 22:35, Marcin Borkowski <mbork@wmi.amu.edu.pl> wrote:

> Hi all,
>
> so I have this monstrosity [note: I know, there are much worse ones,
> too!]:
>
> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"
>
> (it's in the org-latex--script-size function in ox-latex.el, if you're
> curious).
>
> I'm not asking “what does this match” – I can read it myself.  But it
> comes with a considerable effort.  Are you aware of any tools that might
> help to understand such regexen?

BTW, it turned out to be fairly simple after all, but I could see this
only after passing it through (insert ...) in a temporary buffer, so
that all the double backslashes stopped looking like a drunkard's
nightmare.  So even such a rudimentary "tool" (basically, temp buffer
and `insert') did help a lot.

Best,

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-13 21:35 Marcin Borkowski
  2015-03-13 21:45 ` Marcin Borkowski
@ 2015-03-13 21:47 ` Alexis
  2015-03-13 21:57   ` Marcin Borkowski
  2015-03-23 12:18 ` Vaidheeswaran C
  2 siblings, 1 reply; 24+ messages in thread
From: Alexis @ 2015-03-13 21:47 UTC (permalink / raw
  To: help-gnu-emacs

On 2015-03-14T08:35:36+1100, Marcin Borkowski 
<mbork@wmi.amu.edu.pl> said:

 MB> I'm not asking “what does this match” – I can read it myself. 
 MB> But it comes with a considerable effort.  Are you aware of 
 any MB> tools that might help to understand such regexen?

`rxt-explain-elisp`?

Alexis.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-13 21:47 ` Alexis
@ 2015-03-13 21:57   ` Marcin Borkowski
  0 siblings, 0 replies; 24+ messages in thread
From: Marcin Borkowski @ 2015-03-13 21:57 UTC (permalink / raw
  To: help-gnu-emacs


On 2015-03-13, at 22:47, Alexis <flexibeast@gmail.com> wrote:

> On 2015-03-14T08:35:36+1100, Marcin Borkowski 
> <mbork@wmi.amu.edu.pl> said:
>
>  MB> I'm not asking “what does this match” – I can read it myself. 
>  MB> But it comes with a considerable effort.  Are you aware of 
>  any MB> tools that might help to understand such regexen?
>
> `rxt-explain-elisp`?

Interesting, I didn't know about this one.  Thanks a lot, I'll take
a look!

> Alexis.

Best,

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
       [not found] <mailman.1979.1426282552.31049.help-gnu-emacs@gnu.org>
@ 2015-03-13 22:46 ` Emanuel Berg
  2015-03-13 23:16   ` Marcin Borkowski
       [not found]   ` <mailman.1984.1426288628.31049.help-gnu-emacs@gnu.org>
  2015-03-18 16:40 ` Alan Mackenzie
  2015-04-25  4:23 ` Rusi
  2 siblings, 2 replies; 24+ messages in thread
From: Emanuel Berg @ 2015-03-13 22:46 UTC (permalink / raw
  To: help-gnu-emacs

Marcin Borkowski <mbork@wmi.amu.edu.pl> writes:

> so I have this monstrosity [note: I know, there are
> much worse ones, too!]:
>
> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"
>
> (it's in the org-latex--script-size function in
> ox-latex.el, if you're curious).
>
> I'm not asking “what does this match” – I can read
> it myself. But it comes with a considerable effort.

I dare say most people (even programmers) cannot read
that so if you can that's great. As a math
professional you are of course aware of the discipline
called automata theory that deals with such things.
Perhaps relational algebra might help to, if the data
in the sets are strings. But automata theory should be
it even more.

Also, remember you don't have to understand those
expressions. Often they are setup incrementally. They
only need to be correct. The computer understands them
- the programmer only understands the purpose, and the
latest edition. Kind of risky, perhaps not what I math
person would be appealed by, but I've constructed many
that way so I know that method works.

> Are you aware of any tools that might help to
> understand such regexen?

I have seen tools with which you can construct such
expressions and they output figures, states,
transitions, and so on. I wonder how advanced
expression they can deal with? But if you get the
basics right, it should be just basic building blocks
that stick together and from there on the sky is the
limit.

Instead the problem is, as I see it: will those
figures, balls and arrows, tagged with preconditions,
postconditions, everything you can think of, will that
actually be *clearer*?

If I were to do it (which I am not thanks god) my
answer would be *no*. The only way I could do it would
instead be the opposite. Train the brain with such
expressions - exactly as they are - day in, day out,
until they are second nature.

Example: a C++ OO project with classes and everything.
Silly inheritance and interfaces. Some people would
consider those pretty darn difficult to understand.
But to the seasoned C++ programmer (no exaggerating
here, a few years of focused training is enough) those
programs are clear. For those guys, giving up writing
C++ code and instead using some other representation
(be it graphical or not) would be to in one stroke
cripple their skills.

So no, I think that representation is the best there
is. To translate it back and forth would not only be
very difficult to do - and even if possible, which of
course it is, because a representation is just a
representation of I don't know how many possible - I
don't see the end result being any more clear: on the
contrary, most likely.

What I would do - try to get it more readable by using
classes, string classes (do they exist?), and even
more advanced constructs if necessary - as in this
simple example:

    (defconst stop-char-default "\\([[:punct:]]\\|[[:space:]][[:alnum:]]\\)")

How do you define those? Can you identify any which
aren't there, but could/should be?

Example: say there is a class called "delimiters"
which contain [, (, {, <, >, }, ), and ]. Can you
split that up, in "opening-delimiters" and closing
ditto?

Second, exactly you mentioned - the font lock issue -
work on that.

You do know, of course, of

    font-lock-regexp-grouping-construct
    font-lock-regexp-grouping-backslash

Are there more of those, that you can identify, and
add?

-- 
underground experts united

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-13 22:46 ` How to grok a complicated regex? Emanuel Berg
@ 2015-03-13 23:16   ` Marcin Borkowski
  2015-03-14  0:12     ` Rasmus
                       ` (2 more replies)
       [not found]   ` <mailman.1984.1426288628.31049.help-gnu-emacs@gnu.org>
  1 sibling, 3 replies; 24+ messages in thread
From: Marcin Borkowski @ 2015-03-13 23:16 UTC (permalink / raw
  To: help-gnu-emacs

On 2015-03-13, at 23:46, Emanuel Berg <embe8573@student.uu.se> wrote:

> Marcin Borkowski <mbork@wmi.amu.edu.pl> writes:
>
>> so I have this monstrosity [note: I know, there are
>> much worse ones, too!]:
>>
>> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"
>>
>> (it's in the org-latex--script-size function in
>> ox-latex.el, if you're curious).
>>
>> I'm not asking “what does this match” – I can read
>> it myself. But it comes with a considerable effort.
>
> I dare say most people (even programmers) cannot read
> that so if you can that's great. As a math

Really?  It's not /that/ difficult.  You only need enough coffee (or
tea, in my case), time and motivation.  You don’t need a genius, or even
IQ higher than, say, 90 or so.  It's not really /difficult/.
Intimidating, yes.  Boring, possibly.  Laborious (and mechanical), yes.
But not /difficult/.

> professional you are of course aware of the discipline
> called automata theory that deals with such things.

Well, as an analyst working in metric fixed point theory, that's just
it.  I'm /aware/ of automata theory – (almost) nothing more. ;-)

> Perhaps relational algebra might help to, if the data
> in the sets are strings. But automata theory should be
> it even more.
>
> Also, remember you don't have to understand those
> expressions. Often they are setup incrementally. They
> only need to be correct. The computer understands them
> - the programmer only understands the purpose, and the
> latest edition. Kind of risky, perhaps not what I math
> person would be appealed by, but I've constructed many
> that way so I know that method works.

That reminds me of the von Neumann quote: “In mathematics, you don’t
/understand/ things – you just /get used/ to them.”

>> Are you aware of any tools that might help to
>> understand such regexen?
>
> I have seen tools with which you can construct such
> expressions and they output figures, states,
> transitions, and so on. I wonder how advanced
> expression they can deal with? But if you get the
> basics right, it should be just basic building blocks
> that stick together and from there on the sky is the
> limit.
>
> Instead the problem is, as I see it: will those
> figures, balls and arrows, tagged with preconditions,
> postconditions, everything you can think of, will that
> actually be *clearer*?

As we both point out, I’m not talking about changing the representation,
but about making the existing one (which I agree is not /that/ bad) more
comprehensible.  Font lock, grouping and unescaping backslashes would be
definitely helpful.

OTOH, I can imagine that some kind of diagrams might be helpful for
someone.  The point is, in the end you have to read/write these regexen
in their normal form anyway, so why not train yourself to understand
their “default” representation instead of adding the burden of
translationg between representations?

> If I were to do it (which I am not thanks god) my
> answer would be *no*. The only way I could do it would
> instead be the opposite. Train the brain with such
> expressions - exactly as they are - day in, day out,
> until they are second nature.
>
> Example: a C++ OO project with classes and everything.
> Silly inheritance and interfaces. Some people would
> consider those pretty darn difficult to understand.
> But to the seasoned C++ programmer (no exaggerating
> here, a few years of focused training is enough) those
> programs are clear. For those guys, giving up writing
> C++ code and instead using some other representation
> (be it graphical or not) would be to in one stroke
> cripple their skills.
>
> So no, I think that representation is the best there
> is. To translate it back and forth would not only be

I’m not sure whether it’s the best – but it’s a standard (more or less,
Emacs’ regexen are not really “standard” by today’s, well, standards –
but hardly anything about Emacs is “standard” or “typical”, so who
cares;-)).

> very difficult to do - and even if possible, which of

I disagree.  I don’t think that such a translator would be a difficult
one to write.

If only I was a student again, with plenty of spare time, I might have
taken the challenge and tried to write one in TeX, so that some TeX
macro, given an (Emacs) regex would produce a nicely typeset diagram.

Wow, what a nice project for a bachelor’s thesis.  Wait a minute.
Ohboyohboyohboy.  I have to put this in my faculty’s database of
potential topics.  Poor students... ;-)

(BTW, I did once write a poor man’s parser in pure TeX; since there were
no regex engine written in TeX back then (now there is one!), I had to
craft a simple automaton myself.  Not an extremely pleasant work...)

> course it is, because a representation is just a
> representation of I don't know how many possible - I
> don't see the end result being any more clear: on the
> contrary, most likely.
>
> What I would do - try to get it more readable by using
> classes, string classes (do they exist?), and even
> more advanced constructs if necessary - as in this
> simple example:
>
>     (defconst stop-char-default "\\([[:punct:]]\\|[[:space:]][[:alnum:]]\\)")
>
> How do you define those? Can you identify any which
> aren't there, but could/should be?
>
> Example: say there is a class called "delimiters"
> which contain [, (, {, <, >, }, ), and ]. Can you
> split that up, in "opening-delimiters" and closing
> ditto?
>
> Second, exactly you mentioned - the font lock issue -
> work on that.
>
> You do know, of course, of
>
>     font-lock-regexp-grouping-construct
>     font-lock-regexp-grouping-backslash
>
> Are there more of those, that you can identify, and
> add?

There could be quite a few.  (As Alexis pointed out, a tool I was
writing about seems to exist – if it’s not satisfactory, I could think
about extending it somehow.  Not very probable, though – I’m too busy
now.  If only someone could be paying me for goofing around and playing
with Emacs hacks...)

Thanks for your input, and best regards!

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-13 23:16   ` Marcin Borkowski
@ 2015-03-14  0:12     ` Rasmus
  2015-03-14 13:18       ` Stefan Monnier
                         ` (2 more replies)
  2015-03-14  5:14     ` Yuri Khan
  2015-03-14  7:03     ` Drew Adams
  2 siblings, 3 replies; 24+ messages in thread
From: Rasmus @ 2015-03-14  0:12 UTC (permalink / raw
  To: help-gnu-emacs

Marcin Borkowski <mbork@wmi.amu.edu.pl> writes:

> On 2015-03-13, at 23:46, Emanuel Berg <embe8573@student.uu.se> wrote:
>
>> Marcin Borkowski <mbork@wmi.amu.edu.pl> writes:
>>
>>> so I have this monstrosity [note: I know, there are
>>> much worse ones, too!]:
>>>
>>> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"
>>>
>>> (it's in the org-latex--script-size function in
>>> ox-latex.el, if you're curious).
>>>
>>> I'm not asking “what does this match” – I can read
>>> it myself. But it comes with a considerable effort.
>>
>> I dare say most people (even programmers) cannot read
>> that so if you can that's great.
>
> Really?  It's not /that/ difficult.  You only need enough coffee (or
> tea, in my case), time and motivation.
> You don’t need a genius, or even IQ higher than, say, 90 or so.

Damn.  At least I know why I don't understand it now...

To grok REs I sometimes prefer visualize regexps¹ over re-builder.  Though
re-builder has the advantage that it can understands \\ out of the box.
You may also find highlight-regexp since it would color the different
parentheses matches.

Here's another project (for your students): adding lookaround to Emacs
regexp /and/ have it merged.  It would be *insanely(!)* at times.

—Rasmus

Footnotes: 
¹   https://github.com/benma/visual-regexp.el

-- 
A clever person solves a problem. A wise person avoids it




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
       [not found]   ` <mailman.1984.1426288628.31049.help-gnu-emacs@gnu.org>
@ 2015-03-14  3:58     ` Emanuel Berg
  2015-03-14  4:44       ` Emanuel Berg
  0 siblings, 1 reply; 24+ messages in thread
From: Emanuel Berg @ 2015-03-14  3:58 UTC (permalink / raw
  To: help-gnu-emacs

Marcin Borkowski <mbork@wmi.amu.edu.pl> writes:

> Really? It's not /that/ difficult. You only need
> enough coffee (or tea, in my case), time and
> motivation. You don’t need a genius, or even IQ
> higher than, say, 90 or so. It's not really
> /difficult/. Intimidating, yes. Boring, possibly.
> Laborious (and mechanical), yes. But not
> /difficult/.

I mean to be able to read it like you read the code of
a programming language. What that takes is training
like everything else. Instead of deconstructing and
reconstructing complicated expressions like your
example I would recommend starting small - the most
basic building blocks over and over, then make them
gradually more complicated by combinations, then
combinations of combinations, ... It is the way a
machine would process it (only the other way around),
and it is the way a foreign natural language is
acquired (almost always). "IQ" is a joke and has
nothing to do with it unless IQ is defined by the
ability to understand regular expression, which by the
way I think isn't far away from how they test "IQ"
(which says alot).

> I disagree. I don’t think that such a translator
> would be a difficult one to write.

The compiler itself is perhaps not extremely difficult
tho certainly not trivial. But that's only the first
step. Then comes presenting it graphically, and make
an editor. To get that to actually work, polished, and
work better than just mastering and typing that form
of code - I'm not convinced.

> Wow, what a nice project for a bachelor’s thesis.
> Wait a minute. Ohboyohboyohboy. I have to put this
> in my faculty’s database of potential topics. Poor
> students... ;-)

That kind of autistic-genius, single-sided crazy stuff
doesn't appeal to me (in fact I think it is
destructive). I'm into execution and combinations -
i.e. not focusing on the technique per se. As an
example, when I did my Master in CS I had Lisp, C++,
zsh, and LaTeX (and more), everything working together
like glued to each other. I don't like one scientist
to do all the thinking, I like on engineer that does
everything and thinks at the same time.

-- 
underground experts united

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-14  3:58     ` Emanuel Berg
@ 2015-03-14  4:44       ` Emanuel Berg
  2015-03-14  4:58         ` Emanuel Berg
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Emanuel Berg @ 2015-03-14  4:44 UTC (permalink / raw
  To: help-gnu-emacs

I don't understand this discussion anymore or what
anyone are saying.

The representation is difficult to read, but not that
difficult, so there shouldn't be another
representation tool, a tool which isn't that difficult
to do, so it should be a Bachelor degree project.

The show must go on!

-- 
underground experts united

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-14  4:44       ` Emanuel Berg
@ 2015-03-14  4:58         ` Emanuel Berg
  2015-03-14  8:43         ` Thien-Thi Nguyen
       [not found]         ` <mailman.1997.1426324089.31049.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 24+ messages in thread
From: Emanuel Berg @ 2015-03-14  4:58 UTC (permalink / raw
  To: help-gnu-emacs

Emanuel Berg <embe8573@student.uu.se> writes:

> I don't understand this discussion anymore or what
> anyone are saying.
>
> The representation is difficult to read, but not that
> difficult, so there shouldn't be another
> representation tool, a tool which isn't that difficult
> to do, so it should be a Bachelor degree project.
>
> The show must go on!

OK, sorry about that. This discussion was interesting.
The whole session was good. J'ai confiance. Long live
techno-techno-totalitarianism! Now I'm too
light-headed, so I'm hitting the paleo-sack: when I
wake up in a week or so I'll read the latest messages
and offer all assimilated insights that has struck me
like a lighting bolt on Terra Prima.

-- 
underground experts united

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-13 23:16   ` Marcin Borkowski
  2015-03-14  0:12     ` Rasmus
@ 2015-03-14  5:14     ` Yuri Khan
  2015-03-14  7:03     ` Drew Adams
  2 siblings, 0 replies; 24+ messages in thread
From: Yuri Khan @ 2015-03-14  5:14 UTC (permalink / raw
  To: Marcin Borkowski; +Cc: help-gnu-emacs@gnu.org

On Sat, Mar 14, 2015 at 5:16 AM, Marcin Borkowski <mbork@wmi.amu.edu.pl> wrote:

>>> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"
>>>
> It's not really /difficult/.
> Intimidating, yes.  Boring, possibly.  Laborious (and mechanical), yes.
> But not /difficult/.

I tried it and it’s not very intimidating or boring or laborious or
difficult. Here’s my thought process:

First I unescape all backslashes, by global-replacing “\\” with “\”.

Then I insert spaces at key points to separate the syntactic
constructs. (Any literal spaces in the regexp need to be made
explicit, e.g. by replacing as <space>.)

    \` \(?: \\ [([] \| \$+ \)? \(.*?\) \(?: \\ [])] \| \$+ \)? \'

Imagining the parentheses and alternatives as nested boxes might help, too:

       ┌─────────┬─────┐  ╔═══╗ ┌─────────┬─────┐
    \` │ \\ [([] │ \$+ │? ║.*?║ │ \\ [])] │ \$+ │? \'
       └─────────┴─────┘  ╚═══╝ └─────────┴─────┘

(Here the nesting level is just 1, so I didn’t actually need to draw
it, just match.)

Now I can read it:

1. start-of-string
2. optionally followed by either
    * a backslash and either an opening parenthesis or bracket
    * or one or more dollar signs
3. followed by any string, which is extracted as group 1
4. optionally followed by either
    * a backslash and either a closing bracket or parenthesis
    * or one or more dollar signs
5. followed by end-of-string

I can further grok it as matching a valid (La)TeX math formula: $…$,
$$…$$, \(…\), \[…\]; as well as some invalid markup such as $$$$…$$$,
$…\], \(…\], $$…, etc.

As for the bigger picture, I think, if a regular expression ends up
difficult to read, it needs decomposed into small, easily digestible
chunks, each with a descriptive name. Elisp has the let* form and the
rx macro for this purpose.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: How to grok a complicated regex?
  2015-03-13 23:16   ` Marcin Borkowski
  2015-03-14  0:12     ` Rasmus
  2015-03-14  5:14     ` Yuri Khan
@ 2015-03-14  7:03     ` Drew Adams
  2 siblings, 0 replies; 24+ messages in thread
From: Drew Adams @ 2015-03-14  7:03 UTC (permalink / raw
  To: Marcin Borkowski, help-gnu-emacs

> I’m not talking about changing the representation, but about making the
> existing one (which I agree is not /that/ bad) more comprehensible.
> Font lock, grouping and unescaping backslashes would be definitely helpful.
> 
> OTOH, I can imagine that some kind of diagrams might be helpful for
> someone.  The point is, in the end you have to read/write these regexen
> in their normal form anyway, so why not train yourself to understand
> their “default” representation instead of adding the burden of
> translationg between representations?

I agree that a visual aid can help with learning - about regexps
in general and about Emacs regexp syntax in particular.

The Emacs Wiki page about regexps provides suggestions about learning
regexp syntax: http://www.emacswiki.org/emacs/RegularExpression.

Incremental regexp searching (`C-M-s') is one good tool for learning.
What it does not help so much with is subgroup matching - keeping
the different groups straight when there are several possibilities.

Rasmus mentioned that `visual-regexp.el' can help with that. Likewise,
Icicles search: it highlights different subgroup matches differently.

Here is a screenshot that shows a complex regexp (5 groups) and a
diagram that maps each group to its highlighting:
http://www.emacswiki.org/emacs/RegularExpression#RegexpsInIcicles

The regexp: "(\([-a-z*]+\) *\((\(([-a-z]+ *\([^)]*\))\))\).*".
A left paren, a name, possibly some whitespace, two left parens,
a name, possibly some whitespace, possibly non right-paren chars,
two right parens, and any chars other than newline.  But grouped
in a particular way.

I find that it is more often the case, for a complicated regexp,
that you encounter it readymade (in some existing code), and you
want to see what it is all about and perhaps make a modification
to it.  That use case is more typical than is creating a complex
regexp from scratch.  As Emanuel said, such regexps are often
arrived at incrementally - they start simpler and evolve.

I recommend playing with existing regexps this way, seeing what
they match by using them with a visual tool such as Icicles search,
`visual-regexp.el', or even `C-M-s'.  A tour through the Emacs
source code will show you plenty of interesting regexps you can
play with - font-lock keywords and patterns defining Emacs pages,
sentences, etc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* How to grok a complicated regex?
@ 2015-03-14  8:16 martin rudalics
  0 siblings, 0 replies; 24+ messages in thread
From: martin rudalics @ 2015-03-14  8:16 UTC (permalink / raw
  To: mbork; +Cc: help-gnu-emacs

 > so I have this monstrosity [note: I know, there are much worse ones,
 > too!]:
 >
 > "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"
 >
 > (it's in the org-latex--script-size function in ox-latex.el, if you're
 > curious).
 >
 > I'm not asking “what does this match” – I can read it myself.  But it
 > comes with a considerable effort.  Are you aware of any tools that might
 > help to understand such regexen?

You might want to try regexp-lock.el which you can find here:

https://lists.gnu.org/archive/html/emacs-devel/2014-10/msg00688.html

Eventually it should also appear on ELPA but I have to polish up some
things first.

Sincerely, martin




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-14  4:44       ` Emanuel Berg
  2015-03-14  4:58         ` Emanuel Berg
@ 2015-03-14  8:43         ` Thien-Thi Nguyen
       [not found]         ` <mailman.1997.1426324089.31049.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 24+ messages in thread
From: Thien-Thi Nguyen @ 2015-03-14  8:43 UTC (permalink / raw
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 826 bytes --]

() Emanuel Berg <embe8573@student.uu.se>
() Sat, 14 Mar 2015 05:44:02 +0100

   I don't understand this discussion anymore or what
   anyone are saying.

I'm sorry we don't support backtracking in REader Generated
EXPressions.  Ha ha, just kidding.  :-D  M-x M-explore RET:

I notice many times what people say, you respond with your
personal preferences, without acknowledging in some way the
validity of other people's pov.  Maybe that method somehow
interferes w/ your understanding of other people and their
concerns.

-- 
Thien-Thi Nguyen ------------------------------------------
  (if you're human and you know it) read my lisp:
    (defun responsep (type via)
      (case type
        (technical (eq 'mailing-list via))
        ...))
----------------------------------------- GPG key: 4C807502

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-14  0:12     ` Rasmus
@ 2015-03-14 13:18       ` Stefan Monnier
       [not found]       ` <mailman.2003.1426339118.31049.help-gnu-emacs@gnu.org>
  2015-03-22  2:29       ` Tom Tromey
  2 siblings, 0 replies; 24+ messages in thread
From: Stefan Monnier @ 2015-03-14 13:18 UTC (permalink / raw
  To: help-gnu-emacs

> Here's another project (for your students): adding lookaround to Emacs
> regexp /and/ have it merged.  It would be *insanely(!)* at times.

A better project: replace the regexp engine with one that does not
backtrack all the time.


        Stefan




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
       [not found]       ` <mailman.2003.1426339118.31049.help-gnu-emacs@gnu.org>
@ 2015-03-15  4:31         ` Rusi
  0 siblings, 0 replies; 24+ messages in thread
From: Rusi @ 2015-03-15  4:31 UTC (permalink / raw
  To: help-gnu-emacs

On Saturday, March 14, 2015 at 6:48:41 PM UTC+5:30, Stefan Monnier wrote:
> > Here's another project (for your students): adding lookaround to Emacs
> > regexp /and/ have it merged.  It would be *insanely(!)* at times.
> 
> A better project: replace the regexp engine with one that does not
> backtrack all the time.
> 
> 
>         Stefan

http://www.colm.net/open-source/ragel/
already exists

It would be neat if it were part of emacs' core


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
       [not found] <mailman.1979.1426282552.31049.help-gnu-emacs@gnu.org>
  2015-03-13 22:46 ` How to grok a complicated regex? Emanuel Berg
@ 2015-03-18 16:40 ` Alan Mackenzie
  2015-03-19  8:15   ` Tassilo Horn
  2015-04-25  4:23 ` Rusi
  2 siblings, 1 reply; 24+ messages in thread
From: Alan Mackenzie @ 2015-03-18 16:40 UTC (permalink / raw
  To: help-gnu-emacs

Hi, Marcin.

Sorry if I'm a bit late to this discussion.

Marcin Borkowski <mbork@wmi.amu.edu.pl> wrote:
> Hi all,

> so I have this monstrosity [note: I know, there are much worse ones,
> too!]:

> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"

> (it's in the org-latex--script-size function in ox-latex.el, if you're
> curious).

> I'm not asking ?what does this match? ? I can read it myself.  But it
> comes with a considerable effort.  Are you aware of any tools that might
> help to understand such regexen?

> I know about re-builder, but it?s well suited for constructing a regex
> matching a given string, not the other way round.

> For instance, show-paren-mode does not really help here, since it seems
> to pair ?\\(? with unescaped ?)?.

> Any ideas?

I wrote myself the following tool.  It's not production quality, but you
might find it useful nonetheless.  To use it, Type

     M-: (pp-regexp re-horror).

It displays the regexp at the end of the *scratch* buffer, dropping the
contents of any \(..\) construct by one line.  I find it useful.  So might
you.  Feel free to adapt it, or pass it on to other people.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(defun pp-regexp (regexp)
  "Pretty print a regexp.  This means, contents of \\\\\(s are lowered a line."
  (or (stringp regexp) (error "parameter is not a string."))
  (let ((depth 0)
        (re (replace-regexp-in-string
             "[\t\n\r\f]"
             (lambda (s)
               (or (cdr (assoc s '(("\t" . "??")
                                   ("\n" . "??")
                                   ("\r" . "??"))))
                   "??"))
             regexp))
        (start 0)     ; earliest position still without an acm-depth property.
        (pos 0)       ; current analysis position.
        (max-depth 0) ; How many lines do we need to print?
        (min-depth 0) ; Pick up "negative depth" errors.
        pr-line       ; output line being constructed
        line-no ; line number of pr-line, varies between min-depth and max-depth.
        ch
        )
    ;(translate-rnt re)
    ;; apply acm-depth properties to the whole string.
    (while (< start (length re))
      (setq pos (string-match ;; "\\\\\\((\\(\\?:\\)?\\||\\|)\\)"
                 "\\\\\\(\\\\\\|(\\(\\?:\\)?\\||\\|)\\)"
                                  re start))
      (put-text-property start (or pos (length re)) 'acm-depth depth re)
      (when pos
        (setq ch (aref (match-string 1 re) 0))
        (cond
         ((eq ch ?\\)
          (put-text-property pos (match-end 1) 'acm-depth depth re))
         ((eq ch ?\()
          (put-text-property pos (match-end 1) 'acm-depth depth re)
          (setq depth (1+ depth))
          (if (> depth max-depth) (setq max-depth depth)))

         ((eq ch ?\|)
          (put-text-property pos (match-end 1) 'acm-depth (1- depth) re)
          (if (< (1- depth) min-depth) (setq min-depth (1- depth))))

         (t                             ; (eq ch ?\))
          (setq depth (1- depth))
          (if (< depth min-depth) (setq min-depth depth))
          (put-text-property pos (match-end 1) 'acm-depth depth re))))
      (setq start (if pos (match-end 1) (length re))))

    ;; print out the strings
    (setq line-no min-depth)
    (while (<= line-no max-depth)
      (with-current-buffer "*scratch*"
        (goto-char (point-max)) (insert ?\n)
        (setq pr-line "")
        (setq start 0)
        (while (< start (length re))
          (setq pos (next-single-property-change start 'acm-depth re (length re)))
          (setq depth (get-text-property start 'acm-depth re))
          (setq pr-line
                (concat pr-line
                        (if (= depth line-no)
                            (substring re start pos)
                          (make-string (- pos start) ?\ ))))
          (setq start pos))
        (insert pr-line)
        (setq line-no (1+ line-no))))))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

> (Note: if there are no such tools, I might be tempted to craft one.  Two
> things that come to my mind are proper highlighting of matching parens
> of various kinds and eldoc-like hints for all the regex constructs ?
> I never seem to remember what does ?\\`? do, for instance.  Also,
> displaying the string with single backslashes and not in the way it is
> actually typed in in Elisp, with all the backslash escaping, might be
> helpful.  Would there be a demand for such a tool larger than one
> person?)

> Best,

> -- 
> Marcin Borkowski
> http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
> Faculty of Mathematics and Computer Science
> Adam Mickiewicz University

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-18 16:40 ` Alan Mackenzie
@ 2015-03-19  8:15   ` Tassilo Horn
  0 siblings, 0 replies; 24+ messages in thread
From: Tassilo Horn @ 2015-03-19  8:15 UTC (permalink / raw
  To: Alan Mackenzie; +Cc: help-gnu-emacs

Alan Mackenzie <acm@muc.de> writes:

> I wrote myself the following tool.  It's not production quality, but
> you might find it useful nonetheless.  To use it, Type
>
>      M-: (pp-regexp re-horror).
>
> It displays the regexp at the end of the *scratch* buffer, dropping
> the contents of any \(..\) construct by one line.

Interesting idea, and it helps a bit.  What would be really cool was a
transformation from regexp to rx form.  Oh, and that seems to exist
already (available from Marmalade and MELPA)!

  https://github.com/joddie/pcre2el

Example:

--8<---------------cut here---------------start------------->8---
(rxt-elisp-to-rx "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'")
;; Evals to...
(seq bos
     (\? (or (seq "\\" (any "[" "("))
	     (+ "$")))
     (submatch (*\? nonl))
     (\? (or (seq "\\" (any ")" "]"))
	     (+ "$")))
     eos)
--8<---------------cut here---------------end--------------->8---

Bye,
Tassilo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
       [not found]         ` <mailman.1997.1426324089.31049.help-gnu-emacs@gnu.org>
@ 2015-03-20  1:05           ` Emanuel Berg
  0 siblings, 0 replies; 24+ messages in thread
From: Emanuel Berg @ 2015-03-20  1:05 UTC (permalink / raw
  To: help-gnu-emacs

Thien-Thi Nguyen <ttn@gnu.org> writes:

> I notice many times what people say, you respond
> with your personal preferences, without
> acknowledging in some way the validity of other
> people's pov. Maybe that method somehow interferes
> w/ your understanding of other people and
> their concerns.

What do you mean?

Aaanyway...

For a person to be able to read those regexps that
look like comic book insults is not to be expected.
If someone is still able to do that congratulations to
him/her, unless such an unusual talent comes with
drawbacks in other areas of life...

For a person who writes and reads such regexps every
day, if such a person exists, he or she should acquire
the skill to do so seamlessly, like I write, and you
read, this English paragraph and ditto Elisp form:

    (setq fill-nobreak-predicate '(fill-single-char-nobreak-p
                                   fill-single-word-nobreak-p))

There should be no need at all of a thought process
but instead instant recognition. How will such
a person arrive at that skill level? Simple, he/she
does it every day! There will be no need for a second
representation or even illustrative tools. Such will
be at best fun toys (very soon) as the actual
representation will be the only one ever considered.

For everyone else who perhaps does it now and then the
(de/re)construction method like picking apart a math
formula or a French MAB Model B pistol is nothing to
be ashamed of. Or, for that matter the incremental
method of understanding the general purpose and
inserting the missing char whenever a problem appears.

If anyone is very fond of the regexps and wishes to do
them all the time and for this reason thinks of tools
and toys as to be able to do that, that's fine, as
long as one is aware why it is done (well, maybe
that's not necessary come think of it).

But if so, then I have an even better idea, namely an
Emacs wiki page to which you can e-mail desired
regexps, and then the group of regexp lovers can
provide those after getting instruction either exactly
what it should be, or the general problem to be
solved, and then the can deliver it, stainless steel,
and everyone is happy.

-- 
underground experts united

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-14  0:12     ` Rasmus
  2015-03-14 13:18       ` Stefan Monnier
       [not found]       ` <mailman.2003.1426339118.31049.help-gnu-emacs@gnu.org>
@ 2015-03-22  2:29       ` Tom Tromey
  2015-03-22  2:44         ` Rasmus
  2 siblings, 1 reply; 24+ messages in thread
From: Tom Tromey @ 2015-03-22  2:29 UTC (permalink / raw
  To: Rasmus; +Cc: help-gnu-emacs

Rasmus> Here's another project (for your students): adding lookaround to Emacs
Rasmus> regexp /and/ have it merged.  It would be *insanely(!)* at times.

It was done once already and either rejected or never merged in :-(

Tom



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-22  2:29       ` Tom Tromey
@ 2015-03-22  2:44         ` Rasmus
  0 siblings, 0 replies; 24+ messages in thread
From: Rasmus @ 2015-03-22  2:44 UTC (permalink / raw
  To: tom; +Cc: help-gnu-emacs

Tom Tromey <tom@tromey.com> writes:

> It was done once already and either rejected or never merged in :-(

That's a real shame.  Any particular reason?

I need it for some Gnus settings that only takes regexp (I manage "public"
mailing lists and my own catch-all email on one domain).

I'm sure it could be useful in e.g. Org as well, though speed is of course
an issue.

—Rasmus

-- 
When the facts change, I change my mind. What do you do, sir?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-03-13 21:35 Marcin Borkowski
  2015-03-13 21:45 ` Marcin Borkowski
  2015-03-13 21:47 ` Alexis
@ 2015-03-23 12:18 ` Vaidheeswaran C
  2 siblings, 0 replies; 24+ messages in thread
From: Vaidheeswaran C @ 2015-03-23 12:18 UTC (permalink / raw
  To: help-gnu-emacs


On Saturday 14 March 2015 03:05 AM, Marcin Borkowski wrote:
> 
> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"
> 
> (it's in the org-latex--script-size function in ox-latex.el, if you're
> curious).
> 
> I'm not asking “what does this match” – I can read it myself.  But it
> comes with a considerable effort.  Are you aware of any tools that might
> help to understand such regexen?

Get xr.el from
http://debbugs.gnu.org/cgi/bugreport.cgi?msg=40;filename=xr.el;att=1;bug=13369

M-x load-library xr.el

M-x pp-eval-expression RET
(xr "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'") RET

    (seq bos
	 (opt
	  (or
	   (seq "\\"
		(any "[" "("))
	   (one-or-more "$")))
	 (group
	  (minimal-match
	   (zero-or-more nonl)))
	 (opt
	  (or
	   (seq "\\"
		(any ")" "]"))
	   (one-or-more "$")))
	 eos)


There is also lex (see http://elpa.gnu.org/packages/lex.html) which
provides similar functionality.  FWIW, my edit window "disappears" if
I do

(lex-parse-re
"\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'")




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
       [not found] <mailman.1979.1426282552.31049.help-gnu-emacs@gnu.org>
  2015-03-13 22:46 ` How to grok a complicated regex? Emanuel Berg
  2015-03-18 16:40 ` Alan Mackenzie
@ 2015-04-25  4:23 ` Rusi
  2015-04-27 13:26   ` Julien Cubizolles
  2 siblings, 1 reply; 24+ messages in thread
From: Rusi @ 2015-04-25  4:23 UTC (permalink / raw
  To: help-gnu-emacs

On Saturday, March 14, 2015 at 3:05:55 AM UTC+5:30, Marcin Borkowski wrote:
> Hi all,
> 
> so I have this monstrosity [note: I know, there are much worse ones,
> too!]:
> 
> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'"

<details snipped>

> 
> Any ideas?


Just saw this
http://crowding.github.io/blog/2014/09/09/editing-regexes-interactively-in-emacs/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to grok a complicated regex?
  2015-04-25  4:23 ` Rusi
@ 2015-04-27 13:26   ` Julien Cubizolles
  0 siblings, 0 replies; 24+ messages in thread
From: Julien Cubizolles @ 2015-04-27 13:26 UTC (permalink / raw
  To: help-gnu-emacs

Rusi <rustompmody@gmail.com> writes:

> Just saw this
> http://crowding.github.io/blog/2014/09/09/editing-regexes-interactively-in-emacs/

For helm users, helm-regexp can be useful too, allows one to save as
sexp, run a query-replace-regexp from the builder.




^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-04-27 13:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.1979.1426282552.31049.help-gnu-emacs@gnu.org>
2015-03-13 22:46 ` How to grok a complicated regex? Emanuel Berg
2015-03-13 23:16   ` Marcin Borkowski
2015-03-14  0:12     ` Rasmus
2015-03-14 13:18       ` Stefan Monnier
     [not found]       ` <mailman.2003.1426339118.31049.help-gnu-emacs@gnu.org>
2015-03-15  4:31         ` Rusi
2015-03-22  2:29       ` Tom Tromey
2015-03-22  2:44         ` Rasmus
2015-03-14  5:14     ` Yuri Khan
2015-03-14  7:03     ` Drew Adams
     [not found]   ` <mailman.1984.1426288628.31049.help-gnu-emacs@gnu.org>
2015-03-14  3:58     ` Emanuel Berg
2015-03-14  4:44       ` Emanuel Berg
2015-03-14  4:58         ` Emanuel Berg
2015-03-14  8:43         ` Thien-Thi Nguyen
     [not found]         ` <mailman.1997.1426324089.31049.help-gnu-emacs@gnu.org>
2015-03-20  1:05           ` Emanuel Berg
2015-03-18 16:40 ` Alan Mackenzie
2015-03-19  8:15   ` Tassilo Horn
2015-04-25  4:23 ` Rusi
2015-04-27 13:26   ` Julien Cubizolles
2015-03-14  8:16 martin rudalics
  -- strict thread matches above, loose matches on Subject: below --
2015-03-13 21:35 Marcin Borkowski
2015-03-13 21:45 ` Marcin Borkowski
2015-03-13 21:47 ` Alexis
2015-03-13 21:57   ` Marcin Borkowski
2015-03-23 12:18 ` Vaidheeswaran C

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.