From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Marcin Borkowski Newsgroups: gmane.emacs.help Subject: Re: How to grok a complicated regex? Date: Sat, 14 Mar 2015 00:16:50 +0100 Message-ID: <87egosa3od.fsf@wmi.amu.edu.pl> References: <87twxo1pnr.fsf@debian.uxu> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1426288646 1861 80.91.229.3 (13 Mar 2015 23:17:26 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 13 Mar 2015 23:17:26 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Sat Mar 14 00:17:21 2015 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YWYpU-0001Fd-LM for geh-help-gnu-emacs@m.gmane.org; Sat, 14 Mar 2015 00:17:20 +0100 Original-Received: from localhost ([::1]:39013 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YWYpU-0002s5-1n for geh-help-gnu-emacs@m.gmane.org; Fri, 13 Mar 2015 19:17:20 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:51400) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YWYpD-0002ne-UY for help-gnu-emacs@gnu.org; Fri, 13 Mar 2015 19:17:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YWYp9-0007Xw-2v for help-gnu-emacs@gnu.org; Fri, 13 Mar 2015 19:17:03 -0400 Original-Received: from msg.wmi.amu.edu.pl ([2001:808:114:2::50]:48161) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YWYp8-0007Xc-PH for help-gnu-emacs@gnu.org; Fri, 13 Mar 2015 19:16:59 -0400 Original-Received: from localhost (localhost [127.0.0.1]) by msg.wmi.amu.edu.pl (Postfix) with ESMTP id 2673357E4D for ; Sat, 14 Mar 2015 00:16:57 +0100 (CET) Original-Received: from msg.wmi.amu.edu.pl ([127.0.0.1]) by localhost (msg.wmi.amu.edu.pl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YLe71+0lUG2n for ; Sat, 14 Mar 2015 00:16:57 +0100 (CET) Original-Received: from localhost (117-116.echostar.pl [213.156.117.116]) by msg.wmi.amu.edu.pl (Postfix) with ESMTPSA id 27AAF57E49 for ; Sat, 14 Mar 2015 00:16:56 +0100 (CET) In-reply-to: <87twxo1pnr.fsf@debian.uxu> X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:808:114:2::50 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:103154 Archived-At: On 2015-03-13, at 23:46, Emanuel Berg wrote: > Marcin Borkowski writes: > >> so I have this monstrosity [note: I know, there are >> much worse ones, too!]: >> >> "\\`\\(?:\\\\[([]\\|\\$+\\)?\\(.*?\\)\\(?:\\\\[])]\\|\\$+\\)?\\'" >> >> (it's in the org-latex--script-size function in >> ox-latex.el, if you're curious). >> >> I'm not asking =E2=80=9Cwhat does this match=E2=80=9D =E2=80=93 I can = read >> it myself. But it comes with a considerable effort. > > I dare say most people (even programmers) cannot read > that so if you can that's great. As a math Really? It's not /that/ difficult. You only need enough coffee (or tea, in my case), time and motivation. You don=E2=80=99t need a genius, = or even IQ higher than, say, 90 or so. It's not really /difficult/. Intimidating, yes. Boring, possibly. Laborious (and mechanical), yes. But not /difficult/. > professional you are of course aware of the discipline > called automata theory that deals with such things. Well, as an analyst working in metric fixed point theory, that's just it. I'm /aware/ of automata theory =E2=80=93 (almost) nothing more. ;-) > Perhaps relational algebra might help to, if the data > in the sets are strings. But automata theory should be > it even more. > > Also, remember you don't have to understand those > expressions. Often they are setup incrementally. They > only need to be correct. The computer understands them > - the programmer only understands the purpose, and the > latest edition. Kind of risky, perhaps not what I math > person would be appealed by, but I've constructed many > that way so I know that method works. That reminds me of the von Neumann quote: =E2=80=9CIn mathematics, you do= n=E2=80=99t /understand/ things =E2=80=93 you just /get used/ to them.=E2=80=9D >> Are you aware of any tools that might help to >> understand such regexen? > > I have seen tools with which you can construct such > expressions and they output figures, states, > transitions, and so on. I wonder how advanced > expression they can deal with? But if you get the > basics right, it should be just basic building blocks > that stick together and from there on the sky is the > limit. > > Instead the problem is, as I see it: will those > figures, balls and arrows, tagged with preconditions, > postconditions, everything you can think of, will that > actually be *clearer*? As we both point out, I=E2=80=99m not talking about changing the represen= tation, but about making the existing one (which I agree is not /that/ bad) more comprehensible. Font lock, grouping and unescaping backslashes would be definitely helpful. OTOH, I can imagine that some kind of diagrams might be helpful for someone. The point is, in the end you have to read/write these regexen in their normal form anyway, so why not train yourself to understand their =E2=80=9Cdefault=E2=80=9D representation instead of adding the burd= en of translationg between representations? > If I were to do it (which I am not thanks god) my > answer would be *no*. The only way I could do it would > instead be the opposite. Train the brain with such > expressions - exactly as they are - day in, day out, > until they are second nature. > > Example: a C++ OO project with classes and everything. > Silly inheritance and interfaces. Some people would > consider those pretty darn difficult to understand. > But to the seasoned C++ programmer (no exaggerating > here, a few years of focused training is enough) those > programs are clear. For those guys, giving up writing > C++ code and instead using some other representation > (be it graphical or not) would be to in one stroke > cripple their skills. > > So no, I think that representation is the best there > is. To translate it back and forth would not only be I=E2=80=99m not sure whether it=E2=80=99s the best =E2=80=93 but it=E2=80= =99s a standard (more or less, Emacs=E2=80=99 regexen are not really =E2=80=9Cstandard=E2=80=9D by today= =E2=80=99s, well, standards =E2=80=93 but hardly anything about Emacs is =E2=80=9Cstandard=E2=80=9D or =E2=80=9C= typical=E2=80=9D, so who cares;-)). > very difficult to do - and even if possible, which of I disagree. I don=E2=80=99t think that such a translator would be a diff= icult one to write. If only I was a student again, with plenty of spare time, I might have taken the challenge and tried to write one in TeX, so that some TeX macro, given an (Emacs) regex would produce a nicely typeset diagram. Wow, what a nice project for a bachelor=E2=80=99s thesis. Wait a minute. Ohboyohboyohboy. I have to put this in my faculty=E2=80=99s database of potential topics. Poor students... ;-) (BTW, I did once write a poor man=E2=80=99s parser in pure TeX; since the= re were no regex engine written in TeX back then (now there is one!), I had to craft a simple automaton myself. Not an extremely pleasant work...) > course it is, because a representation is just a > representation of I don't know how many possible - I > don't see the end result being any more clear: on the > contrary, most likely. > > What I would do - try to get it more readable by using > classes, string classes (do they exist?), and even > more advanced constructs if necessary - as in this > simple example: > > (defconst stop-char-default "\\([[:punct:]]\\|[[:space:]][[:alnum:]= ]\\)") > > How do you define those? Can you identify any which > aren't there, but could/should be? > > Example: say there is a class called "delimiters" > which contain [, (, {, <, >, }, ), and ]. Can you > split that up, in "opening-delimiters" and closing > ditto? > > Second, exactly you mentioned - the font lock issue - > work on that. > > You do know, of course, of > > font-lock-regexp-grouping-construct > font-lock-regexp-grouping-backslash > > Are there more of those, that you can identify, and > add? There could be quite a few. (As Alexis pointed out, a tool I was writing about seems to exist =E2=80=93 if it=E2=80=99s not satisfactory, = I could think about extending it somehow. Not very probable, though =E2=80=93 I=E2=80=99= m too busy now. If only someone could be paying me for goofing around and playing with Emacs hacks...) Thanks for your input, and best regards! --=20 Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University