From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: martin rudalics <rudalics@gmx.at>
Newsgroups: gmane.emacs.devel
Subject: Re: Unquoted special characters in regexps
Date: Sun, 05 Mar 2006 12:54:14 +0100
Message-ID: <440AD166.5040108@gmx.at>
References: <wkek1rz72f.fsf@gmx.at>
	<jeek1rz3e8.fsf@sykes.suse.de>	<4400AD8E.5050001@gmx.at>
	<jeaccfz142.fsf@sykes.suse.de>	<4400BBB1.2050800@gmx.at>	<200602252213.k1PMDBP24413@raven.dms.auburn.edu>	<4401A98D.3070809@gmx.at>
	<jefym6ut3l.fsf@sykes.suse.de>	<4401E0F2.7030800@gmx.at>
	<je7j7ic8cf.fsf@sykes.suse.de>	<4401FCBA.1070206@gmx.at>
	<E1FDneo-00050N-Nn@fencepost.gnu.org>	<200602280044.k1S0iHG07279@raven.dms.auburn.edu>	<jkmzg5rkai.fsf@glug.org>
	<200603050337.k253brP03395@raven.dms.auburn.edu>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: sea.gmane.org 1141564737 19865 80.91.229.2 (5 Mar 2006 13:18:57 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Sun, 5 Mar 2006 13:18:57 +0000 (UTC)
Cc: ttn@gnu.org, emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Mar 05 14:18:55 2006
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1FFt8M-00060T-Au
	for ged-emacs-devel@m.gmane.org; Sun, 05 Mar 2006 14:18:50 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1FFt8S-0004P5-Pp
	for ged-emacs-devel@m.gmane.org; Sun, 05 Mar 2006 08:18:56 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1FFt8D-0003Mv-0r
	for emacs-devel@gnu.org; Sun, 05 Mar 2006 08:18:41 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1FFt89-00030E-4x
	for emacs-devel@gnu.org; Sun, 05 Mar 2006 08:18:40 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1FFt89-0002zf-1D
	for emacs-devel@gnu.org; Sun, 05 Mar 2006 08:18:37 -0500
Original-Received: from [213.165.64.20] (helo=mail.gmx.net)
	by monty-python.gnu.org with smtp (Exim 4.52) id 1FFtAB-0001vr-Ql
	for emacs-devel@gnu.org; Sun, 05 Mar 2006 08:20:44 -0500
Original-Received: (qmail invoked by alias); 05 Mar 2006 13:18:26 -0000
Original-Received: from N921P004.adsl.highway.telekom.at (EHLO [62.47.59.4])
	[62.47.59.4]
	by mail.gmx.net (mp033) with SMTP; 05 Mar 2006 14:18:26 +0100
X-Authenticated: #14592706
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
X-Accept-Language: de-DE, de, en-us, en
Original-To: Luc Teirlinck <teirllm@dms.auburn.edu>
In-Reply-To: <200603050337.k253brP03395@raven.dms.auburn.edu>
X-Y-GMX-Trusted: 0
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:51229
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/51229>

Luc Teirlinck wrote:
 > Also, forward and backward views of a regexp are not
 > algorithmically equivalent.  If you read a regexp forward, you know
 > immediately when you encounter a character whether it has to be taken
 > literally or not (or at worst after a _very_ limited number of
 > characters, as the second `[' in in "[[:...").  If you read the regexp
 > backward, you may have to read all the way back to the beginning
 > before you can be sure that a `]' is to be taken literally.

How do you read the following regexp from `cc-langs.el'?

(concat
  "\\("
  "[\)\[\(]"
  (if (c-lang-const c-type-modifier-kwds)
      (concat
       "\\|"
       ;; "throw" in `c-type-modifier-kwds' is followed
       ;; by a parenthesis list, but no extra measures
       ;; are necessary to handle that.
       (regexp-opt (c-lang-const c-type-modifier-kwds) t)
       "\\>")
    "")
  "\\)")

Do you really evaluate the (c-lang-const ...)s _before_ looking at the
closing `\\)'?  What would you do if the value of `c-type-modifier-kwds'
were available at run-time only?

When trying to understand such regexps I break them up into parts first.
Such parts are, in my understanding, groups like `\\(...\\)',
subexpressions delimited by `\\|', and character alternatives.  Next I
try to understand the parts that interest me without paying notice to
parts that do not relate to my specific problem.  And I would have
troubles to isolate a character alternative when the author matches a
literal right bracket with `]'.

People can make reading a regexp truly awkward by writing kludgy
expressions like

(let ((keywords (concat "\\([;(){}`|&]\\|^\\)[ \t]*\\(\\("
			(regexp-opt (sh-feature sh-leading-keywords) t)
			"[ \t]+\\)?"
			(regexp-opt (append (sh-feature sh-leading-keywords)
					    (sh-feature sh-other-keywords))
				    t))))

in `sh-font-lock-keywords-1' which I understand correctly iff I read the
definition of the entire function first.  Such expressions are, however,
rare in present Emacs code.

 > Hence, reading a regexp forward _is_ algorithmically _very_ superior
 > over reading it backward if your purpose is to understand the regexp.

If my purpose is to understand how a regexp engine interprets a regexp,
reading a regexp forwardly is superior.  If, however, my purpose is to
understand a complex regexp I want to guess the author's intentions
first.  In that case I do want to break up the expression into its
constituents.  In general, languages hiding implementation details are
easier to use than languages that require users to know how specific
features are implemented.

 > I must admit however, that if you want is to uncover the subliminal
 > satanic messages in the regexp, then you _have_ to read it backward.

It's better to avoid "subliminal satanic messages" when _writing_ a
regexp.  It's bad if you have to uncover them when reading a regexp.