Unquoted special characters in regexps

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Unquoted special characters in regexps
@ 2006-02-25 17:23 martin rudalics
  2006-02-25 18:42 ` Andreas Schwab
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-25 17:23 UTC (permalink / raw)


Section 34.3.1.1 (Special Characters in Regular Expressions) of the
Elisp manual says:

*Please note:* For historical compatibility, special characters are
treated as ordinary ones if they are in contexts where their special
meanings make no sense. ...  It is poor practice to depend on this
behavior; quote the special character anyway, regardless of where it
appears.

The three patches below eliminate spurious occurrences of such practice:

	* font-lock.el (lisp-font-lock-keywords-2)
	* emacs-lisp/rx.el (rx-check-any, rx-check-not)
	* generic-x.el (reg-generic-mode): Quote "]"s in regexps when
	they have no special meaning.

================================================================================
*** font-lock.el	Wed Feb  1 10:17:44 2006
--- font-lock.el	Thu Feb 16 20:24:48 2006
***************
*** 2120,2126 ****
         ;; Erroneous structures.
         ("(\\(abort\\|assert\\|warn\\|check-type\\|cerror\\|error\\|signal\\)\\>" 1 font-lock-warning-face)
         ;; Words inside \\[] tend to be for `substitute-command-keys'.
!        ("\\\\\\\\\\[\\(\\sw+\\)]" 1 font-lock-constant-face prepend)
         ;; Words inside `' tend to be symbol names.
         ("`\\(\\sw\\sw+\\)'" 1 font-lock-constant-face prepend)
         ;; Constant values.
--- 2120,2126 ----
         ;; Erroneous structures.
         ("(\\(abort\\|assert\\|warn\\|check-type\\|cerror\\|error\\|signal\\)\\>" 1 font-lock-warning-face)
         ;; Words inside \\[] tend to be for `substitute-command-keys'.
!        ("\\\\\\\\\\[\\(\\sw+\\)\\]" 1 font-lock-constant-face prepend)
         ;; Words inside `' tend to be symbol names.
         ("`\\(\\sw\\sw+\\)'" 1 font-lock-constant-face prepend)
         ;; Constant values.
================================================================================
*** rx.el	Sat Nov  5 20:44:46 2005
--- rx.el	Thu Feb 16 20:28:18 2006
***************
*** 371,378 ****
       (if (eq ?^ (aref arg 0))
  	 (setq arg (concat "\\" arg)))
       ;; Remove ] and set flag for adding it to start of overall result.
!      (when (string-match "]" arg)
!        (setq arg (replace-regexp-in-string "]" "" arg)
  	     rx-bracket "]")))
     (when (symbolp arg)
       (let ((translation (condition-case nil
--- 371,378 ----
       (if (eq ?^ (aref arg 0))
  	 (setq arg (concat "\\" arg)))
       ;; Remove ] and set flag for adding it to start of overall result.
!      (when (string-match "\\]" arg)
!        (setq arg (replace-regexp-in-string "\\]" "" arg)
  	     rx-bracket "]")))
     (when (symbolp arg)
       (let ((translation (condition-case nil
***************
*** 404,410 ****
  (defun rx-check-not (arg)
    "Check arg ARG for Rx `not'."
    (unless (or (and (symbolp arg)
! 		   (string-match "\\`\\[\\[:[-a-z]:]]\\'"
  				 (condition-case nil
  				     (rx-to-string arg 'no-group)
  				   (error ""))))
--- 404,410 ----
  (defun rx-check-not (arg)
    "Check arg ARG for Rx `not'."
    (unless (or (and (symbolp arg)
! 		   (string-match "\\`\\[\\[:[-a-z]:\\]\\]\\'"
  				 (condition-case nil
  				     (rx-to-string arg 'no-group)
  				   (error ""))))
================================================================================
*** generic-x.el	Sat Nov  5 20:44:28 2005
--- generic-x.el	Sat Feb 25 17:22:58 2006
***************
*** 433,439 ****
  (define-generic-mode reg-generic-mode
    '(?\;)
    '("key" "classes_root" "REGEDIT" "REGEDIT4")
!   '(("\\(\\[.*]\\)"          1 font-lock-constant-face)
      ("^\\([^\n\r]*\\)\\s-*=" 1 font-lock-variable-name-face))
    '("\\.[rR][eE][gG]\\'")
    (list
--- 433,439 ----
  (define-generic-mode reg-generic-mode
    '(?\;)
    '("key" "classes_root" "REGEDIT" "REGEDIT4")
!   '(("\\(\\[.*\\]\\)"        1 font-lock-constant-face)
      ("^\\([^\n\r]*\\)\\s-*=" 1 font-lock-variable-name-face))
    '("\\.[rR][eE][gG]\\'")
    (list
================================================================================

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 17:23 Unquoted special characters in regexps martin rudalics
@ 2006-02-25 18:42 ` Andreas Schwab
  2006-02-25 19:18   ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-02-25 18:42 UTC (permalink / raw)
  Cc: emacs-devel

martin rudalics <rudalics@gmx.at> writes:

> 	* font-lock.el (lisp-font-lock-keywords-2)
> 	* emacs-lisp/rx.el (rx-check-any, rx-check-not)
> 	* generic-x.el (reg-generic-mode): Quote "]"s in regexps when
> 	they have no special meaning.

']' is not special in regexps.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 18:42 ` Andreas Schwab
@ 2006-02-25 19:18   ` martin rudalics
  2006-02-25 19:31     ` Andreas Schwab
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-25 19:18 UTC (permalink / raw)
  Cc: emacs-devel

Andreas Schwab schrieb:
> martin rudalics <rudalics@gmx.at> writes:
> 
> 
>>	* font-lock.el (lisp-font-lock-keywords-2)
>>	* emacs-lisp/rx.el (rx-check-any, rx-check-not)
>>	* generic-x.el (reg-generic-mode): Quote "]"s in regexps when
>>	they have no special meaning.
> 
> 
> ']' is not special in regexps.
> 
> Andreas.
> 
 From the Elisp manual:

34.3.1 Syntax of Regular Expressions
------------------------------------

Regular expressions have a syntax in which a few characters are special
constructs and the rest are "ordinary".  An ordinary character is a
simple regular expression that matches that character and nothing else.
The special characters are `.', `*', `+', `?', `[', ------> `]' <------, `^', `$', and
`\'; no new special characters will be defined in the future.  Any
other character appearing in a regular expression is ordinary, unless a
`\' precedes it.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 19:18   ` martin rudalics
@ 2006-02-25 19:31     ` Andreas Schwab
  2006-02-25 20:18       ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-02-25 19:31 UTC (permalink / raw)
  Cc: emacs-devel

martin rudalics <rudalics@gmx.at> writes:

> From the Elisp manual:
>
> 34.3.1 Syntax of Regular Expressions
> ------------------------------------
>
> Regular expressions have a syntax in which a few characters are special
> constructs and the rest are "ordinary".  An ordinary character is a
> simple regular expression that matches that character and nothing else.
> The special characters are `.', `*', `+', `?', `[', ------> `]' <------, `^', `$', and

This is incorrect.  ']' is only special in a bracket expression (like
'-').

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 19:31     ` Andreas Schwab
@ 2006-02-25 20:18       ` martin rudalics
  2006-02-25 22:09         ` Andreas Schwab
                           ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-25 20:18 UTC (permalink / raw)
  Cc: emacs-devel

Andreas Schwab schrieb:
 > This is incorrect.  ']' is only special in a bracket expression (like
 > '-').

`]' is _also_ special in a character alternative, like `^'.  `-' is
special _only_ in a character alternative.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 20:18       ` martin rudalics
@ 2006-02-25 22:09         ` Andreas Schwab
  2006-02-26 11:32           ` martin rudalics
  2006-02-25 22:13         ` Luc Teirlinck
  2006-02-25 22:34         ` Luc Teirlinck
  2 siblings, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-02-25 22:09 UTC (permalink / raw)
  Cc: emacs-devel

martin rudalics <rudalics@gmx.at> writes:

> Andreas Schwab schrieb:
>> This is incorrect.  ']' is only special in a bracket expression (like
>> '-').
>
> `]' is _also_ special in a character alternative,

A bracket expression has a completely different set of special characters.
For example, '\' and '$' are not special there.

> `-' is special _only_ in a character alternative.

Just like ']'.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 20:18       ` martin rudalics
  2006-02-25 22:09         ` Andreas Schwab
@ 2006-02-25 22:13         ` Luc Teirlinck
  2006-02-26 13:13           ` martin rudalics
  2006-02-25 22:34         ` Luc Teirlinck
  2 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-25 22:13 UTC (permalink / raw)
  Cc: schwab, emacs-devel

Martin Rudalics wrote:

   `]' is _also_ special in a character alternative, like `^'.  `-' is
   special _only_ in a character alternative.

I may be overlooking something, but _which_ special meaning does `]'
have outside of character alternatives?

You are using \\] to quote `]'.  Could that possibly clear up any
confusion or does it just add confusion?  I personally believe the
latter.  Is there a situation where \\ can be used to prevent `]' from
having a special meaning?  In "[a\\]b]", the first `]' still has a
special meaning, even though there might be some optical illusion
making it look "quoted", the second `]' has no special meaning.

The only way I know of to put a "quoted" `]' in a character
alternative is to write it immediately after the `[' or "[^".

But "quoting" `]' by writing "[]]" instead of just `]', seems
contorted, even though it would be less confusing than "\\]".

I believe that a `]' should not be quoted at all if it is outside a
character alternative, where it has no special meaning, unless I am
overlooking something.  (Just tell what, in that case.)

ELISP> (string-match "[a\\]b]" "]")
nil
ELISP> (string-match "[a\\]b]" "\\b")
nil
ELISP> (string-match "[a\\]b]" "\\b]")
0

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 20:18       ` martin rudalics
  2006-02-25 22:09         ` Andreas Schwab
  2006-02-25 22:13         ` Luc Teirlinck
@ 2006-02-25 22:34         ` Luc Teirlinck
  2006-02-25 22:59           ` Andreas Schwab
  2006-02-26 13:20           ` martin rudalics
  2 siblings, 2 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-25 22:34 UTC (permalink / raw)
  Cc: schwab, emacs-devel

>From my previous message:

    But "quoting" `]' by writing "[]]" instead of just `]', seems
    contorted, even though it would be less confusing than "\\]".

But if you *really* want to emphasize that the `]' does not close a
character alternative, "[]]" is the only way I know to do that.
"\\]" does not do the job.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 22:34         ` Luc Teirlinck
@ 2006-02-25 22:59           ` Andreas Schwab
  2006-02-26 13:20           ` martin rudalics
  1 sibling, 0 replies; 81+ messages in thread
From: Andreas Schwab @ 2006-02-25 22:59 UTC (permalink / raw)
  Cc: rudalics, emacs-devel

Luc Teirlinck <teirllm@dms.auburn.edu> writes:

>>From my previous message:
>
>     But "quoting" `]' by writing "[]]" instead of just `]', seems
>     contorted, even though it would be less confusing than "\\]".
>
> But if you *really* want to emphasize that the `]' does not close a
> character alternative, "[]]" is the only way I know to do that.
> "\\]" does not do the job.

JFTR, in AWK regexps the backslash retains its special meaning inside
character lists.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 22:09         ` Andreas Schwab
@ 2006-02-26 11:32           ` martin rudalics
  2006-02-26 11:50             ` Andreas Schwab
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-26 11:32 UTC (permalink / raw)
  Cc: emacs-devel

 >>>This is incorrect.  ']' is only special in a bracket expression (like
 >>>'-').
 >>
 >>`]' is _also_ special in a character alternative,
 >
 >
 > A bracket expression has a completely different set of special characters.
 > For example, '\' and '$' are not special there.
 >
 >
 >>`-' is special _only_ in a character alternative.
 >
 >
 > Just like ']'.
 >
 > Andreas.
 >

We would have to agree on the semantics of the term "special" first. In
Elisp descriptions this term is overloaded.  Take, for example, the
following excerpt from the Elisp tutorial:

    Indeed more than one such mark or brace may precede the space.  These
    require a expression that looks like this:

    	     []\"')}]*

       In this expression, the first `]' is the first character in the
    expression; the second character is `"', which is preceded by a `\'
    to tell Emacs the `"' is _not_ special.  The last three characters
    are `'', `)', and `}'.

It's confusing because we know from the Elisp manual that

    The special characters are `.', `*', `+', `?', `[', `]', `^', `$',
    and `\'; no new special characters will be defined in the future.

hence a double-quote is never "special" in terms of regexp semantics.
Why should we have to tell Emacs that it is "_not_ special" then?  The
answer is, obviously, that the Elisp read syntax for regexps is the
string data type and the tutorial's "special" indeed refers to string
semantics.  Hence when you say that "'\' and '$' are not special there"
you probably don't mean the special semantics of the backslash within
strings.

Now let's agree on the term "there".  Reasonably, "there" is the
sequence of characters obtained after stripping both the opening bracket
_and_ the closing bracket of a character alternative.  Otherwise, the
sentence from the Elisp manual "To include a `]' in a character
alternative, you must make it the first character." wouldn't make sense.
`]' is special inside a character alternative because it may appear in
one and only one position - namely the first.  And the semantics of the
`]' in the first position is "match one `]'".  The semantics of an `]'
closing a character alternative is completely different from that.

 From an operational point of view - that of the Elisp interpreter - you
_can_ say that `]' is not special in regexps.  If that's the preferred
point of view it's sufficient to remove `]' from the list of special
characters in the respective manuals and treat it like `-' as you
propose.  And, you wouldn't have to mention "poor practice" in the Elisp
manual at all - anything the Elisp interpreter interprets as intended by
the programmer would be valid.

There exists, however, a functional subset of Elisp (Elisp without
setqs, iterators, ...) amenable to mathematical reasoning like proving
correctness or validity of your code.  And mathematicians don't like to
reason about malformed constructs like "(a + 3" or "a + 3)".  They
prefer something like "a + 3" or "(a + 3)" instead.  They do rely on the
special semantics of "(" _and_ ")" within an expression.  Hence, saying
that `[' is special and `]' not would be tantamount to removing regexps
from the subset of Elisp amenable to mathematical reasoning.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 11:32           ` martin rudalics
@ 2006-02-26 11:50             ` Andreas Schwab
  2006-02-26 13:28               ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-02-26 11:50 UTC (permalink / raw)
  Cc: emacs-devel

martin rudalics <rudalics@gmx.at> writes:

> The answer is, obviously, that the Elisp read syntax for regexps is the

There is no such thing as a read syntax for regexps.  This is not a Lisp
data type.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 22:13         ` Luc Teirlinck
@ 2006-02-26 13:13           ` martin rudalics
  2006-02-26 13:50             ` Andreas Schwab
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-26 13:13 UTC (permalink / raw)
  Cc: schwab, emacs-devel

 > Martin Rudalics wrote:
 >
 >    `]' is _also_ special in a character alternative, like `^'.  `-' is
 >    special _only_ in a character alternative.
 >
 > I may be overlooking something, but _which_ special meaning does `]'
 > have outside of character alternatives?

That of closing a character alternative.  When you write

(defvar foo "]")
(defvar bar "\\]")

you can't interchangeably use `foo' and `bar' in an arbitrary regular
expression.  Some people call this "referential transparency".

 > You are using \\] to quote `]'.  Could that possibly clear up any
 > confusion or does it just add confusion?  I personally believe the
 > latter.  Is there a situation where \\ can be used to prevent `]' from
 > having a special meaning?  In "[a\\]b]", the first `]' still has a
 > special meaning, even though there might be some optical illusion
 > making it look "quoted", the second `]' has no special meaning.

According to the Elisp manual "[a\\]b]" is poor practice.  You probably
mean "[a\\]b\\]" here.  Anyway, the first `]' has a special meaning but
it's not "inside" the character alternative.  It does have a special
meaning because `]' is special _outside_ character alternatives.

 > I believe that a `]' should not be quoted at all if it is outside a
 > character alternative, where it has no special meaning, unless I am
 > overlooking something.  (Just tell what, in that case.)

That of terminating a character alternative.

 >
 > ELISP> (string-match "[a\\]b]" "]")
 > nil
 > ELISP> (string-match "[a\\]b]" "\\b")
 > nil
 > ELISP> (string-match "[a\\]b]" "\\b]")
 > 0

According to the Elisp manual all these exhibit "poor practice" since
you didn't quote the second `]'s.  You should have complained about that
when you read the manual.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-25 22:34         ` Luc Teirlinck
  2006-02-25 22:59           ` Andreas Schwab
@ 2006-02-26 13:20           ` martin rudalics
  2006-02-26 16:53             ` Luc Teirlinck
  2006-02-26 17:19             ` Luc Teirlinck
  1 sibling, 2 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-26 13:20 UTC (permalink / raw)
  Cc: schwab, emacs-devel

 >>From my previous message:
 >
 >     But "quoting" `]' by writing "[]]" instead of just `]', seems
 >     contorted, even though it would be less confusing than "\\]".
 >
 > But if you *really* want to emphasize that the `]' does not close a
 > character alternative, "[]]" is the only way I know to do that.
 > "\\]" does not do the job.

I didn't care about `]'s closing a character alternative.  I did care
about `]'s appearing in regular expressions _outside_ a character
alternative and meant to match the character `]'.  Such `]'s should be
quoted just like `.', `*', `+', `?', `[', `^', `$', and `\'.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 11:50             ` Andreas Schwab
@ 2006-02-26 13:28               ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-26 13:28 UTC (permalink / raw)
  Cc: emacs-devel

>>The answer is, obviously, that the Elisp read syntax for regexps is the
> 
> 
> There is no such thing as a read syntax for regexps.  This is not a Lisp
> data type.
> 
> Andreas.
> 

 From the Elisp manual:

      Therefore, the read syntax for a regular expression matching `\'
      is `"\\\\"'.

Please send your complaints to the writer of that sentence.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 13:13           ` martin rudalics
@ 2006-02-26 13:50             ` Andreas Schwab
  2006-02-26 16:41               ` Luc Teirlinck
  2006-02-26 17:10               ` martin rudalics
  0 siblings, 2 replies; 81+ messages in thread
From: Andreas Schwab @ 2006-02-26 13:50 UTC (permalink / raw)
  Cc: Luc Teirlinck, emacs-devel

martin rudalics <rudalics@gmx.at> writes:

> That of closing a character alternative.  When you write
>
> (defvar foo "]")
> (defvar bar "\\]")
>
> you can't interchangeably use `foo' and `bar' in an arbitrary regular
> expression.  Some people call this "referential transparency".

Of course you can't, since the meaning of '\' is context dependent.

> Anyway, the first `]' has a special meaning but it's not "inside" the
> character alternative.

It is part of it, just like the leading '['.

> It does have a special meaning because `]' is special _outside_
> character alternatives.

This is wrong.  Outside of a character set ']' has no special meaning
whatsoever, independent of the context.

> According to the Elisp manual all these exhibit "poor practice" since
> you didn't quote the second `]'s.

It's a bug in the manual.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 13:50             ` Andreas Schwab
@ 2006-02-26 16:41               ` Luc Teirlinck
  2006-02-26 17:53                 ` martin rudalics
  2006-02-26 17:10               ` martin rudalics
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-26 16:41 UTC (permalink / raw)
  Cc: rudalics, emacs-devel

Andreas Schwab wrote:

   > According to the Elisp manual all these exhibit "poor practice" since
   > you didn't quote the second `]'s.

   It's a bug in the manual.

I propose the following patch to lispref/searching.texi, which I can
install if desired.  I will wait till more people, in particular
Richard, have had an opportunity to see it.

Note that the current version already clearly states elsewhere that
`]' is special _inside_ character alternatives:

     Note that the usual regexp special characters are not special
     inside a character alternative.  A completely different set of
     characters is special inside character alternatives: `]', `-' and
     `^'.

Apart from correcting the bug we are discussing, it also corrects
another misstatement:

    For example, a string with unbalanced square brackets is invalid
    (with a few exceptions, such as `[]]'),

That is incorrect as the examples below show.

ELISP> (string-match "]]]]" "]]]]")
0
ELISP> (string-match "[[]" "[")
0

One correct way to restate it would be that a string whose square
brackets _with special meaning in the context in which they are used _
do not balance is invalid.  This would be (unless I overlook
something) without exceptions: in `[]]' the square brackets with
special meaning do balance.  In the patch below I formulated it
differently.

None of my previous mails to emacs-{devel,pretest-bug} in the last few
days have appeared on the list, so I wonder whether this one will.

===File ~/searching.texi-diff===============================
*** searching.texi	06 Feb 2006 16:02:08 -0600	1.68
--- searching.texi	26 Feb 2006 10:25:06 -0600	
***************
*** 237,243 ****
  special constructs and the rest are @dfn{ordinary}.  An ordinary
  character is a simple regular expression that matches that character and
  nothing else.  The special characters are @samp{.}, @samp{*}, @samp{+},
! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new
  special characters will be defined in the future.  Any other character
  appearing in a regular expression is ordinary, unless a @samp{\}
  precedes it.
--- 237,243 ----
  special constructs and the rest are @dfn{ordinary}.  An ordinary
  character is a simple regular expression that matches that character and
  nothing else.  The special characters are @samp{.}, @samp{*}, @samp{+},
! @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new
  special characters will be defined in the future.  Any other character
  appearing in a regular expression is ordinary, unless a @samp{\}
  precedes it.
***************
*** 740,747 ****
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! with unbalanced square brackets is invalid (with a few exceptions, such
! as @samp{[]]}), and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
--- 740,747 ----
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! that ends inside a character alternative without terminating @samp{]}
! is invalid, and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
============================================================

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 13:20           ` martin rudalics
@ 2006-02-26 16:53             ` Luc Teirlinck
  2006-02-26 18:01               ` martin rudalics
  2006-02-26 17:19             ` Luc Teirlinck
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-26 16:53 UTC (permalink / raw)
  Cc: schwab, emacs-devel

Martin Rudalics wrote:

   Such `]'s should be quoted just like `.', `*', `+', `?', `[', `^',
   `$', and `\'.

Even if that _would_ be true, `]' should _not_ be quoted as "\\]" but as
"[]]", as I pointed out earlier.  We are talking about Elisp, not AWK.

   According to the Elisp manual all these exhibit "poor practice" since
   you didn't quote the second `]'s.  You should have complained about that
   when you read the manual.

Reading through a large body of text, it is easy to miss a small and
non-obvious detail like this.  I am complaining about it now and, in
a separate message, proposed a patch.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 13:50             ` Andreas Schwab
  2006-02-26 16:41               ` Luc Teirlinck
@ 2006-02-26 17:10               ` martin rudalics
  2006-02-26 17:42                 ` Luc Teirlinck
  2006-02-26 17:56                 ` Andreas Schwab
  1 sibling, 2 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-26 17:10 UTC (permalink / raw)
  Cc: Luc Teirlinck, emacs-devel

 >>That of closing a character alternative.  When you write
 >>
 >>(defvar foo "]")
 >>(defvar bar "\\]")
 >>
 >>you can't interchangeably use `foo' and `bar' in an arbitrary regular
 >>expression.  Some people call this "referential transparency".
 >
 >
 > Of course you can't, since the meaning of '\' is context dependent.

When you say that outside a character alternative `]' and `\\]' have the
same meaning you abandon the principle of referential transparency.

 >>Anyway, the first `]' has a special meaning but it's not "inside" the
 >>character alternative.
 >
 >
 > It is part of it, just like the leading '['.

As a consequence, in your model "[" is a valid regexp too.

 >>It does have a special meaning because `]' is special _outside_
 >>character alternatives.
 >
 >
 > This is wrong.  Outside of a character set ']' has no special meaning
 > whatsoever, independent of the context.

On a similar footing you can say that `*' has no special meaning unless
it's preceded by a character.  Hence "*foo" is valid too in your model.

 >>According to the Elisp manual all these exhibit "poor practice" since
 >>you didn't quote the second `]'s.
 >
 >
 > It's a bug in the manual.

Please fix it.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 13:20           ` martin rudalics
  2006-02-26 16:53             ` Luc Teirlinck
@ 2006-02-26 17:19             ` Luc Teirlinck
  2006-02-26 18:13               ` martin rudalics
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-26 17:19 UTC (permalink / raw)
  Cc: schwab, emacs-devel

Martin Rudalics wrote:

   I didn't care about `]'s closing a character alternative.  I did care
   about `]'s appearing in regular expressions _outside_ a character
   alternative and meant to match the character `]'.  Such `]'s should be
   quoted just like `.', `*', `+', `?', `[', `^', `$', and `\'.

To expand on my previous reply, you are arguing from a purely
legalistic viewpoint here, based on an error in the manual.  The
_purpose_ of quoting special characters even when they have no special
meaning is make clear that they have no special meaning.  "\\]" does
not clear up any confusion in situations where confusion could be
possible, so it is completely meaningless.

There is a mistake in the manual.  We should correct that mistake, not
make meaningless, and even confusing, changes to code.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 17:10               ` martin rudalics
@ 2006-02-26 17:42                 ` Luc Teirlinck
  2006-02-26 19:06                   ` martin rudalics
  2006-02-26 17:56                 ` Andreas Schwab
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-26 17:42 UTC (permalink / raw)
  Cc: schwab, emacs-devel

Martin Rudalics wrote:

    >>Anyway, the first `]' has a special meaning but it's not "inside" the
    >>character alternative.
    >
    >
    > It is part of it, just like the leading '['.

   As a consequence, in your model "[" is a valid regexp too.

What matters is the following.  If you type `[' it has a special
meaning if you type it _outside_ the _context_ of a character
alternative.  Its special meaning there is that it starts that
context.  Inside that special context, `[' has no special meaning.

On the other hand, if you type `]' it has no special meaning _unless_
you are in the _context_ of a character alternative.  Its special
meaning in that context is that it ends that context.

So `[' is special outside the context of a character alternative, but
not inside it, `]' is special inside that context, but not outside.

    > It's a bug in the manual.

   Please fix it.

I plan to do that.  I just want to wait to make sure that Richard agrees.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 16:41               ` Luc Teirlinck
@ 2006-02-26 17:53                 ` martin rudalics
  2006-02-26 18:22                   ` Luc Teirlinck
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-26 17:53 UTC (permalink / raw)
  Cc: schwab, emacs-devel

 > Apart from correcting the bug we are discussing, it also corrects
 > another misstatement:
 >
 >     For example, a string with unbalanced square brackets is invalid
 >     (with a few exceptions, such as `[]]'),
 >
 > That is incorrect as the examples below show.
 >
 > ELISP> (string-match "]]]]" "]]]]")
 > 0
 > ELISP> (string-match "[[]" "[")
 > 0

Your example doesn't show that.  You reason about validity in presence
of a sentence like

       *Please note:* For historical compatibility, special characters
    are treated as ordinary ones if they are in contexts where their
    special meanings make no sense.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 17:10               ` martin rudalics
  2006-02-26 17:42                 ` Luc Teirlinck
@ 2006-02-26 17:56                 ` Andreas Schwab
  2006-02-26 19:08                   ` martin rudalics
  1 sibling, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-02-26 17:56 UTC (permalink / raw)
  Cc: Luc Teirlinck, emacs-devel

martin rudalics <rudalics@gmx.at> writes:

>> It is part of it, just like the leading '['.
>
> As a consequence, in your model "[" is a valid regexp too.

Where did I write that?  Please expand.

>> This is wrong.  Outside of a character set ']' has no special meaning
>> whatsoever, independent of the context.
>
> On a similar footing you can say that `*' has no special meaning unless
> it's preceded by a character.

Why do you think I would  say that?  Please expand.

> Please fix it.

You are free to contribute patches.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 16:53             ` Luc Teirlinck
@ 2006-02-26 18:01               ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-26 18:01 UTC (permalink / raw)
  Cc: schwab, emacs-devel

 > Martin Rudalics wrote:
 >
 >    Such `]'s should be quoted just like `.', `*', `+', `?', `[', `^',
 >    `$', and `\'.
 >
 > Even if that _would_ be true, `]' should _not_ be quoted as "\\]" but as
 > "[]]", as I pointed out earlier.  We are talking about Elisp, not AWK.

The Elisp manual clearly states that "Any other character appearing in a
regular expression is ordinary, unless a `\' precedes it."  Hence,
quoting with a backslash is the canonical method in Elisp.

martin, who doesn't talk AWK.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 17:19             ` Luc Teirlinck
@ 2006-02-26 18:13               ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-26 18:13 UTC (permalink / raw)
  Cc: schwab, emacs-devel

 > To expand on my previous reply, you are arguing from a purely
 > legalistic viewpoint here, based on an error in the manual.  The
 > _purpose_ of quoting special characters even when they have no special
 > meaning is make clear that they have no special meaning.  "\\]" does
 > not clear up any confusion in situations where confusion could be
 > possible, so it is completely meaningless.
 >
 > There is a mistake in the manual.  We should correct that mistake, not
 > make meaningless, and even confusing, changes to code.

Do you really think that considering

(string-match "]" "]")

a valid Elisp expression and

(string-match "[" "[")

an invalid one is less confusing?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 17:53                 ` martin rudalics
@ 2006-02-26 18:22                   ` Luc Teirlinck
  2006-02-26 19:26                     ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-26 18:22 UTC (permalink / raw)
  Cc: schwab, emacs-devel

Martin Rudalics wrote:

    > Apart from correcting the bug we are discussing, it also corrects
    > another misstatement:
    >
    >     For example, a string with unbalanced square brackets is invalid
    >     (with a few exceptions, such as `[]]'),
    >
    > That is incorrect as the examples below show.
    >
    > ELISP> (string-match "]]]]" "]]]]")
    > 0
    > ELISP> (string-match "[[]" "[")
    > 0

   Your example doesn't show that.  You reason about validity in presence
   of a sentence like

	  *Please note:* For historical compatibility, special characters
       are treated as ordinary ones if they are in contexts where their
       special meanings make no sense.

You are confusing validity with meeting stylistic guidelines.  The
fact that string-match returns 0 instead of throwing an error shows
that the regexps are valid.  "*", "." and so on are valid regexps too,
even though they violate stylistic guidelines. "[" is _not_ a valid
regexp.

You can change stylistic guidelines by making doc changes.  You can
only change what is valid by making code changes.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 17:42                 ` Luc Teirlinck
@ 2006-02-26 19:06                   ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-26 19:06 UTC (permalink / raw)
  Cc: schwab, emacs-devel

 > What matters is the following.  If you type `[' it has a special
 > meaning if you type it _outside_ the _context_ of a character
 > alternative.  Its special meaning there is that it starts that
 > context.  Inside that special context, `[' has no special meaning.
 >
 > On the other hand, if you type `]' it has no special meaning _unless_
 > you are in the _context_ of a character alternative.  Its special
 > meaning in that context is that it ends that context.
 >
 > So `[' is special outside the context of a character alternative, but
 > not inside it, `]' is special inside that context, but not outside.

In mathematics `(3 + 4' is a silly expression just like `3 + 4)'.  In
Lisp `(+ 3 4' is invalid just like `+ 3 4)'.  A regular expression
interpretation machine shouldn't handle expressions differently.

By the way try to evaluate

(regexp-opt (list "foo]" "bar]"))

You want to patch `regexp-opt.el' too?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 17:56                 ` Andreas Schwab
@ 2006-02-26 19:08                   ` martin rudalics
  2006-02-27 19:03                     ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-26 19:08 UTC (permalink / raw)
  Cc: Luc Teirlinck, emacs-devel

 >>>It is part of it, just like the leading '['.
 >>
 >>As a consequence, in your model "[" is a valid regexp too.
 >
 >
 > Where did I write that?  Please expand.

You argue that `]' is part of a character alternative just like the
leading `['.  You further argue that "]" is a valid regular expression
outside a character alternative.  Hence, "[" must be a valid regular
expression outside a character alternative too.  Qed.

 > You are free to contribute patches.

I certainly don't want to.  In my opinion the manual is correct here.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 18:22                   ` Luc Teirlinck
@ 2006-02-26 19:26                     ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-26 19:26 UTC (permalink / raw)
  Cc: schwab, emacs-devel

 > You are confusing validity with meeting stylistic guidelines.  The
 > fact that string-match returns 0 instead of throwing an error shows
 > that the regexps are valid.  "*", "." and so on are valid regexps too,
 > even though they violate stylistic guidelines. "[" is _not_ a valid
 > regexp.
 >
 > You can change stylistic guidelines by making doc changes.  You can
 > only change what is valid by making code changes.

A stylistic guideline can tell me to write comments or documentation
strings in a particular way in order to improve their readability.  Even
if I don't follow the guidelines I'm confident that a future Elisp
interpreter will still execute my program correctly.

Validity of a regular expression is something more serious though.  If
someone decides one day that "*" is no more a regular expression the
interpreter should accept I have to rewrite my program.  And I couldn't
possibly complain - I've been warned.

You are confusing "validity" based on the implementation of a particular
interpreter with validity based on mathematical reasoning.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-26 19:08                   ` martin rudalics
@ 2006-02-27 19:03                     ` Richard Stallman
  2006-02-27 19:36                       ` Andreas Schwab
                                         ` (3 more replies)
  0 siblings, 4 replies; 81+ messages in thread
From: Richard Stallman @ 2006-02-27 19:03 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

      You further argue that "]" is a valid regular expression
    outside a character alternative.

Strictly speaking, that is true, it is a valid regular expression.

      Hence, "[" must be a valid regular
    expression outside a character alternative too.

That doesn't follow.  Strictly speaking, "[" is not a valid regular
expression.

However, that doesn't necessarily mean the manual is wrong.
There is more than one way to understand the word "special".
At the most literal level, ] is not special; if you write it
without \\, the regexp compiler won't misunderstand it.
However, it does play a special role in the syntax of regexps,
and it is not necessarily a bad thing for users to think of it
as a special character.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 19:03                     ` Richard Stallman
@ 2006-02-27 19:36                       ` Andreas Schwab
  2006-02-27 20:03                         ` martin rudalics
  2006-02-28  0:30                       ` Luc Teirlinck
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-02-27 19:36 UTC (permalink / raw)
  Cc: martin rudalics, teirllm, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> However, it does play a special role in the syntax of regexps,
> and it is not necessarily a bad thing for users to think of it
> as a special character.

There are more characters like this, ie. ':' and '-', which both play a
special role in character sets, but not elsewhere.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 19:36                       ` Andreas Schwab
@ 2006-02-27 20:03                         ` martin rudalics
  2006-02-27 20:32                           ` Andreas Schwab
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-27 20:03 UTC (permalink / raw)
  Cc: teirllm, rms, emacs-devel

>>However, it does play a special role in the syntax of regexps,
>>and it is not necessarily a bad thing for users to think of it
>>as a special character.
> 
> 
> There are more characters like this, ie. ':' and '-', which both play a
> special role in character sets, but not elsewhere.

My Emacs has

(regexp-quote "]") => "\\]"
(regexp-quote "-") => "-"
(regexp-quote ":") => ":"

I do like my Emacs.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 20:03                         ` martin rudalics
@ 2006-02-27 20:32                           ` Andreas Schwab
  2006-02-27 21:43                             ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-02-27 20:32 UTC (permalink / raw)
  Cc: teirllm, rms, emacs-devel

martin rudalics <rudalics@gmx.at> writes:

>>>However, it does play a special role in the syntax of regexps,
>>>and it is not necessarily a bad thing for users to think of it
>>>as a special character.
>> There are more characters like this, ie. ':' and '-', which both play a
>> special role in character sets, but not elsewhere.
>
> My Emacs has
>
> (regexp-quote "]") => "\\]"
> (regexp-quote "-") => "-"
> (regexp-quote ":") => ":"
>
> I do like my Emacs.

Bugs can be fixed.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 20:32                           ` Andreas Schwab
@ 2006-02-27 21:43                             ` martin rudalics
  2006-02-27 22:11                               ` Andreas Schwab
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-27 21:43 UTC (permalink / raw)
  Cc: teirllm, rms, emacs-devel

> Bugs can be fixed.

That was my intention.  I'd fix some ten bugs (counting cc-awk.el).
You have to fix a bit more.  Good luck then.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 21:43                             ` martin rudalics
@ 2006-02-27 22:11                               ` Andreas Schwab
  2006-02-28  6:19                                 ` Richard Stallman
  2006-02-28 10:28                                 ` martin rudalics
  0 siblings, 2 replies; 81+ messages in thread
From: Andreas Schwab @ 2006-02-27 22:11 UTC (permalink / raw)
  Cc: teirllm, rms, emacs-devel

martin rudalics <rudalics@gmx.at> writes:

> That was my intention.  I'd fix some ten bugs (counting cc-awk.el).
> You have to fix a bit more.  Good luck then.

You are not to decide who is fixing bugs.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 19:03                     ` Richard Stallman
  2006-02-27 19:36                       ` Andreas Schwab
@ 2006-02-28  0:30                       ` Luc Teirlinck
  2006-02-28 10:27                         ` martin rudalics
  2006-03-01 17:54                         ` Richard Stallman
  2006-02-28  0:44                       ` Luc Teirlinck
  2006-02-28  0:59                       ` Luc Teirlinck
  3 siblings, 2 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-28  0:30 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

None of the messages I sent on this (or on anything else) in the last
few days made it to emacs-devel, although all other people's
responses did, be it after some delay.  I just got messages saying
that local delivery failed.  So I will have to repeat some things that
I already said before.

Richard Stallman wrote:

   However, that doesn't necessarily mean the manual is wrong.
   There is more than one way to understand the word "special".
   At the most literal level, ] is not special; if you write it
   without \\, the regexp compiler won't misunderstand it.

`]', like `-' are only special in the context of a character
alternative, that is if, before you type them, you are in a character
alternative.   By contrast, `['  and all other special characters
(except `^') are  only special outside that context.

All characters that are special outside character alternatives are
never special if you precede them with a backslash.  This is true even
for `^'.  This is why it is good to precede them with a backslash even
if they are not special.  That way, the reader can see that they are
not special, without studying the regexp.

On the other hand, a backslash, _never_ eliminates the special meaning
of a `]' or `-' with a special meaning. 

There are two questions here.  Whether a `]' outside a character
alternative should be quoted or not and whether any changes to the
Elisp manual are required.  In this posting, I will only discuss the
first.

First of all, there are (surprisingly) many occurrences of "\\]" in
the Emacs source, where the `]' _is_ special and closes a character
alternative that contains a slash.  Reportedly quoting a `]' with a
backslash _inside_ a character alternative works in some other regexp
implementations such as AWK.  So if I see "\\]" I have to worry about
three possibilities:  it might deliberately close a character
alternative which includes a slash, it might do so by accident because
the author tried to quote a `]' inside a character alternative (and
hence the regexp is buggy), or it might be a deliberately quoted `]'
outside a character alternative.

If I see `]' without preceding "\\", I only have to worry about
whether or not it closes a character alternative, and not about the
third possibility of a bug.

In summary I believe that quoting a `]' outside a character
alternative only adds clutter and a third possibility to worry about.

There are places in the Emacs code that quote a `]' outside a
character alternative.  Even if we decide that this is undesirable, I
do not fancy finding and changing them all.  But we could change the
behavior of `regexp-quote' and `regexp-opt' which currently quote
such `]'.  That could be done with the following trivial patch, which
I could install if that is what we decide to do:

===File ~/search.c-diff=====================================
*** search.c	06 Feb 2006 16:02:24 -0600	1.206
--- search.c	27 Feb 2006 00:16:42 -0600	
***************
*** 3066,3072 ****
  
    for (; in != end; in++)
      {
!       if (*in == '[' || *in == ']'
  	  || *in == '*' || *in == '.' || *in == '\\'
  	  || *in == '?' || *in == '+'
  	  || *in == '^' || *in == '$')
--- 3066,3072 ----
  
    for (; in != end; in++)
      {
!       if (*in == '['
  	  || *in == '*' || *in == '.' || *in == '\\'
  	  || *in == '?' || *in == '+'
  	  || *in == '^' || *in == '$')
============================================================

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 19:03                     ` Richard Stallman
  2006-02-27 19:36                       ` Andreas Schwab
  2006-02-28  0:30                       ` Luc Teirlinck
@ 2006-02-28  0:44                       ` Luc Teirlinck
  2006-03-04 21:07                         ` Thien-Thi Nguyen
  2006-02-28  0:59                       ` Luc Teirlinck
  3 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-28  0:44 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

Richard Stallman wrote:

   However, that doesn't necessarily mean the manual is wrong.
   There is more than one way to understand the word "special".
   At the most literal level, ] is not special; if you write it
   without \\, the regexp compiler won't misunderstand it.
   However, it does play a special role in the syntax of regexps,
   and it is not necessarily a bad thing for users to think of it
   as a special character.

It is good for users to think of `]' and `-' as characters that are
special _inside_ the context of a character alternative.  That way
they will not be confused by quotes from the Elisp manual like:

     Note that the usual regexp special characters are not special
     inside a character alternative.  A completely different set of
     characters is special inside character alternatives: `]', `-' and
     `^'.

The special meaning of `]' inside a character alternative is obviously
to close that alternative.  But the Elisp manual currently lists `]',
like `^', but unlike `-' among the characters that also have another
special meaning outside that context.  That is true for `^', but which
other special meaning does `]' have?

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 19:03                     ` Richard Stallman
                                         ` (2 preceding siblings ...)
  2006-02-28  0:44                       ` Luc Teirlinck
@ 2006-02-28  0:59                       ` Luc Teirlinck
  2006-03-06 12:52                         ` Richard Stallman
  3 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-28  0:59 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

I can install any one or both of the two chunks of the patch to
lispref/searching.texi included below, if desired.  (I sent this
bafore, but it never arrived at emacs-devel).

The first chunk just would eliminate `]' from the list of characters
that are described as special outside a character alternative.

The second chunk rephrases the following:

    For example, a string with unbalanced square brackets is invalid
    (with a few exceptions, such as `[]]'),

That is incorrect or at least ambiguous (how exactly do you define
balanced?) as the examples below show.

ELISP> (string-match "]]]]" "]]]]")
0
ELISP> (string-match "[[]" "[")
0

One accurate way to restate it would be that a string whose square
brackets _with special meaning _ do not balance is invalid.  This
would be (unless I overlook something) without exceptions: in `[]]'
the square brackets with special meaning do balance.  In the patch
below I formulated it differently.

===File ~/searching.texi-diff===============================
*** searching.texi	06 Feb 2006 16:02:08 -0600	1.68
--- searching.texi	26 Feb 2006 10:25:06 -0600	
***************
*** 237,243 ****
  special constructs and the rest are @dfn{ordinary}.  An ordinary
  character is a simple regular expression that matches that character and
  nothing else.  The special characters are @samp{.}, @samp{*}, @samp{+},
! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new
  special characters will be defined in the future.  Any other character
  appearing in a regular expression is ordinary, unless a @samp{\}
  precedes it.
--- 237,243 ----
  special constructs and the rest are @dfn{ordinary}.  An ordinary
  character is a simple regular expression that matches that character and
  nothing else.  The special characters are @samp{.}, @samp{*}, @samp{+},
! @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new
  special characters will be defined in the future.  Any other character
  appearing in a regular expression is ordinary, unless a @samp{\}
  precedes it.
***************
*** 740,747 ****
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! with unbalanced square brackets is invalid (with a few exceptions, such
! as @samp{[]]}), and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
--- 740,747 ----
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! that ends inside a character alternative without terminating @samp{]}
! is invalid, and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
============================================================

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 22:11                               ` Andreas Schwab
@ 2006-02-28  6:19                                 ` Richard Stallman
  2006-02-28 10:28                                 ` martin rudalics
  1 sibling, 0 replies; 81+ messages in thread
From: Richard Stallman @ 2006-02-28  6:19 UTC (permalink / raw)
  Cc: rudalics, teirllm, emacs-devel

    > That was my intention.  I'd fix some ten bugs (counting cc-awk.el).
    > You have to fix a bit more.  Good luck then.

    You are not to decide who is fixing bugs.

Martin was not really trying to give orders.
He was urging people to do more bug-fixing.

Everyone's welcome to do that.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-28  0:30                       ` Luc Teirlinck
@ 2006-02-28 10:27                         ` martin rudalics
  2006-02-28 22:57                           ` Luc Teirlinck
  2006-03-01 17:54                         ` Richard Stallman
  1 sibling, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-02-28 10:27 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

 > `]', like `-' are only special in the context of a character
 > alternative, that is if, before you type them, you are in a character
 > alternative.   By contrast, `['  and all other special characters
 > (except `^') are  only special outside that context.

You can talk about a context iff you are able to grammatically specify
it.  In order to talk about the contents of a string you must be able to
determine the character sequences opening and closing strings.  It would
be strange to say, for example, that the double-quote opening an Elisp
string is outside the context of the string and the double-quote that
closes it inside.  It would be strange to say that the bracket opening a
character alternative is outside the context of the alternative and the
closing bracket inside.

 > All characters that are special outside character alternatives are
 > never special if you precede them with a backslash.  This is true even
 > for `^'.  This is why it is good to precede them with a backslash even
 > if they are not special.  That way, the reader can see that they are
 > not special, without studying the regexp.

I agree.  Let's try to read the following definition from `cc-fonts.el':

(defconst autodoc-font-lock-doc-comments
   `(("@\\(\\w+{\\|\\[\\([^\]@\n\r]\\|@@\\)*\\]\\|[@}]\\|$\\)"
  ...

It tells me that there are two character alternatives started by an
unquoted `[' and terminated by an unquoted `]'.  It also tells me that
it's meant to match a bracketed expression as represented by `\\[' and
`\\]' - I quickly exclude the possibility that the backslashes preceding
any of these brackets are quoted backslashes in a character alternative.
And, finally, the expression tells me that the author was probably
uncertain about how to put a `]' inside a complemented character
alternative, hence (s)he quoted it with a single backslash.  In any case
I have no difficulties reading the expression although I completely
ignore its meaning.  You propose to write

(defconst autodoc-font-lock-doc-comments
   `(("@\\(\\w+{\\|\\[\\([^\]@\n\r]\\|@@\\)*]\\|[@}]\\|$\\)"
  ...

instead.  In that case, when I look at the character sequence `*]' I
would have to consider the case that the `]' closes some character
alternative.  Only after I resolved that I would be able to say that the
`]' should indeed match a right bracket.  And I would still have to
check whether the backslashes preceding the `\\[' are quoted backslashes
in a character set.

 > First of all, there are (surprisingly) many occurrences of "\\]" in
 > the Emacs source, where the `]' _is_ special and closes a character
 > alternative that contains a slash.  Reportedly quoting a `]' with a
 > backslash _inside_ a character alternative works in some other regexp
 > implementations such as AWK.  So if I see "\\]" I have to worry about
 > three possibilities:  it might deliberately close a character
 > alternative which includes a slash, it might do so by accident because
 > the author tried to quote a `]' inside a character alternative (and
 > hence the regexp is buggy), or it might be a deliberately quoted `]'
 > outside a character alternative.

The Emacs manual clearly states that the backslash is not special in a
character set.  But I admit that users of other languages do have
problems when writing Elisp regexps.  That's why a clear and unambiguous
definition of these concepts is important.

 > If I see `]' without preceding "\\", I only have to worry about
 > whether or not it closes a character alternative, and not about the
 > third possibility of a bug.

When I try to read a regular expression I do not worry about the
possibility of a bug in the first place.  I try to understand what the
author wanted to match.

 > There are places in the Emacs code that quote a `]' outside a
 > character alternative.  Even if we decide that this is undesirable, I
 > do not fancy finding and changing them all.  But we could change the
 > behavior of `regexp-quote' and `regexp-opt' which currently quote
 > such `]'.  That could be done with the following trivial patch, which
 > I could install if that is what we decide to do:

Given the amount of regular expressions users created with these
functions and manually inserted in their code that would be confusing
indeed.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-27 22:11                               ` Andreas Schwab
  2006-02-28  6:19                                 ` Richard Stallman
@ 2006-02-28 10:28                                 ` martin rudalics
  1 sibling, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-02-28 10:28 UTC (permalink / raw)
  Cc: teirllm, rms, emacs-devel

>>That was my intention.  I'd fix some ten bugs (counting cc-awk.el).
>>You have to fix a bit more.  Good luck then.
>
>
> You are not to decide who is fixing bugs.

If you read my mails in this thread you will see that I did not intend
to fix any "bugs" in the first place.  I just proposed to remove a few
occurrences of "poor practice".  The term "bug" has been introduced by
you in this thread and I have tried to use it your sense.  Apparently I
failed to do so.

If the text you cite above makes the impression that I wanted to decide
on "what constitutes a bug" or "who should fix it" I apologize.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-28 10:27                         ` martin rudalics
@ 2006-02-28 22:57                           ` Luc Teirlinck
  2006-03-01 13:00                             ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-02-28 22:57 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Martin Rudalics wrote:

   It would be strange to say, for example, that the double-quote
   opening an Elisp string is outside the context of the string and
   the double-quote that closes it inside.

I do not see why you consider this strange.  Quite to the contrary,
this is exactly what allows one to determine whether a `"' opens or
closes a string.  `"" is special both inside and outside the context
of a string.  But its special meaning depends on that context.
Outside the context of a string `"' starts a string, inside the
context of a string, `"' ends a string.  So an opening `"' is opening
_because_ it occurs outside of a string context and the closing `"' is
the closing one _because_ it occurs inside a string context.

Note that the GNU regexp manual, node `(regex)List Operators' agrees
with Andreas and me that `[' is special _outside_ a character alternative
(by stating that it is ordinary inside one) and explicitly states that
`]' has the special meaning of closing a character alternative
_inside_ a character alternative.  (Note that it refers to character
alternatives as "lists".)

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-28 22:57                           ` Luc Teirlinck
@ 2006-03-01 13:00                             ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-01 13:00 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

 > Martin Rudalics wrote:
 >
 >    It would be strange to say, for example, that the double-quote
 >    opening an Elisp string is outside the context of the string and
 >    the double-quote that closes it inside.
 >
 > I do not see why you consider this strange.  Quite to the contrary,
 > this is exactly what allows one to determine whether a `"' opens or
 > closes a string.  `"" is special both inside and outside the context
 > of a string.  But its special meaning depends on that context.
 > Outside the context of a string `"' starts a string, inside the
 > context of a string, `"' ends a string.  So an opening `"' is opening
 > _because_ it occurs outside of a string context and the closing `"' is
 > the closing one _because_ it occurs inside a string context.
 >
 > Note that the GNU regexp manual, node `(regex)List Operators' agrees
 > with Andreas and me that `[' is special _outside_ a character alternative
 > (by stating that it is ordinary inside one) and explicitly states that
 > `]' has the special meaning of closing a character alternative
 > _inside_ a character alternative.  (Note that it refers to character
 > alternatives as "lists".)

If you refer to section "3.6 List Operators ([ ... ] and [^ ... ])" of
the GNU regex manual I can exctract three relevant sentences:

"A matching list matches a single character represented by one of the
list items. You form a matching list by enclosing one or more items
within an open-matching-list operator (represented by `[') and a
close-list operator (represented by `]')."

If you deduce here that the "close-list operator" is part of the "items
within" you can deduce that the "open-matching-list" operator is part of
the "items within" as well.

"`]' ends the list if it's not the first list item. So, if you want to
make the `]' character a list item, you must put it first."

`]' is special inside a chararacter list - the "items within" mentioned
above - because it has to appear as the first element of that list.

"`-' represents the range operator (see section 3.6.2 The Range Operator
(-)) if it's not first or last in a list or the ending point of a range."

If `-' can be "last in a list" the close-list operator `]' cannot be
"last in that list".  Ex falso sequitur quodlibet.

If anyone's interested in how other languages handle regexp brackets
see the list below:

Perl's metacharacters are:
     { } [ ] ( ) ^ $ . | * + ? \

Python metacharacters are:
     . ^ $ * + ? { [ ] \ | ( )

PHP:
     Outside square brackets, the meta-characters are as follows:
     ...
     [ start character class definition
     ] end character class definition
     ...

XML:
     A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ].

Tcl:
     A regular expression uses metacharacters (characters that assume special
     meaning for matching other characters) such as *, [], $ and ..
     ...
     A backslash (\) disables the special meaning of the following character,
     so you could match the string [Hello] with the RE \[Hello\].

Java (http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html):
     Perl is forgiving about malformed matching constructs, as in the
     expression *a, as well as dangling brackets, as in the expression
     abc], and treats them as literals.

     Java also accepts dangling brackets but is strict about dangling
     metacharacters like +, ? and *, and will throw a
     PatternSyntaxException if it encounters them.

Hence all classic regexp languages do consider `]' special and do not
consider `-' special.  The Java doc calls the `]' in `abc]' a dangling
bracket.  The fact that languages "forgive" or "accept" such constructs
shouldn't cause anyone to promote such style.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-28  0:30                       ` Luc Teirlinck
  2006-02-28 10:27                         ` martin rudalics
@ 2006-03-01 17:54                         ` Richard Stallman
  2006-03-02  4:06                           ` Luc Teirlinck
                                             ` (2 more replies)
  1 sibling, 3 replies; 81+ messages in thread
From: Richard Stallman @ 2006-03-01 17:54 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

    `]', like `-' are only special in the context of a character
    alternative, that is if, before you type them, you are in a character
    alternative.   By contrast, `['  and all other special characters
    (except `^') are  only special outside that context.

You're interpreting the term "context" the same way the regexp
compiler does: meaning the preceding characters of the regexp.  The
regexp compiler works from left to right.  However, to a person, the
context of a character set, or any sub-regexp, is found on both sides
of it.  Understood in this way, the role of a character set's closing
] is dual to that of the opening [; both of them delimit the character
set.  Both characters play special roles in the syntax of regexps, and
these roles are not internal to a character set.

    First of all, there are (surprisingly) many occurrences of "\\]" in
    the Emacs source, where the `]' _is_ special and closes a character
    alternative that contains a slash.

That is a good point.  We don't want people to get confused about that.

So I think we should not encourage the quoting of ], but we need to be
careful about how to explain this.  I will write it.

Meanwhile, please do install your search.c patch.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-01 17:54                         ` Richard Stallman
@ 2006-03-02  4:06                           ` Luc Teirlinck
  2006-03-02 19:43                             ` Richard Stallman
  2006-03-02  4:54                           ` Luc Teirlinck
  2006-03-02 18:40                           ` martin rudalics
  2 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-02  4:06 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

Richard Stallman write:

   You're interpreting the term "context" the same way the regexp
   compiler does: meaning the preceding characters of the regexp.

Of course I do.  That is the only interpretation my computer cares about.
If I interpret a regexp differently from the regexp compiler, the
regexp compiler wins, and I loose.  So I do not want to do that.

   The regexp compiler works from left to right.

I usually read regexps left to right too, keeping track of context the
same way the regexp compiler does.  I want to make sure that I
interprete regexps the same way the regexp compiler and my computer do.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-01 17:54                         ` Richard Stallman
  2006-03-02  4:06                           ` Luc Teirlinck
@ 2006-03-02  4:54                           ` Luc Teirlinck
  2006-03-02 18:40                           ` martin rudalics
  2 siblings, 0 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-02  4:54 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

Richard Stallman wrote:

   Meanwhile, please do install your search.c patch.

Done.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-01 17:54                         ` Richard Stallman
  2006-03-02  4:06                           ` Luc Teirlinck
  2006-03-02  4:54                           ` Luc Teirlinck
@ 2006-03-02 18:40                           ` martin rudalics
  2006-03-02 23:26                             ` Luc Teirlinck
                                               ` (2 more replies)
  2 siblings, 3 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-02 18:40 UTC (permalink / raw)
  Cc: schwab, Luc Teirlinck, emacs-devel

 >     First of all, there are (surprisingly) many occurrences of "\\]" in
 >     the Emacs source, where the `]' _is_ special and closes a character
 >     alternative that contains a slash.
 >
 > That is a good point.  We don't want people to get confused about that.

There are very few expressions where `\\' does have to precede a right
bracket, `[^\\]', `[]\\]', and `[^]\\]' come to mind.  I any other case
people may avoid confusion by moving the backslash in front of another
character.

In current Emacs code there are some 100 occurrencs where programmers
were able to convey the intention that they indeed wanted to match a
right bracket by writing `\\]'.  Simultaneously, programmers were able
to express that they did _not_ want a character alternative to end here.
Your change will make it difficult if not impossible to express such
intentions.

And, your change is motivated by the pessimistic assumption that people
frequently submit code with buggy regexps.  Even if that were the case
your change would hardly help.  Consider the following expression from
`gud-jdb-marker-filter':

	 "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
\\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

Experience tells me that this should be probably written as

	 "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
\\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

but I'm not quite sure since `gud.el' is one of the few Emacs files that
do not consistently use `\\]' to match a right bracket.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-02  4:06                           ` Luc Teirlinck
@ 2006-03-02 19:43                             ` Richard Stallman
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 2006-03-02 19:43 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

       You're interpreting the term "context" the same way the regexp
       compiler does: meaning the preceding characters of the regexp.

    Of course I do.  That is the only interpretation my computer cares about.

The manual is meant for human beings to read, not for computers.
And the strict left-to-right parsing concept is not the way
human beings usually understand regexps.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-02 18:40                           ` martin rudalics
@ 2006-03-02 23:26                             ` Luc Teirlinck
  2006-03-03  7:42                               ` martin rudalics
  2006-03-03 10:25                             ` Richard Stallman
  2006-03-03 10:25                             ` Richard Stallman
  2 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-02 23:26 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Martin Rudalics wrote:

	    "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
   \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

   Experience tells me that this should be probably written as

	    "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
   \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

   but I'm not quite sure since `gud.el' is one of the few Emacs files that
   do not consistently use `\\]' to match a right bracket.

I do not see what this problem has to do with "\\]" vs ']'.

This seems to be just a case of forgetting to double up `\' for Lisp
syntax.  The actually intended regexo would seem to obviously be:

"\\(\\[[0-9]+] \\)* and so on.

The present regexp is valid, but the syntax it is looking for seems
bizarre.  On the other hand looking for things like:

"[123] [5] [2034] "

seems to make sense.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-02 23:26                             ` Luc Teirlinck
@ 2006-03-03  7:42                               ` martin rudalics
  2006-03-03 13:51                                 ` Luc Teirlinck
  2006-03-03 14:09                                 ` Luc Teirlinck
  0 siblings, 2 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-03  7:42 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

 > I do not see what this problem has to do with "\\]" vs ']'.
 >
 > This seems to be just a case of forgetting to double up `\' for Lisp
 > syntax.

That's precisely what I meant.  If programmers consistently double up
backslashes for _all_ escaped brackets it's usually simple to guess when
one of them has been omitted.  Otherwise you always have to consider the
possibility that the author wanted to close a character alternative here
and messed up some preceding part.

You have a long-standing experience (or maybe some sixth sense) for
discovering wrong regexps faster than most of us.  But you should
occasionally think of less experienced programmers who try to guess the
motivations for writing an expression like

(string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*"
					 definition start)

in `mailalias.el'.  It's got no less than three backslashes preceding
non-escaped right brackets.  Can you tell me what the author wants to
match?  If, by default, I have to consider the possibility that a `]'
may either close a character alternative _or_ stand for itself, the
number of interpretations of such expressions explodes combinatorially.
Programmers should avoid confusion by not putting `\\' at the end of a
character alternative unless its needed as in `[^\\]'.

 > The present regexp is valid, but the syntax it is looking for seems
 > bizarre.  On the other hand looking for things like:
 >
 > "[123] [5] [2034] "
 >
 > seems to make sense.

Because people are used to consider objects like "[123] [5] [2034]"
well-formed and objects like "123]", "]5]", "[2034 " bizarre.  Most
humans _do_ expect to find some sort of symmetry in the things they
observe.  Symmetry is a driving principle of mathematics and computer
sciences.  Often, it's a lack of symmetry that makes people aware of
faults or other anomalies.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-02 18:40                           ` martin rudalics
  2006-03-02 23:26                             ` Luc Teirlinck
@ 2006-03-03 10:25                             ` Richard Stallman
  2006-03-03 15:20                               ` martin rudalics
  2006-03-03 10:25                             ` Richard Stallman
  2 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 2006-03-03 10:25 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

    Your change will make it difficult if not impossible to express such
    intentions.

I don't understand.  I suspect there is a miscommunication.
When you say "my change", what change is that?

I approved a proposed change in regexp-quote,
and I said I would change the manual.  I did not talk about
any change in parsing regexps.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-02 18:40                           ` martin rudalics
  2006-03-02 23:26                             ` Luc Teirlinck
  2006-03-03 10:25                             ` Richard Stallman
@ 2006-03-03 10:25                             ` Richard Stallman
  2006-03-03 15:51                               ` martin rudalics
  2006-03-05  2:54                               ` Luc Teirlinck
  2 siblings, 2 replies; 81+ messages in thread
From: Richard Stallman @ 2006-03-03 10:25 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

    your change would hardly help.  Consider the following expression from
    `gud-jdb-marker-filter':

	     "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
    \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

    Experience tells me that this should be probably written as

	     "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
    \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

\[ and \] in Lisp strings are equivalent to just [ and just ].  So I
think the current value is incorrect, and the [ needs to have \\ before it.

Meanwhile, the question we're discussing here is whether to write \\
before the ].  That is harmless, and the question is whether it makes
things clearer or more confusing.  The problem is that usually it
makes things clearer, but occasionally people could get confused when
\\ is last in a character alternative.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03  7:42                               ` martin rudalics
@ 2006-03-03 13:51                                 ` Luc Teirlinck
  2006-03-03 14:09                                 ` Luc Teirlinck
  1 sibling, 0 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-03 13:51 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Martin Rudalics wrote:

   But you should
   occasionally think of less experienced programmers who try to guess the
   motivations for writing an expression like

   (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*"
					    definition start)

   in `mailalias.el'.  It's got no less than three backslashes preceding
   non-escaped right brackets.  Can you tell me what the author wants to
   match?

Unless it really is too early in the morning for me, something that
starts with something that is not a backslash, then an even number of
backslashes, then a ", then a sequence of non-newline whitespace or
commas.  The one pair of \\(...\\) that is not needed for this meaning
is probably meant for use with match-data.

What is the point you are trying to make?  That

"[^\\]\\(\\(\\\\\\\\\\)*\\)\"[ \t,]*"

would be easier to read?  Not for me.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03  7:42                               ` martin rudalics
  2006-03-03 13:51                                 ` Luc Teirlinck
@ 2006-03-03 14:09                                 ` Luc Teirlinck
  2006-03-03 18:52                                   ` martin rudalics
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-03 14:09 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Martin Rudalics wrote:

   Can you tell me what the author wants to match?

To give a less technical answer than in my previous response, an
_unquoted_ ", followed by a bunch non-newline whitespace or commas. 

   Most humans _do_ expect to find some sort of symmetry in the things
   they observe.

Not necessarily.  Because you might start your regexp search in the
middle of something, breaking all symmetry.  In the example above, the
search probably started inside a string and the regexp is looking for
the end of it.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 10:25                             ` Richard Stallman
@ 2006-03-03 15:20                               ` martin rudalics
  2006-03-04 13:37                                 ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-03-03 15:20 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

 > I don't understand.  I suspect there is a miscommunication.
 > When you say "my change", what change is that?

Sorry, for the misunderstanding.  For some reasons, I'm currently
receiving mails in quite erratic order.

I referred to Luc's change of `regexp-quote' which, in my opinion, will
make it in some cases impossible to generate regular expressions the
traditional way.  More precisely `(regexp-quote "[foo]")' so far
evaluates to "\\[foo\\]" and has evaluated that way ever since.
Changing this to "\\[foo]" will require that when in future I want to
study a regexp I must also keep in mind whether that expression was
generated by `regexp-opt' before or after that change.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 10:25                             ` Richard Stallman
@ 2006-03-03 15:51                               ` martin rudalics
  2006-03-03 23:48                                 ` Luc Teirlinck
  2006-03-04 23:16                                 ` Luc Teirlinck
  2006-03-05  2:54                               ` Luc Teirlinck
  1 sibling, 2 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-03 15:51 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

 > 	     "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
 >     \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"
 >
 >     Experience tells me that this should be probably written as
 >
 > 	     "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
 >     \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"
 >
 > \[ and \] in Lisp strings are equivalent to just [ and just ].  So I
 > think the current value is incorrect, and the [ needs to have \\ before it.
 >
 > Meanwhile, the question we're discussing here is whether to write \\
 > before the ].  That is harmless, and the question is whether it makes
 > things clearer or more confusing.  The problem is that usually it
 > makes things clearer, but occasionally people could get confused when
 > \\ is last in a character alternative.

The question whether writing '\\' before the `]' is relevant for the
example cited above.  Usually, when I see a `\\]' outside a character
alternative I expect it to match a right bracket in some text.  And,
usually, in that text a left bracket will precede the right bracket.
Hence, if in the text above the author had used `\\]' instead of `\]' it
would have been easy to conclude - from the absence of a preceding `\\['
- that something went wrong.  Vice versa, when seeing a `\\[' I usually
expect it to have a corresponding `\\]' somehwere on the right.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 14:09                                 ` Luc Teirlinck
@ 2006-03-03 18:52                                   ` martin rudalics
  2006-03-03 22:41                                     ` Luc Teirlinck
  2006-03-03 23:00                                     ` Luc Teirlinck
  0 siblings, 2 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-03 18:52 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

 > To give a less technical answer than in my previous response, an
 > _unquoted_ ", followed by a bunch non-newline whitespace or commas.

(string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" "\" ,") => nil

 > What is the point you are trying to make?  That
 >
 > "[^\\]\\(\\(\\\\\\\\\\)*\\)\"[ \t,]*"
 >
 > would be easier to read?  Not for me.

I agree that writing and reading 10 backslashes in a row is dreadful.
However, writing `[\\]' to match a single backslash is dreadful as well.
A character alternative without alternative does not deserve its name.
Nowadays I'd probably write something like

"[^\\]\\(\\\\\\{2\\}*\\)\"[ \t,]*"

but maybe at the time the original expression was written repetition
operators were not yet available.

Anyway, the "point I was trying to make" was a different one.  I believe
we should give suggestions how to avoid writing confusing regexps rather
than change `regexp-quote'.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 18:52                                   ` martin rudalics
@ 2006-03-03 22:41                                     ` Luc Teirlinck
  2006-03-03 23:00                                     ` Luc Teirlinck
  1 sibling, 0 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-03 22:41 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Martin Rudalics wrote:

    > To give a less technical answer than in my previous response, an
    > _unquoted_ ", followed by a bunch non-newline whitespace or commas.

   (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" "\" ,") => nil

Less technical means more potential for ambiguity.  Of course the
regexp does _not_ match "\" ,", because that would not guarantee that
the " it found is unquoted.  Apparently, point is at the beginning of
a string and the regexp searches for the ending unquoted " by
searching, as I said in my previous message, for something that is
_not_ a backslash, then an even number of backslashes, then a ", then
a bunch of non-newline whitespace or commas.  Apparently, the code
relies on the assumption that the string does not consist of _only_
backslashes.

ELISP> (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" "012\" ,")
2

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 18:52                                   ` martin rudalics
  2006-03-03 22:41                                     ` Luc Teirlinck
@ 2006-03-03 23:00                                     ` Luc Teirlinck
  1 sibling, 0 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-03 23:00 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Martin Rudalics wrote:

   Nowadays I'd probably write something like

   "[^\\]\\(\\\\\\{2\\}*\\)\"[ \t,]*"

   but maybe at the time the original expression was written repetition
   operators were not yet available.

To me, the above regexp is _really_ awkward, whereas

"[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*"

is really easy to understand and very self-documenting.

   However, writing `[\\]' to match a single backslash is dreadful as well.

Quite to the contrary.  It documents very clearly which \\ together
represent one single literal backslash and it separates them clearly
from the surrounding non-literal backslashes.  It is what makes this
regexp so very easy to read, unlike your suggested replacement with
its six consecutive ungrouped backslashes, with various different
meanings.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 15:51                               ` martin rudalics
@ 2006-03-03 23:48                                 ` Luc Teirlinck
  2006-03-04  9:58                                   ` martin rudalics
  2006-03-04 23:16                                 ` Luc Teirlinck
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-03 23:48 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Martin Rudalics wrote:

   The question whether writing '\\' before the `]' is relevant for the
   example cited above.  Usually, when I see a `\\]' outside a character
   alternative I expect it to match a right bracket in some text.  And,
   usually, in that text a left bracket will precede the right bracket.
   Hence, if in the text above the author had used `\\]' instead of `\]' it
   would have been easy to conclude - from the absence of a preceding `\\['
   - that something went wrong.  Vice versa, when seeing a `\\[' I usually
   expect it to have a corresponding `\\]' somehwere on the right.

I believe that you make understanding regexps hard on yourself by
making all kind of assumptions that often are not satisfied.

There is no reason why a literal `]' should be matched by a literal
`[' to the right or vice versa.  Even _if_ the `[' and the `]' balance
in the text you are parsing through _considered in its entirety_
(which is not at all guaranteed), you might be inside, say, a nested
Lisp vector and your regexp may be searching for its end.  No balance
of literal `[' and `]' at all.  This is _not_ an exceptional
situation.  It occurs all over the place in the Emacs source code.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 23:48                                 ` Luc Teirlinck
@ 2006-03-04  9:58                                   ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-04  9:58 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

> I believe that you make understanding regexps hard on yourself by
> making all kind of assumptions that often are not satisfied.
>
> There is no reason why a literal `]' should be matched by a literal
> `[' to the right or vice versa.

What I meant was that

(i) when I see a literal `[' I expect it to be matched by a literal `]'
in the text that follows and,

(ii) when I see a literal `]' I expect it to be matched by a literal `['
in the preceding text.

In mathematics open intervals like `]3,5]' are an obvious exception to
these rules but in general I've been quite happy with them.  In the
particular case, I've been talking about a regexp in Emacs source

	 "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
\\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

which I consider wrong.  Apparently that part of the code is never taken
thus no one has complained so far about mismatches.  However, similar
expressions to match line numbers occur frequently.  And I use the rules
above to reason about them and am confident that in this particular case
you use one of these rules as well.

If I followed your reasoning to its logical end I couldn't possibly rule
out malformed regexps like `[a-z'.  After all the `[' states that a
character alternative starts here, why should a user bother to close it?

> Even _if_ the `[' and the `]' balance
> in the text you are parsing through _considered in its entirety_
> (which is not at all guaranteed), you might be inside, say, a nested
> Lisp vector and your regexp may be searching for its end.  No balance
> of literal `[' and `]' at all.  This is _not_ an exceptional
> situation.  It occurs all over the place in the Emacs source code.

I fully agree.  However, in such cases there is practically always some
pdl (variable) to record the current state of "unclosed" literal `['s.
In practice, I will complain about unmatching brackets when either the
pdl is empty (the variable is zero) and I find a literal `]' or the pdl
is non-empty (the variable is non-zero) when I encounter the end of the
text.  Hence, the pdl (variable) compensates missing symmetry in the
part of the text I want to parse.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 15:20                               ` martin rudalics
@ 2006-03-04 13:37                                 ` Richard Stallman
  2006-03-04 14:40                                   ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 2006-03-04 13:37 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

    I referred to Luc's change of `regexp-quote' which, in my opinion, will
    make it in some cases impossible to generate regular expressions the
    traditional way.

I don't understand what that means.  What exactly is the task you
believe will be impossible?  As far as I know, it will still generate
correct regexps, just somewhat different ones.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-04 13:37                                 ` Richard Stallman
@ 2006-03-04 14:40                                   ` martin rudalics
  2006-03-06  0:48                                     ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: martin rudalics @ 2006-03-04 14:40 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

 > I don't understand what that means.  What exactly is the task you
 > believe will be impossible?  As far as I know, it will still generate
 > correct regexps, just somewhat different ones.

Suppose someone wanted to match `[foo]' in an earlier version of a
program and now wants to match `[foo][bar]' where `[foo]' and `[bar]'
are complicated expressions to be generated with help of `regexp-opt'.
The earlier version was obtained with `regexp-opt' producing
`\\[foo\\]'.  For the new version `regexp-opt' would generate `\\[bar]'.
The resulting expression would read as `\\[foo\\]\\[bar]' which is
confusing since two different styles are involved.  The user would have
to manually change `\\]' to `]' (or `]' to `\\]') to get a uniform
appearance.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-28  0:44                       ` Luc Teirlinck
@ 2006-03-04 21:07                         ` Thien-Thi Nguyen
  2006-03-05  3:37                           ` Luc Teirlinck
  0 siblings, 1 reply; 81+ messages in thread
From: Thien-Thi Nguyen @ 2006-03-04 21:07 UTC (permalink / raw)

Luc Teirlinck <teirllm@dms.auburn.edu> writes:

> It is good for users to think of `]' and `-' as characters that
> are special _inside_ the context of a character alternative.

when describing a complicated process (such as the regexp
compiler's operation) it's a common technique to break things down
into independent parts.  that seems to be the approached used in
the docs thus far.  but maybe explaining the role of the square
brace delimiters is not so independent.

my 2c: probably "inside" and "outside" are not as precise as
possible when talking about delimiters (of context or anything),
such as the square braces.  such delimiters "change" the context
(unless somehow inhibited), so that things before or after may be
"inside" or "outside" of the old/new context.

whether or not the delimiter itself is considered inside or outside
depends on whether your pov tends to be forward- or backward-looking
(which is a personal choice, and thus, algorithmically irrelevent).
ascribing context membership to a delimiter is like arguing for
child custody; the delimiter will always be between and no happier
for the label.

thi

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 15:51                               ` martin rudalics
  2006-03-03 23:48                                 ` Luc Teirlinck
@ 2006-03-04 23:16                                 ` Luc Teirlinck
  1 sibling, 0 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-04 23:16 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

I believe that we should just decide whether there is a bug in the
regexp in question (which seems _nearly_ certain) and correct it if
so.  For people who use the Java debugger jdb (I do not know Java),
I summarize the problem, so there is no need to read through any of
the prior postings in this thread.

The regexp in question occurs in `gud-jdb-marker-filter' on line 2155
of progmodes/gud.el and is:

     "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \
\\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)"

The problem is limited to the \\(\[[0-9]+\] \\)* part at the beginning.
According to the Change Logs, this part _seems_ to be used to
search/detect classpath information in jdb's output.

The regexp as given is valid.  But it looks like \\(\[[0-9]+\] \\)*
was actualy meant to mean \\(\\[[0-9]+] \\)*, since the author
seemingly forgot to double up `\' for Lisp syntax.

I do not know Java, so I have no way of knowing what the correct
syntax is.

According to the current regexp, it consists of a something that looks
like a sequence of integers written in base 11, where `[' bizarrely
stands for ten, separated and terminated by "] ".

The "obvious" correction \\(\\[[0-9]+] \\)* looks for a bunch of
decimal digits enclosed in square brackets separated by a space,
like "[1276] [0] ".

It seems that we should make the "obvious" correction, but it would
nevertheless be good if somebody who knows the syntax could confirm
this.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-03 10:25                             ` Richard Stallman
  2006-03-03 15:51                               ` martin rudalics
@ 2006-03-05  2:54                               ` Luc Teirlinck
  2006-03-06  0:49                                 ` Richard Stallman
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-05  2:54 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, schwab

Richard Stallman wrote:

   \[ and \] in Lisp strings are equivalent to just [ and just ].  So I
   think the current value is incorrect, and the [ needs to have \\ before it.

Since I sent my previous message I noticed from the comment and the
code following the regexp, that "\\(\\[[0-9]+] \\)* is the only
possible interpretation.  The comment and code are _really_ looking
for a sequence of the type "[123] [3] ":

      ;; A good marker is one that:
      ;; 1) does not have a "[n] " prefix (not part of a stack backtrace)
      ;; 2) does have an "[n] " prefix and n is the lowest prefix seen
      ;;    since the last prompt

So I believe that we just should go ahead and change "\\(\[[0-9]+\] \\)*"
to "\\(\\[[0-9]+] \\)*".  I can do this, if desired.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-04 21:07                         ` Thien-Thi Nguyen
@ 2006-03-05  3:37                           ` Luc Teirlinck
  2006-03-05 11:10                             ` martin rudalics
  2006-03-05 11:54                             ` martin rudalics
  0 siblings, 2 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-05  3:37 UTC (permalink / raw)
  Cc: emacs-devel

Thien-Thi Nguyen wrote:

   whether or not the delimiter itself is considered inside or outside
   depends on whether your pov tends to be forward- or backward-looking
   (which is a personal choice, and thus, algorithmically irrelevent).

No, both the notion of context and the forward-looking view are
algorithmically _very_ relevant.

If you consider in "[a]b]" the first and the second `]' to be _both_
inside or _both_ outside the context of a character alternative, then
it would be impossible to determine solely from that notion of context
which of the two `]' has to be taken literally.  If you consider the
opening and ending " of a string to be _both_ inside or _both_ outside
the context of a string, then it would be impossible from that notion
of context to determine which " open and which " close strings.

Thus any such notions of context are useless.

On the other hand the regexp compiler uses the notion
of context I mentioned to determine which `[' or `]' are to be
interpreted literally.  It is also how other parsers determine which "
open strings and which close them.  Hence, that notion of context is
useful, in fact, necessary.

Also, forward and backward views of a regexp are not
algorithmically equivalent.  If you read a regexp forward, you know
immediately when you encounter a character whether it has to be taken
literally or not (or at worst after a _very_ limited number of
characters, as the second `[' in in "[[:...").  If you read the regexp
backward, you may have to read all the way back to the beginning
before you can be sure that a `]' is to be taken literally.

Hence, reading a regexp forward _is_ algorithmically _very_ superior
over reading it backward if your purpose is to understand the regexp.
I must admit however, that if you want is to uncover the subliminal
satanic messages in the regexp, then you _have_ to read it backward.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05  3:37                           ` Luc Teirlinck
@ 2006-03-05 11:10                             ` martin rudalics
  2006-03-05 15:32                               ` Luc Teirlinck
  2006-03-05 17:04                               ` Luc Teirlinck
  2006-03-05 11:54                             ` martin rudalics
  1 sibling, 2 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-05 11:10 UTC (permalink / raw)
  Cc: ttn, emacs-devel

Luc Teirlinck wrote:
 > If you consider in "[a]b]" the first and the second `]' to be _both_
 > inside or _both_ outside the context of a character alternative, then
 > it would be impossible to determine solely from that notion of context
 > which of the two `]' has to be taken literally.

That's what I don't get tired of saying for one week already.  You
always denied it by saying things like

    The special meaning of `]' inside a character alternative is
    obviously to close that alternative.

and

    `]' has the special meaning of closing a character alternative
    _inside_ a character alternative

If the closing `]' is inside the alternative where does the first `]' in
"[a]b]" go?

 > If you consider the
 > opening and ending " of a string to be _both_ inside or _both_ outside
 > the context of a string, then it would be impossible from that notion
 > of context to determine which " open and which " close strings.

You're cheating here: The double-quote opening a string compares to the
_left_ bracket opening a character alternative.  The double-quote
closing a string compares to the _right_ bracket closing a character
alternative.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05  3:37                           ` Luc Teirlinck
  2006-03-05 11:10                             ` martin rudalics
@ 2006-03-05 11:54                             ` martin rudalics
  2006-03-05 15:35                               ` Andreas Schwab
  2006-03-05 18:36                               ` Luc Teirlinck
  1 sibling, 2 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-05 11:54 UTC (permalink / raw)
  Cc: ttn, emacs-devel

Luc Teirlinck wrote:
 > Also, forward and backward views of a regexp are not
 > algorithmically equivalent.  If you read a regexp forward, you know
 > immediately when you encounter a character whether it has to be taken
 > literally or not (or at worst after a _very_ limited number of
 > characters, as the second `[' in in "[[:...").  If you read the regexp
 > backward, you may have to read all the way back to the beginning
 > before you can be sure that a `]' is to be taken literally.

How do you read the following regexp from `cc-langs.el'?

(concat
  "\\("
  "[\)\[\(]"
  (if (c-lang-const c-type-modifier-kwds)
      (concat
       "\\|"
       ;; "throw" in `c-type-modifier-kwds' is followed
       ;; by a parenthesis list, but no extra measures
       ;; are necessary to handle that.
       (regexp-opt (c-lang-const c-type-modifier-kwds) t)
       "\\>")
    "")
  "\\)")

Do you really evaluate the (c-lang-const ...)s _before_ looking at the
closing `\\)'?  What would you do if the value of `c-type-modifier-kwds'
were available at run-time only?

When trying to understand such regexps I break them up into parts first.
Such parts are, in my understanding, groups like `\\(...\\)',
subexpressions delimited by `\\|', and character alternatives.  Next I
try to understand the parts that interest me without paying notice to
parts that do not relate to my specific problem.  And I would have
troubles to isolate a character alternative when the author matches a
literal right bracket with `]'.

People can make reading a regexp truly awkward by writing kludgy
expressions like

(let ((keywords (concat "\\([;(){}`|&]\\|^\\)[ \t]*\\(\\("
			(regexp-opt (sh-feature sh-leading-keywords) t)
			"[ \t]+\\)?"
			(regexp-opt (append (sh-feature sh-leading-keywords)
					    (sh-feature sh-other-keywords))
				    t))))

in `sh-font-lock-keywords-1' which I understand correctly iff I read the
definition of the entire function first.  Such expressions are, however,
rare in present Emacs code.

 > Hence, reading a regexp forward _is_ algorithmically _very_ superior
 > over reading it backward if your purpose is to understand the regexp.

If my purpose is to understand how a regexp engine interprets a regexp,
reading a regexp forwardly is superior.  If, however, my purpose is to
understand a complex regexp I want to guess the author's intentions
first.  In that case I do want to break up the expression into its
constituents.  In general, languages hiding implementation details are
easier to use than languages that require users to know how specific
features are implemented.

 > I must admit however, that if you want is to uncover the subliminal
 > satanic messages in the regexp, then you _have_ to read it backward.

It's better to avoid "subliminal satanic messages" when _writing_ a
regexp.  It's bad if you have to uncover them when reading a regexp.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 11:10                             ` martin rudalics
@ 2006-03-05 15:32                               ` Luc Teirlinck
  2006-03-06  7:41                                 ` martin rudalics
  2006-03-05 17:04                               ` Luc Teirlinck
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-05 15:32 UTC (permalink / raw)
  Cc: ttn, emacs-devel

Martin Rudalics wrote:

   Luc Teirlinck wrote:
    > If you consider in "[a]b]" the first and the second `]' to be _both_
    > inside or _both_ outside the context of a character alternative, then
    > it would be impossible to determine solely from that notion of context
    > which of the two `]' has to be taken literally.

   That's what I don't get tired of saying for one week already.  You
   always denied it by saying things like

       The special meaning of `]' inside a character alternative is
       obviously to close that alternative.

   and

       `]' has the special meaning of closing a character alternative
       _inside_ a character alternative

Look, I am getting tired of this endless yes-no discussion.  But you
have completely misunderstood everything I have been saying.  Let me
try once more to explain.

Figuring out whether a `]' has to be taken literally or not is a
completely trivial problem, but you are making it difficult on
yourself for counterproductive philosophical reasons.

Start at the beginning of the regexp.  `[' is special, `]' not,
because we are outside a character alternative.  After the first
unquoted `[' is read, which is special because it was typed outside a
character alternative, we are inside a character alternative. `[' is
no longer special, but `]' is (except immediately after the `[' or
"[^"), because we now are inside a character alternative.  After the
next `]' is read, which is special because it was typed inside a
character alternative, we are back outside a character alternative,
`[' is special, `]' not. To summarize, `]' is only special in a
character alternative, `[' is only special outside one.

Note how easy this is.  Unlike for, say \\( you do not even have to
keep track of which `[' matches which `]', because there is no
nesting.  All you need to keep track of is whether you are inside or
outside a character alternative.

You are making things difficult by treating `[' and `]' in regexps as
if they had the usual open-close parentheses syntax, like \\( and \\).
They do *not* and that is the cause of all your misunderstandings.  In
"[1[2]3]" the first `]' closes the first `[' and "balance" makes no
sense for the other `[' and `]'.  If `[' and `]' had the usual open-close
parentheses syntax, the 2 would be inside a nested character
alternative, two levels deep.  But there is no such thing as nested
character alternatives, because, in regexps, `[' and `]' do not have
the usual open-close parentheses syntax (unlike, say, in Lisp vectors).

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 11:54                             ` martin rudalics
@ 2006-03-05 15:35                               ` Andreas Schwab
  2006-03-06  8:19                                 ` martin rudalics
  2006-03-05 18:36                               ` Luc Teirlinck
  1 sibling, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 2006-03-05 15:35 UTC (permalink / raw)
  Cc: ttn, Luc Teirlinck, emacs-devel

martin rudalics <rudalics@gmx.at> writes:

> And I would have troubles to isolate a character alternative when the
> author matches a literal right bracket with `]'.

A bracket expression always starts with an unquoted `['.  When looking at
a `]' you will never know whether it is part of a bracket expression
(independent of whether it is preceded by `\') without first determining
which syntax is currently active (inside or outside of a bracket
expression).

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 11:10                             ` martin rudalics
  2006-03-05 15:32                               ` Luc Teirlinck
@ 2006-03-05 17:04                               ` Luc Teirlinck
  1 sibling, 0 replies; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-05 17:04 UTC (permalink / raw)
  Cc: ttn, emacs-devel

Martin Rudalics wrote"

   Luc Teirlinck wrote:
    > If you consider in "[a]b]" the first and the second `]' to be _both_
    > inside or _both_ outside the context of a character alternative, then
    > it would be impossible to determine solely from that notion of context
    > which of the two `]' has to be taken literally.

   That's what I don't get tired of saying for one week already.  You
   always denied it by saying things like

I believe that you forgot to read the "If" that starts the passage you
quoted.  Hence your impression that I was contradicting myself.  I
consider the first `]' in "[a]b]" to be inside a character
alternative, the second one outside.  From that context I can
determine that the first `]' is special and the second one not.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 11:54                             ` martin rudalics
  2006-03-05 15:35                               ` Andreas Schwab
@ 2006-03-05 18:36                               ` Luc Teirlinck
  2006-03-05 19:14                                 ` Luc Teirlinck
  1 sibling, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-05 18:36 UTC (permalink / raw)
  Cc: ttn, emacs-devel

Martin Rudalics wrote:

   If my purpose is to understand how a regexp engine interprets a regexp,
   reading a regexp forwardly is superior.

As Andreas already pointed out, there is _no_ way to determine whether
either a `]' _or_ a `\\]' has to be taken literally or closes a
character alternative without parsing the regexp forward from the start.

   In general, languages hiding implementation details are
   easier to use than languages that require users to know how specific
   features are implemented.

But if you _are_ using a language that requires parsing forward from the
beginning for correct understanding, like regexps, then _pretending_
that you are using some other type of language is not going to help.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 18:36                               ` Luc Teirlinck
@ 2006-03-05 19:14                                 ` Luc Teirlinck
  2006-03-06  8:17                                   ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-05 19:14 UTC (permalink / raw)
  Cc: rudalics, emacs-devel, ttn

>From my previous reply:

   As Andreas already pointed out, there is _no_ way to determine whether
   either a `]' _or_ a `\\]' has to be taken literally or closes a
   character alternative without parsing the regexp forward from the start.

Well, _in certain cases_, you might be able to determine it sooner by
parsing backward from the `]'.  If you see a `]' or "\\]" you know it
has to be taken literally.  (Note: the "\\" are irrelevant to the
question again.)  But then you do not know whether that earlier `]' or
"\\]" has to be taken literally, without keeping going.  If you see an
unquoted [, you know that the `]' closes a character alternative, but
you still do not know whether that `[' is opening that character
alternative or has to be taken literally, as in "[asd[fgh]".  If you
encounter another `[' next you know that either that one opens your
character alternative _or_ that the regexp was very poorly written.

But there definitely are many cases where you would have to parse back
all the way to the beginning.

Sincerely,

Luc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-04 14:40                                   ` martin rudalics
@ 2006-03-06  0:48                                     ` Richard Stallman
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 2006-03-06  0:48 UTC (permalink / raw)
  Cc: schwab, teirllm, emacs-devel

      For the new version `regexp-opt' would generate `\\[bar]'.
    The resulting expression would read as `\\[foo\\]\\[bar]' which is
    confusing since two different styles are involved.  The user would have
    to manually change `\\]' to `]' (or `]' to `\\]') to get a uniform
    appearance.

That seems like no big deal.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05  2:54                               ` Luc Teirlinck
@ 2006-03-06  0:49                                 ` Richard Stallman
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 2006-03-06  0:49 UTC (permalink / raw)
  Cc: rudalics, schwab, emacs-devel

    So I believe that we just should go ahead and change "\\(\[[0-9]+\] \\)*"
    to "\\(\\[[0-9]+] \\)*".  I can do this, if desired.

Please do.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 15:32                               ` Luc Teirlinck
@ 2006-03-06  7:41                                 ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-06  7:41 UTC (permalink / raw)
  Cc: ttn, emacs-devel

 > Note how easy this is.  Unlike for, say \\( you do not even have to
 > keep track of which `[' matches which `]', because there is no
 > nesting.  All you need to keep track of is whether you are inside or
 > outside a character alternative.

I do not have any problems matching `[' with `]' when regexps are
written cleanly.  I do have problems when `]', `\]', or `\\]' get mixed
up as in the `gud-jdb-marker-filter' bug.

 > You are making things difficult by treating `[' and `]' in regexps as
 > if they had the usual open-close parentheses syntax, like \\( and \\).
 > They do *not* and that is the cause of all your misunderstandings.  In
 > "[1[2]3]" the first `]' closes the first `[' and "balance" makes no
 > sense for the other `[' and `]'.  If `[' and `]' had the usual open-close
 > parentheses syntax, the 2 would be inside a nested character
 > alternative, two levels deep.  But there is no such thing as nested
 > character alternatives, because, in regexps, `[' and `]' do not have
 > the usual open-close parentheses syntax (unlike, say, in Lisp vectors).

We have been comparing character alternatives with strings.  Elisp
strings don't nest.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 19:14                                 ` Luc Teirlinck
@ 2006-03-06  8:17                                   ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-06  8:17 UTC (permalink / raw)
  Cc: ttn, emacs-devel

 > But there definitely are many cases where you would have to parse back
 > all the way to the beginning.

I don't want to parse regexps, neither forward nor backward.  I want to
understand what the author of the expression intended to match.  For
that purpose I try to extract familiar patterns from the expression.
Parsing a regexp in order to show that it's wrong or doesn't match what
it should doesn't make sense for most human beings.  The regexp engine
can do that much better.

Experienced programmers like you mentally parse complicated regexps from
beginning to end.  Experienced programmers occasionally forget that less
experienced programmers are not able to do that.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-05 15:35                               ` Andreas Schwab
@ 2006-03-06  8:19                                 ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-06  8:19 UTC (permalink / raw)
  Cc: ttn, Luc Teirlinck, emacs-devel

> A bracket expression always starts with an unquoted `['.  When looking at
> a `]' you will never know whether it is part of a bracket expression
> (independent of whether it is preceded by `\') without first determining
> which syntax is currently active (inside or outside of a bracket
> expression).

Agreed.  When looking at a single isolated character I can never tell
whether it's inside a bracket expression or not.  However, I'd like to
determine whether it is before having to read an entire regexp from
beginning to end.  For that I want all the syntactic help I can get.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-02-28  0:59                       ` Luc Teirlinck
@ 2006-03-06 12:52                         ` Richard Stallman
  2006-03-07  5:52                           ` Luc Teirlinck
  0 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 2006-03-06 12:52 UTC (permalink / raw)
  Cc: rudalics, schwab, emacs-devel

The basic concept of a character class is an entity surrounded by
matching parentheses.  However, quirks such as quoting make it
necessary to understand the construct in terms of left-to-right
parsing for complete understanding of the details.

I think the manual needs to explain both levels--the first level so
beginners can begin to understand, and the second level for precise
thinking about counterintuitive regexps.

I could certainly do that, but I am terribly overloaded.  Would
someone else like to try it?

Meanwhile, I sure wish the quoting conventions for regexps
were more rational.  But that would be an incompatible change
and I think the minuses will always outweigh the pluses.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-06 12:52                         ` Richard Stallman
@ 2006-03-07  5:52                           ` Luc Teirlinck
  2006-03-07  8:53                             ` martin rudalics
  0 siblings, 1 reply; 81+ messages in thread
From: Luc Teirlinck @ 2006-03-07  5:52 UTC (permalink / raw)
  Cc: rudalics, schwab, emacs-devel

Richard Stallman wrote:

   I think the manual needs to explain both levels--the first level so
   beginners can begin to understand, and the second level for precise
   thinking about counterintuitive regexps.

   I could certainly do that, but I am terribly overloaded.  Would
   someone else like to try it?

What about the following patch, which I can install if desired?

It includes one unrelated change dealing with a problem I noticed in
the process.  It moves a paragraph occurring currently in the
description of `*' to the description of `+'.  (Although, from diff's
perspective, it instead moves the definition of `+' up till before
that paragraph.  Everything is relative, I guess.)  The reason is that
the paragraph discusses the regexp "(x+y*\)*a" before the meaning of
`+' is explained.  This makes `x+y' look like is the sum of x and y.
Also the remarks in the paragraph apply to both `*' and `+'.

===File ~/searching.texi-diff===============================
*** searching.texi	06 Feb 2006 16:02:08 -0600	1.68
--- searching.texi	06 Mar 2006 23:47:42 -0600	
***************
*** 235,246 ****
  
    Regular expressions have a syntax in which a few characters are
  special constructs and the rest are @dfn{ordinary}.  An ordinary
! character is a simple regular expression that matches that character and
! nothing else.  The special characters are @samp{.}, @samp{*}, @samp{+},
! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new
! special characters will be defined in the future.  Any other character
! appearing in a regular expression is ordinary, unless a @samp{\}
! precedes it.
  
    For example, @samp{f} is not a special character, so it is ordinary, and
  therefore @samp{f} is a regular expression that matches the string
--- 235,249 ----
  
    Regular expressions have a syntax in which a few characters are
  special constructs and the rest are @dfn{ordinary}.  An ordinary
! character is a simple regular expression that matches that character
! and nothing else.  The special characters are @samp{.}, @samp{*},
! @samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new
! special characters will be defined in the future.  The character
! @samp{]} is special if it ends a character alternative (see later).
! The character @samp{-} is special inside a character alternative.  A
! @samp{[:} and balancing @samp{:]} enclose a character class inside a
! character alternative.  Any other character appearing in a regular
! expression is ordinary, unless a @samp{\} precedes it.
  
    For example, @samp{f} is not a special character, so it is ordinary, and
  therefore @samp{f} is a regular expression that matches the string
***************
*** 301,306 ****
--- 304,316 ----
  The next alternative is for @samp{a*} to match only two @samp{a}s.  With
  this choice, the rest of the regexp matches successfully.@refill
  
+ @item @samp{+}
+ @cindex @samp{+} in regexp
+ is a postfix operator, similar to @samp{*} except that it must match
+ the preceding expression at least once.  So, for example, @samp{ca+r}
+ matches the strings @samp{car} and @samp{caaaar} but not the string
+ @samp{cr}, whereas @samp{ca*r} matches all three strings.
+ 
  Nested repetition operators take a long time, or even forever, if they
  lead to ambiguous matching.  For example, trying to match the regular
  expression @samp{\(x+y*\)*a} against the string
***************
*** 311,323 ****
  it causes an infinite loop.  To avoid these problems, check nested
  repetitions carefully.
  
- @item @samp{+}
- @cindex @samp{+} in regexp
- is a postfix operator, similar to @samp{*} except that it must match
- the preceding expression at least once.  So, for example, @samp{ca+r}
- matches the strings @samp{car} and @samp{caaaar} but not the string
- @samp{cr}, whereas @samp{ca*r} matches all three strings.
- 
  @item @samp{?}
  @cindex @samp{?} in regexp
  is a postfix operator, similar to @samp{*} except that it must match the
--- 321,326 ----
***************
*** 468,473 ****
--- 471,504 ----
  can act.  It is poor practice to depend on this behavior; quote the
  special character anyway, regardless of where it appears.@refill
  
+ As a @samp{\} is not special inside a character alternative, it can
+ never remove the special meaning of @samp{-} or @samp{]}.  So you
+ should not quote these characters when they have no special meaning
+ either.  This would not clarify anything, since backslashes can
+ legitimately precede these characters where they @emph{have} special
+ meaning, as in @code{[^\]} (@code{"[^\\]"} for Lisp string syntax),
+ which matches any single character except a backslash.
+ 
+ In practice, most @samp{]} that occur in regular expressions close a
+ character alternative and hence are special.  However, occasionally a
+ regular expression may try to match a complex pattern of literal
+ @samp{[} and @samp{]}.  In such situations, it sometimes may be
+ necessary to carefully parse the regexp from the start to determine
+ which square brackets enclose a character alternative.  For example,
+ @code{[^][]]}, consists of the complemented character alternative
+ @code{[^][]}, which matches any single character that is not a square
+ bracket, followed by a literal @samp{]}.
+ 
+ The exact rules are that at the beginning of a regexp, @samp{[} is
+ special and @samp{]} not.  This lasts until the first unquoted
+ @samp{[}, after which we are in a character alternative; @samp{[} is
+ no longer special (except if it starts a character class) but @samp{]}
+ is special, unless it immediately follows the special @samp{[} or that
+ @samp{[} followed by a @samp{^}.  This lasts until the next special
+ @samp{]} that does not end a character class.  This ends the character
+ alternative and restores the ordinary syntax of regular expressions;
+ an unquoted @samp{[} is special again and a @samp{]} not.
+ 
  @node Char Classes
  @subsubsection Character Classes
  @cindex character classes in regexp
***************
*** 740,747 ****
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! with unbalanced square brackets is invalid (with a few exceptions, such
! as @samp{[]]}), and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
--- 771,778 ----
  
  @kindex invalid-regexp
    Not every string is a valid regular expression.  For example, a string
! that ends inside a character alternative without terminating @samp{]}
! is invalid, and so is a string that ends with a single @samp{\}.  If
  an invalid regular expression is passed to any of the search functions,
  an @code{invalid-regexp} error is signaled.
  
============================================================

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Unquoted special characters in regexps
  2006-03-07  5:52                           ` Luc Teirlinck
@ 2006-03-07  8:53                             ` martin rudalics
  0 siblings, 0 replies; 81+ messages in thread
From: martin rudalics @ 2006-03-07  8:53 UTC (permalink / raw)
  Cc: schwab, rms, emacs-devel

Luc Teirlinck wrote:
 > What about the following patch, which I can install if desired?

The patch is logically consistent and it's probably reasonable to close
this issue now.  If the patch is installed, a similar one will have to
be written for the Emacs manual.

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2006-03-07  8:53 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-25 17:23 Unquoted special characters in regexps martin rudalics
2006-02-25 18:42 ` Andreas Schwab
2006-02-25 19:18   ` martin rudalics
2006-02-25 19:31     ` Andreas Schwab
2006-02-25 20:18       ` martin rudalics
2006-02-25 22:09         ` Andreas Schwab
2006-02-26 11:32           ` martin rudalics
2006-02-26 11:50             ` Andreas Schwab
2006-02-26 13:28               ` martin rudalics
2006-02-25 22:13         ` Luc Teirlinck
2006-02-26 13:13           ` martin rudalics
2006-02-26 13:50             ` Andreas Schwab
2006-02-26 16:41               ` Luc Teirlinck
2006-02-26 17:53                 ` martin rudalics
2006-02-26 18:22                   ` Luc Teirlinck
2006-02-26 19:26                     ` martin rudalics
2006-02-26 17:10               ` martin rudalics
2006-02-26 17:42                 ` Luc Teirlinck
2006-02-26 19:06                   ` martin rudalics
2006-02-26 17:56                 ` Andreas Schwab
2006-02-26 19:08                   ` martin rudalics
2006-02-27 19:03                     ` Richard Stallman
2006-02-27 19:36                       ` Andreas Schwab
2006-02-27 20:03                         ` martin rudalics
2006-02-27 20:32                           ` Andreas Schwab
2006-02-27 21:43                             ` martin rudalics
2006-02-27 22:11                               ` Andreas Schwab
2006-02-28  6:19                                 ` Richard Stallman
2006-02-28 10:28                                 ` martin rudalics
2006-02-28  0:30                       ` Luc Teirlinck
2006-02-28 10:27                         ` martin rudalics
2006-02-28 22:57                           ` Luc Teirlinck
2006-03-01 13:00                             ` martin rudalics
2006-03-01 17:54                         ` Richard Stallman
2006-03-02  4:06                           ` Luc Teirlinck
2006-03-02 19:43                             ` Richard Stallman
2006-03-02  4:54                           ` Luc Teirlinck
2006-03-02 18:40                           ` martin rudalics
2006-03-02 23:26                             ` Luc Teirlinck
2006-03-03  7:42                               ` martin rudalics
2006-03-03 13:51                                 ` Luc Teirlinck
2006-03-03 14:09                                 ` Luc Teirlinck
2006-03-03 18:52                                   ` martin rudalics
2006-03-03 22:41                                     ` Luc Teirlinck
2006-03-03 23:00                                     ` Luc Teirlinck
2006-03-03 10:25                             ` Richard Stallman
2006-03-03 15:20                               ` martin rudalics
2006-03-04 13:37                                 ` Richard Stallman
2006-03-04 14:40                                   ` martin rudalics
2006-03-06  0:48                                     ` Richard Stallman
2006-03-03 10:25                             ` Richard Stallman
2006-03-03 15:51                               ` martin rudalics
2006-03-03 23:48                                 ` Luc Teirlinck
2006-03-04  9:58                                   ` martin rudalics
2006-03-04 23:16                                 ` Luc Teirlinck
2006-03-05  2:54                               ` Luc Teirlinck
2006-03-06  0:49                                 ` Richard Stallman
2006-02-28  0:44                       ` Luc Teirlinck
2006-03-04 21:07                         ` Thien-Thi Nguyen
2006-03-05  3:37                           ` Luc Teirlinck
2006-03-05 11:10                             ` martin rudalics
2006-03-05 15:32                               ` Luc Teirlinck
2006-03-06  7:41                                 ` martin rudalics
2006-03-05 17:04                               ` Luc Teirlinck
2006-03-05 11:54                             ` martin rudalics
2006-03-05 15:35                               ` Andreas Schwab
2006-03-06  8:19                                 ` martin rudalics
2006-03-05 18:36                               ` Luc Teirlinck
2006-03-05 19:14                                 ` Luc Teirlinck
2006-03-06  8:17                                   ` martin rudalics
2006-02-28  0:59                       ` Luc Teirlinck
2006-03-06 12:52                         ` Richard Stallman
2006-03-07  5:52                           ` Luc Teirlinck
2006-03-07  8:53                             ` martin rudalics
2006-02-25 22:34         ` Luc Teirlinck
2006-02-25 22:59           ` Andreas Schwab
2006-02-26 13:20           ` martin rudalics
2006-02-26 16:53             ` Luc Teirlinck
2006-02-26 18:01               ` martin rudalics
2006-02-26 17:19             ` Luc Teirlinck
2006-02-26 18:13               ` martin rudalics

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).