* Unquoted special characters in regexps @ 2006-02-25 17:23 martin rudalics 2006-02-25 18:42 ` Andreas Schwab 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-25 17:23 UTC (permalink / raw) Section 34.3.1.1 (Special Characters in Regular Expressions) of the Elisp manual says: *Please note:* For historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. ... It is poor practice to depend on this behavior; quote the special character anyway, regardless of where it appears. The three patches below eliminate spurious occurrences of such practice: * font-lock.el (lisp-font-lock-keywords-2) * emacs-lisp/rx.el (rx-check-any, rx-check-not) * generic-x.el (reg-generic-mode): Quote "]"s in regexps when they have no special meaning. ================================================================================ *** font-lock.el Wed Feb 1 10:17:44 2006 --- font-lock.el Thu Feb 16 20:24:48 2006 *************** *** 2120,2126 **** ;; Erroneous structures. ("(\\(abort\\|assert\\|warn\\|check-type\\|cerror\\|error\\|signal\\)\\>" 1 font-lock-warning-face) ;; Words inside \\[] tend to be for `substitute-command-keys'. ! ("\\\\\\\\\\[\\(\\sw+\\)]" 1 font-lock-constant-face prepend) ;; Words inside `' tend to be symbol names. ("`\\(\\sw\\sw+\\)'" 1 font-lock-constant-face prepend) ;; Constant values. --- 2120,2126 ---- ;; Erroneous structures. ("(\\(abort\\|assert\\|warn\\|check-type\\|cerror\\|error\\|signal\\)\\>" 1 font-lock-warning-face) ;; Words inside \\[] tend to be for `substitute-command-keys'. ! ("\\\\\\\\\\[\\(\\sw+\\)\\]" 1 font-lock-constant-face prepend) ;; Words inside `' tend to be symbol names. ("`\\(\\sw\\sw+\\)'" 1 font-lock-constant-face prepend) ;; Constant values. ================================================================================ *** rx.el Sat Nov 5 20:44:46 2005 --- rx.el Thu Feb 16 20:28:18 2006 *************** *** 371,378 **** (if (eq ?^ (aref arg 0)) (setq arg (concat "\\" arg))) ;; Remove ] and set flag for adding it to start of overall result. ! (when (string-match "]" arg) ! (setq arg (replace-regexp-in-string "]" "" arg) rx-bracket "]"))) (when (symbolp arg) (let ((translation (condition-case nil --- 371,378 ---- (if (eq ?^ (aref arg 0)) (setq arg (concat "\\" arg))) ;; Remove ] and set flag for adding it to start of overall result. ! (when (string-match "\\]" arg) ! (setq arg (replace-regexp-in-string "\\]" "" arg) rx-bracket "]"))) (when (symbolp arg) (let ((translation (condition-case nil *************** *** 404,410 **** (defun rx-check-not (arg) "Check arg ARG for Rx `not'." (unless (or (and (symbolp arg) ! (string-match "\\`\\[\\[:[-a-z]:]]\\'" (condition-case nil (rx-to-string arg 'no-group) (error "")))) --- 404,410 ---- (defun rx-check-not (arg) "Check arg ARG for Rx `not'." (unless (or (and (symbolp arg) ! (string-match "\\`\\[\\[:[-a-z]:\\]\\]\\'" (condition-case nil (rx-to-string arg 'no-group) (error "")))) ================================================================================ *** generic-x.el Sat Nov 5 20:44:28 2005 --- generic-x.el Sat Feb 25 17:22:58 2006 *************** *** 433,439 **** (define-generic-mode reg-generic-mode '(?\;) '("key" "classes_root" "REGEDIT" "REGEDIT4") ! '(("\\(\\[.*]\\)" 1 font-lock-constant-face) ("^\\([^\n\r]*\\)\\s-*=" 1 font-lock-variable-name-face)) '("\\.[rR][eE][gG]\\'") (list --- 433,439 ---- (define-generic-mode reg-generic-mode '(?\;) '("key" "classes_root" "REGEDIT" "REGEDIT4") ! '(("\\(\\[.*\\]\\)" 1 font-lock-constant-face) ("^\\([^\n\r]*\\)\\s-*=" 1 font-lock-variable-name-face)) '("\\.[rR][eE][gG]\\'") (list ================================================================================ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 17:23 Unquoted special characters in regexps martin rudalics @ 2006-02-25 18:42 ` Andreas Schwab 2006-02-25 19:18 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-02-25 18:42 UTC (permalink / raw) Cc: emacs-devel martin rudalics <rudalics@gmx.at> writes: > * font-lock.el (lisp-font-lock-keywords-2) > * emacs-lisp/rx.el (rx-check-any, rx-check-not) > * generic-x.el (reg-generic-mode): Quote "]"s in regexps when > they have no special meaning. ']' is not special in regexps. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 18:42 ` Andreas Schwab @ 2006-02-25 19:18 ` martin rudalics 2006-02-25 19:31 ` Andreas Schwab 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-25 19:18 UTC (permalink / raw) Cc: emacs-devel Andreas Schwab schrieb: > martin rudalics <rudalics@gmx.at> writes: > > >> * font-lock.el (lisp-font-lock-keywords-2) >> * emacs-lisp/rx.el (rx-check-any, rx-check-not) >> * generic-x.el (reg-generic-mode): Quote "]"s in regexps when >> they have no special meaning. > > > ']' is not special in regexps. > > Andreas. > From the Elisp manual: 34.3.1 Syntax of Regular Expressions ------------------------------------ Regular expressions have a syntax in which a few characters are special constructs and the rest are "ordinary". An ordinary character is a simple regular expression that matches that character and nothing else. The special characters are `.', `*', `+', `?', `[', ------> `]' <------, `^', `$', and `\'; no new special characters will be defined in the future. Any other character appearing in a regular expression is ordinary, unless a `\' precedes it. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 19:18 ` martin rudalics @ 2006-02-25 19:31 ` Andreas Schwab 2006-02-25 20:18 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-02-25 19:31 UTC (permalink / raw) Cc: emacs-devel martin rudalics <rudalics@gmx.at> writes: > From the Elisp manual: > > 34.3.1 Syntax of Regular Expressions > ------------------------------------ > > Regular expressions have a syntax in which a few characters are special > constructs and the rest are "ordinary". An ordinary character is a > simple regular expression that matches that character and nothing else. > The special characters are `.', `*', `+', `?', `[', ------> `]' <------, `^', `$', and This is incorrect. ']' is only special in a bracket expression (like '-'). Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 19:31 ` Andreas Schwab @ 2006-02-25 20:18 ` martin rudalics 2006-02-25 22:09 ` Andreas Schwab ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: martin rudalics @ 2006-02-25 20:18 UTC (permalink / raw) Cc: emacs-devel Andreas Schwab schrieb: > This is incorrect. ']' is only special in a bracket expression (like > '-'). `]' is _also_ special in a character alternative, like `^'. `-' is special _only_ in a character alternative. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 20:18 ` martin rudalics @ 2006-02-25 22:09 ` Andreas Schwab 2006-02-26 11:32 ` martin rudalics 2006-02-25 22:13 ` Luc Teirlinck 2006-02-25 22:34 ` Luc Teirlinck 2 siblings, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-02-25 22:09 UTC (permalink / raw) Cc: emacs-devel martin rudalics <rudalics@gmx.at> writes: > Andreas Schwab schrieb: >> This is incorrect. ']' is only special in a bracket expression (like >> '-'). > > `]' is _also_ special in a character alternative, A bracket expression has a completely different set of special characters. For example, '\' and '$' are not special there. > `-' is special _only_ in a character alternative. Just like ']'. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 22:09 ` Andreas Schwab @ 2006-02-26 11:32 ` martin rudalics 2006-02-26 11:50 ` Andreas Schwab 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-26 11:32 UTC (permalink / raw) Cc: emacs-devel >>>This is incorrect. ']' is only special in a bracket expression (like >>>'-'). >> >>`]' is _also_ special in a character alternative, > > > A bracket expression has a completely different set of special characters. > For example, '\' and '$' are not special there. > > >>`-' is special _only_ in a character alternative. > > > Just like ']'. > > Andreas. > We would have to agree on the semantics of the term "special" first. In Elisp descriptions this term is overloaded. Take, for example, the following excerpt from the Elisp tutorial: Indeed more than one such mark or brace may precede the space. These require a expression that looks like this: []\"')}]* In this expression, the first `]' is the first character in the expression; the second character is `"', which is preceded by a `\' to tell Emacs the `"' is _not_ special. The last three characters are `'', `)', and `}'. It's confusing because we know from the Elisp manual that The special characters are `.', `*', `+', `?', `[', `]', `^', `$', and `\'; no new special characters will be defined in the future. hence a double-quote is never "special" in terms of regexp semantics. Why should we have to tell Emacs that it is "_not_ special" then? The answer is, obviously, that the Elisp read syntax for regexps is the string data type and the tutorial's "special" indeed refers to string semantics. Hence when you say that "'\' and '$' are not special there" you probably don't mean the special semantics of the backslash within strings. Now let's agree on the term "there". Reasonably, "there" is the sequence of characters obtained after stripping both the opening bracket _and_ the closing bracket of a character alternative. Otherwise, the sentence from the Elisp manual "To include a `]' in a character alternative, you must make it the first character." wouldn't make sense. `]' is special inside a character alternative because it may appear in one and only one position - namely the first. And the semantics of the `]' in the first position is "match one `]'". The semantics of an `]' closing a character alternative is completely different from that. From an operational point of view - that of the Elisp interpreter - you _can_ say that `]' is not special in regexps. If that's the preferred point of view it's sufficient to remove `]' from the list of special characters in the respective manuals and treat it like `-' as you propose. And, you wouldn't have to mention "poor practice" in the Elisp manual at all - anything the Elisp interpreter interprets as intended by the programmer would be valid. There exists, however, a functional subset of Elisp (Elisp without setqs, iterators, ...) amenable to mathematical reasoning like proving correctness or validity of your code. And mathematicians don't like to reason about malformed constructs like "(a + 3" or "a + 3)". They prefer something like "a + 3" or "(a + 3)" instead. They do rely on the special semantics of "(" _and_ ")" within an expression. Hence, saying that `[' is special and `]' not would be tantamount to removing regexps from the subset of Elisp amenable to mathematical reasoning. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 11:32 ` martin rudalics @ 2006-02-26 11:50 ` Andreas Schwab 2006-02-26 13:28 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-02-26 11:50 UTC (permalink / raw) Cc: emacs-devel martin rudalics <rudalics@gmx.at> writes: > The answer is, obviously, that the Elisp read syntax for regexps is the There is no such thing as a read syntax for regexps. This is not a Lisp data type. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 11:50 ` Andreas Schwab @ 2006-02-26 13:28 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-02-26 13:28 UTC (permalink / raw) Cc: emacs-devel >>The answer is, obviously, that the Elisp read syntax for regexps is the > > > There is no such thing as a read syntax for regexps. This is not a Lisp > data type. > > Andreas. > From the Elisp manual: Therefore, the read syntax for a regular expression matching `\' is `"\\\\"'. Please send your complaints to the writer of that sentence. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 20:18 ` martin rudalics 2006-02-25 22:09 ` Andreas Schwab @ 2006-02-25 22:13 ` Luc Teirlinck 2006-02-26 13:13 ` martin rudalics 2006-02-25 22:34 ` Luc Teirlinck 2 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-25 22:13 UTC (permalink / raw) Cc: schwab, emacs-devel Martin Rudalics wrote: `]' is _also_ special in a character alternative, like `^'. `-' is special _only_ in a character alternative. I may be overlooking something, but _which_ special meaning does `]' have outside of character alternatives? You are using \\] to quote `]'. Could that possibly clear up any confusion or does it just add confusion? I personally believe the latter. Is there a situation where \\ can be used to prevent `]' from having a special meaning? In "[a\\]b]", the first `]' still has a special meaning, even though there might be some optical illusion making it look "quoted", the second `]' has no special meaning. The only way I know of to put a "quoted" `]' in a character alternative is to write it immediately after the `[' or "[^". But "quoting" `]' by writing "[]]" instead of just `]', seems contorted, even though it would be less confusing than "\\]". I believe that a `]' should not be quoted at all if it is outside a character alternative, where it has no special meaning, unless I am overlooking something. (Just tell what, in that case.) ELISP> (string-match "[a\\]b]" "]") nil ELISP> (string-match "[a\\]b]" "\\b") nil ELISP> (string-match "[a\\]b]" "\\b]") 0 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 22:13 ` Luc Teirlinck @ 2006-02-26 13:13 ` martin rudalics 2006-02-26 13:50 ` Andreas Schwab 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-26 13:13 UTC (permalink / raw) Cc: schwab, emacs-devel > Martin Rudalics wrote: > > `]' is _also_ special in a character alternative, like `^'. `-' is > special _only_ in a character alternative. > > I may be overlooking something, but _which_ special meaning does `]' > have outside of character alternatives? That of closing a character alternative. When you write (defvar foo "]") (defvar bar "\\]") you can't interchangeably use `foo' and `bar' in an arbitrary regular expression. Some people call this "referential transparency". > You are using \\] to quote `]'. Could that possibly clear up any > confusion or does it just add confusion? I personally believe the > latter. Is there a situation where \\ can be used to prevent `]' from > having a special meaning? In "[a\\]b]", the first `]' still has a > special meaning, even though there might be some optical illusion > making it look "quoted", the second `]' has no special meaning. According to the Elisp manual "[a\\]b]" is poor practice. You probably mean "[a\\]b\\]" here. Anyway, the first `]' has a special meaning but it's not "inside" the character alternative. It does have a special meaning because `]' is special _outside_ character alternatives. > I believe that a `]' should not be quoted at all if it is outside a > character alternative, where it has no special meaning, unless I am > overlooking something. (Just tell what, in that case.) That of terminating a character alternative. > > ELISP> (string-match "[a\\]b]" "]") > nil > ELISP> (string-match "[a\\]b]" "\\b") > nil > ELISP> (string-match "[a\\]b]" "\\b]") > 0 According to the Elisp manual all these exhibit "poor practice" since you didn't quote the second `]'s. You should have complained about that when you read the manual. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 13:13 ` martin rudalics @ 2006-02-26 13:50 ` Andreas Schwab 2006-02-26 16:41 ` Luc Teirlinck 2006-02-26 17:10 ` martin rudalics 0 siblings, 2 replies; 81+ messages in thread From: Andreas Schwab @ 2006-02-26 13:50 UTC (permalink / raw) Cc: Luc Teirlinck, emacs-devel martin rudalics <rudalics@gmx.at> writes: > That of closing a character alternative. When you write > > (defvar foo "]") > (defvar bar "\\]") > > you can't interchangeably use `foo' and `bar' in an arbitrary regular > expression. Some people call this "referential transparency". Of course you can't, since the meaning of '\' is context dependent. > Anyway, the first `]' has a special meaning but it's not "inside" the > character alternative. It is part of it, just like the leading '['. > It does have a special meaning because `]' is special _outside_ > character alternatives. This is wrong. Outside of a character set ']' has no special meaning whatsoever, independent of the context. > According to the Elisp manual all these exhibit "poor practice" since > you didn't quote the second `]'s. It's a bug in the manual. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 13:50 ` Andreas Schwab @ 2006-02-26 16:41 ` Luc Teirlinck 2006-02-26 17:53 ` martin rudalics 2006-02-26 17:10 ` martin rudalics 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-26 16:41 UTC (permalink / raw) Cc: rudalics, emacs-devel Andreas Schwab wrote: > According to the Elisp manual all these exhibit "poor practice" since > you didn't quote the second `]'s. It's a bug in the manual. I propose the following patch to lispref/searching.texi, which I can install if desired. I will wait till more people, in particular Richard, have had an opportunity to see it. Note that the current version already clearly states elsewhere that `]' is special _inside_ character alternatives: Note that the usual regexp special characters are not special inside a character alternative. A completely different set of characters is special inside character alternatives: `]', `-' and `^'. Apart from correcting the bug we are discussing, it also corrects another misstatement: For example, a string with unbalanced square brackets is invalid (with a few exceptions, such as `[]]'), That is incorrect as the examples below show. ELISP> (string-match "]]]]" "]]]]") 0 ELISP> (string-match "[[]" "[") 0 One correct way to restate it would be that a string whose square brackets _with special meaning in the context in which they are used _ do not balance is invalid. This would be (unless I overlook something) without exceptions: in `[]]' the square brackets with special meaning do balance. In the patch below I formulated it differently. None of my previous mails to emacs-{devel,pretest-bug} in the last few days have appeared on the list, so I wonder whether this one will. ===File ~/searching.texi-diff=============================== *** searching.texi 06 Feb 2006 16:02:08 -0600 1.68 --- searching.texi 26 Feb 2006 10:25:06 -0600 *************** *** 237,243 **** special constructs and the rest are @dfn{ordinary}. An ordinary character is a simple regular expression that matches that character and nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, ! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new special characters will be defined in the future. Any other character appearing in a regular expression is ordinary, unless a @samp{\} precedes it. --- 237,243 ---- special constructs and the rest are @dfn{ordinary}. An ordinary character is a simple regular expression that matches that character and nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, ! @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new special characters will be defined in the future. Any other character appearing in a regular expression is ordinary, unless a @samp{\} precedes it. *************** *** 740,747 **** @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! with unbalanced square brackets is invalid (with a few exceptions, such ! as @samp{[]]}), and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. --- 740,747 ---- @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! that ends inside a character alternative without terminating @samp{]} ! is invalid, and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. ============================================================ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 16:41 ` Luc Teirlinck @ 2006-02-26 17:53 ` martin rudalics 2006-02-26 18:22 ` Luc Teirlinck 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-26 17:53 UTC (permalink / raw) Cc: schwab, emacs-devel > Apart from correcting the bug we are discussing, it also corrects > another misstatement: > > For example, a string with unbalanced square brackets is invalid > (with a few exceptions, such as `[]]'), > > That is incorrect as the examples below show. > > ELISP> (string-match "]]]]" "]]]]") > 0 > ELISP> (string-match "[[]" "[") > 0 Your example doesn't show that. You reason about validity in presence of a sentence like *Please note:* For historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 17:53 ` martin rudalics @ 2006-02-26 18:22 ` Luc Teirlinck 2006-02-26 19:26 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-26 18:22 UTC (permalink / raw) Cc: schwab, emacs-devel Martin Rudalics wrote: > Apart from correcting the bug we are discussing, it also corrects > another misstatement: > > For example, a string with unbalanced square brackets is invalid > (with a few exceptions, such as `[]]'), > > That is incorrect as the examples below show. > > ELISP> (string-match "]]]]" "]]]]") > 0 > ELISP> (string-match "[[]" "[") > 0 Your example doesn't show that. You reason about validity in presence of a sentence like *Please note:* For historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. You are confusing validity with meeting stylistic guidelines. The fact that string-match returns 0 instead of throwing an error shows that the regexps are valid. "*", "." and so on are valid regexps too, even though they violate stylistic guidelines. "[" is _not_ a valid regexp. You can change stylistic guidelines by making doc changes. You can only change what is valid by making code changes. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 18:22 ` Luc Teirlinck @ 2006-02-26 19:26 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-02-26 19:26 UTC (permalink / raw) Cc: schwab, emacs-devel > You are confusing validity with meeting stylistic guidelines. The > fact that string-match returns 0 instead of throwing an error shows > that the regexps are valid. "*", "." and so on are valid regexps too, > even though they violate stylistic guidelines. "[" is _not_ a valid > regexp. > > You can change stylistic guidelines by making doc changes. You can > only change what is valid by making code changes. A stylistic guideline can tell me to write comments or documentation strings in a particular way in order to improve their readability. Even if I don't follow the guidelines I'm confident that a future Elisp interpreter will still execute my program correctly. Validity of a regular expression is something more serious though. If someone decides one day that "*" is no more a regular expression the interpreter should accept I have to rewrite my program. And I couldn't possibly complain - I've been warned. You are confusing "validity" based on the implementation of a particular interpreter with validity based on mathematical reasoning. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 13:50 ` Andreas Schwab 2006-02-26 16:41 ` Luc Teirlinck @ 2006-02-26 17:10 ` martin rudalics 2006-02-26 17:42 ` Luc Teirlinck 2006-02-26 17:56 ` Andreas Schwab 1 sibling, 2 replies; 81+ messages in thread From: martin rudalics @ 2006-02-26 17:10 UTC (permalink / raw) Cc: Luc Teirlinck, emacs-devel >>That of closing a character alternative. When you write >> >>(defvar foo "]") >>(defvar bar "\\]") >> >>you can't interchangeably use `foo' and `bar' in an arbitrary regular >>expression. Some people call this "referential transparency". > > > Of course you can't, since the meaning of '\' is context dependent. When you say that outside a character alternative `]' and `\\]' have the same meaning you abandon the principle of referential transparency. >>Anyway, the first `]' has a special meaning but it's not "inside" the >>character alternative. > > > It is part of it, just like the leading '['. As a consequence, in your model "[" is a valid regexp too. >>It does have a special meaning because `]' is special _outside_ >>character alternatives. > > > This is wrong. Outside of a character set ']' has no special meaning > whatsoever, independent of the context. On a similar footing you can say that `*' has no special meaning unless it's preceded by a character. Hence "*foo" is valid too in your model. >>According to the Elisp manual all these exhibit "poor practice" since >>you didn't quote the second `]'s. > > > It's a bug in the manual. Please fix it. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 17:10 ` martin rudalics @ 2006-02-26 17:42 ` Luc Teirlinck 2006-02-26 19:06 ` martin rudalics 2006-02-26 17:56 ` Andreas Schwab 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-26 17:42 UTC (permalink / raw) Cc: schwab, emacs-devel Martin Rudalics wrote: >>Anyway, the first `]' has a special meaning but it's not "inside" the >>character alternative. > > > It is part of it, just like the leading '['. As a consequence, in your model "[" is a valid regexp too. What matters is the following. If you type `[' it has a special meaning if you type it _outside_ the _context_ of a character alternative. Its special meaning there is that it starts that context. Inside that special context, `[' has no special meaning. On the other hand, if you type `]' it has no special meaning _unless_ you are in the _context_ of a character alternative. Its special meaning in that context is that it ends that context. So `[' is special outside the context of a character alternative, but not inside it, `]' is special inside that context, but not outside. > It's a bug in the manual. Please fix it. I plan to do that. I just want to wait to make sure that Richard agrees. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 17:42 ` Luc Teirlinck @ 2006-02-26 19:06 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-02-26 19:06 UTC (permalink / raw) Cc: schwab, emacs-devel > What matters is the following. If you type `[' it has a special > meaning if you type it _outside_ the _context_ of a character > alternative. Its special meaning there is that it starts that > context. Inside that special context, `[' has no special meaning. > > On the other hand, if you type `]' it has no special meaning _unless_ > you are in the _context_ of a character alternative. Its special > meaning in that context is that it ends that context. > > So `[' is special outside the context of a character alternative, but > not inside it, `]' is special inside that context, but not outside. In mathematics `(3 + 4' is a silly expression just like `3 + 4)'. In Lisp `(+ 3 4' is invalid just like `+ 3 4)'. A regular expression interpretation machine shouldn't handle expressions differently. By the way try to evaluate (regexp-opt (list "foo]" "bar]")) You want to patch `regexp-opt.el' too? ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 17:10 ` martin rudalics 2006-02-26 17:42 ` Luc Teirlinck @ 2006-02-26 17:56 ` Andreas Schwab 2006-02-26 19:08 ` martin rudalics 1 sibling, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-02-26 17:56 UTC (permalink / raw) Cc: Luc Teirlinck, emacs-devel martin rudalics <rudalics@gmx.at> writes: >> It is part of it, just like the leading '['. > > As a consequence, in your model "[" is a valid regexp too. Where did I write that? Please expand. >> This is wrong. Outside of a character set ']' has no special meaning >> whatsoever, independent of the context. > > On a similar footing you can say that `*' has no special meaning unless > it's preceded by a character. Why do you think I would say that? Please expand. > Please fix it. You are free to contribute patches. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 17:56 ` Andreas Schwab @ 2006-02-26 19:08 ` martin rudalics 2006-02-27 19:03 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-26 19:08 UTC (permalink / raw) Cc: Luc Teirlinck, emacs-devel >>>It is part of it, just like the leading '['. >> >>As a consequence, in your model "[" is a valid regexp too. > > > Where did I write that? Please expand. You argue that `]' is part of a character alternative just like the leading `['. You further argue that "]" is a valid regular expression outside a character alternative. Hence, "[" must be a valid regular expression outside a character alternative too. Qed. > You are free to contribute patches. I certainly don't want to. In my opinion the manual is correct here. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 19:08 ` martin rudalics @ 2006-02-27 19:03 ` Richard Stallman 2006-02-27 19:36 ` Andreas Schwab ` (3 more replies) 0 siblings, 4 replies; 81+ messages in thread From: Richard Stallman @ 2006-02-27 19:03 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel You further argue that "]" is a valid regular expression outside a character alternative. Strictly speaking, that is true, it is a valid regular expression. Hence, "[" must be a valid regular expression outside a character alternative too. That doesn't follow. Strictly speaking, "[" is not a valid regular expression. However, that doesn't necessarily mean the manual is wrong. There is more than one way to understand the word "special". At the most literal level, ] is not special; if you write it without \\, the regexp compiler won't misunderstand it. However, it does play a special role in the syntax of regexps, and it is not necessarily a bad thing for users to think of it as a special character. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 19:03 ` Richard Stallman @ 2006-02-27 19:36 ` Andreas Schwab 2006-02-27 20:03 ` martin rudalics 2006-02-28 0:30 ` Luc Teirlinck ` (2 subsequent siblings) 3 siblings, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-02-27 19:36 UTC (permalink / raw) Cc: martin rudalics, teirllm, emacs-devel Richard Stallman <rms@gnu.org> writes: > However, it does play a special role in the syntax of regexps, > and it is not necessarily a bad thing for users to think of it > as a special character. There are more characters like this, ie. ':' and '-', which both play a special role in character sets, but not elsewhere. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 19:36 ` Andreas Schwab @ 2006-02-27 20:03 ` martin rudalics 2006-02-27 20:32 ` Andreas Schwab 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-27 20:03 UTC (permalink / raw) Cc: teirllm, rms, emacs-devel >>However, it does play a special role in the syntax of regexps, >>and it is not necessarily a bad thing for users to think of it >>as a special character. > > > There are more characters like this, ie. ':' and '-', which both play a > special role in character sets, but not elsewhere. My Emacs has (regexp-quote "]") => "\\]" (regexp-quote "-") => "-" (regexp-quote ":") => ":" I do like my Emacs. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 20:03 ` martin rudalics @ 2006-02-27 20:32 ` Andreas Schwab 2006-02-27 21:43 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-02-27 20:32 UTC (permalink / raw) Cc: teirllm, rms, emacs-devel martin rudalics <rudalics@gmx.at> writes: >>>However, it does play a special role in the syntax of regexps, >>>and it is not necessarily a bad thing for users to think of it >>>as a special character. >> There are more characters like this, ie. ':' and '-', which both play a >> special role in character sets, but not elsewhere. > > My Emacs has > > (regexp-quote "]") => "\\]" > (regexp-quote "-") => "-" > (regexp-quote ":") => ":" > > I do like my Emacs. Bugs can be fixed. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 20:32 ` Andreas Schwab @ 2006-02-27 21:43 ` martin rudalics 2006-02-27 22:11 ` Andreas Schwab 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-27 21:43 UTC (permalink / raw) Cc: teirllm, rms, emacs-devel > Bugs can be fixed. That was my intention. I'd fix some ten bugs (counting cc-awk.el). You have to fix a bit more. Good luck then. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 21:43 ` martin rudalics @ 2006-02-27 22:11 ` Andreas Schwab 2006-02-28 6:19 ` Richard Stallman 2006-02-28 10:28 ` martin rudalics 0 siblings, 2 replies; 81+ messages in thread From: Andreas Schwab @ 2006-02-27 22:11 UTC (permalink / raw) Cc: teirllm, rms, emacs-devel martin rudalics <rudalics@gmx.at> writes: > That was my intention. I'd fix some ten bugs (counting cc-awk.el). > You have to fix a bit more. Good luck then. You are not to decide who is fixing bugs. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 22:11 ` Andreas Schwab @ 2006-02-28 6:19 ` Richard Stallman 2006-02-28 10:28 ` martin rudalics 1 sibling, 0 replies; 81+ messages in thread From: Richard Stallman @ 2006-02-28 6:19 UTC (permalink / raw) Cc: rudalics, teirllm, emacs-devel > That was my intention. I'd fix some ten bugs (counting cc-awk.el). > You have to fix a bit more. Good luck then. You are not to decide who is fixing bugs. Martin was not really trying to give orders. He was urging people to do more bug-fixing. Everyone's welcome to do that. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 22:11 ` Andreas Schwab 2006-02-28 6:19 ` Richard Stallman @ 2006-02-28 10:28 ` martin rudalics 1 sibling, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-02-28 10:28 UTC (permalink / raw) Cc: teirllm, rms, emacs-devel >>That was my intention. I'd fix some ten bugs (counting cc-awk.el). >>You have to fix a bit more. Good luck then. > > > You are not to decide who is fixing bugs. If you read my mails in this thread you will see that I did not intend to fix any "bugs" in the first place. I just proposed to remove a few occurrences of "poor practice". The term "bug" has been introduced by you in this thread and I have tried to use it your sense. Apparently I failed to do so. If the text you cite above makes the impression that I wanted to decide on "what constitutes a bug" or "who should fix it" I apologize. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 19:03 ` Richard Stallman 2006-02-27 19:36 ` Andreas Schwab @ 2006-02-28 0:30 ` Luc Teirlinck 2006-02-28 10:27 ` martin rudalics 2006-03-01 17:54 ` Richard Stallman 2006-02-28 0:44 ` Luc Teirlinck 2006-02-28 0:59 ` Luc Teirlinck 3 siblings, 2 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-02-28 0:30 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab None of the messages I sent on this (or on anything else) in the last few days made it to emacs-devel, although all other people's responses did, be it after some delay. I just got messages saying that local delivery failed. So I will have to repeat some things that I already said before. Richard Stallman wrote: However, that doesn't necessarily mean the manual is wrong. There is more than one way to understand the word "special". At the most literal level, ] is not special; if you write it without \\, the regexp compiler won't misunderstand it. `]', like `-' are only special in the context of a character alternative, that is if, before you type them, you are in a character alternative. By contrast, `[' and all other special characters (except `^') are only special outside that context. All characters that are special outside character alternatives are never special if you precede them with a backslash. This is true even for `^'. This is why it is good to precede them with a backslash even if they are not special. That way, the reader can see that they are not special, without studying the regexp. On the other hand, a backslash, _never_ eliminates the special meaning of a `]' or `-' with a special meaning. There are two questions here. Whether a `]' outside a character alternative should be quoted or not and whether any changes to the Elisp manual are required. In this posting, I will only discuss the first. First of all, there are (surprisingly) many occurrences of "\\]" in the Emacs source, where the `]' _is_ special and closes a character alternative that contains a slash. Reportedly quoting a `]' with a backslash _inside_ a character alternative works in some other regexp implementations such as AWK. So if I see "\\]" I have to worry about three possibilities: it might deliberately close a character alternative which includes a slash, it might do so by accident because the author tried to quote a `]' inside a character alternative (and hence the regexp is buggy), or it might be a deliberately quoted `]' outside a character alternative. If I see `]' without preceding "\\", I only have to worry about whether or not it closes a character alternative, and not about the third possibility of a bug. In summary I believe that quoting a `]' outside a character alternative only adds clutter and a third possibility to worry about. There are places in the Emacs code that quote a `]' outside a character alternative. Even if we decide that this is undesirable, I do not fancy finding and changing them all. But we could change the behavior of `regexp-quote' and `regexp-opt' which currently quote such `]'. That could be done with the following trivial patch, which I could install if that is what we decide to do: ===File ~/search.c-diff===================================== *** search.c 06 Feb 2006 16:02:24 -0600 1.206 --- search.c 27 Feb 2006 00:16:42 -0600 *************** *** 3066,3072 **** for (; in != end; in++) { ! if (*in == '[' || *in == ']' || *in == '*' || *in == '.' || *in == '\\' || *in == '?' || *in == '+' || *in == '^' || *in == '$') --- 3066,3072 ---- for (; in != end; in++) { ! if (*in == '[' || *in == '*' || *in == '.' || *in == '\\' || *in == '?' || *in == '+' || *in == '^' || *in == '$') ============================================================ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-28 0:30 ` Luc Teirlinck @ 2006-02-28 10:27 ` martin rudalics 2006-02-28 22:57 ` Luc Teirlinck 2006-03-01 17:54 ` Richard Stallman 1 sibling, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-02-28 10:27 UTC (permalink / raw) Cc: schwab, rms, emacs-devel > `]', like `-' are only special in the context of a character > alternative, that is if, before you type them, you are in a character > alternative. By contrast, `[' and all other special characters > (except `^') are only special outside that context. You can talk about a context iff you are able to grammatically specify it. In order to talk about the contents of a string you must be able to determine the character sequences opening and closing strings. It would be strange to say, for example, that the double-quote opening an Elisp string is outside the context of the string and the double-quote that closes it inside. It would be strange to say that the bracket opening a character alternative is outside the context of the alternative and the closing bracket inside. > All characters that are special outside character alternatives are > never special if you precede them with a backslash. This is true even > for `^'. This is why it is good to precede them with a backslash even > if they are not special. That way, the reader can see that they are > not special, without studying the regexp. I agree. Let's try to read the following definition from `cc-fonts.el': (defconst autodoc-font-lock-doc-comments `(("@\\(\\w+{\\|\\[\\([^\]@\n\r]\\|@@\\)*\\]\\|[@}]\\|$\\)" ... It tells me that there are two character alternatives started by an unquoted `[' and terminated by an unquoted `]'. It also tells me that it's meant to match a bracketed expression as represented by `\\[' and `\\]' - I quickly exclude the possibility that the backslashes preceding any of these brackets are quoted backslashes in a character alternative. And, finally, the expression tells me that the author was probably uncertain about how to put a `]' inside a complemented character alternative, hence (s)he quoted it with a single backslash. In any case I have no difficulties reading the expression although I completely ignore its meaning. You propose to write (defconst autodoc-font-lock-doc-comments `(("@\\(\\w+{\\|\\[\\([^\]@\n\r]\\|@@\\)*]\\|[@}]\\|$\\)" ... instead. In that case, when I look at the character sequence `*]' I would have to consider the case that the `]' closes some character alternative. Only after I resolved that I would be able to say that the `]' should indeed match a right bracket. And I would still have to check whether the backslashes preceding the `\\[' are quoted backslashes in a character set. > First of all, there are (surprisingly) many occurrences of "\\]" in > the Emacs source, where the `]' _is_ special and closes a character > alternative that contains a slash. Reportedly quoting a `]' with a > backslash _inside_ a character alternative works in some other regexp > implementations such as AWK. So if I see "\\]" I have to worry about > three possibilities: it might deliberately close a character > alternative which includes a slash, it might do so by accident because > the author tried to quote a `]' inside a character alternative (and > hence the regexp is buggy), or it might be a deliberately quoted `]' > outside a character alternative. The Emacs manual clearly states that the backslash is not special in a character set. But I admit that users of other languages do have problems when writing Elisp regexps. That's why a clear and unambiguous definition of these concepts is important. > If I see `]' without preceding "\\", I only have to worry about > whether or not it closes a character alternative, and not about the > third possibility of a bug. When I try to read a regular expression I do not worry about the possibility of a bug in the first place. I try to understand what the author wanted to match. > There are places in the Emacs code that quote a `]' outside a > character alternative. Even if we decide that this is undesirable, I > do not fancy finding and changing them all. But we could change the > behavior of `regexp-quote' and `regexp-opt' which currently quote > such `]'. That could be done with the following trivial patch, which > I could install if that is what we decide to do: Given the amount of regular expressions users created with these functions and manually inserted in their code that would be confusing indeed. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-28 10:27 ` martin rudalics @ 2006-02-28 22:57 ` Luc Teirlinck 2006-03-01 13:00 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-28 22:57 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Martin Rudalics wrote: It would be strange to say, for example, that the double-quote opening an Elisp string is outside the context of the string and the double-quote that closes it inside. I do not see why you consider this strange. Quite to the contrary, this is exactly what allows one to determine whether a `"' opens or closes a string. `"" is special both inside and outside the context of a string. But its special meaning depends on that context. Outside the context of a string `"' starts a string, inside the context of a string, `"' ends a string. So an opening `"' is opening _because_ it occurs outside of a string context and the closing `"' is the closing one _because_ it occurs inside a string context. Note that the GNU regexp manual, node `(regex)List Operators' agrees with Andreas and me that `[' is special _outside_ a character alternative (by stating that it is ordinary inside one) and explicitly states that `]' has the special meaning of closing a character alternative _inside_ a character alternative. (Note that it refers to character alternatives as "lists".) Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-28 22:57 ` Luc Teirlinck @ 2006-03-01 13:00 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-03-01 13:00 UTC (permalink / raw) Cc: schwab, rms, emacs-devel > Martin Rudalics wrote: > > It would be strange to say, for example, that the double-quote > opening an Elisp string is outside the context of the string and > the double-quote that closes it inside. > > I do not see why you consider this strange. Quite to the contrary, > this is exactly what allows one to determine whether a `"' opens or > closes a string. `"" is special both inside and outside the context > of a string. But its special meaning depends on that context. > Outside the context of a string `"' starts a string, inside the > context of a string, `"' ends a string. So an opening `"' is opening > _because_ it occurs outside of a string context and the closing `"' is > the closing one _because_ it occurs inside a string context. > > Note that the GNU regexp manual, node `(regex)List Operators' agrees > with Andreas and me that `[' is special _outside_ a character alternative > (by stating that it is ordinary inside one) and explicitly states that > `]' has the special meaning of closing a character alternative > _inside_ a character alternative. (Note that it refers to character > alternatives as "lists".) If you refer to section "3.6 List Operators ([ ... ] and [^ ... ])" of the GNU regex manual I can exctract three relevant sentences: "A matching list matches a single character represented by one of the list items. You form a matching list by enclosing one or more items within an open-matching-list operator (represented by `[') and a close-list operator (represented by `]')." If you deduce here that the "close-list operator" is part of the "items within" you can deduce that the "open-matching-list" operator is part of the "items within" as well. "`]' ends the list if it's not the first list item. So, if you want to make the `]' character a list item, you must put it first." `]' is special inside a chararacter list - the "items within" mentioned above - because it has to appear as the first element of that list. "`-' represents the range operator (see section 3.6.2 The Range Operator (-)) if it's not first or last in a list or the ending point of a range." If `-' can be "last in a list" the close-list operator `]' cannot be "last in that list". Ex falso sequitur quodlibet. If anyone's interested in how other languages handle regexp brackets see the list below: Perl's metacharacters are: { } [ ] ( ) ^ $ . | * + ? \ Python metacharacters are: . ^ $ * + ? { [ ] \ | ( ) PHP: Outside square brackets, the meta-characters are as follows: ... [ start character class definition ] end character class definition ... XML: A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. Tcl: A regular expression uses metacharacters (characters that assume special meaning for matching other characters) such as *, [], $ and .. ... A backslash (\) disables the special meaning of the following character, so you could match the string [Hello] with the RE \[Hello\]. Java (http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html): Perl is forgiving about malformed matching constructs, as in the expression *a, as well as dangling brackets, as in the expression abc], and treats them as literals. Java also accepts dangling brackets but is strict about dangling metacharacters like +, ? and *, and will throw a PatternSyntaxException if it encounters them. Hence all classic regexp languages do consider `]' special and do not consider `-' special. The Java doc calls the `]' in `abc]' a dangling bracket. The fact that languages "forgive" or "accept" such constructs shouldn't cause anyone to promote such style. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-28 0:30 ` Luc Teirlinck 2006-02-28 10:27 ` martin rudalics @ 2006-03-01 17:54 ` Richard Stallman 2006-03-02 4:06 ` Luc Teirlinck ` (2 more replies) 1 sibling, 3 replies; 81+ messages in thread From: Richard Stallman @ 2006-03-01 17:54 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab `]', like `-' are only special in the context of a character alternative, that is if, before you type them, you are in a character alternative. By contrast, `[' and all other special characters (except `^') are only special outside that context. You're interpreting the term "context" the same way the regexp compiler does: meaning the preceding characters of the regexp. The regexp compiler works from left to right. However, to a person, the context of a character set, or any sub-regexp, is found on both sides of it. Understood in this way, the role of a character set's closing ] is dual to that of the opening [; both of them delimit the character set. Both characters play special roles in the syntax of regexps, and these roles are not internal to a character set. First of all, there are (surprisingly) many occurrences of "\\]" in the Emacs source, where the `]' _is_ special and closes a character alternative that contains a slash. That is a good point. We don't want people to get confused about that. So I think we should not encourage the quoting of ], but we need to be careful about how to explain this. I will write it. Meanwhile, please do install your search.c patch. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-01 17:54 ` Richard Stallman @ 2006-03-02 4:06 ` Luc Teirlinck 2006-03-02 19:43 ` Richard Stallman 2006-03-02 4:54 ` Luc Teirlinck 2006-03-02 18:40 ` martin rudalics 2 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-02 4:06 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab Richard Stallman write: You're interpreting the term "context" the same way the regexp compiler does: meaning the preceding characters of the regexp. Of course I do. That is the only interpretation my computer cares about. If I interpret a regexp differently from the regexp compiler, the regexp compiler wins, and I loose. So I do not want to do that. The regexp compiler works from left to right. I usually read regexps left to right too, keeping track of context the same way the regexp compiler does. I want to make sure that I interprete regexps the same way the regexp compiler and my computer do. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-02 4:06 ` Luc Teirlinck @ 2006-03-02 19:43 ` Richard Stallman 0 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 2006-03-02 19:43 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab You're interpreting the term "context" the same way the regexp compiler does: meaning the preceding characters of the regexp. Of course I do. That is the only interpretation my computer cares about. The manual is meant for human beings to read, not for computers. And the strict left-to-right parsing concept is not the way human beings usually understand regexps. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-01 17:54 ` Richard Stallman 2006-03-02 4:06 ` Luc Teirlinck @ 2006-03-02 4:54 ` Luc Teirlinck 2006-03-02 18:40 ` martin rudalics 2 siblings, 0 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-03-02 4:54 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab Richard Stallman wrote: Meanwhile, please do install your search.c patch. Done. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-01 17:54 ` Richard Stallman 2006-03-02 4:06 ` Luc Teirlinck 2006-03-02 4:54 ` Luc Teirlinck @ 2006-03-02 18:40 ` martin rudalics 2006-03-02 23:26 ` Luc Teirlinck ` (2 more replies) 2 siblings, 3 replies; 81+ messages in thread From: martin rudalics @ 2006-03-02 18:40 UTC (permalink / raw) Cc: schwab, Luc Teirlinck, emacs-devel > First of all, there are (surprisingly) many occurrences of "\\]" in > the Emacs source, where the `]' _is_ special and closes a character > alternative that contains a slash. > > That is a good point. We don't want people to get confused about that. There are very few expressions where `\\' does have to precede a right bracket, `[^\\]', `[]\\]', and `[^]\\]' come to mind. I any other case people may avoid confusion by moving the backslash in front of another character. In current Emacs code there are some 100 occurrencs where programmers were able to convey the intention that they indeed wanted to match a right bracket by writing `\\]'. Simultaneously, programmers were able to express that they did _not_ want a character alternative to end here. Your change will make it difficult if not impossible to express such intentions. And, your change is motivated by the pessimistic assumption that people frequently submit code with buggy regexps. Even if that were the case your change would hardly help. Consider the following expression from `gud-jdb-marker-filter': "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" Experience tells me that this should be probably written as "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" but I'm not quite sure since `gud.el' is one of the few Emacs files that do not consistently use `\\]' to match a right bracket. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-02 18:40 ` martin rudalics @ 2006-03-02 23:26 ` Luc Teirlinck 2006-03-03 7:42 ` martin rudalics 2006-03-03 10:25 ` Richard Stallman 2006-03-03 10:25 ` Richard Stallman 2 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-02 23:26 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Martin Rudalics wrote: "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" Experience tells me that this should be probably written as "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" but I'm not quite sure since `gud.el' is one of the few Emacs files that do not consistently use `\\]' to match a right bracket. I do not see what this problem has to do with "\\]" vs ']'. This seems to be just a case of forgetting to double up `\' for Lisp syntax. The actually intended regexo would seem to obviously be: "\\(\\[[0-9]+] \\)* and so on. The present regexp is valid, but the syntax it is looking for seems bizarre. On the other hand looking for things like: "[123] [5] [2034] " seems to make sense. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-02 23:26 ` Luc Teirlinck @ 2006-03-03 7:42 ` martin rudalics 2006-03-03 13:51 ` Luc Teirlinck 2006-03-03 14:09 ` Luc Teirlinck 0 siblings, 2 replies; 81+ messages in thread From: martin rudalics @ 2006-03-03 7:42 UTC (permalink / raw) Cc: schwab, rms, emacs-devel > I do not see what this problem has to do with "\\]" vs ']'. > > This seems to be just a case of forgetting to double up `\' for Lisp > syntax. That's precisely what I meant. If programmers consistently double up backslashes for _all_ escaped brackets it's usually simple to guess when one of them has been omitted. Otherwise you always have to consider the possibility that the author wanted to close a character alternative here and messed up some preceding part. You have a long-standing experience (or maybe some sixth sense) for discovering wrong regexps faster than most of us. But you should occasionally think of less experienced programmers who try to guess the motivations for writing an expression like (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" definition start) in `mailalias.el'. It's got no less than three backslashes preceding non-escaped right brackets. Can you tell me what the author wants to match? If, by default, I have to consider the possibility that a `]' may either close a character alternative _or_ stand for itself, the number of interpretations of such expressions explodes combinatorially. Programmers should avoid confusion by not putting `\\' at the end of a character alternative unless its needed as in `[^\\]'. > The present regexp is valid, but the syntax it is looking for seems > bizarre. On the other hand looking for things like: > > "[123] [5] [2034] " > > seems to make sense. Because people are used to consider objects like "[123] [5] [2034]" well-formed and objects like "123]", "]5]", "[2034 " bizarre. Most humans _do_ expect to find some sort of symmetry in the things they observe. Symmetry is a driving principle of mathematics and computer sciences. Often, it's a lack of symmetry that makes people aware of faults or other anomalies. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 7:42 ` martin rudalics @ 2006-03-03 13:51 ` Luc Teirlinck 2006-03-03 14:09 ` Luc Teirlinck 1 sibling, 0 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-03-03 13:51 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Martin Rudalics wrote: But you should occasionally think of less experienced programmers who try to guess the motivations for writing an expression like (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" definition start) in `mailalias.el'. It's got no less than three backslashes preceding non-escaped right brackets. Can you tell me what the author wants to match? Unless it really is too early in the morning for me, something that starts with something that is not a backslash, then an even number of backslashes, then a ", then a sequence of non-newline whitespace or commas. The one pair of \\(...\\) that is not needed for this meaning is probably meant for use with match-data. What is the point you are trying to make? That "[^\\]\\(\\(\\\\\\\\\\)*\\)\"[ \t,]*" would be easier to read? Not for me. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 7:42 ` martin rudalics 2006-03-03 13:51 ` Luc Teirlinck @ 2006-03-03 14:09 ` Luc Teirlinck 2006-03-03 18:52 ` martin rudalics 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-03 14:09 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Martin Rudalics wrote: Can you tell me what the author wants to match? To give a less technical answer than in my previous response, an _unquoted_ ", followed by a bunch non-newline whitespace or commas. Most humans _do_ expect to find some sort of symmetry in the things they observe. Not necessarily. Because you might start your regexp search in the middle of something, breaking all symmetry. In the example above, the search probably started inside a string and the regexp is looking for the end of it. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 14:09 ` Luc Teirlinck @ 2006-03-03 18:52 ` martin rudalics 2006-03-03 22:41 ` Luc Teirlinck 2006-03-03 23:00 ` Luc Teirlinck 0 siblings, 2 replies; 81+ messages in thread From: martin rudalics @ 2006-03-03 18:52 UTC (permalink / raw) Cc: schwab, rms, emacs-devel > To give a less technical answer than in my previous response, an > _unquoted_ ", followed by a bunch non-newline whitespace or commas. (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" "\" ,") => nil > What is the point you are trying to make? That > > "[^\\]\\(\\(\\\\\\\\\\)*\\)\"[ \t,]*" > > would be easier to read? Not for me. I agree that writing and reading 10 backslashes in a row is dreadful. However, writing `[\\]' to match a single backslash is dreadful as well. A character alternative without alternative does not deserve its name. Nowadays I'd probably write something like "[^\\]\\(\\\\\\{2\\}*\\)\"[ \t,]*" but maybe at the time the original expression was written repetition operators were not yet available. Anyway, the "point I was trying to make" was a different one. I believe we should give suggestions how to avoid writing confusing regexps rather than change `regexp-quote'. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 18:52 ` martin rudalics @ 2006-03-03 22:41 ` Luc Teirlinck 2006-03-03 23:00 ` Luc Teirlinck 1 sibling, 0 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-03-03 22:41 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Martin Rudalics wrote: > To give a less technical answer than in my previous response, an > _unquoted_ ", followed by a bunch non-newline whitespace or commas. (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" "\" ,") => nil Less technical means more potential for ambiguity. Of course the regexp does _not_ match "\" ,", because that would not guarantee that the " it found is unquoted. Apparently, point is at the beginning of a string and the regexp searches for the ending unquoted " by searching, as I said in my previous message, for something that is _not_ a backslash, then an even number of backslashes, then a ", then a bunch of non-newline whitespace or commas. Apparently, the code relies on the assumption that the string does not consist of _only_ backslashes. ELISP> (string-match "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" "012\" ,") 2 Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 18:52 ` martin rudalics 2006-03-03 22:41 ` Luc Teirlinck @ 2006-03-03 23:00 ` Luc Teirlinck 1 sibling, 0 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-03-03 23:00 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Martin Rudalics wrote: Nowadays I'd probably write something like "[^\\]\\(\\\\\\{2\\}*\\)\"[ \t,]*" but maybe at the time the original expression was written repetition operators were not yet available. To me, the above regexp is _really_ awkward, whereas "[^\\]\\(\\([\\][\\]\\)*\\)\"[ \t,]*" is really easy to understand and very self-documenting. However, writing `[\\]' to match a single backslash is dreadful as well. Quite to the contrary. It documents very clearly which \\ together represent one single literal backslash and it separates them clearly from the surrounding non-literal backslashes. It is what makes this regexp so very easy to read, unlike your suggested replacement with its six consecutive ungrouped backslashes, with various different meanings. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-02 18:40 ` martin rudalics 2006-03-02 23:26 ` Luc Teirlinck @ 2006-03-03 10:25 ` Richard Stallman 2006-03-03 15:20 ` martin rudalics 2006-03-03 10:25 ` Richard Stallman 2 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 2006-03-03 10:25 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel Your change will make it difficult if not impossible to express such intentions. I don't understand. I suspect there is a miscommunication. When you say "my change", what change is that? I approved a proposed change in regexp-quote, and I said I would change the manual. I did not talk about any change in parsing regexps. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 10:25 ` Richard Stallman @ 2006-03-03 15:20 ` martin rudalics 2006-03-04 13:37 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-03-03 15:20 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel > I don't understand. I suspect there is a miscommunication. > When you say "my change", what change is that? Sorry, for the misunderstanding. For some reasons, I'm currently receiving mails in quite erratic order. I referred to Luc's change of `regexp-quote' which, in my opinion, will make it in some cases impossible to generate regular expressions the traditional way. More precisely `(regexp-quote "[foo]")' so far evaluates to "\\[foo\\]" and has evaluated that way ever since. Changing this to "\\[foo]" will require that when in future I want to study a regexp I must also keep in mind whether that expression was generated by `regexp-opt' before or after that change. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 15:20 ` martin rudalics @ 2006-03-04 13:37 ` Richard Stallman 2006-03-04 14:40 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 2006-03-04 13:37 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel I referred to Luc's change of `regexp-quote' which, in my opinion, will make it in some cases impossible to generate regular expressions the traditional way. I don't understand what that means. What exactly is the task you believe will be impossible? As far as I know, it will still generate correct regexps, just somewhat different ones. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-04 13:37 ` Richard Stallman @ 2006-03-04 14:40 ` martin rudalics 2006-03-06 0:48 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: martin rudalics @ 2006-03-04 14:40 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel > I don't understand what that means. What exactly is the task you > believe will be impossible? As far as I know, it will still generate > correct regexps, just somewhat different ones. Suppose someone wanted to match `[foo]' in an earlier version of a program and now wants to match `[foo][bar]' where `[foo]' and `[bar]' are complicated expressions to be generated with help of `regexp-opt'. The earlier version was obtained with `regexp-opt' producing `\\[foo\\]'. For the new version `regexp-opt' would generate `\\[bar]'. The resulting expression would read as `\\[foo\\]\\[bar]' which is confusing since two different styles are involved. The user would have to manually change `\\]' to `]' (or `]' to `\\]') to get a uniform appearance. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-04 14:40 ` martin rudalics @ 2006-03-06 0:48 ` Richard Stallman 0 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 2006-03-06 0:48 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel For the new version `regexp-opt' would generate `\\[bar]'. The resulting expression would read as `\\[foo\\]\\[bar]' which is confusing since two different styles are involved. The user would have to manually change `\\]' to `]' (or `]' to `\\]') to get a uniform appearance. That seems like no big deal. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-02 18:40 ` martin rudalics 2006-03-02 23:26 ` Luc Teirlinck 2006-03-03 10:25 ` Richard Stallman @ 2006-03-03 10:25 ` Richard Stallman 2006-03-03 15:51 ` martin rudalics 2006-03-05 2:54 ` Luc Teirlinck 2 siblings, 2 replies; 81+ messages in thread From: Richard Stallman @ 2006-03-03 10:25 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel your change would hardly help. Consider the following expression from `gud-jdb-marker-filter': "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" Experience tells me that this should be probably written as "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" \[ and \] in Lisp strings are equivalent to just [ and just ]. So I think the current value is incorrect, and the [ needs to have \\ before it. Meanwhile, the question we're discussing here is whether to write \\ before the ]. That is harmless, and the question is whether it makes things clearer or more confusing. The problem is that usually it makes things clearer, but occasionally people could get confused when \\ is last in a character alternative. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 10:25 ` Richard Stallman @ 2006-03-03 15:51 ` martin rudalics 2006-03-03 23:48 ` Luc Teirlinck 2006-03-04 23:16 ` Luc Teirlinck 2006-03-05 2:54 ` Luc Teirlinck 1 sibling, 2 replies; 81+ messages in thread From: martin rudalics @ 2006-03-03 15:51 UTC (permalink / raw) Cc: schwab, teirllm, emacs-devel > "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ > \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" > > Experience tells me that this should be probably written as > > "\\(\\[[0-9]+\\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ > \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" > > \[ and \] in Lisp strings are equivalent to just [ and just ]. So I > think the current value is incorrect, and the [ needs to have \\ before it. > > Meanwhile, the question we're discussing here is whether to write \\ > before the ]. That is harmless, and the question is whether it makes > things clearer or more confusing. The problem is that usually it > makes things clearer, but occasionally people could get confused when > \\ is last in a character alternative. The question whether writing '\\' before the `]' is relevant for the example cited above. Usually, when I see a `\\]' outside a character alternative I expect it to match a right bracket in some text. And, usually, in that text a left bracket will precede the right bracket. Hence, if in the text above the author had used `\\]' instead of `\]' it would have been easy to conclude - from the absence of a preceding `\\[' - that something went wrong. Vice versa, when seeing a `\\[' I usually expect it to have a corresponding `\\]' somehwere on the right. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 15:51 ` martin rudalics @ 2006-03-03 23:48 ` Luc Teirlinck 2006-03-04 9:58 ` martin rudalics 2006-03-04 23:16 ` Luc Teirlinck 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-03 23:48 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Martin Rudalics wrote: The question whether writing '\\' before the `]' is relevant for the example cited above. Usually, when I see a `\\]' outside a character alternative I expect it to match a right bracket in some text. And, usually, in that text a left bracket will precede the right bracket. Hence, if in the text above the author had used `\\]' instead of `\]' it would have been easy to conclude - from the absence of a preceding `\\[' - that something went wrong. Vice versa, when seeing a `\\[' I usually expect it to have a corresponding `\\]' somehwere on the right. I believe that you make understanding regexps hard on yourself by making all kind of assumptions that often are not satisfied. There is no reason why a literal `]' should be matched by a literal `[' to the right or vice versa. Even _if_ the `[' and the `]' balance in the text you are parsing through _considered in its entirety_ (which is not at all guaranteed), you might be inside, say, a nested Lisp vector and your regexp may be searching for its end. No balance of literal `[' and `]' at all. This is _not_ an exceptional situation. It occurs all over the place in the Emacs source code. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 23:48 ` Luc Teirlinck @ 2006-03-04 9:58 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-03-04 9:58 UTC (permalink / raw) Cc: schwab, rms, emacs-devel > I believe that you make understanding regexps hard on yourself by > making all kind of assumptions that often are not satisfied. > > There is no reason why a literal `]' should be matched by a literal > `[' to the right or vice versa. What I meant was that (i) when I see a literal `[' I expect it to be matched by a literal `]' in the text that follows and, (ii) when I see a literal `]' I expect it to be matched by a literal `[' in the preceding text. In mathematics open intervals like `]3,5]' are an obvious exception to these rules but in general I've been quite happy with them. In the particular case, I've been talking about a regexp in Emacs source "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" which I consider wrong. Apparently that part of the code is never taken thus no one has complained so far about mismatches. However, similar expressions to match line numbers occur frequently. And I use the rules above to reason about them and am confident that in this particular case you use one of these rules as well. If I followed your reasoning to its logical end I couldn't possibly rule out malformed regexps like `[a-z'. After all the `[' states that a character alternative starts here, why should a user bother to close it? > Even _if_ the `[' and the `]' balance > in the text you are parsing through _considered in its entirety_ > (which is not at all guaranteed), you might be inside, say, a nested > Lisp vector and your regexp may be searching for its end. No balance > of literal `[' and `]' at all. This is _not_ an exceptional > situation. It occurs all over the place in the Emacs source code. I fully agree. However, in such cases there is practically always some pdl (variable) to record the current state of "unclosed" literal `['s. In practice, I will complain about unmatching brackets when either the pdl is empty (the variable is zero) and I find a literal `]' or the pdl is non-empty (the variable is non-zero) when I encounter the end of the text. Hence, the pdl (variable) compensates missing symmetry in the part of the text I want to parse. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 15:51 ` martin rudalics 2006-03-03 23:48 ` Luc Teirlinck @ 2006-03-04 23:16 ` Luc Teirlinck 1 sibling, 0 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-03-04 23:16 UTC (permalink / raw) Cc: schwab, rms, emacs-devel I believe that we should just decide whether there is a bug in the regexp in question (which seems _nearly_ certain) and correct it if so. For people who use the Java debugger jdb (I do not know Java), I summarize the problem, so there is no need to read through any of the prior postings in this thread. The regexp in question occurs in `gud-jdb-marker-filter' on line 2155 of progmodes/gud.el and is: "\\(\[[0-9]+\] \\)*\\([a-zA-Z0-9.$_]+\\)\\.[a-zA-Z0-9$_<>(),]+ \ \\(([a-zA-Z0-9.$_]+:\\|line=\\)\\([0-9.,]+\\)" The problem is limited to the \\(\[[0-9]+\] \\)* part at the beginning. According to the Change Logs, this part _seems_ to be used to search/detect classpath information in jdb's output. The regexp as given is valid. But it looks like \\(\[[0-9]+\] \\)* was actualy meant to mean \\(\\[[0-9]+] \\)*, since the author seemingly forgot to double up `\' for Lisp syntax. I do not know Java, so I have no way of knowing what the correct syntax is. According to the current regexp, it consists of a something that looks like a sequence of integers written in base 11, where `[' bizarrely stands for ten, separated and terminated by "] ". The "obvious" correction \\(\\[[0-9]+] \\)* looks for a bunch of decimal digits enclosed in square brackets separated by a space, like "[1276] [0] ". It seems that we should make the "obvious" correction, but it would nevertheless be good if somebody who knows the syntax could confirm this. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-03 10:25 ` Richard Stallman 2006-03-03 15:51 ` martin rudalics @ 2006-03-05 2:54 ` Luc Teirlinck 2006-03-06 0:49 ` Richard Stallman 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-05 2:54 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab Richard Stallman wrote: \[ and \] in Lisp strings are equivalent to just [ and just ]. So I think the current value is incorrect, and the [ needs to have \\ before it. Since I sent my previous message I noticed from the comment and the code following the regexp, that "\\(\\[[0-9]+] \\)* is the only possible interpretation. The comment and code are _really_ looking for a sequence of the type "[123] [3] ": ;; A good marker is one that: ;; 1) does not have a "[n] " prefix (not part of a stack backtrace) ;; 2) does have an "[n] " prefix and n is the lowest prefix seen ;; since the last prompt So I believe that we just should go ahead and change "\\(\[[0-9]+\] \\)*" to "\\(\\[[0-9]+] \\)*". I can do this, if desired. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 2:54 ` Luc Teirlinck @ 2006-03-06 0:49 ` Richard Stallman 0 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 2006-03-06 0:49 UTC (permalink / raw) Cc: rudalics, schwab, emacs-devel So I believe that we just should go ahead and change "\\(\[[0-9]+\] \\)*" to "\\(\\[[0-9]+] \\)*". I can do this, if desired. Please do. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 19:03 ` Richard Stallman 2006-02-27 19:36 ` Andreas Schwab 2006-02-28 0:30 ` Luc Teirlinck @ 2006-02-28 0:44 ` Luc Teirlinck 2006-03-04 21:07 ` Thien-Thi Nguyen 2006-02-28 0:59 ` Luc Teirlinck 3 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-28 0:44 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab Richard Stallman wrote: However, that doesn't necessarily mean the manual is wrong. There is more than one way to understand the word "special". At the most literal level, ] is not special; if you write it without \\, the regexp compiler won't misunderstand it. However, it does play a special role in the syntax of regexps, and it is not necessarily a bad thing for users to think of it as a special character. It is good for users to think of `]' and `-' as characters that are special _inside_ the context of a character alternative. That way they will not be confused by quotes from the Elisp manual like: Note that the usual regexp special characters are not special inside a character alternative. A completely different set of characters is special inside character alternatives: `]', `-' and `^'. The special meaning of `]' inside a character alternative is obviously to close that alternative. But the Elisp manual currently lists `]', like `^', but unlike `-' among the characters that also have another special meaning outside that context. That is true for `^', but which other special meaning does `]' have? Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-28 0:44 ` Luc Teirlinck @ 2006-03-04 21:07 ` Thien-Thi Nguyen 2006-03-05 3:37 ` Luc Teirlinck 0 siblings, 1 reply; 81+ messages in thread From: Thien-Thi Nguyen @ 2006-03-04 21:07 UTC (permalink / raw) Luc Teirlinck <teirllm@dms.auburn.edu> writes: > It is good for users to think of `]' and `-' as characters that > are special _inside_ the context of a character alternative. when describing a complicated process (such as the regexp compiler's operation) it's a common technique to break things down into independent parts. that seems to be the approached used in the docs thus far. but maybe explaining the role of the square brace delimiters is not so independent. my 2c: probably "inside" and "outside" are not as precise as possible when talking about delimiters (of context or anything), such as the square braces. such delimiters "change" the context (unless somehow inhibited), so that things before or after may be "inside" or "outside" of the old/new context. whether or not the delimiter itself is considered inside or outside depends on whether your pov tends to be forward- or backward-looking (which is a personal choice, and thus, algorithmically irrelevent). ascribing context membership to a delimiter is like arguing for child custody; the delimiter will always be between and no happier for the label. thi ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-04 21:07 ` Thien-Thi Nguyen @ 2006-03-05 3:37 ` Luc Teirlinck 2006-03-05 11:10 ` martin rudalics 2006-03-05 11:54 ` martin rudalics 0 siblings, 2 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-03-05 3:37 UTC (permalink / raw) Cc: emacs-devel Thien-Thi Nguyen wrote: whether or not the delimiter itself is considered inside or outside depends on whether your pov tends to be forward- or backward-looking (which is a personal choice, and thus, algorithmically irrelevent). No, both the notion of context and the forward-looking view are algorithmically _very_ relevant. If you consider in "[a]b]" the first and the second `]' to be _both_ inside or _both_ outside the context of a character alternative, then it would be impossible to determine solely from that notion of context which of the two `]' has to be taken literally. If you consider the opening and ending " of a string to be _both_ inside or _both_ outside the context of a string, then it would be impossible from that notion of context to determine which " open and which " close strings. Thus any such notions of context are useless. On the other hand the regexp compiler uses the notion of context I mentioned to determine which `[' or `]' are to be interpreted literally. It is also how other parsers determine which " open strings and which close them. Hence, that notion of context is useful, in fact, necessary. Also, forward and backward views of a regexp are not algorithmically equivalent. If you read a regexp forward, you know immediately when you encounter a character whether it has to be taken literally or not (or at worst after a _very_ limited number of characters, as the second `[' in in "[[:..."). If you read the regexp backward, you may have to read all the way back to the beginning before you can be sure that a `]' is to be taken literally. Hence, reading a regexp forward _is_ algorithmically _very_ superior over reading it backward if your purpose is to understand the regexp. I must admit however, that if you want is to uncover the subliminal satanic messages in the regexp, then you _have_ to read it backward. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 3:37 ` Luc Teirlinck @ 2006-03-05 11:10 ` martin rudalics 2006-03-05 15:32 ` Luc Teirlinck 2006-03-05 17:04 ` Luc Teirlinck 2006-03-05 11:54 ` martin rudalics 1 sibling, 2 replies; 81+ messages in thread From: martin rudalics @ 2006-03-05 11:10 UTC (permalink / raw) Cc: ttn, emacs-devel Luc Teirlinck wrote: > If you consider in "[a]b]" the first and the second `]' to be _both_ > inside or _both_ outside the context of a character alternative, then > it would be impossible to determine solely from that notion of context > which of the two `]' has to be taken literally. That's what I don't get tired of saying for one week already. You always denied it by saying things like The special meaning of `]' inside a character alternative is obviously to close that alternative. and `]' has the special meaning of closing a character alternative _inside_ a character alternative If the closing `]' is inside the alternative where does the first `]' in "[a]b]" go? > If you consider the > opening and ending " of a string to be _both_ inside or _both_ outside > the context of a string, then it would be impossible from that notion > of context to determine which " open and which " close strings. You're cheating here: The double-quote opening a string compares to the _left_ bracket opening a character alternative. The double-quote closing a string compares to the _right_ bracket closing a character alternative. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 11:10 ` martin rudalics @ 2006-03-05 15:32 ` Luc Teirlinck 2006-03-06 7:41 ` martin rudalics 2006-03-05 17:04 ` Luc Teirlinck 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-05 15:32 UTC (permalink / raw) Cc: ttn, emacs-devel Martin Rudalics wrote: Luc Teirlinck wrote: > If you consider in "[a]b]" the first and the second `]' to be _both_ > inside or _both_ outside the context of a character alternative, then > it would be impossible to determine solely from that notion of context > which of the two `]' has to be taken literally. That's what I don't get tired of saying for one week already. You always denied it by saying things like The special meaning of `]' inside a character alternative is obviously to close that alternative. and `]' has the special meaning of closing a character alternative _inside_ a character alternative Look, I am getting tired of this endless yes-no discussion. But you have completely misunderstood everything I have been saying. Let me try once more to explain. Figuring out whether a `]' has to be taken literally or not is a completely trivial problem, but you are making it difficult on yourself for counterproductive philosophical reasons. Start at the beginning of the regexp. `[' is special, `]' not, because we are outside a character alternative. After the first unquoted `[' is read, which is special because it was typed outside a character alternative, we are inside a character alternative. `[' is no longer special, but `]' is (except immediately after the `[' or "[^"), because we now are inside a character alternative. After the next `]' is read, which is special because it was typed inside a character alternative, we are back outside a character alternative, `[' is special, `]' not. To summarize, `]' is only special in a character alternative, `[' is only special outside one. Note how easy this is. Unlike for, say \\( you do not even have to keep track of which `[' matches which `]', because there is no nesting. All you need to keep track of is whether you are inside or outside a character alternative. You are making things difficult by treating `[' and `]' in regexps as if they had the usual open-close parentheses syntax, like \\( and \\). They do *not* and that is the cause of all your misunderstandings. In "[1[2]3]" the first `]' closes the first `[' and "balance" makes no sense for the other `[' and `]'. If `[' and `]' had the usual open-close parentheses syntax, the 2 would be inside a nested character alternative, two levels deep. But there is no such thing as nested character alternatives, because, in regexps, `[' and `]' do not have the usual open-close parentheses syntax (unlike, say, in Lisp vectors). Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 15:32 ` Luc Teirlinck @ 2006-03-06 7:41 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-03-06 7:41 UTC (permalink / raw) Cc: ttn, emacs-devel > Note how easy this is. Unlike for, say \\( you do not even have to > keep track of which `[' matches which `]', because there is no > nesting. All you need to keep track of is whether you are inside or > outside a character alternative. I do not have any problems matching `[' with `]' when regexps are written cleanly. I do have problems when `]', `\]', or `\\]' get mixed up as in the `gud-jdb-marker-filter' bug. > You are making things difficult by treating `[' and `]' in regexps as > if they had the usual open-close parentheses syntax, like \\( and \\). > They do *not* and that is the cause of all your misunderstandings. In > "[1[2]3]" the first `]' closes the first `[' and "balance" makes no > sense for the other `[' and `]'. If `[' and `]' had the usual open-close > parentheses syntax, the 2 would be inside a nested character > alternative, two levels deep. But there is no such thing as nested > character alternatives, because, in regexps, `[' and `]' do not have > the usual open-close parentheses syntax (unlike, say, in Lisp vectors). We have been comparing character alternatives with strings. Elisp strings don't nest. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 11:10 ` martin rudalics 2006-03-05 15:32 ` Luc Teirlinck @ 2006-03-05 17:04 ` Luc Teirlinck 1 sibling, 0 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-03-05 17:04 UTC (permalink / raw) Cc: ttn, emacs-devel Martin Rudalics wrote" Luc Teirlinck wrote: > If you consider in "[a]b]" the first and the second `]' to be _both_ > inside or _both_ outside the context of a character alternative, then > it would be impossible to determine solely from that notion of context > which of the two `]' has to be taken literally. That's what I don't get tired of saying for one week already. You always denied it by saying things like I believe that you forgot to read the "If" that starts the passage you quoted. Hence your impression that I was contradicting myself. I consider the first `]' in "[a]b]" to be inside a character alternative, the second one outside. From that context I can determine that the first `]' is special and the second one not. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 3:37 ` Luc Teirlinck 2006-03-05 11:10 ` martin rudalics @ 2006-03-05 11:54 ` martin rudalics 2006-03-05 15:35 ` Andreas Schwab 2006-03-05 18:36 ` Luc Teirlinck 1 sibling, 2 replies; 81+ messages in thread From: martin rudalics @ 2006-03-05 11:54 UTC (permalink / raw) Cc: ttn, emacs-devel Luc Teirlinck wrote: > Also, forward and backward views of a regexp are not > algorithmically equivalent. If you read a regexp forward, you know > immediately when you encounter a character whether it has to be taken > literally or not (or at worst after a _very_ limited number of > characters, as the second `[' in in "[[:..."). If you read the regexp > backward, you may have to read all the way back to the beginning > before you can be sure that a `]' is to be taken literally. How do you read the following regexp from `cc-langs.el'? (concat "\\(" "[\)\[\(]" (if (c-lang-const c-type-modifier-kwds) (concat "\\|" ;; "throw" in `c-type-modifier-kwds' is followed ;; by a parenthesis list, but no extra measures ;; are necessary to handle that. (regexp-opt (c-lang-const c-type-modifier-kwds) t) "\\>") "") "\\)") Do you really evaluate the (c-lang-const ...)s _before_ looking at the closing `\\)'? What would you do if the value of `c-type-modifier-kwds' were available at run-time only? When trying to understand such regexps I break them up into parts first. Such parts are, in my understanding, groups like `\\(...\\)', subexpressions delimited by `\\|', and character alternatives. Next I try to understand the parts that interest me without paying notice to parts that do not relate to my specific problem. And I would have troubles to isolate a character alternative when the author matches a literal right bracket with `]'. People can make reading a regexp truly awkward by writing kludgy expressions like (let ((keywords (concat "\\([;(){}`|&]\\|^\\)[ \t]*\\(\\(" (regexp-opt (sh-feature sh-leading-keywords) t) "[ \t]+\\)?" (regexp-opt (append (sh-feature sh-leading-keywords) (sh-feature sh-other-keywords)) t)))) in `sh-font-lock-keywords-1' which I understand correctly iff I read the definition of the entire function first. Such expressions are, however, rare in present Emacs code. > Hence, reading a regexp forward _is_ algorithmically _very_ superior > over reading it backward if your purpose is to understand the regexp. If my purpose is to understand how a regexp engine interprets a regexp, reading a regexp forwardly is superior. If, however, my purpose is to understand a complex regexp I want to guess the author's intentions first. In that case I do want to break up the expression into its constituents. In general, languages hiding implementation details are easier to use than languages that require users to know how specific features are implemented. > I must admit however, that if you want is to uncover the subliminal > satanic messages in the regexp, then you _have_ to read it backward. It's better to avoid "subliminal satanic messages" when _writing_ a regexp. It's bad if you have to uncover them when reading a regexp. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 11:54 ` martin rudalics @ 2006-03-05 15:35 ` Andreas Schwab 2006-03-06 8:19 ` martin rudalics 2006-03-05 18:36 ` Luc Teirlinck 1 sibling, 1 reply; 81+ messages in thread From: Andreas Schwab @ 2006-03-05 15:35 UTC (permalink / raw) Cc: ttn, Luc Teirlinck, emacs-devel martin rudalics <rudalics@gmx.at> writes: > And I would have troubles to isolate a character alternative when the > author matches a literal right bracket with `]'. A bracket expression always starts with an unquoted `['. When looking at a `]' you will never know whether it is part of a bracket expression (independent of whether it is preceded by `\') without first determining which syntax is currently active (inside or outside of a bracket expression). Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 15:35 ` Andreas Schwab @ 2006-03-06 8:19 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-03-06 8:19 UTC (permalink / raw) Cc: ttn, Luc Teirlinck, emacs-devel > A bracket expression always starts with an unquoted `['. When looking at > a `]' you will never know whether it is part of a bracket expression > (independent of whether it is preceded by `\') without first determining > which syntax is currently active (inside or outside of a bracket > expression). Agreed. When looking at a single isolated character I can never tell whether it's inside a bracket expression or not. However, I'd like to determine whether it is before having to read an entire regexp from beginning to end. For that I want all the syntactic help I can get. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 11:54 ` martin rudalics 2006-03-05 15:35 ` Andreas Schwab @ 2006-03-05 18:36 ` Luc Teirlinck 2006-03-05 19:14 ` Luc Teirlinck 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-05 18:36 UTC (permalink / raw) Cc: ttn, emacs-devel Martin Rudalics wrote: If my purpose is to understand how a regexp engine interprets a regexp, reading a regexp forwardly is superior. As Andreas already pointed out, there is _no_ way to determine whether either a `]' _or_ a `\\]' has to be taken literally or closes a character alternative without parsing the regexp forward from the start. In general, languages hiding implementation details are easier to use than languages that require users to know how specific features are implemented. But if you _are_ using a language that requires parsing forward from the beginning for correct understanding, like regexps, then _pretending_ that you are using some other type of language is not going to help. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 18:36 ` Luc Teirlinck @ 2006-03-05 19:14 ` Luc Teirlinck 2006-03-06 8:17 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-05 19:14 UTC (permalink / raw) Cc: rudalics, emacs-devel, ttn >From my previous reply: As Andreas already pointed out, there is _no_ way to determine whether either a `]' _or_ a `\\]' has to be taken literally or closes a character alternative without parsing the regexp forward from the start. Well, _in certain cases_, you might be able to determine it sooner by parsing backward from the `]'. If you see a `]' or "\\]" you know it has to be taken literally. (Note: the "\\" are irrelevant to the question again.) But then you do not know whether that earlier `]' or "\\]" has to be taken literally, without keeping going. If you see an unquoted [, you know that the `]' closes a character alternative, but you still do not know whether that `[' is opening that character alternative or has to be taken literally, as in "[asd[fgh]". If you encounter another `[' next you know that either that one opens your character alternative _or_ that the regexp was very poorly written. But there definitely are many cases where you would have to parse back all the way to the beginning. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-05 19:14 ` Luc Teirlinck @ 2006-03-06 8:17 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-03-06 8:17 UTC (permalink / raw) Cc: ttn, emacs-devel > But there definitely are many cases where you would have to parse back > all the way to the beginning. I don't want to parse regexps, neither forward nor backward. I want to understand what the author of the expression intended to match. For that purpose I try to extract familiar patterns from the expression. Parsing a regexp in order to show that it's wrong or doesn't match what it should doesn't make sense for most human beings. The regexp engine can do that much better. Experienced programmers like you mentally parse complicated regexps from beginning to end. Experienced programmers occasionally forget that less experienced programmers are not able to do that. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-27 19:03 ` Richard Stallman ` (2 preceding siblings ...) 2006-02-28 0:44 ` Luc Teirlinck @ 2006-02-28 0:59 ` Luc Teirlinck 2006-03-06 12:52 ` Richard Stallman 3 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-28 0:59 UTC (permalink / raw) Cc: rudalics, emacs-devel, schwab I can install any one or both of the two chunks of the patch to lispref/searching.texi included below, if desired. (I sent this bafore, but it never arrived at emacs-devel). The first chunk just would eliminate `]' from the list of characters that are described as special outside a character alternative. The second chunk rephrases the following: For example, a string with unbalanced square brackets is invalid (with a few exceptions, such as `[]]'), That is incorrect or at least ambiguous (how exactly do you define balanced?) as the examples below show. ELISP> (string-match "]]]]" "]]]]") 0 ELISP> (string-match "[[]" "[") 0 One accurate way to restate it would be that a string whose square brackets _with special meaning _ do not balance is invalid. This would be (unless I overlook something) without exceptions: in `[]]' the square brackets with special meaning do balance. In the patch below I formulated it differently. ===File ~/searching.texi-diff=============================== *** searching.texi 06 Feb 2006 16:02:08 -0600 1.68 --- searching.texi 26 Feb 2006 10:25:06 -0600 *************** *** 237,243 **** special constructs and the rest are @dfn{ordinary}. An ordinary character is a simple regular expression that matches that character and nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, ! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new special characters will be defined in the future. Any other character appearing in a regular expression is ordinary, unless a @samp{\} precedes it. --- 237,243 ---- special constructs and the rest are @dfn{ordinary}. An ordinary character is a simple regular expression that matches that character and nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, ! @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new special characters will be defined in the future. Any other character appearing in a regular expression is ordinary, unless a @samp{\} precedes it. *************** *** 740,747 **** @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! with unbalanced square brackets is invalid (with a few exceptions, such ! as @samp{[]]}), and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. --- 740,747 ---- @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! that ends inside a character alternative without terminating @samp{]} ! is invalid, and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. ============================================================ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-28 0:59 ` Luc Teirlinck @ 2006-03-06 12:52 ` Richard Stallman 2006-03-07 5:52 ` Luc Teirlinck 0 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 2006-03-06 12:52 UTC (permalink / raw) Cc: rudalics, schwab, emacs-devel The basic concept of a character class is an entity surrounded by matching parentheses. However, quirks such as quoting make it necessary to understand the construct in terms of left-to-right parsing for complete understanding of the details. I think the manual needs to explain both levels--the first level so beginners can begin to understand, and the second level for precise thinking about counterintuitive regexps. I could certainly do that, but I am terribly overloaded. Would someone else like to try it? Meanwhile, I sure wish the quoting conventions for regexps were more rational. But that would be an incompatible change and I think the minuses will always outweigh the pluses. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-06 12:52 ` Richard Stallman @ 2006-03-07 5:52 ` Luc Teirlinck 2006-03-07 8:53 ` martin rudalics 0 siblings, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-03-07 5:52 UTC (permalink / raw) Cc: rudalics, schwab, emacs-devel Richard Stallman wrote: I think the manual needs to explain both levels--the first level so beginners can begin to understand, and the second level for precise thinking about counterintuitive regexps. I could certainly do that, but I am terribly overloaded. Would someone else like to try it? What about the following patch, which I can install if desired? It includes one unrelated change dealing with a problem I noticed in the process. It moves a paragraph occurring currently in the description of `*' to the description of `+'. (Although, from diff's perspective, it instead moves the definition of `+' up till before that paragraph. Everything is relative, I guess.) The reason is that the paragraph discusses the regexp "(x+y*\)*a" before the meaning of `+' is explained. This makes `x+y' look like is the sum of x and y. Also the remarks in the paragraph apply to both `*' and `+'. ===File ~/searching.texi-diff=============================== *** searching.texi 06 Feb 2006 16:02:08 -0600 1.68 --- searching.texi 06 Mar 2006 23:47:42 -0600 *************** *** 235,246 **** Regular expressions have a syntax in which a few characters are special constructs and the rest are @dfn{ordinary}. An ordinary ! character is a simple regular expression that matches that character and ! nothing else. The special characters are @samp{.}, @samp{*}, @samp{+}, ! @samp{?}, @samp{[}, @samp{]}, @samp{^}, @samp{$}, and @samp{\}; no new ! special characters will be defined in the future. Any other character ! appearing in a regular expression is ordinary, unless a @samp{\} ! precedes it. For example, @samp{f} is not a special character, so it is ordinary, and therefore @samp{f} is a regular expression that matches the string --- 235,249 ---- Regular expressions have a syntax in which a few characters are special constructs and the rest are @dfn{ordinary}. An ordinary ! character is a simple regular expression that matches that character ! and nothing else. The special characters are @samp{.}, @samp{*}, ! @samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new ! special characters will be defined in the future. The character ! @samp{]} is special if it ends a character alternative (see later). ! The character @samp{-} is special inside a character alternative. A ! @samp{[:} and balancing @samp{:]} enclose a character class inside a ! character alternative. Any other character appearing in a regular ! expression is ordinary, unless a @samp{\} precedes it. For example, @samp{f} is not a special character, so it is ordinary, and therefore @samp{f} is a regular expression that matches the string *************** *** 301,306 **** --- 304,316 ---- The next alternative is for @samp{a*} to match only two @samp{a}s. With this choice, the rest of the regexp matches successfully.@refill + @item @samp{+} + @cindex @samp{+} in regexp + is a postfix operator, similar to @samp{*} except that it must match + the preceding expression at least once. So, for example, @samp{ca+r} + matches the strings @samp{car} and @samp{caaaar} but not the string + @samp{cr}, whereas @samp{ca*r} matches all three strings. + Nested repetition operators take a long time, or even forever, if they lead to ambiguous matching. For example, trying to match the regular expression @samp{\(x+y*\)*a} against the string *************** *** 311,323 **** it causes an infinite loop. To avoid these problems, check nested repetitions carefully. - @item @samp{+} - @cindex @samp{+} in regexp - is a postfix operator, similar to @samp{*} except that it must match - the preceding expression at least once. So, for example, @samp{ca+r} - matches the strings @samp{car} and @samp{caaaar} but not the string - @samp{cr}, whereas @samp{ca*r} matches all three strings. - @item @samp{?} @cindex @samp{?} in regexp is a postfix operator, similar to @samp{*} except that it must match the --- 321,326 ---- *************** *** 468,473 **** --- 471,504 ---- can act. It is poor practice to depend on this behavior; quote the special character anyway, regardless of where it appears.@refill + As a @samp{\} is not special inside a character alternative, it can + never remove the special meaning of @samp{-} or @samp{]}. So you + should not quote these characters when they have no special meaning + either. This would not clarify anything, since backslashes can + legitimately precede these characters where they @emph{have} special + meaning, as in @code{[^\]} (@code{"[^\\]"} for Lisp string syntax), + which matches any single character except a backslash. + + In practice, most @samp{]} that occur in regular expressions close a + character alternative and hence are special. However, occasionally a + regular expression may try to match a complex pattern of literal + @samp{[} and @samp{]}. In such situations, it sometimes may be + necessary to carefully parse the regexp from the start to determine + which square brackets enclose a character alternative. For example, + @code{[^][]]}, consists of the complemented character alternative + @code{[^][]}, which matches any single character that is not a square + bracket, followed by a literal @samp{]}. + + The exact rules are that at the beginning of a regexp, @samp{[} is + special and @samp{]} not. This lasts until the first unquoted + @samp{[}, after which we are in a character alternative; @samp{[} is + no longer special (except if it starts a character class) but @samp{]} + is special, unless it immediately follows the special @samp{[} or that + @samp{[} followed by a @samp{^}. This lasts until the next special + @samp{]} that does not end a character class. This ends the character + alternative and restores the ordinary syntax of regular expressions; + an unquoted @samp{[} is special again and a @samp{]} not. + @node Char Classes @subsubsection Character Classes @cindex character classes in regexp *************** *** 740,747 **** @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! with unbalanced square brackets is invalid (with a few exceptions, such ! as @samp{[]]}), and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. --- 771,778 ---- @kindex invalid-regexp Not every string is a valid regular expression. For example, a string ! that ends inside a character alternative without terminating @samp{]} ! is invalid, and so is a string that ends with a single @samp{\}. If an invalid regular expression is passed to any of the search functions, an @code{invalid-regexp} error is signaled. ============================================================ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-03-07 5:52 ` Luc Teirlinck @ 2006-03-07 8:53 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-03-07 8:53 UTC (permalink / raw) Cc: schwab, rms, emacs-devel Luc Teirlinck wrote: > What about the following patch, which I can install if desired? The patch is logically consistent and it's probably reasonable to close this issue now. If the patch is installed, a similar one will have to be written for the Emacs manual. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 20:18 ` martin rudalics 2006-02-25 22:09 ` Andreas Schwab 2006-02-25 22:13 ` Luc Teirlinck @ 2006-02-25 22:34 ` Luc Teirlinck 2006-02-25 22:59 ` Andreas Schwab 2006-02-26 13:20 ` martin rudalics 2 siblings, 2 replies; 81+ messages in thread From: Luc Teirlinck @ 2006-02-25 22:34 UTC (permalink / raw) Cc: schwab, emacs-devel >From my previous message: But "quoting" `]' by writing "[]]" instead of just `]', seems contorted, even though it would be less confusing than "\\]". But if you *really* want to emphasize that the `]' does not close a character alternative, "[]]" is the only way I know to do that. "\\]" does not do the job. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 22:34 ` Luc Teirlinck @ 2006-02-25 22:59 ` Andreas Schwab 2006-02-26 13:20 ` martin rudalics 1 sibling, 0 replies; 81+ messages in thread From: Andreas Schwab @ 2006-02-25 22:59 UTC (permalink / raw) Cc: rudalics, emacs-devel Luc Teirlinck <teirllm@dms.auburn.edu> writes: >>From my previous message: > > But "quoting" `]' by writing "[]]" instead of just `]', seems > contorted, even though it would be less confusing than "\\]". > > But if you *really* want to emphasize that the `]' does not close a > character alternative, "[]]" is the only way I know to do that. > "\\]" does not do the job. JFTR, in AWK regexps the backslash retains its special meaning inside character lists. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-25 22:34 ` Luc Teirlinck 2006-02-25 22:59 ` Andreas Schwab @ 2006-02-26 13:20 ` martin rudalics 2006-02-26 16:53 ` Luc Teirlinck 2006-02-26 17:19 ` Luc Teirlinck 1 sibling, 2 replies; 81+ messages in thread From: martin rudalics @ 2006-02-26 13:20 UTC (permalink / raw) Cc: schwab, emacs-devel >>From my previous message: > > But "quoting" `]' by writing "[]]" instead of just `]', seems > contorted, even though it would be less confusing than "\\]". > > But if you *really* want to emphasize that the `]' does not close a > character alternative, "[]]" is the only way I know to do that. > "\\]" does not do the job. I didn't care about `]'s closing a character alternative. I did care about `]'s appearing in regular expressions _outside_ a character alternative and meant to match the character `]'. Such `]'s should be quoted just like `.', `*', `+', `?', `[', `^', `$', and `\'. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 13:20 ` martin rudalics @ 2006-02-26 16:53 ` Luc Teirlinck 2006-02-26 18:01 ` martin rudalics 2006-02-26 17:19 ` Luc Teirlinck 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-26 16:53 UTC (permalink / raw) Cc: schwab, emacs-devel Martin Rudalics wrote: Such `]'s should be quoted just like `.', `*', `+', `?', `[', `^', `$', and `\'. Even if that _would_ be true, `]' should _not_ be quoted as "\\]" but as "[]]", as I pointed out earlier. We are talking about Elisp, not AWK. According to the Elisp manual all these exhibit "poor practice" since you didn't quote the second `]'s. You should have complained about that when you read the manual. Reading through a large body of text, it is easy to miss a small and non-obvious detail like this. I am complaining about it now and, in a separate message, proposed a patch. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 16:53 ` Luc Teirlinck @ 2006-02-26 18:01 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-02-26 18:01 UTC (permalink / raw) Cc: schwab, emacs-devel > Martin Rudalics wrote: > > Such `]'s should be quoted just like `.', `*', `+', `?', `[', `^', > `$', and `\'. > > Even if that _would_ be true, `]' should _not_ be quoted as "\\]" but as > "[]]", as I pointed out earlier. We are talking about Elisp, not AWK. The Elisp manual clearly states that "Any other character appearing in a regular expression is ordinary, unless a `\' precedes it." Hence, quoting with a backslash is the canonical method in Elisp. martin, who doesn't talk AWK. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 13:20 ` martin rudalics 2006-02-26 16:53 ` Luc Teirlinck @ 2006-02-26 17:19 ` Luc Teirlinck 2006-02-26 18:13 ` martin rudalics 1 sibling, 1 reply; 81+ messages in thread From: Luc Teirlinck @ 2006-02-26 17:19 UTC (permalink / raw) Cc: schwab, emacs-devel Martin Rudalics wrote: I didn't care about `]'s closing a character alternative. I did care about `]'s appearing in regular expressions _outside_ a character alternative and meant to match the character `]'. Such `]'s should be quoted just like `.', `*', `+', `?', `[', `^', `$', and `\'. To expand on my previous reply, you are arguing from a purely legalistic viewpoint here, based on an error in the manual. The _purpose_ of quoting special characters even when they have no special meaning is make clear that they have no special meaning. "\\]" does not clear up any confusion in situations where confusion could be possible, so it is completely meaningless. There is a mistake in the manual. We should correct that mistake, not make meaningless, and even confusing, changes to code. Sincerely, Luc. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Unquoted special characters in regexps 2006-02-26 17:19 ` Luc Teirlinck @ 2006-02-26 18:13 ` martin rudalics 0 siblings, 0 replies; 81+ messages in thread From: martin rudalics @ 2006-02-26 18:13 UTC (permalink / raw) Cc: schwab, emacs-devel > To expand on my previous reply, you are arguing from a purely > legalistic viewpoint here, based on an error in the manual. The > _purpose_ of quoting special characters even when they have no special > meaning is make clear that they have no special meaning. "\\]" does > not clear up any confusion in situations where confusion could be > possible, so it is completely meaningless. > > There is a mistake in the manual. We should correct that mistake, not > make meaningless, and even confusing, changes to code. Do you really think that considering (string-match "]" "]") a valid Elisp expression and (string-match "[" "[") an invalid one is less confusing? ^ permalink raw reply [flat|nested] 81+ messages in thread
end of thread, other threads:[~2006-03-07 8:53 UTC | newest] Thread overview: 81+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-02-25 17:23 Unquoted special characters in regexps martin rudalics 2006-02-25 18:42 ` Andreas Schwab 2006-02-25 19:18 ` martin rudalics 2006-02-25 19:31 ` Andreas Schwab 2006-02-25 20:18 ` martin rudalics 2006-02-25 22:09 ` Andreas Schwab 2006-02-26 11:32 ` martin rudalics 2006-02-26 11:50 ` Andreas Schwab 2006-02-26 13:28 ` martin rudalics 2006-02-25 22:13 ` Luc Teirlinck 2006-02-26 13:13 ` martin rudalics 2006-02-26 13:50 ` Andreas Schwab 2006-02-26 16:41 ` Luc Teirlinck 2006-02-26 17:53 ` martin rudalics 2006-02-26 18:22 ` Luc Teirlinck 2006-02-26 19:26 ` martin rudalics 2006-02-26 17:10 ` martin rudalics 2006-02-26 17:42 ` Luc Teirlinck 2006-02-26 19:06 ` martin rudalics 2006-02-26 17:56 ` Andreas Schwab 2006-02-26 19:08 ` martin rudalics 2006-02-27 19:03 ` Richard Stallman 2006-02-27 19:36 ` Andreas Schwab 2006-02-27 20:03 ` martin rudalics 2006-02-27 20:32 ` Andreas Schwab 2006-02-27 21:43 ` martin rudalics 2006-02-27 22:11 ` Andreas Schwab 2006-02-28 6:19 ` Richard Stallman 2006-02-28 10:28 ` martin rudalics 2006-02-28 0:30 ` Luc Teirlinck 2006-02-28 10:27 ` martin rudalics 2006-02-28 22:57 ` Luc Teirlinck 2006-03-01 13:00 ` martin rudalics 2006-03-01 17:54 ` Richard Stallman 2006-03-02 4:06 ` Luc Teirlinck 2006-03-02 19:43 ` Richard Stallman 2006-03-02 4:54 ` Luc Teirlinck 2006-03-02 18:40 ` martin rudalics 2006-03-02 23:26 ` Luc Teirlinck 2006-03-03 7:42 ` martin rudalics 2006-03-03 13:51 ` Luc Teirlinck 2006-03-03 14:09 ` Luc Teirlinck 2006-03-03 18:52 ` martin rudalics 2006-03-03 22:41 ` Luc Teirlinck 2006-03-03 23:00 ` Luc Teirlinck 2006-03-03 10:25 ` Richard Stallman 2006-03-03 15:20 ` martin rudalics 2006-03-04 13:37 ` Richard Stallman 2006-03-04 14:40 ` martin rudalics 2006-03-06 0:48 ` Richard Stallman 2006-03-03 10:25 ` Richard Stallman 2006-03-03 15:51 ` martin rudalics 2006-03-03 23:48 ` Luc Teirlinck 2006-03-04 9:58 ` martin rudalics 2006-03-04 23:16 ` Luc Teirlinck 2006-03-05 2:54 ` Luc Teirlinck 2006-03-06 0:49 ` Richard Stallman 2006-02-28 0:44 ` Luc Teirlinck 2006-03-04 21:07 ` Thien-Thi Nguyen 2006-03-05 3:37 ` Luc Teirlinck 2006-03-05 11:10 ` martin rudalics 2006-03-05 15:32 ` Luc Teirlinck 2006-03-06 7:41 ` martin rudalics 2006-03-05 17:04 ` Luc Teirlinck 2006-03-05 11:54 ` martin rudalics 2006-03-05 15:35 ` Andreas Schwab 2006-03-06 8:19 ` martin rudalics 2006-03-05 18:36 ` Luc Teirlinck 2006-03-05 19:14 ` Luc Teirlinck 2006-03-06 8:17 ` martin rudalics 2006-02-28 0:59 ` Luc Teirlinck 2006-03-06 12:52 ` Richard Stallman 2006-03-07 5:52 ` Luc Teirlinck 2006-03-07 8:53 ` martin rudalics 2006-02-25 22:34 ` Luc Teirlinck 2006-02-25 22:59 ` Andreas Schwab 2006-02-26 13:20 ` martin rudalics 2006-02-26 16:53 ` Luc Teirlinck 2006-02-26 18:01 ` martin rudalics 2006-02-26 17:19 ` Luc Teirlinck 2006-02-26 18:13 ` martin rudalics
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.