all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Paul Eggert <eggert@cs.ucla.edu>
To: "Mattias Engdegård" <mattiase@acm.org>
Cc: emacs-devel <emacs-devel@gnu.org>
Subject: Scan of regexp mistakes
Date: Fri, 8 Mar 2019 09:13:04 -0800	[thread overview]
Message-ID: <e2c83a03-a6ba-5b44-b49e-d2c4da447213@cs.ucla.edu> (raw)
In-Reply-To: <D3E6A383-1368-46F0-8D37-7C45FD322DDC@acm.org>

[-- Attachment #1: Type: text/plain, Size: 5271 bytes --]

On 3/5/19 7:06 AM, Mattias Engdegård wrote:
>
> I can run it periodically but would surely forget. Should I put the trawler in the Emacs source tree (if so, where?), in ELPA, or elsewhere?

Stefan mentioned one possibility. Though even then I daresay it'd be
helpful if you ran it periodically, just as I periodically run
admin/merge-gnulib. (If you don't run it, it's likely nobody else will....)


> -      (re-search-forward "^\\(\s+=+\s?+\\)+\n")
> +      (re-search-forward "^\\(\s+=+\s+\\)+\n")
>                                     ^^^
> Are you sure this shouldn't be `\s*'

No, good point. I'll change it to that. See attached patch.


> -  "[a-z0-9$%(*-=?[_][^<>\")!;:,{}\n\t @]*"
> +  "[a-z$%(*-=?[_][^<>\")!;:,{}\n\t @]*"
>
> You kept the rather odd range `*-=' which comprises `*+,-./0123456789:;<='. Is it supposed to be that way?

Goodness knows what is intended here, as this is some ad hoc variant of
RFC 5322/2822/822 and I don't know which variant. Might as well spell
out the range in a more-conventional way, though. The attached patch
replaces *-= with *+./= (as ,:;< aren't allowed unquoted in RFC 5322
atoms) and puts - and 0-9 elsewhere; this should be closer to what was
wanted and should be clearer anyway. I was unable to track down whatever
suggestion was made by Felix Wiemann long ago, and so removed that
comment (since the regexp no longer matches his suggestion anyway).


> -    (while (re-search-forward "\\ce[»\\.\\?]\\|«\\ce" nil t)
> +    (while (re-search-forward "\\ce[»\\.?]\\|«\\ce" nil t)
>
> Should `\' really be kept in the set of characters? It looks like it was only included as an attempt to escape `.' and `?'.

Yes, probably. Fixed in the attached.


> searching for A-z uncovers more suspect regexps, some of which aren't found by the trawler.

I wonder where those all came from? I attempted to fix them in the attached.


> Here is another one in the same file (line 33), but that wasn't found by the trawler:
>
> 	 (replace-regexp-in-string "[\000-\032\177<>#%\"{}|\\^[]`%?;]"
>
> That \032 doesn't look right (number base confusion?), and it looks like it's meant as a single character alternative but it isn't, given the misplaced `]'.

The regexp has other troubles. It doesn't include !$'()*+,/:@&= (all of
which are reserved characters according to RFC 3986), and it has
duplicate %. The attached patch fixes the % and puts in a FIXME about
the other chars.


> diff --git a/lisp/org/org-mobile.el b/lisp/org/org-mobile.el
> index 1ff6358403..83dcc7b0d1 100644
> --- a/lisp/org/org-mobile.el
> +++ b/lisp/org/org-mobile.el
> @@ -845,11 +845,11 @@ If BEG and END are given, only do this in that region."
>  	    (cl-incf cnt-error)
>  	    (throw 'next t))
>  	  (move-marker bos-marker (point))
> -	  (if (re-search-forward "^** Old value[ \t]*$" eos t)
> +	  (if (re-search-forward "^\\*\\* Old value[ \t]*$" eos t)
>
> Shouldn't this start with "^\\**", or does it have to be exactly two asterisks?
>
>  	      (setq old (buffer-substring
>  			 (1+ (match-end 0))
>  			 (progn (outline-next-heading) (point)))))
> -	  (if (re-search-forward "^** New value[ \t]*$" eos t)
> +	  (if (re-search-forward "^\\*\\* New value[ \t]*$" eos t)
>
> Idem.

"\\**" would be safer, yes. Fixed in the attached.


> -   ((string-match "^//\\(.?*\\)/\\(<.*>\\)$" path)
> +   ((string-match "^//\\(.*\\)/\\(<.*>\\)$" path)
>
> Another repetition-of-repetition. Sure it shouldn't be `*?' instead? It looks likely, since there is a `/' following that would be eaten by the `.*' given half a chance.

The comment on the next line says "Planner has the id after the final
slash", which implies that the first .* should indeed be greedy.


>
> diff --git a/lisp/progmodes/fortran.el b/lisp/progmodes/fortran.el
> index be272c0922..c1a267f4c5 100644
> --- a/lisp/progmodes/fortran.el
> +++ b/lisp/progmodes/fortran.el
> @@ -2052,7 +2052,7 @@ If ALL is nil, only match comments that start in column > 0."
>                  (when (<= (point) bos)
>                    (move-to-column (1+ fill-column))
>                    ;; What is this doing???
> -                  (or (re-search-forward "[\t\n,'+-/*)=]" eol t)
> +                  (or (re-search-forward "[-\t\n,'+./*)=]" eol t)
>
> Where did the . come from? Don't you think that `+-/*' were meant to include those four symbols only?

I couldn't figure out what the code was doing (note the comment...) so
decided to preserve the semantics of the old regexp. But you're right,
"." is likely not intended there. I removed it in the attached.


>  		;; Regexp bug in XEmacs disallows ][ inside [], and wants + last
> -		"\\s-*\\.\\(\\([a-zA-Z0-9`_$+@^.*?|---]\\|[][]\\|\\\\[()|]\\)+\\)\\s-*(\\(.*\\))\\s-*\\(,\\|)\\s-*;\\)")
> +		"\\s-*\\.\\(\\([-a-zA-Z0-9`_$+@^.*?]\\|[][]\\|\\\\[()|]\\)+\\)\\s-*(\\(.*\\))\\s-*\\(,\\|)\\s-*;\\)")
>  	       (setq rep (match-string-no-properties 3))
>  	       (goto-char (match-end 0))
>  	       (setq tpl-wild-list
>
> Are you sure that | shouldn't be there too? Or is this some kind of XEmacs idiom?
>
You're right. Wilson Snyder later stepped in and fixed that string. Nice
to have a real expert in the house.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-More-regexp-corrections-and-tweaks.patch --]
[-- Type: text/x-patch; name="0001-More-regexp-corrections-and-tweaks.patch", Size: 9246 bytes --]

From 03d2da31d311def9773ede09b6ad89b094c61805 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Fri, 8 Mar 2019 09:08:46 -0800
Subject: [PATCH] More regexp corrections and tweaks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

From suggestions by Mattias Engdegård in:
https://lists.gnu.org/r/emacs-devel/2019-03/msg00131.html
* lisp/arc-mode.el (archive-rar-summarize):
* lisp/gnus/gnus-art.el (gnus-button-valid-localpart-regexp):
* lisp/language/ethio-util.el (ethio-fidel-to-tex-buffer):
* lisp/nxml/rng-uri.el (rng-file-name-uri):
* lisp/org/org-mobile.el (org-mobile-apply):
* lisp/progmodes/cperl-mode.el (cperl-init-faces):
* lisp/progmodes/fortran.el (fortran-fill):
* lisp/progmodes/mantemp.el (mantemp-remove-comments)
(mantemp-remove-memfuncs, mantemp-insert-cxx-syntax):
* lisp/speedbar.el (speedbar-directory-buttons-follow):
* lisp/vc/add-log.el (change-log-font-lock-keywords):
Fix more regular expressions that seem to be typos or infelicities.
---
 lisp/arc-mode.el             | 2 +-
 lisp/gnus/gnus-art.el        | 3 +--
 lisp/language/ethio-util.el  | 2 +-
 lisp/nxml/rng-uri.el         | 7 ++++---
 lisp/org/org-mobile.el       | 4 ++--
 lisp/progmodes/cperl-mode.el | 4 ++--
 lisp/progmodes/fortran.el    | 2 +-
 lisp/progmodes/mantemp.el    | 8 ++++----
 lisp/speedbar.el             | 2 +-
 lisp/vc/add-log.el           | 2 +-
 10 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/lisp/arc-mode.el b/lisp/arc-mode.el
index 2afde7ee75..6a58d61a54 100644
--- a/lisp/arc-mode.el
+++ b/lisp/arc-mode.el
@@ -2016,7 +2016,7 @@ archive-rar-summarize
       (call-process "lsar" nil t nil "-l" (or file copy))
       (if copy (delete-file copy))
       (goto-char (point-min))
-      (re-search-forward "^\\(\s+=+\s+\\)+\n")
+      (re-search-forward "^\\(\s+=+\s*\\)+\n")
       (while (looking-at (concat "^\s+[0-9.]+\s+D?-+\s+"   ; Flags
                                  "\\([0-9-]+\\)\s+"        ; Size
                                  "\\([-0-9.%]+\\|-+\\)\s+" ; Ratio
diff --git a/lisp/gnus/gnus-art.el b/lisp/gnus/gnus-art.el
index fa3abfac58..baf44cb483 100644
--- a/lisp/gnus/gnus-art.el
+++ b/lisp/gnus/gnus-art.el
@@ -7376,9 +7376,8 @@ gnus-button-valid-fqdn-regexp
   :group 'gnus-article-buttons
   :type 'regexp)
 
-;; Regexp suggested by Felix Wiemann in <87oeuomcz9.fsf@news2.ososo.de>
 (defcustom gnus-button-valid-localpart-regexp
-  "[a-z$%(*-=?[_][^<>\")!;:,{}\n\t @]*"
+  "[-a-z0-9$%(*+./=?[_][^<>\")!;:,{}\n\t @]*"
   "Regular expression that matches a localpart of mail addresses or MIDs."
   :version "22.1"
   :group 'gnus-article-buttons
diff --git a/lisp/language/ethio-util.el b/lisp/language/ethio-util.el
index 512d49b9c5..04b15ddd9a 100644
--- a/lisp/language/ethio-util.el
+++ b/lisp/language/ethio-util.el
@@ -804,7 +804,7 @@ ethio-fidel-to-tex-buffer
 
     ;; Special Ethiopic punctuation.
     (goto-char (point-min))
-    (while (re-search-forward "\\ce[»\\.?]\\|«\\ce" nil t)
+    (while (re-search-forward "\\ce[».?]\\|«\\ce" nil t)
       (cond
        ((= (setq ch (preceding-char)) ?\»)
 	(delete-char -1)
diff --git a/lisp/nxml/rng-uri.el b/lisp/nxml/rng-uri.el
index d8f2884f5e..798475bbc3 100644
--- a/lisp/nxml/rng-uri.el
+++ b/lisp/nxml/rng-uri.el
@@ -30,9 +30,10 @@ rng-file-name-uri
 escape them using %HH."
   (setq f (expand-file-name f))
   (let ((url
-	 (replace-regexp-in-string "[\000-\032\177<>#%\"{}|\\^[]`%?;]"
-				   'rng-percent-encode
-				   f)))
+	 ;; FIXME. Explain why the pattern doesn't also have "!$&'()*+,/:@=".
+	 ;; See Internet RFC 3986 section 2.2.
+	 (replace-regexp-in-string "[]\0-\s\"#%;<>?[\\^`{|}\177]"
+				   'rng-percent-encode f)))
     (concat "file:"
 	    (if (and (> (length url) 0)
 		     (= (aref url 0) ?/))
diff --git a/lisp/org/org-mobile.el b/lisp/org/org-mobile.el
index 83dcc7b0d1..8b4e895388 100644
--- a/lisp/org/org-mobile.el
+++ b/lisp/org/org-mobile.el
@@ -845,11 +845,11 @@ org-mobile-apply
 	    (cl-incf cnt-error)
 	    (throw 'next t))
 	  (move-marker bos-marker (point))
-	  (if (re-search-forward "^\\*\\* Old value[ \t]*$" eos t)
+	  (if (re-search-forward "^\\** Old value[ \t]*$" eos t)
 	      (setq old (buffer-substring
 			 (1+ (match-end 0))
 			 (progn (outline-next-heading) (point)))))
-	  (if (re-search-forward "^\\*\\* New value[ \t]*$" eos t)
+	  (if (re-search-forward "^\\** New value[ \t]*$" eos t)
 	      (setq new (buffer-substring
 			 (1+ (match-end 0))
 			 (progn (outline-next-heading)
diff --git a/lisp/progmodes/cperl-mode.el b/lisp/progmodes/cperl-mode.el
index 0fe4b106c5..a9402e17a9 100644
--- a/lisp/progmodes/cperl-mode.el
+++ b/lisp/progmodes/cperl-mode.el
@@ -5736,9 +5736,9 @@ cperl-init-faces
 			 (if (eq (char-after (cperl-1- (match-end 0))) ?\{ )
 			     'font-lock-function-name-face
 			   'font-lock-variable-name-face))))
-	    '("\\<\\(package\\|require\\|use\\|import\\|no\\|bootstrap\\)[ \t]+\\([a-zA-z_][a-zA-z_0-9:]*\\)[ \t;]" ; require A if B;
+	    '("\\<\\(package\\|require\\|use\\|import\\|no\\|bootstrap\\)[ \t]+\\([a-zA-Z_][a-zA-Z_0-9:]*\\)[ \t;]" ; require A if B;
 	      2 font-lock-function-name-face)
-	    '("^[ \t]*format[ \t]+\\([a-zA-z_][a-zA-z_0-9:]*\\)[ \t]*=[ \t]*$"
+	    '("^[ \t]*format[ \t]+\\([a-zA-Z_][a-zA-Z_0-9:]*\\)[ \t]*=[ \t]*$"
 	      1 font-lock-function-name-face)
 	    (cond ((featurep 'font-lock-extra)
 		   '("\\([]}\\\\%@>*&]\\|\\$[a-zA-Z0-9_:]*\\)[ \t]*{[ \t]*\\(-?[a-zA-Z0-9_:]+\\)[ \t]*}"
diff --git a/lisp/progmodes/fortran.el b/lisp/progmodes/fortran.el
index c1a267f4c5..b8aa521cf6 100644
--- a/lisp/progmodes/fortran.el
+++ b/lisp/progmodes/fortran.el
@@ -2052,7 +2052,7 @@ fortran-fill
                 (when (<= (point) bos)
                   (move-to-column (1+ fill-column))
                   ;; What is this doing???
-                  (or (re-search-forward "[-\t\n,'+./*)=]" eol t)
+                  (or (re-search-forward "[-\t\n,'+/*)=]" eol t)
                       (goto-char bol)))
                 (if (bolp)
                     (re-search-forward "[ \t]" opoint t))
diff --git a/lisp/progmodes/mantemp.el b/lisp/progmodes/mantemp.el
index 9beeb4aae6..4190a84727 100644
--- a/lisp/progmodes/mantemp.el
+++ b/lisp/progmodes/mantemp.el
@@ -89,7 +89,7 @@ mantemp-remove-comments
   (save-excursion
     (goto-char (point-min))
     (message "Removing comments")
-    (while (re-search-forward "^[A-z.()+0-9: ]*`\\|'.*$" nil t)
+    (while (re-search-forward "^[a-zA-Z.()+0-9: ]*`\\|'.*$" nil t)
       (replace-match ""))))
 
 (defun mantemp-remove-memfuncs ()
@@ -99,14 +99,14 @@ mantemp-remove-memfuncs
     (goto-char (point-min))
     (message "Removing member function extensions")
     (while (re-search-forward
-	    "^[A-z :&*<>~=,0-9+]*>::operator " nil t nil)
+	    "^[a-zA-Z :&*<>~=,0-9+]*>::operator " nil t nil)
       (progn
 	(backward-char 11)
 	(delete-region (point) (line-end-position))))
     ;; Remove other member function extensions.
     (goto-char (point-min))
     (message "Removing member function extensions")
-    (while (re-search-forward "^[A-z :&*<>~=,0-9+]*>::" nil t nil)
+    (while (re-search-forward "^[a-zA-Z :&*<>~=,0-9+]*>::" nil t nil)
       (progn
 	(backward-char 2)
 	(delete-region (point) (line-end-position))))))
@@ -154,7 +154,7 @@ mantemp-insert-cxx-syntax
     (goto-char (point-min))
     (message "Inserting 'template' for functions")
     (while (re-search-forward
-	    "^template class [A-z :&*<>~=,0-9+!]*(" nil t nil)
+	    "^template class [a-zA-Z :&*<>~=,0-9+!]*(" nil t nil)
       (progn
 	(beginning-of-line)
 	(forward-word-strictly 1)
diff --git a/lisp/speedbar.el b/lisp/speedbar.el
index 46b3f2ea90..a7fd564e94 100644
--- a/lisp/speedbar.el
+++ b/lisp/speedbar.el
@@ -3388,7 +3388,7 @@ speedbar-directory-buttons-follow
   "Speedbar click handler for default directory buttons.
 TEXT is the button clicked on.  TOKEN is the directory to follow.
 INDENT is the current indentation level and is unused."
-  (if (string-match "^[A-z]:$" token)
+  (if (string-match "^[A-Za-z]:$" token)
       (setq default-directory (concat token "/"))
     (setq default-directory token))
   ;; Because we leave speedbar as the current buffer,
diff --git a/lisp/vc/add-log.el b/lisp/vc/add-log.el
index 9fe06bbf52..f9efd44c5c 100644
--- a/lisp/vc/add-log.el
+++ b/lisp/vc/add-log.el
@@ -239,7 +239,7 @@ change-log-font-lock-keywords
     ;; wrongly with a non-date line existing as a random note.  In
     ;; addition, using any kind of fixed setting like this doesn't
     ;; work if a user customizes add-log-time-format.
-    ("^[0-9-]+ +\\|^ \\{11,\\}\\|^\t \\{3,\\}\\|^\\(Sun\\|Mon\\|Tue\\|Wed\\|Thu\\|Fri\\|Sat\\) [A-z][a-z][a-z] [0-9:+ ]+"
+    ("^[0-9-]+ +\\|^ \\{11,\\}\\|^\t \\{3,\\}\\|^\\(Sun\\|Mon\\|Tue\\|Wed\\|Thu\\|Fri\\|Sat\\) [A-Z][a-z][a-z] [0-9:+ ]+"
      (0 'change-log-date)
      ;; Name and e-mail; some people put e-mail in parens, not angles.
      ("\\([^<(]+?\\)[ \t]*[(<]\\([A-Za-z0-9_.+-]+@[A-Za-z0-9_.-]+\\)[>)]" nil nil
-- 
2.20.1


  parent reply	other threads:[~2019-03-08 17:13 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-03 18:01 Scan of regexp mistakes Mattias Engdegård
2019-03-05  2:04 ` Paul Eggert
2019-03-05  4:30   ` Clément Pit-Claudel
2019-03-05 15:06   ` Mattias Engdegård
2019-03-08  6:02     ` Stefan Monnier
2019-03-08 17:13     ` Paul Eggert [this message]
     [not found]       ` <1c44342f-01ab-1bc6-4b5c-3485a6f01fda@lanl.gov>
     [not found]         ` <61035fed-564a-8766-8982-0d05acea341c@cs.ucla.edu>
2019-03-08 20:46           ` bug#34789: " Davis Herring
2019-03-11 14:20             ` Eli Zaretskii
2019-03-09 12:36       ` Mattias Engdegård
2019-03-09 17:14         ` Paul Eggert
2019-03-09 13:09       ` Mattias Engdegård
2019-03-09 17:12         ` Paul Eggert
2019-03-09 20:16           ` Mattias Engdegård
     [not found] <20190305112504.D97DD1EC@emma.svaha.wsnyder.org>
2019-03-05 15:35 ` Wilson Snyder
2019-03-05 16:12   ` Mattias Engdegård
2019-03-07 17:39   ` Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e2c83a03-a6ba-5b44-b49e-d2c4da447213@cs.ucla.edu \
    --to=eggert@cs.ucla.edu \
    --cc=emacs-devel@gnu.org \
    --cc=mattiase@acm.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.