Re: Scan of regexps in Emacs (March 17)

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: "Mattias Engdegård" <mattiase@acm.org>
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Tue, 2 Apr 2019 16:15:13 +0200	[thread overview]
Message-ID: <09AE372B-3A30-4596-8C4E-B9F4CBF6E348@acm.org> (raw)
In-Reply-To: <f0edb8ac-9a9a-6cd6-3594-ea12cdbcd03b@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 3831 bytes --]

2 apr. 2019 kl. 09.33 skrev Paul Eggert <eggert@cs.ucla.edu>:
> 
>> don't we also need a precise description of exactly how they are interpreted by the engine?
> 
> In other parts of Emacs, we are typically OK with specs that don't completely specify behavior. This gives us more freedom to make changes in the undocumented behavior later. I think it makes sense to do that here too, for regular expressions like "[z-a-m]" that most readers would find confusing.

Then where does a user go to understand extant regexps? (Do we have any latitude at all for changing even obscure corners of regexp syntax and semantics today?) That's why I favour expounding on the details in a separate section.

>> The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'unibyte'? Is \x7f ever a raw 8-bit byte?
>> I agree that [å-\xff], say, should be invalid but I've never seen such constructs.
> 
> After looking into it I realized that I don't really know the semantics here (the text I recently added there seems to be wrong, in some cases), and I have my doubts that anyone else knows the semantics either. The attached patch simply gets rid of that section, leaving the area undocumented. User beware!

Apparently I don't really know it either -- I just discovered that:

(string-match "\xff"     "\xff")  => 0
(string-match "[\xff]"   "\xff")  => 0
(string-match "\xffé?"   "\xff")  => nil
(string-match "[\xff]é?" "\xff")  => 0
(string-match "\xff"     "\xffé") => 0
(string-match "[\xff]"   "\xffé") => nil
(string-match "\xffé?"   "\xffé") => 0
(string-match "[\xff]é?" "\xffé") => nil

> OK, then we should document z-a as the preferred syntax (best go with the flow...). Done in the attached patch.

Actually, the only place where I saw z-a was in auctex (in negated form, [^z-a]).

>> As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr and found a handful in both Emacs and GNU ELPA, but none of them carried a freeload of bugs. Keeping that check didn't seem worthwhile; the regexps may be a bit odd-looking, but aren't wrong.
> 
> It depends on what one means by "wrong". If one wants to use the ranges in both Emacs and grep they are "wrong", so it's reasonable for the manual to recommend against them.

Definitely agree that it should be discouraged. I've attached the ones found by a modified relint/xr, in case you are interested.

> It might also help for the trawler to warn about [X-Z] where Z = X+2. [XYZ] is clearer and less error-prone than [X-Z]. I shoehorned that into the attached patch too.

These seem to be rare; I found exactly one occurrence (lisp/gnus/message.el:1291):

 "[ \t]\\|[][!\"#$%&'()*+,-./0-9;<=>?@A-Z\\^_`a-z{|}~]+:"

which uses the punny range ,-. (possibly by benign accident).
Similarly, singleton ranges, X-X, are non-existent save for --- which I presume is an XEmacs workaround.

The latest xr version warns about 2-character ranges, except within digits because [0-1] etc was found to be common and harmless.

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 748ab586af..72ee9233a3 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
...
+A character alternative can include duplicates.  For example,
+@samp{[XYa-yYb-zX]} is less clear than @samp{[XYa-z]}.

Certainly, but does this need to be mentioned? Overlapping ranges are rarely written on purpose. Besides, duplication isn't confined to ranges.

More useful, I think, would be to recommend ranges to stay within natural sequences (letters, digits, etc) so that a reader needn't consult a table to see what is included. Thus [0-9.:/] good, [.-:] bad, even though they denote the same set.

+@item
+A @samp{-} also appear at the beginning of a character alternative, or

'appears'


[-- Attachment #2: chained-ranges.log --]
[-- Type: application/octet-stream, Size: 2310 bytes --]

;; -*- compilation -*-
lisp/gnus/nndoc.el:704:30: In call to re-search-forward: Attempt at chained ranges `A-Z-\' (pos 47)
  "^\\\\\\\\\n\\(Paper\\( (\\*cross-listing\\*)\\)?: [a-zA-Z-\\.]+/[0-9]+\\|arXiv:\\)"
   .........................................................^
lisp/gnus/nndoc.el:735:27: In call to looking-at: Attempt at chained ranges `A-Z-\' (pos 34)
  "^\\(Paper.*: \\|arXiv:\\)\\([0-9a-zA-Z-\\./]+\\)"
   ......................................^
lisp/org/org-eshell.el:40:29: In call to string-match: Attempt at chained ranges `0-9-+' (pos 12)
  "\\([A-Za-z0-9-+*]+\\):\\(.*\\)"
   .............^
lisp/org/org.el:432:3: In org-deadline-time-hour-regexp: Attempt at chained ranges `0-9-+' (pos 48)
  "\\<DEADLINE: *<\\([^>]+[0-9]\\{1,2\\}:[0-9]\\{2\\}[0-9-+:hdwmy \t.]*\\)>"
   ......................................................^
lisp/org/org.el:448:3: In org-scheduled-time-hour-regexp: Attempt at chained ranges `0-9-+' (pos 49)
  "\\<SCHEDULED: *<\\([^>]+[0-9]\\{1,2\\}:[0-9]\\{2\\}[0-9-+:hdwmy \t.]*\\)>"
   .......................................................^
lisp/progmodes/bat-mode.el:68:3: In bat-font-lock-keywords: Attempt at chained ranges `0-9-_' (pos 39)
  "\\_<\\(call\\|goto\\)\\_>[ \t]+%?\\([A-Za-z0-9-_\\:.]+\\)%?"
   ..............................................^
lisp/progmodes/bug-reference.el:72:3: In bug-reference-bug-regexp: Attempt at chained ranges `a-z-+' (pos 42)
  "\\([Bb]ug ?#?\\|[Pp]atch ?#\\|RFE ?#\\|PR [a-z-+]+/\\)\\([0-9]+\\(?:#[0-9]+\\)?\\)"
   ..............................................^
lisp/textmodes/less-css-mode.el:196:3: In less-css-font-lock-keywords: Attempt at chained ranges `a-z-_' (pos 12)
  "@[a-z_-][a-z-_0-9]*"
   ............^
lisp/textmodes/less-css-mode.el:196:3: In less-css-font-lock-keywords: Attempt at chained ranges `a-z-_' (pos 30)
  "\\(?:[ \t{;]\\|^\\)\\(\\.[a-z_-][a-z-_0-9]*\\)[ \t]*;"
   ....................................^
lisp/vc/vc-cvs.el:1090:27: In call to string-match: Attempt at chained ranges `A-Z-_' (pos 11)
  "[^a-z0-9A-Z-_]"
   ...........^
lisp/vc/vc-svn.el:762:27: In call to string-match: Attempt at chained ranges `A-Z-_' (pos 11)
  "[^a-z0-9A-Z-_]"
   ...........^
lisp/files.el:6319:28: In call to string-match: Attempt at chained ranges `0-9-_' (pos 11)
  "[^A-Za-z0-9-_.~#+]"
   ...........^

[-- Attachment #3: Type: text/plain, Size: 1 bytes --]

next prev parent reply	other threads:[~2019-04-02 14:15 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-17 13:50 Scan of regexps in Emacs (March 17) Mattias Engdegård
2019-03-19  1:21 ` Paul Eggert
2019-03-19 10:34   ` Mattias Engdegård
2019-03-20  1:53     ` Paul Eggert
2019-03-20  2:20       ` Stefan Monnier
2019-03-20 22:01         ` Paul Eggert
2019-03-20 22:59           ` Drew Adams
2019-03-20 23:10             ` Paul Eggert
2019-03-21  3:38               ` Eli Zaretskii
     [not found]             ` <<deeccd91-0f43-c329-6087-17435550b328@cs.ucla.edu>
     [not found]               ` <<83d0mk6go5.fsf@gnu.org>
2019-03-21  4:21                 ` Drew Adams
2019-03-21 14:17                   ` Eli Zaretskii
2019-03-21  0:57           ` Stefan Monnier
2019-03-21 11:15           ` Mattias Engdegård
2019-04-02  7:33             ` Paul Eggert
2019-04-02 14:15               ` Mattias Engdegård [this message]
2019-04-02 14:26                 ` Noam Postavsky
2019-04-02 19:13                   ` Mattias Engdegård
2019-04-02 16:58                 ` Stefan Monnier
2019-04-02 22:08                 ` Paul Eggert
2019-04-03  4:52                   ` Eli Zaretskii
2019-04-03 17:02                     ` Paul Eggert
2019-04-06  9:43                   ` Mattias Engdegård
2019-04-07  8:15                     ` Michael Albinus
2019-04-07  9:47                     ` Paul Eggert
2019-04-07 10:06                       ` Mattias Engdegård
2019-04-07 18:45                         ` Paul Eggert
2019-03-21  2:07         ` Richard Stallman
2019-03-22 13:26         ` Stephen Leake
2019-03-22 14:03           ` Stefan Monnier
2019-03-22 14:12           ` Mattias Engdegård
2019-03-20 10:04       ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=09AE372B-3A30-4596-8C4E-B9F4CBF6E348@acm.org \
    --to=mattiase@acm.org \
    --cc=eggert@cs.ucla.edu \
    --cc=emacs-devel@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).