unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Thu, 21 Mar 2019 12:15:57 +0100	[thread overview]
Message-ID: <21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org> (raw)
In-Reply-To: <4b1164c4-e302-ce41-07c3-145d31a97b4c@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 4250 bytes --]

20 mars 2019 kl. 23.01 skrev Paul Eggert <eggert@cs.ucla.edu>:
> 
> On 3/19/19 7:20 PM, Stefan Monnier wrote:
>> I wonder why the doc doesn't just say that `-` should be the last
>> character and not mention the other possibilities which just make the
>> rule unnecessarily complex.

Agreed, that is what the 'how to write regexps' part of the docs should say. But don't we also need a precise description of exactly how they are interpreted by the engine? Otherwise, a user cannot read and understand existing code. (Unless he or she uses xr!) Perhaps there needs to be a separate 'gritty details' section.

> * The doc already says that regular expressions like "*foo" and "+foo"
> are problematic (they're confusing, and POSIX says the behavior is
> undefined) and should be avoided. REs like "[a-m-z]" and "[!-[:alpha:]]"
> and "[[:alpha:]-~]" are problematic in the same way and also should be
> avoided.

I'm with Stefan here; `-' should go last. Anything else is a gritty detail.

> * The doc doesn't clearly say when the Emacs range behavior is an
> extension to POSIX; saying this will help people know better when they
> can export Emacs regular expressions to other programs.

Documenting differences from POSIX regexps is useful. Do you prefer having those differences being spread out, or all concentrated into one section?

These days, a user may be more familiar with the various PCRE dialects than traditional or extended POSIX. Should that be taken into account?

> * The doc is confused (and there's a comment about this) about what
> happens when one end of a range is unibyte and the other is multibyte. I
> added something saying that if one bound is a raw 8-bit byte then the
> other should be a unibyte character (either ASCII, or a raw 8-bit byte).
> I don't see any good way to specify the behavior when one bound is a raw
> 8-bit byte and the other bound is a multibyte character, in such a way
> that it's a natural extension of the documented behavior, so the
> documentation now recommends against that.

The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'unibyte'? Is \x7f ever a raw 8-bit byte?
I agree that [å-\xff], say, should be invalid but I've never seen such constructs.

> * We might as well go ahead and say that [b-a] matches nothing, as
> enough code (ab)uses regexps in that way, and there is value in having a
> simple regular expression that always fails to match. However, I expect
> that we should say that users should avoid wilder examples like [~-!] so
> that the trawler can catch them as typos.

It already does, and some bugs were found that way. As a special case, it no longer complains about z-a because that is unlikely to be an accident and occurs in some code on purpose.

I'm not sure it's a good idea to document reversed ranges as a recommended way to match any or no character (although the description of the semantics would belong in a 'gritty details' section), and only to use [Y-X] where Y=X+1. More about that in a separate post.

> These new recommendations ("should"s in the attached patch) will give
> the trawler license to diagnose questionable REs like "[a-m-z]",
> "[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is no
> change to actual Emacs behavior.

As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr and found a handful in both Emacs and GNU ELPA, but none of them carried a freeload of bugs. Keeping that check didn't seem worthwhile; the regexps may be a bit odd-looking, but aren't wrong.

[!-[:alpha:]] is already detected since xr parses it correctly and will complain about the duplication of ':'. The reverse, [[:digit:]-z], is seen occasionally but again does not seem to be a serious bug proxy.

Much as I would like to outlaw ranges where a typical programmer has to consult an ASCII table to understand what's included, they just seem too common, with too many false positives, to merit inclusion in xr.
Nevertheless I had a quick look and extracted a few that might merit attention; see attachment.

Similarly, a rule finding [X-Y] where Y=X+1 found one or two questionable cases in a sea of false positives (also in the attachment).

[-- Attachment #2: possibly-broken-regexps.log --]
[-- Type: application/octet-stream, Size: 992 bytes --]

/Users/mattias/emacs/lisp/vc/diff-mode.el:2215:19: In call to re-search-forward: Unintuitive range `+-<' (pos 3)
  "\n[!+-<>]\\(-- [0-9]+\\(,[0-9]+\\)? ----\n\\( .*\n\\)*[+]\\)?"
   ....^
/Users/mattias/emacs/lisp/speedbar.el:2852:42: In call to re-search-forward: Unintuitive range `+-?' (pos 21)
  "^\\([0-9]+\\):\\s-*[[<][+-?][]>] "
   ........................^
/Users/mattias/emacs/lisp/speedbar.el:2903:42: In call to re-search-forward: Unintuitive range `+-?' (pos 19)
  "^\\([0-9]+\\):\\s-*\\[[+-?]\\] "
   .......................^
/Users/mattias/emacs/lisp/woman.el:3514:26: In call to looking-at: Unintuitive range `+-/' (pos 1)
  "[+-/*%]"
   .^
/Users/mattias/emacs/lisp/net/webjump.el:345:39: In call to string-match: Two-character range `.-/' (pos 8)
  "[a-zA-Z_.-/]"
   ........^
/Users/mattias/emacs/lisp/align.el:386:3: In align-rules-list (perl-assignment): Two-character range `*-+' (pos 6)
  "[^=!^&*-+<>/| \t\n]\\(\\s-*\\)=[~>]?\\(\\s-*\\)\\([^>= \t\n]\\|$\\)"
   ......^

  parent reply	other threads:[~2019-03-21 11:15 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-17 13:50 Scan of regexps in Emacs (March 17) Mattias Engdegård
2019-03-19  1:21 ` Paul Eggert
2019-03-19 10:34   ` Mattias Engdegård
2019-03-20  1:53     ` Paul Eggert
2019-03-20  2:20       ` Stefan Monnier
2019-03-20 22:01         ` Paul Eggert
2019-03-20 22:59           ` Drew Adams
2019-03-20 23:10             ` Paul Eggert
2019-03-21  3:38               ` Eli Zaretskii
     [not found]             ` <<deeccd91-0f43-c329-6087-17435550b328@cs.ucla.edu>
     [not found]               ` <<83d0mk6go5.fsf@gnu.org>
2019-03-21  4:21                 ` Drew Adams
2019-03-21 14:17                   ` Eli Zaretskii
2019-03-21  0:57           ` Stefan Monnier
2019-03-21 11:15           ` Mattias Engdegård [this message]
2019-04-02  7:33             ` Paul Eggert
2019-04-02 14:15               ` Mattias Engdegård
2019-04-02 14:26                 ` Noam Postavsky
2019-04-02 19:13                   ` Mattias Engdegård
2019-04-02 16:58                 ` Stefan Monnier
2019-04-02 22:08                 ` Paul Eggert
2019-04-03  4:52                   ` Eli Zaretskii
2019-04-03 17:02                     ` Paul Eggert
2019-04-06  9:43                   ` Mattias Engdegård
2019-04-07  8:15                     ` Michael Albinus
2019-04-07  9:47                     ` Paul Eggert
2019-04-07 10:06                       ` Mattias Engdegård
2019-04-07 18:45                         ` Paul Eggert
2019-03-21  2:07         ` Richard Stallman
2019-03-22 13:26         ` Stephen Leake
2019-03-22 14:03           ` Stefan Monnier
2019-03-22 14:12           ` Mattias Engdegård
2019-03-20 10:04       ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=21CCFA3D-B391-44E1-9ED5-1D37009F1988@acm.org \
    --to=mattiase@acm.org \
    --cc=eggert@cs.ucla.edu \
    --cc=emacs-devel@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).