2 apr. 2019 kl. 09.33 skrev Paul Eggert : > >> don't we also need a precise description of exactly how they are interpreted by the engine? > > In other parts of Emacs, we are typically OK with specs that don't completely specify behavior. This gives us more freedom to make changes in the undocumented behavior later. I think it makes sense to do that here too, for regular expressions like "[z-a-m]" that most readers would find confusing. Then where does a user go to understand extant regexps? (Do we have any latitude at all for changing even obscure corners of regexp syntax and semantics today?) That's why I favour expounding on the details in a separate section. >> The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'unibyte'? Is \x7f ever a raw 8-bit byte? >> I agree that [å-\xff], say, should be invalid but I've never seen such constructs. > > After looking into it I realized that I don't really know the semantics here (the text I recently added there seems to be wrong, in some cases), and I have my doubts that anyone else knows the semantics either. The attached patch simply gets rid of that section, leaving the area undocumented. User beware! Apparently I don't really know it either -- I just discovered that: (string-match "\xff" "\xff") => 0 (string-match "[\xff]" "\xff") => 0 (string-match "\xffé?" "\xff") => nil (string-match "[\xff]é?" "\xff") => 0 (string-match "\xff" "\xffé") => 0 (string-match "[\xff]" "\xffé") => nil (string-match "\xffé?" "\xffé") => 0 (string-match "[\xff]é?" "\xffé") => nil > OK, then we should document z-a as the preferred syntax (best go with the flow...). Done in the attached patch. Actually, the only place where I saw z-a was in auctex (in negated form, [^z-a]). >> As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr and found a handful in both Emacs and GNU ELPA, but none of them carried a freeload of bugs. Keeping that check didn't seem worthwhile; the regexps may be a bit odd-looking, but aren't wrong. > > It depends on what one means by "wrong". If one wants to use the ranges in both Emacs and grep they are "wrong", so it's reasonable for the manual to recommend against them. Definitely agree that it should be discouraged. I've attached the ones found by a modified relint/xr, in case you are interested. > It might also help for the trawler to warn about [X-Z] where Z = X+2. [XYZ] is clearer and less error-prone than [X-Z]. I shoehorned that into the attached patch too. These seem to be rare; I found exactly one occurrence (lisp/gnus/message.el:1291): "[ \t]\\|[][!\"#$%&'()*+,-./0-9;<=>?@A-Z\\^_`a-z{|}~]+:" which uses the punny range ,-. (possibly by benign accident). Similarly, singleton ranges, X-X, are non-existent save for --- which I presume is an XEmacs workaround. The latest xr version warns about 2-character ranges, except within digits because [0-1] etc was found to be common and harmless. diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 748ab586af..72ee9233a3 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi ... +A character alternative can include duplicates. For example, +@samp{[XYa-yYb-zX]} is less clear than @samp{[XYa-z]}. Certainly, but does this need to be mentioned? Overlapping ranges are rarely written on purpose. Besides, duplication isn't confined to ranges. More useful, I think, would be to recommend ranges to stay within natural sequences (letters, digits, etc) so that a reader needn't consult a table to see what is included. Thus [0-9.:/] good, [.-:] bad, even though they denote the same set. +@item +A @samp{-} also appear at the beginning of a character alternative, or 'appears'