On 3/19/19 7:20 PM, Stefan Monnier wrote: > I wonder why the doc doesn't just say that `-` should be the last > character and not mention the other possibilities which just make the > rule unnecessarily complex. '-' can also be the first character in a regular expression; this is pretty common and is standard. POSIX also says '-' can be the upper bound of a range, which is a bit weird (but hey! it's standard). I went through the documentation and attempted to fix the doc to describe this mess better by installing the attached patch into the emacs-26 branch. The basic ideas are: * The doc already says that regular expressions like "*foo" and "+foo" are problematic (they're confusing, and POSIX says the behavior is undefined) and should be avoided. REs like "[a-m-z]" and "[!-[:alpha:]]" and "[[:alpha:]-~]" are problematic in the same way and also should be avoided. * The doc doesn't clearly say when the Emacs range behavior is an extension to POSIX; saying this will help people know better when they can export Emacs regular expressions to other programs. * The doc is confused (and there's a comment about this) about what happens when one end of a range is unibyte and the other is multibyte. I added something saying that if one bound is a raw 8-bit byte then the other should be a unibyte character (either ASCII, or a raw 8-bit byte). I don't see any good way to specify the behavior when one bound is a raw 8-bit byte and the other bound is a multibyte character, in such a way that it's a natural extension of the documented behavior, so the documentation now recommends against that. * We might as well go ahead and say that [b-a] matches nothing, as enough code (ab)uses regexps in that way, and there is value in having a simple regular expression that always fails to match. However, I expect that we should say that users should avoid wilder examples like [~-!] so that the trawler can catch them as typos. These new recommendations ("should"s in the attached patch) will give the trawler license to diagnose questionable REs like "[a-m-z]", "[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is no change to actual Emacs behavior.