* bug#36496: [PATCH] Describe the rx notation in the lisp manual
@ 2019-07-04 12:13 Mattias Engdegård
2019-07-04 14:59 ` Drew Adams
2019-07-04 16:28 ` Eli Zaretskii
0 siblings, 2 replies; 26+ messages in thread
From: Mattias Engdegård @ 2019-07-04 12:13 UTC (permalink / raw)
To: 36496
[-- Attachment #1: Type: text/plain, Size: 763 bytes --]
The rx notation is useful and complex enough to merit inclusion in the manual.
Right now, it's mainly described in the `rx' doc string, which is fairly well-written but quite long and a bit unstructured. Describing it in the manual permits a different pace and style of exposition, the inclusion of examples and related information, structured into separate sections with cross-references.
Proposed patch attached. It covers all rx features, functions, macros, including the pcase pattern, and a mention of the corresponding string regexp constructs.
The existing `rx' doc string can be left unchanged, or reduced to something more concise, perhaps without a description of the entire rx language but with a manual reference. Suggestions are welcome.
[-- Attachment #2: 0001-Describe-the-rx-notation-in-the-elisp-manual.patch --]
[-- Type: application/octet-stream, Size: 22004 bytes --]
From 770ce5fad60ea6449881cc2578c365c2724eda56 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Thu, 4 Jul 2019 13:01:52 +0200
Subject: [PATCH] Describe the rx notation in the elisp manual
* doc/lispref/searching.texi (Regular Expressions): New menu entry.
(Regexp Example): Add rx form of the example.
(Rx Notation, Rx Constructs, Rx Functions): New nodes.
* doc/lispref/control.texi (pcase Macro): Describe the rx pattern.
---
doc/lispref/control.texi | 21 ++
doc/lispref/searching.texi | 525 +++++++++++++++++++++++++++++++++++++
2 files changed, 546 insertions(+)
diff --git a/doc/lispref/control.texi b/doc/lispref/control.texi
index e308d68b75..f7361fed11 100644
--- a/doc/lispref/control.texi
+++ b/doc/lispref/control.texi
@@ -618,6 +618,27 @@ pcase Macro
to @var{body-forms} (thus avoiding an evaluation error on match),
if any of the sub-patterns let-binds a set of symbols,
they @emph{must} all bind the same set of symbols.
+
+@item (rx @var{rx-expr}@dots{})
+Matches strings against the regexp @var{rx-expr}@dots{}, using the
+@code{rx} regexp notation (@pxref{Rx Notation}), as if by
+@code{string-match}.
+
+In addition to the usual @code{rx} syntax, @var{rx-expr}@dots{} can
+contain the following constructs:
+
+@table @code
+@item (let @var{ref} @var{rx-expr}@dots{})
+Bind the name @var{ref} to a submatch that matches @var{rx-expr}@enddots{}.
+@var{ref} is bound in @var{body-forms} to the string of the submatch
+or nil, but can also be used in @code{backref}.
+
+@item (backref @var{ref})
+Like the standard @code{backref} construct, but @var{ref} can here
+also be a name introduced by a previous @code{(let @var{ref} @dots{})}
+construct.
+@end table
+
@end table
@anchor{pcase-example-0}
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index ef1cffc446..b3b4ed3638 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -254,6 +254,7 @@ Regular Expressions
@menu
* Syntax of Regexps:: Rules for writing regular expressions.
* Regexp Example:: Illustrates regular expression syntax.
+* Rx Notation:: An alternative, structured regexp notation.
* Regexp Functions:: Functions for operating on regular expressions.
@end menu
@@ -951,6 +952,530 @@ Regexp Example
beyond the minimum needed to end a sentence.
@end table
+In the @code{rx} notation (@pxref{Rx Notation}), the regexp could be written
+
+@example
+@group
+(rx (any ".?!") ; Punctuation ending sentence.
+ (zero-or-more (any "\"')]@}")) ; Closing quotes or brackets.
+ (or line-end
+ (seq " " line-end)
+ "\t"
+ " ") ; Two spaces.
+ (zero-or-more (any "\t\n "))) ; Optional extra whitespace.
+@end group
+@end example
+
+Since @code{rx} regexps are just S-expressions, they can be formatted
+and commented as such.
+
+@node Rx Notation
+@subsection The @code{rx} Structured Regexp Notation
+@cindex rx
+@cindex regexp syntax
+
+ As an alternative to the string-based syntax, Emacs provides the
+structured @code{rx} notation based on Lisp forms. This notation is
+usually easier to read, write and maintain than regexp strings, and
+can be indented and commented freely. It requires a conversion into
+string form since that is what regexp functions expect, but that
+conversion typically takes place during byte-compilation rather than
+when the Lisp code using the regexp is run.
+
+ Here is an @code{rx} regexp@footnote{It could be written much
+simpler with non-greedy operators (how?), but that would make the
+example less interesting.} that matches a block comment in the C
+programming language:
+
+@example
+@group
+(rx "/*" ; Initial /*
+ (zero-or-more
+ (or (not (any "*")) ; Either non-*,
+ (seq "*" ; or * followed by
+ (not (any "/"))))) ; non-/
+ (one-or-more "*") ; At least one star,
+ "/") ; and the final /
+@end group
+@end example
+
+or, using shorter synonyms and written more compactly,
+
+@example
+@group
+(rx "/*"
+ (* (| (not (any "*"))
+ (: "*" (not (any "/")))))
+ (+ "*") "/")
+@end group
+@end example
+
+In conventional string syntax, it would be written
+
+@example
+"/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
+@end example
+
+The @code{rx} notation is mainly useful in Lisp code; it cannot be
+used in most interactive situations where a regexp is requested, such
+as when running @code{query-replace-regexp} or in variable
+customisation.
+
+@menu
+* Rx Constructs:: Constructs valid in rx forms.
+* Rx Functions:: Functions and macros that use rx forms.
+@end menu
+
+@node Rx Constructs
+@subsubsection Constructs in @code{rx} regexps
+
+The various forms in @code{rx} regexps are described below. The
+shorthand @var{rx} represents any @code{rx} form, and @var{rx}@dots{}
+means one or more @code{rx} forms. Where the corresponding string
+regexp syntax is given, @samp{A}, @samp{B}, @dots{} are string regexp
+subexpressions.
+@c With the new implementation of rx, this can be changed from
+@c 'one or more' to 'zero or more'.
+
+@subsubheading Literals
+
+@table @asis
+@item @code{"some-string"}
+Matches the string @samp{some-string} literally. There are no
+characters with special meaning, unlike in string regexps.
+
+@item @code{?C}
+Matches the character @samp{C} literally.
+@end table
+
+@subsubheading Fundamental structure
+
+@table @asis
+@item @code{(seq @var{rx}@dots{})}
+@cindex @samp{seq} in rx
+@itemx @code{(sequence @var{rx}@dots{})}
+@cindex @samp{sequence} in rx
+@itemx @code{(: @var{rx}@dots{})}
+@cindex @samp{:} in rx
+@itemx @code{(and @var{rx}@dots{})}
+@cindex @samp{and} in rx
+Match the @var{rx}s in sequence. Without arguments, the expression
+matches the empty string.@*
+Corresponding string regexp: @samp{AB@dots{}} (subexpressions in sequence).
+
+@item @code{(or @var{rx}@dots{})}
+@cindex @samp{or} in rx
+@itemx @code{(| @var{rx}@dots{})}
+@cindex @samp{|} in rx
+Match exactly one of the @var{rx}s, trying from left to right.
+Without arguments, the expression will not match anything at all.@*
+Corresponding string regexp: @samp{A\|B\|@dots{}}.
+@end table
+
+@subsubheading Repetition
+
+@table @code
+@item (zero-or-more @var{rx}@dots{})
+@cindex @samp{zero-or-more} in rx
+@itemx (0+ @var{rx}@dots{})
+@cindex @samp{0+} in rx
+@itemx (* @var{rx}@dots{})
+@cindex @samp{*} in rx
+Match the @var{rx}s zero or more times.@*
+Corresponding string regexp: @samp{A*}
+
+@item (one-or-more @var{rx}@dots{})
+@cindex @samp{one-or-more} in rx
+@itemx (1+ @var{rx}@dots{})
+@cindex @samp{1+} in rx
+@itemx (+ @var{rx}@dots{})
+@cindex @samp{+} in rx
+Match the @var{rx}s one or more times.@*
+Corresponding string regexp: @samp{A+}
+
+@item (zero-or-one @var{rx}@dots{})
+@cindex @samp{zero-or-one} in rx
+@itemx (optional @var{rx}@dots{})
+@cindex @samp{optional} in rx
+@itemx (opt @var{rx}@dots{})
+@cindex @samp{opt} in rx
+@itemx (? @var{rx}@dots{})
+@cindex @samp{?} in rx
+Match the @var{rx}s once or not at all.@*
+Corresponding string regexp: @samp{A?}
+
+@item (*? @var{rx}@dots{})
+@cindex @samp{*?} in rx
+Match the @var{rx}s zero or more times, non-greedily.@*
+Corresponding string regexp: @samp{A*?}
+
+@item (+? @var{rx}@dots{})
+@cindex @samp{+?} in rx
+Match the @var{rx}s one or more times, non-greedily.@*
+Corresponding string regexp: @samp{A+?}
+
+@item (?? @var{rx}@dots{})
+@cindex @samp{??} in rx
+Match the @var{rx}s once or not at all, non-greedily.@*
+Corresponding string regexp: @samp{A??}
+
+@item (= @var{n} @var{rx}@dots{})
+@cindex @samp{=} in rx
+@itemx (repeat @var{n} @var{rx})
+Match the @var{rx}s exactly @var{n} times.@*
+Corresponding string regexp: @samp{A\@{@var{n}\@}}
+
+@item (>= @var{n} @var{rx}@dots{})
+@cindex @samp{>=} in rx
+Match the @var{rx}s @var{n} or more times.@*
+Corresponding string regexp: @samp{A\@{@var{n},\@}}
+
+@item (** @var{n} @var{m} @var{rx}@dots{})
+@cindex @samp{**} in rx
+@itemx (repeat @var{n} @var{m} @var{rx}@dots{})
+@cindex @samp{repeat} in rx
+Match the @var{rx}s at least @var{n} but no more than @var{m} times.@*
+Corresponding string regexp: @samp{A\@{@var{n},@var{m}\@}}
+
+@item (minimal-match @var{rx})
+@cindex @samp{minimal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{one-or-more} and
+@code{zero-or-more} and their synonyms @emph{except} @code{*},
+@code{+} and @code{?} using non-greedy matching.
+
+@item (maximal-match @var{rx})
+@cindex @samp{maximal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{one-or-more} and
+@code{zero-or-more} and their synonyms using greedy matching.
+This is the default.
+@end table
+
+@subsubheading Matching single characters
+
+@table @asis
+@item @code{(any @var{charset}@dots{})}
+@cindex @samp{any} in rx
+@itemx @code{(char @var{charset}@dots{})}
+@cindex @samp{char} in rx
+@itemx @code{(in @var{charset}@dots{})}
+@cindex @samp{in} in rx
+Match a single character from one of the @var{charset}s.
+Each @var{charset} is a character, a string representing the set of
+its characters, a range or a character class. A range is either a
+hyphen-separated string like @code{"A-Z"}, or a cons of characters
+like @code{(?A . ?Z)}.
+
+Note that hyphen (@code{-}) is special in strings in this construct,
+since it acts as a range separator. To include a hyphen, add it as a
+separate character or single-character string.@*
+Corresponding string regexp: @samp{[@dots{}]}
+
+@item @code{(not @var{charspec})}
+@cindex @samp{not} in rx
+Match a character not included in @var{charspec}. @var{charspec} can
+be an @code{any}, @code{syntax} or @code{category} form, or a
+character class.@*
+Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
+@samp{\C@var{code}}
+
+@item @code{not-newline}, @code{nonl}
+@cindex @samp{not-newline} in rx
+@cindex @samp{nonl} in rx
+Match any character except a newline.@*
+Corresponding string regexp: @samp{.} (dot)
+
+@item @code{anything}
+@cindex @samp{anything} in rx
+Match any character.@*
+Corresponding string regexp: @samp{.\|\n} (for example)
+
+@item character class
+@cindex character class in rx
+Match a character from a named character class:
+
+@table @asis
+@item @code{alpha}, @code{alphabetic}, @code{letter}
+Match alphabetic characters. More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+alphabetic.
+
+@item @code{alnum}, @code{alphanumeric}
+Match alphabetic characters and digits. More precisely, match
+characters whose Unicode @samp{general-category} property indicates
+that they are alphabetic or decimal digits.
+
+@item @code{digit}, @code{numeric}, @code{num}
+Match the digits 0--9.
+
+@item @code{xdigit}, @code{hex-digit}, @code{hex}
+Match 0--9, A--F and a--f.
+
+@item @code{cntrl}, @code{control}
+Match any character whose code is in the range 0--31.
+
+@item @code{space}, @code{whitespace}, @code{white}
+Match any character that has whitespace syntax.
+
+@item @code{lower}, @code{lower-case}
+Match anything lower-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+upper-case letter.
+
+@item @code{upper}, @code{upper-case}
+Match anything upper-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+lower-case letter.
+
+@item @code{graph}, @code{graphic}
+Match any character except whitespace, ASCII and non-ASCII control
+characters, surrogates, and codepoints unassigned by Unicode, as
+indicated by the Unicode @samp{general-category} property.
+
+@item @code{print}, @code{printing}
+Match whitespace or a character matched by @code{graph}.
+
+@item @code{punct}, @code{punctuation}
+Match any punctuation character. (At present, for multibyte
+characters, anything that has non-word syntax.)
+
+@item @code{word}, @code{wordchar}
+Match any character that has word syntax (@pxref{Syntax Class Table}).
+@end table
+
+Corresponding string regexp: @samp{[[:@var{class}:]]}
+
+@item @code{(syntax @var{syntax})}
+@cindex @samp{syntax} in rx
+Match a character with syntax @var{syntax}, being one of the following
+names:
+
+@multitable {@code{close-parenthesis}} {Syntax character}
+@headitem Syntax name @tab Syntax character
+@item @code{whitespace} @tab @code{-}
+@item @code{punctuation} @tab @code{.}
+@item @code{word} @tab @code{w}
+@item @code{symbol} @tab @code{_}
+@item @code{open-parenthesis} @tab @code{(}
+@item @code{close-parenthesis} @tab @code{)}
+@item @code{expression-prefix} @tab @code{'}
+@item @code{string-quote} @tab @code{"}
+@item @code{paired-delimiter} @tab @code{$}
+@item @code{escape} @tab @code{\}
+@item @code{character-quote} @tab @code{/}
+@item @code{comment-start} @tab @code{<}
+@item @code{comment-end} @tab @code{>}
+@item @code{string-delimiter} @tab @code{|}
+@item @code{comment-delimiter} @tab @code{!}
+@end multitable
+
+@xref{Syntax Class Table} for details. Please note that
+@code{(syntax punctuation)} is @emph{not} equivalent to the character class
+@code{punctuation}.@*
+Corresponding string regexp: @samp{\s@var{code}}
+
+@item @code {(category @var{category})}
+@cindex @samp{category} in rx
+Match a character in category @var{category}, which is either one of
+the names below or its category character.
+
+@multitable {@code{vowel-modifying-diacritical-mark}} {Category character}
+@headitem Category name @tab Category character
+@item @code{space-for-indent} @tab space
+@item @code{base} @tab @code{.}
+@item @code{consonant} @tab @code{0}
+@item @code{base-vowel} @tab @code{1}
+@item @code{upper-diacritical-mark} @tab @code{2}
+@item @code{lower-diacritical-mark} @tab @code{3}
+@item @code{tone-mark} @tab @code{4}
+@item @code{symbol} @tab @code{5}
+@item @code{digit} @tab @code{6}
+@item @code{vowel-modifying-diacritical-mark} @tab @code{7}
+@item @code{vowel-sign} @tab @code{8}
+@item @code{semivowel-lower} @tab @code{9}
+@item @code{not-at-end-of-line} @tab @code{<}
+@item @code{not-at-beginning-of-line} @tab @code{>}
+@item @code{alpha-numeric-two-byte} @tab @code{A}
+@item @code{chinese-two-byte} @tab @code{C}
+@item @code{greek-two-byte} @tab @code{G}
+@item @code{japanese-hiragana-two-byte} @tab @code{H}
+@item @code{indian-two-byte} @tab @code{I}
+@item @code{japanese-katakana-two-byte} @tab @code{K}
+@item @code{strong-left-to-right} @tab @code{L}
+@item @code{korean-hangul-two-byte} @tab @code{N}
+@item @code{strong-right-to-left} @tab @code{R}
+@item @code{cyrillic-two-byte} @tab @code{Y}
+@item @code{combining-diacritic} @tab @code{^}
+@item @code{ascii} @tab @code{a}
+@item @code{arabic} @tab @code{b}
+@item @code{chinese} @tab @code{c}
+@item @code{ethiopic} @tab @code{e}
+@item @code{greek} @tab @code{g}
+@item @code{korean} @tab @code{h}
+@item @code{indian} @tab @code{i}
+@item @code{japanese} @tab @code{j}
+@item @code{japanese-katakana} @tab @code{k}
+@item @code{latin} @tab @code{l}
+@item @code{lao} @tab @code{o}
+@item @code{tibetan} @tab @code{q}
+@item @code{japanese-roman} @tab @code{r}
+@item @code{thai} @tab @code{t}
+@item @code{vietnamese} @tab @code{v}
+@item @code{hebrew} @tab @code{w}
+@item @code{cyrillic} @tab @code{y}
+@item @code{can-break} @tab @code{|}
+@end multitable
+
+For more information about currently defined categories, run the command
+@kbd{M-x describe-categories @key{RET}}. @xref{Categories} for how
+to define new categories.@*
+Corresponding string regexp: @samp{\c@var{code}}
+@end table
+
+@subsubheading Zero-width assertions
+
+These all match the empty string, but only in specific places.
+
+@table @asis
+@item @code{line-start}, @code{bol}
+@cindex @samp{line-start} in rx
+@cindex @samp{bol} in rx
+Match at the beginning of a line.@*
+Corresponding string regexp: @samp{^}
+
+@item @code{line-end}, @code{eol}
+@cindex @samp{line-end} in rx
+@cindex @samp{eol} in rx
+Match at the end of a line.@*
+Corresponding string regexp: @samp{$}
+
+@item @code{string-start}, @code{bos}, @code{buffer-start}, @code{bot}
+@cindex @samp{string-start} in rx
+@cindex @samp{bos} in rx
+@cindex @samp{buffer-start} in rx
+@cindex @samp{bot} in rx
+Match at the start of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\`}
+
+@item @code{string-end}, @code{eos}, @code{buffer-end}, @code{eot}
+@cindex @samp{string-end} in rx
+@cindex @samp{eos} in rx
+@cindex @samp{buffer-end} in rx
+@cindex @samp{eot} in rx
+Match at the end of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\'}
+
+@item @code{point}
+@cindex @samp{point} in rx
+Matches at point.@*
+Corresponding string regexp: @samp{\=}
+
+@item @code{word-start}
+@cindex @samp{word-start} in rx
+Matches at the beginning of a word.@*
+Corresponding string regexp: @samp{\<}
+
+@item @code{word-end}
+@cindex @samp{word-end} in rx
+Matches at the end of a word.@*
+Corresponding string regexp: @samp{\>}
+
+@item @code{word-boundary}
+@cindex @samp{word-boundary} in rx
+Matches at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\b}
+
+@item @code{not-word-boundary}
+@cindex @samp{not-word-boundary} in rx
+Matches anywhere but at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\B}
+
+@item @code{symbol-start}
+@cindex @samp{symbol-start} in rx
+Matches at the beginning of a symbol.@*
+Corresponding string regexp: @samp{\_<}
+
+@item @code{symbol-end}
+@cindex @samp{symbol-end} in rx
+Matches at the end of a symbol.@*
+Corresponding string regexp: @samp{\_>}
+@end table
+
+@subsubheading Capture groups
+
+@table @code
+@item (group @var{rx}@dots{})
+@cindex @samp{group} in rx
+@itemx (submatch @var{rx}@dots{})
+@cindex @samp{submatch} in rx
+Match the @var{rx}s, making the matched text and position accessible
+in the match data. The first group in a regexp is numbered 1;
+subsequent groups will be numbered one higher than the previous
+group.@*
+Corresponding string regexp: @samp{\(@dots{}\)}
+
+@item (group-n @var{n} @var{rx}@dots{})
+@cindex @samp{group-n} in rx
+@itemx (submatch-n @var{n} @var{rx}@dots{})
+@cindex @samp{submatch-n} in rx
+Like @code{group}, but explicitly assign the group number @var{n}.
+@var{n} must be positive.@*
+Corresponding string regexp: @samp{\(?@var{n}:@dots{}\)}
+
+@item (backref @var{n})
+@cindex @samp{backref} in rx
+Match the text previously matched by group number @var{n}.
+@var{n} must be positive and less than 10.@*
+Corresponding string regexp: @samp{\@var{n}}
+@end table
+
+@subsubheading Dynamic inclusion
+
+@table @code
+@item (literal @var{expr})
+@cindex @samp{literal} in rx
+Match the literal string that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (regexp @var{expr})
+@cindex @samp{regexp} in rx
+@itemx (regex @var{expr})
+@cindex @samp{regex} in rx
+Match the string regexp that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (eval @var{expr})
+@cindex @samp{eval} in rx
+Match the rx form that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at macro-expansion
+time for @code{rx}, at call time for @code{rx-to-string},
+in the current global environment.
+@end table
+
+@node Rx Functions
+@subsubsection Functions and macros using @code{rx} regexps
+
+@defmac rx rx-expr@dots{}
+Translate the @var{rx-expr}s to a string regexp, as if they were the
+body of a @code{(seq @dots{})} form. The @code{rx} macro expands to a
+string constant, or, if @code{literal} or @code{regexp} forms are
+used, a Lisp expression that evaluates to a string.
+@end defmac
+
+@defun rx-to-string rx-expr &optional no-group
+Translate @var{rx-expr} to a string regexp which is returned.
+If @var{no-group} is absent or nil, bracket the result in a
+non-capturing group, @samp{\(?:@dots{}\)}, if necessary to ensure that
+a postfix operator appended to it will apply to the whole expression.
+
+Arguments to @code{literal} and @code{regexp} forms in @var{rx-expr}
+must be string literals.
+@end defun
+
+The @code{pcase} macro can use @code{rx} expressions as patterns
+directly; @pxref{pcase Macro}.
+
@node Regexp Functions
@subsection Regular Expression Functions
--
2.20.1 (Apple Git-117)
^ permalink raw reply related [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-04 12:13 bug#36496: [PATCH] Describe the rx notation in the lisp manual Mattias Engdegård
@ 2019-07-04 14:59 ` Drew Adams
2019-07-04 16:28 ` Eli Zaretskii
1 sibling, 0 replies; 26+ messages in thread
From: Drew Adams @ 2019-07-04 14:59 UTC (permalink / raw)
To: Mattias Engdegård, 36496
> The rx notation is useful and complex enough to merit inclusion in the
> manual.
>
> Right now, it's mainly described in the `rx' doc string, which is fairly
> well-written but quite long and a bit unstructured. Describing it in the
> manual permits a different pace and style of exposition, the inclusion of
> examples and related information, structured into separate sections with
> cross-references.
Indeed. Bonne initiative !
Thanks for working on this.
Like `cl-loop' and Unix or GNU/Linux `find',
`rx' is practically a language unto itself.
But fortunately (unlike `cl-loop') it is
quite Lispy.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-04 12:13 bug#36496: [PATCH] Describe the rx notation in the lisp manual Mattias Engdegård
2019-07-04 14:59 ` Drew Adams
@ 2019-07-04 16:28 ` Eli Zaretskii
2019-07-05 14:13 ` Mattias Engdegård
2019-07-06 0:10 ` Richard Stallman
1 sibling, 2 replies; 26+ messages in thread
From: Eli Zaretskii @ 2019-07-04 16:28 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 36496
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 4 Jul 2019 14:13:26 +0200
>
> The rx notation is useful and complex enough to merit inclusion in the manual.
>
> Right now, it's mainly described in the `rx' doc string, which is fairly well-written but quite long and a bit unstructured. Describing it in the manual permits a different pace and style of exposition, the inclusion of examples and related information, structured into separate sections with cross-references.
>
> Proposed patch attached. It covers all rx features, functions, macros, including the pcase pattern, and a mention of the corresponding string regexp constructs.
This is a large section. The ELisp reference is already a large book,
printed in two separate volumes. So I think if we want to include
this section, it will have to be on a separate file that is
conditionally included @ifnottex.
Alternatively, we could make this a separate manual.
> The existing `rx' doc string can be left unchanged, or reduced to something more concise, perhaps without a description of the entire rx language but with a manual reference. Suggestions are welcome.
Yes, the doc string should be reduced to the summary of the
constructs.
> +@table @code
> +@item (let @var{ref} @var{rx-expr}@dots{})
> +Bind the name @var{ref} to a submatch that matches @var{rx-expr}@enddots{}.
^^^^^^^^^^^^^^^^^^^^^^^
"Bind the symbol @var{ref}", no?
> +@example
> +@group
> +(rx "/*" ; Initial /*
> + (zero-or-more
> + (or (not (any "*")) ; Either non-*,
> + (seq "*" ; or * followed by
> + (not (any "/"))))) ; non-/
> + (one-or-more "*") ; At least one star,
> + "/") ; and the final /
> +@end group
> +@end example
> +
> +or, using shorter synonyms and written more compactly,
This last line needs @noindent before it.
> +@table @asis
> +@item @code{"some-string"}
Why @code{"..."} and not @samp{...}? The latter will look better both
in print and in Info format.
> +Corresponding string regexp: @samp{AB@dots{}} (subexpressions in sequence).
^^^^^^^^^^^^^^^^
I think this should use @samp{@var{a}@var{b}@dots{}} instead. And
likewise for the other "corresponding string regexps". The reason is
that neither A nor B stand for themselves, literally, they are
meta-variables.
> +Match the @var{rx}s once or not at all.@*
"Match @var{rx} or an empty string" sounds better to me.
> +Match the @var{rx}s zero or more times, non-greedily.@*
I would add here a cross-reference to where greedy matching is
described.
> +@item @code{(any @var{charset}@dots{})}
Please don't call this "charset", as that term is already taken by a
very different creature in Emacs. I suggest "character set" instead.
> +Each @var{charset} is a character, a string representing the set of
> +its characters, a range or a character class. A range is either a
> +hyphen-separated string like @code{"A-Z"}, or a cons of characters
> +like @code{(?A . ?Z)}.
Again, a cross-reference to where "character class" described would be
good here, as would a @cindex entry for "character class in rx".
> +@item @code{space}, @code{whitespace}, @code{white}
> +Match any character that has whitespace syntax.
Only ASCII or also non-ASCII? This should be spelled out.
> +@xref{Syntax Class Table} for details. Please note that
^
Comma missing there.
> +@kbd{M-x describe-categories @key{RET}}. @xref{Categories} for how
^
Likewise.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-04 16:28 ` Eli Zaretskii
@ 2019-07-05 14:13 ` Mattias Engdegård
2019-07-06 9:08 ` Eli Zaretskii
2019-07-06 0:10 ` Richard Stallman
1 sibling, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2019-07-05 14:13 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36496
[-- Attachment #1: Type: text/plain, Size: 4773 bytes --]
4 juli 2019 kl. 18.28 skrev Eli Zaretskii <eliz@gnu.org>:
>
> This is a large section. The ELisp reference is already a large book,
> printed in two separate volumes. So I think if we want to include
> this section, it will have to be on a separate file that is
> conditionally included @ifnottex.
>
> Alternatively, we could make this a separate manual.
It is about 7-8 pages in all. One page could be saved by combining the character class descriptions with the existing ones; they are basically the same. However, that would probably preclude separation into separate files or manuals.
The category names also take up about one page, but that information isn't available anywhere else, since those names are specific to rx. (It would be nice if the names were defined along with the categories, but that isn't the case at present.)
I would prefer @ifnottex to having a separate manual, since one of the points is to make rx feel like a part of elisp and a genuine, practical alternative to regexp strings rather than an add-on. For example, the "Complex Regexp Example" turned out to be a good place for an rx version.
The revised patch (attached) does not separate the contents, because I wanted to hear your opinion on the matter first.
>> The existing `rx' doc string can be left unchanged, or reduced to something more concise, perhaps without a description of the entire rx language but with a manual reference. Suggestions are welcome.
>
> Yes, the doc string should be reduced to the summary of the
> constructs.
Good, let's do that when the changes to the manual are done.
>> +Bind the name @var{ref} to a submatch that matches @var{rx-expr}@enddots{}.
> ^^^^^^^^^^^^^^^^^^^^^^^
> "Bind the symbol @var{ref}", no?
Yes, thank you.
>> +or, using shorter synonyms and written more compactly,
>
> This last line needs @noindent before it.
Added, and in another place.
>> +@table @asis
>> +@item @code{"some-string"}
>
> Why @code{"..."} and not @samp{...}? The latter will look better both
> in print and in Info format.
I looked at the result in all formats (pdf, info, html) and came to the opposite conclusion; it makes it clear that it's about a string literal. It's not a strongly held opinion, however.
>> +Corresponding string regexp: @samp{AB@dots{}} (subexpressions in sequence).
> ^^^^^^^^^^^^^^^^
> I think this should use @samp{@var{a}@var{b}@dots{}} instead. And
> likewise for the other "corresponding string regexps". The reason is
> that neither A nor B stand for themselves, literally, they are
> meta-variables.
Right; again I made experiments, and ended up with @samp{var{A}@var{B}@dots{}}. The upper-case variables looked much better in print and html.
>> +Match the @var{rx}s once or not at all.@*
>
> "Match @var{rx} or an empty string" sounds better to me.
Much better, thank you. Changed in all places.
>> +Match the @var{rx}s zero or more times, non-greedily.@*
>
> I would add here a cross-reference to where greedy matching is
> described.
Done, with a separate sub subheading for the non-greedy stuff.
>> +@item @code{(any @var{charset}@dots{})}
>
> Please don't call this "charset", as that term is already taken by a
> very different creature in Emacs. I suggest "character set" instead.
Yes, I ended up using "set" since it's shorter and even better in this case.
>> +Each @var{charset} is a character, a string representing the set of
>> +its characters, a range or a character class. A range is either a
>> +hyphen-separated string like @code{"A-Z"}, or a cons of characters
>> +like @code{(?A . ?Z)}.
>
> Again, a cross-reference to where "character class" described would be
> good here, as would a @cindex entry for "character class in rx".
Done; the cross-reference is just a "see below" since it's very near.
>> +@item @code{space}, @code{whitespace}, @code{white}
>> +Match any character that has whitespace syntax.
>
> Only ASCII or also non-ASCII? This should be spelled out.
It's a matter of the syntax table; I used the exact formulation of the existing char class description.
>> +@xref{Syntax Class Table} for details. Please note that
> ^
> Comma missing there.
Ah, yes. Apparently, a comma is inserted automatically in the TeX version, so that we get the desired "See Section XIV, page 123, for details"; this is documented. In the info and html versions there is no page number, so a comma doesn't feel like proper English: "See Section XIV, for details" has a distinct German tone to my ears.
Explicit comma after @xref seems to be common in the Emacs manuals, so rather than to fight it out I castled the clauses.
[-- Attachment #2: 0001-Describe-the-rx-notation-in-the-elisp-manual-bug-364.patch --]
[-- Type: application/octet-stream, Size: 23535 bytes --]
From fde854686146a1642c958e2871c4b376b1fe09a1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Thu, 4 Jul 2019 13:01:52 +0200
Subject: [PATCH] Describe the rx notation in the elisp manual (bug#36496)
* doc/lispref/searching.texi (Regular Expressions): New menu entry.
(Regexp Example): Add rx form of the example.
(Rx Notation, Rx Constructs, Rx Functions): New nodes.
* doc/lispref/control.texi (pcase Macro): Describe the rx pattern.
---
doc/lispref/control.texi | 23 ++
doc/lispref/searching.texi | 559 +++++++++++++++++++++++++++++++++++++
2 files changed, 582 insertions(+)
diff --git a/doc/lispref/control.texi b/doc/lispref/control.texi
index e308d68b75..625964774d 100644
--- a/doc/lispref/control.texi
+++ b/doc/lispref/control.texi
@@ -618,6 +618,29 @@ pcase Macro
to @var{body-forms} (thus avoiding an evaluation error on match),
if any of the sub-patterns let-binds a set of symbols,
they @emph{must} all bind the same set of symbols.
+
+@anchor{rx in pcase}
+@item (rx @var{rx-expr}@dots{})
+Matches strings against the regexp @var{rx-expr}@dots{}, using the
+@code{rx} regexp notation (@pxref{Rx Notation}), as if by
+@code{string-match}.
+
+In addition to the usual @code{rx} syntax, @var{rx-expr}@dots{} can
+contain the following constructs:
+
+@table @code
+@item (let @var{ref} @var{rx-expr}@dots{})
+Bind the symbol @var{ref} to a submatch that matches
+@var{rx-expr}@enddots{}. @var{ref} is bound in @var{body-forms} to
+the string of the submatch or nil, but can also be used in
+@code{backref}.
+
+@item (backref @var{ref})
+Like the standard @code{backref} construct, but @var{ref} can here
+also be a name introduced by a previous @code{(let @var{ref} @dots{})}
+construct.
+@end table
+
@end table
@anchor{pcase-example-0}
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index ef1cffc446..40a9cb523b 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -254,6 +254,7 @@ Regular Expressions
@menu
* Syntax of Regexps:: Rules for writing regular expressions.
* Regexp Example:: Illustrates regular expression syntax.
+* Rx Notation:: An alternative, structured regexp notation.
* Regexp Functions:: Functions for operating on regular expressions.
@end menu
@@ -359,6 +360,7 @@ Regexp Special
preceding expression either once or not at all. For example,
@samp{ca?r} matches @samp{car} or @samp{cr}; nothing else.
+@anchor{Non-greedy repetition}
@item @samp{*?}, @samp{+?}, @samp{??}
@cindex non-greedy repetition characters in regexp
These are @dfn{non-greedy} variants of the operators @samp{*}, @samp{+}
@@ -951,6 +953,563 @@ Regexp Example
beyond the minimum needed to end a sentence.
@end table
+In the @code{rx} notation (@pxref{Rx Notation}), the regexp could be written
+
+@example
+@group
+(rx (any ".?!") ; Punctuation ending sentence.
+ (zero-or-more (any "\"')]@}")) ; Closing quotes or brackets.
+ (or line-end
+ (seq " " line-end)
+ "\t"
+ " ") ; Two spaces.
+ (zero-or-more (any "\t\n "))) ; Optional extra whitespace.
+@end group
+@end example
+
+Since @code{rx} regexps are just S-expressions, they can be formatted
+and commented as such.
+
+@node Rx Notation
+@subsection The @code{rx} Structured Regexp Notation
+@cindex rx
+@cindex regexp syntax
+
+ As an alternative to the string-based syntax, Emacs provides the
+structured @code{rx} notation based on Lisp S-expressions. This
+notation is usually easier to read, write and maintain than regexp
+strings, and can be indented and commented freely. It requires a
+conversion into string form since that is what regexp functions
+expect, but that conversion typically takes place during
+byte-compilation rather than when the Lisp code using the regexp is
+run.
+
+ Here is an @code{rx} regexp@footnote{It could be written much
+simpler with non-greedy operators (how?), but that would make the
+example less interesting.} that matches a block comment in the C
+programming language:
+
+@example
+@group
+(rx "/*" ; Initial /*
+ (zero-or-more
+ (or (not (any "*")) ; Either non-*,
+ (seq "*" ; or * followed by
+ (not (any "/"))))) ; non-/
+ (one-or-more "*") ; At least one star,
+ "/") ; and the final /
+@end group
+@end example
+
+@noindent
+or, using shorter synonyms and written more compactly,
+
+@example
+@group
+(rx "/*"
+ (* (| (not (any "*"))
+ (: "*" (not (any "/")))))
+ (+ "*") "/")
+@end group
+@end example
+
+@noindent
+In conventional string syntax, it would be written
+
+@example
+"/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
+@end example
+
+The @code{rx} notation is mainly useful in Lisp code; it cannot be
+used in most interactive situations where a regexp is requested, such
+as when running @code{query-replace-regexp} or in variable
+customisation.
+
+@menu
+* Rx Constructs:: Constructs valid in rx forms.
+* Rx Functions:: Functions and macros that use rx forms.
+@end menu
+
+@node Rx Constructs
+@subsubsection Constructs in @code{rx} regexps
+
+The various forms in @code{rx} regexps are described below. The
+shorthand @var{rx} represents any @code{rx} form, and @var{rx}@dots{}
+means one or more @code{rx} forms. Where the corresponding string
+regexp syntax is given, @var{A}, @var{B}, @dots{} are string regexp
+subexpressions.
+@c With the new implementation of rx, this can be changed from
+@c 'one or more' to 'zero or more'.
+
+@subsubheading Literals
+
+@table @asis
+@item @code{"some-string"}
+Match the string @samp{some-string} literally. There are no
+characters with special meaning, unlike in string regexps.
+
+@item @code{?C}
+Match the character @samp{C} literally.
+@end table
+
+@subsubheading Fundamental structure
+
+@table @asis
+@item @code{(seq @var{rx}@dots{})}
+@cindex @code{seq} in rx
+@itemx @code{(sequence @var{rx}@dots{})}
+@cindex @code{sequence} in rx
+@itemx @code{(: @var{rx}@dots{})}
+@cindex @code{:} in rx
+@itemx @code{(and @var{rx}@dots{})}
+@cindex @code{and} in rx
+Match the @var{rx}s in sequence. Without arguments, the expression
+matches the empty string.@*
+Corresponding string regexp: @samp{@var{A}@var{B}@dots{}}
+(subexpressions in sequence).
+
+@item @code{(or @var{rx}@dots{})}
+@cindex @code{or} in rx
+@itemx @code{(| @var{rx}@dots{})}
+@cindex @code{|} in rx
+Match exactly one of the @var{rx}s, trying from left to right.
+Without arguments, the expression will not match anything at all.@*
+Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
+@end table
+
+@subsubheading Repetition
+
+@table @code
+@item (zero-or-more @var{rx}@dots{})
+@cindex @code{zero-or-more} in rx
+@itemx (0+ @var{rx}@dots{})
+@cindex @code{0+} in rx
+@itemx (* @var{rx}@dots{})
+@cindex @code{*} in rx
+Match the @var{rx}s zero or more times.@*
+Corresponding string regexp: @samp{@var{A}*}
+
+@item (one-or-more @var{rx}@dots{})
+@cindex @code{one-or-more} in rx
+@itemx (1+ @var{rx}@dots{})
+@cindex @code{1+} in rx
+@itemx (+ @var{rx}@dots{})
+@cindex @code{+} in rx
+Match the @var{rx}s one or more times.@*
+Corresponding string regexp: @samp{@var{A}+}
+
+@item (zero-or-one @var{rx}@dots{})
+@cindex @code{zero-or-one} in rx
+@itemx (optional @var{rx}@dots{})
+@cindex @code{optional} in rx
+@itemx (opt @var{rx}@dots{})
+@cindex @code{opt} in rx
+@itemx (? @var{rx}@dots{})
+@cindex @code{?} in rx
+Match the @var{rx}s once or an empty string.@*
+Corresponding string regexp: @samp{@var{A}?}
+
+@item (= @var{n} @var{rx}@dots{})
+@cindex @code{=} in rx
+@itemx (repeat @var{n} @var{rx})
+Match the @var{rx}s exactly @var{n} times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n}\@}}
+
+@item (>= @var{n} @var{rx}@dots{})
+@cindex @code{>=} in rx
+Match the @var{rx}s @var{n} or more times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},\@}}
+
+@item (** @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{**} in rx
+@itemx (repeat @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{repeat} in rx
+Match the @var{rx}s at least @var{n} but no more than @var{m} times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},@var{m}\@}}
+@end table
+
+@subsubheading Non-greedy repetition
+
+Normally, repetition forms are greedy, in that they attempt to match
+as many times as possible. The following three forms are non-greedy; they
+try to match as few times as possible (@pxref{Non-greedy repetition}).
+
+@table @code
+@item (*? @var{rx}@dots{})
+@cindex @code{*?} in rx
+Match the @var{rx}s zero or more times, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}*?}
+
+@item (+? @var{rx}@dots{})
+@cindex @code{+?} in rx
+Match the @var{rx}s one or more times, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}+?}
+
+@item (?? @var{rx}@dots{})
+@cindex @code{??} in rx
+Match the @var{rx}s or an empty string, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}??}
+@end table
+
+The greediness of some repetition forms can be controlled using the
+following constructs. However, it is usually better to use the
+explicit non-greedy forms above instead.
+
+@table @code
+@item (minimal-match @var{rx})
+@cindex @code{minimal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching.
+
+@item (maximal-match @var{rx})
+@cindex @code{maximal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching. This is the default.
+@end table
+
+@subsubheading Matching single characters
+
+@table @asis
+@item @code{(any @var{set}@dots{})}
+@cindex @code{any} in rx
+@itemx @code{(char @var{set}@dots{})}
+@cindex @code{char} in rx
+@itemx @code{(in @var{set}@dots{})}
+@cindex @code{in} in rx
+@cindex character class in rx
+Match a single character from one of the @var{set}s. Each @var{set}
+is a character, a string representing the set of its characters, a
+range or a character class (see below). A range is either a
+hyphen-separated string like @code{"A-Z"}, or a cons of characters
+like @code{(?A . ?Z)}.
+
+Note that hyphen (@code{-}) is special in strings in this construct,
+since it acts as a range separator. To include a hyphen, add it as a
+separate character or single-character string.@*
+Corresponding string regexp: @samp{[@dots{}]}
+
+@item @code{(not @var{charspec})}
+@cindex @code{not} in rx
+Match a character not included in @var{charspec}. @var{charspec} can
+be an @code{any}, @code{syntax} or @code{category} form, or a
+character class.@*
+Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
+@samp{\C@var{code}}
+
+@item @code{not-newline}, @code{nonl}
+@cindex @code{not-newline} in rx
+@cindex @code{nonl} in rx
+Match any character except a newline.@*
+Corresponding string regexp: @samp{.} (dot)
+
+@item @code{anything}
+@cindex @code{anything} in rx
+Match any character.@*
+Corresponding string regexp: @samp{.\|\n} (for example)
+
+@item character class
+@cindex character class in rx
+Match a character from a named character class:
+
+@table @asis
+@item @code{alpha}, @code{alphabetic}, @code{letter}
+Match alphabetic characters. More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+alphabetic.
+
+@item @code{alnum}, @code{alphanumeric}
+Match alphabetic characters and digits. More precisely, match
+characters whose Unicode @samp{general-category} property indicates
+that they are alphabetic or decimal digits.
+
+@item @code{digit}, @code{numeric}, @code{num}
+Match the digits @samp{0}--@samp{9}.
+
+@item @code{xdigit}, @code{hex-digit}, @code{hex}
+Match the hexadecimal digits @samp{0}--@samp{9}, @samp{A}--@samp{F}
+and @samp{a}--@samp{f}.
+
+@item @code{cntrl}, @code{control}
+Match any character whose code is in the range 0--31.
+
+@item @code{blank}
+Match horizontal whitespace. More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+spacing separators.
+
+@item @code{space}, @code{whitespace}, @code{white}
+Match any character that has whitespace syntax
+(@pxref{Syntax Class Table}).
+
+@item @code{lower}, @code{lower-case}
+Match anything lower-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+upper-case letter.
+
+@item @code{upper}, @code{upper-case}
+Match anything upper-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+lower-case letter.
+
+@item @code{graph}, @code{graphic}
+Match any character except whitespace, @acronym{ASCII} and
+non-@acronym{ASCII} control characters, surrogates, and codepoints
+unassigned by Unicode, as indicated by the Unicode
+@samp{general-category} property.
+
+@item @code{print}, @code{printing}
+Match whitespace or a character matched by @code{graph}.
+
+@item @code{punct}, @code{punctuation}
+Match any punctuation character. (At present, for multibyte
+characters, anything that has non-word syntax.)
+
+@item @code{word}, @code{wordchar}
+Match any character that has word syntax (@pxref{Syntax Class Table}).
+
+@item @code{ascii}
+Match any @acronym{ASCII} character (codes 0--127).
+
+@item @code{nonascii}
+Match any non-@acronym{ASCII} character (but not raw bytes).
+@end table
+
+Corresponding string regexp: @samp{[[:@var{class}:]]}
+
+@item @code{(syntax @var{syntax})}
+@cindex @code{syntax} in rx
+Match a character with syntax @var{syntax}, being one of the following
+names:
+
+@multitable {@code{close-parenthesis}} {Syntax character}
+@headitem Syntax name @tab Syntax character
+@item @code{whitespace} @tab @code{-}
+@item @code{punctuation} @tab @code{.}
+@item @code{word} @tab @code{w}
+@item @code{symbol} @tab @code{_}
+@item @code{open-parenthesis} @tab @code{(}
+@item @code{close-parenthesis} @tab @code{)}
+@item @code{expression-prefix} @tab @code{'}
+@item @code{string-quote} @tab @code{"}
+@item @code{paired-delimiter} @tab @code{$}
+@item @code{escape} @tab @code{\}
+@item @code{character-quote} @tab @code{/}
+@item @code{comment-start} @tab @code{<}
+@item @code{comment-end} @tab @code{>}
+@item @code{string-delimiter} @tab @code{|}
+@item @code{comment-delimiter} @tab @code{!}
+@end multitable
+
+For details, @pxref{Syntax Class Table}. Please note that
+@code{(syntax punctuation)} is @emph{not} equivalent to the character class
+@code{punctuation}.@*
+Corresponding string regexp: @samp{\s@var{code}}
+
+@item @code {(category @var{category})}
+@cindex @code{category} in rx
+Match a character in category @var{category}, which is either one of
+the names below or its category character.
+
+@multitable {@code{vowel-modifying-diacritical-mark}} {Category character}
+@headitem Category name @tab Category character
+@item @code{space-for-indent} @tab space
+@item @code{base} @tab @code{.}
+@item @code{consonant} @tab @code{0}
+@item @code{base-vowel} @tab @code{1}
+@item @code{upper-diacritical-mark} @tab @code{2}
+@item @code{lower-diacritical-mark} @tab @code{3}
+@item @code{tone-mark} @tab @code{4}
+@item @code{symbol} @tab @code{5}
+@item @code{digit} @tab @code{6}
+@item @code{vowel-modifying-diacritical-mark} @tab @code{7}
+@item @code{vowel-sign} @tab @code{8}
+@item @code{semivowel-lower} @tab @code{9}
+@item @code{not-at-end-of-line} @tab @code{<}
+@item @code{not-at-beginning-of-line} @tab @code{>}
+@item @code{alpha-numeric-two-byte} @tab @code{A}
+@item @code{chinese-two-byte} @tab @code{C}
+@item @code{greek-two-byte} @tab @code{G}
+@item @code{japanese-hiragana-two-byte} @tab @code{H}
+@item @code{indian-two-byte} @tab @code{I}
+@item @code{japanese-katakana-two-byte} @tab @code{K}
+@item @code{strong-left-to-right} @tab @code{L}
+@item @code{korean-hangul-two-byte} @tab @code{N}
+@item @code{strong-right-to-left} @tab @code{R}
+@item @code{cyrillic-two-byte} @tab @code{Y}
+@item @code{combining-diacritic} @tab @code{^}
+@item @code{ascii} @tab @code{a}
+@item @code{arabic} @tab @code{b}
+@item @code{chinese} @tab @code{c}
+@item @code{ethiopic} @tab @code{e}
+@item @code{greek} @tab @code{g}
+@item @code{korean} @tab @code{h}
+@item @code{indian} @tab @code{i}
+@item @code{japanese} @tab @code{j}
+@item @code{japanese-katakana} @tab @code{k}
+@item @code{latin} @tab @code{l}
+@item @code{lao} @tab @code{o}
+@item @code{tibetan} @tab @code{q}
+@item @code{japanese-roman} @tab @code{r}
+@item @code{thai} @tab @code{t}
+@item @code{vietnamese} @tab @code{v}
+@item @code{hebrew} @tab @code{w}
+@item @code{cyrillic} @tab @code{y}
+@item @code{can-break} @tab @code{|}
+@end multitable
+
+For more information about currently defined categories, run the
+command @kbd{M-x describe-categories @key{RET}}. For how to define
+new categories, @pxref{Categories}.@*
+Corresponding string regexp: @samp{\c@var{code}}
+@end table
+
+@subsubheading Zero-width assertions
+
+These all match the empty string, but only in specific places.
+
+@table @asis
+@item @code{line-start}, @code{bol}
+@cindex @code{line-start} in rx
+@cindex @code{bol} in rx
+Match at the beginning of a line.@*
+Corresponding string regexp: @samp{^}
+
+@item @code{line-end}, @code{eol}
+@cindex @code{line-end} in rx
+@cindex @code{eol} in rx
+Match at the end of a line.@*
+Corresponding string regexp: @samp{$}
+
+@item @code{string-start}, @code{bos}, @code{buffer-start}, @code{bot}
+@cindex @code{string-start} in rx
+@cindex @code{bos} in rx
+@cindex @code{buffer-start} in rx
+@cindex @code{bot} in rx
+Match at the start of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\`}
+
+@item @code{string-end}, @code{eos}, @code{buffer-end}, @code{eot}
+@cindex @code{string-end} in rx
+@cindex @code{eos} in rx
+@cindex @code{buffer-end} in rx
+@cindex @code{eot} in rx
+Match at the end of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\'}
+
+@item @code{point}
+@cindex @code{point} in rx
+Match at point.@*
+Corresponding string regexp: @samp{\=}
+
+@item @code{word-start}
+@cindex @code{word-start} in rx
+Match at the beginning of a word.@*
+Corresponding string regexp: @samp{\<}
+
+@item @code{word-end}
+@cindex @code{word-end} in rx
+Match at the end of a word.@*
+Corresponding string regexp: @samp{\>}
+
+@item @code{word-boundary}
+@cindex @code{word-boundary} in rx
+Match at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\b}
+
+@item @code{not-word-boundary}
+@cindex @code{not-word-boundary} in rx
+Match anywhere but at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\B}
+
+@item @code{symbol-start}
+@cindex @code{symbol-start} in rx
+Match at the beginning of a symbol.@*
+Corresponding string regexp: @samp{\_<}
+
+@item @code{symbol-end}
+@cindex @code{symbol-end} in rx
+Match at the end of a symbol.@*
+Corresponding string regexp: @samp{\_>}
+@end table
+
+@subsubheading Capture groups
+
+@table @code
+@item (group @var{rx}@dots{})
+@cindex @code{group} in rx
+@itemx (submatch @var{rx}@dots{})
+@cindex @code{submatch} in rx
+Match the @var{rx}s, making the matched text and position accessible
+in the match data. The first group in a regexp is numbered 1;
+subsequent groups will be numbered one higher than the previous
+group.@*
+Corresponding string regexp: @samp{\(@dots{}\)}
+
+@item (group-n @var{n} @var{rx}@dots{})
+@cindex @code{group-n} in rx
+@itemx (submatch-n @var{n} @var{rx}@dots{})
+@cindex @code{submatch-n} in rx
+Like @code{group}, but explicitly assign the group number @var{n}.
+@var{n} must be positive.@*
+Corresponding string regexp: @samp{\(?@var{n}:@dots{}\)}
+
+@item (backref @var{n})
+@cindex @code{backref} in rx
+Match the text previously matched by group number @var{n}.
+@var{n} must be in the range 1--9.@*
+Corresponding string regexp: @samp{\@var{n}}
+@end table
+
+@subsubheading Dynamic inclusion
+
+@table @code
+@item (literal @var{expr})
+@cindex @code{literal} in rx
+Match the literal string that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (regexp @var{expr})
+@cindex @code{regexp} in rx
+@itemx (regex @var{expr})
+@cindex @code{regex} in rx
+Match the string regexp that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (eval @var{expr})
+@cindex @code{eval} in rx
+Match the rx form that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at macro-expansion
+time for @code{rx}, at call time for @code{rx-to-string},
+in the current global environment.
+@end table
+
+@node Rx Functions
+@subsubsection Functions and macros using @code{rx} regexps
+
+@defmac rx rx-expr@dots{}
+Translate the @var{rx-expr}s to a string regexp, as if they were the
+body of a @code{(seq @dots{})} form. The @code{rx} macro expands to a
+string constant, or, if @code{literal} or @code{regexp} forms are
+used, a Lisp expression that evaluates to a string.
+@end defmac
+
+@defun rx-to-string rx-expr &optional no-group
+Translate @var{rx-expr} to a string regexp which is returned.
+If @var{no-group} is absent or nil, bracket the result in a
+non-capturing group, @samp{\(?:@dots{}\)}, if necessary to ensure that
+a postfix operator appended to it will apply to the whole expression.
+
+Arguments to @code{literal} and @code{regexp} forms in @var{rx-expr}
+must be string literals.
+@end defun
+
+The @code{pcase} macro can use @code{rx} expressions as patterns
+directly; @pxref{rx in pcase}.
+
@node Regexp Functions
@subsection Regular Expression Functions
--
2.20.1 (Apple Git-117)
^ permalink raw reply related [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-04 16:28 ` Eli Zaretskii
2019-07-05 14:13 ` Mattias Engdegård
@ 2019-07-06 0:10 ` Richard Stallman
2019-07-06 6:47 ` Eli Zaretskii
1 sibling, 1 reply; 26+ messages in thread
From: Richard Stallman @ 2019-07-06 0:10 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: mattiase, 36496
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
The ideal goal of rx is to make that the way most people write regexps
for Emacs. That would be an improvement because rx syntax is more
understandable. If this happens, we will need to have rx syntax in the
Emacs Lisp Manual.
In the past, various practical factors have made rx somewhat inconvenient,
and that prevented rx from competing with the regexp syntax.
Recently we have made some improvements in rx; are they enough to
make rx a real competitor for regexps?
--
Dr Richard Stallman
President, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 0:10 ` Richard Stallman
@ 2019-07-06 6:47 ` Eli Zaretskii
2019-07-06 23:59 ` Richard Stallman
0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2019-07-06 6:47 UTC (permalink / raw)
To: rms; +Cc: mattiase, 36496
> From: Richard Stallman <rms@gnu.org>
> Cc: mattiase@acm.org, 36496@debbugs.gnu.org
> Date: Fri, 05 Jul 2019 20:10:26 -0400
>
> In the past, various practical factors have made rx somewhat inconvenient,
> and that prevented rx from competing with the regexp syntax.
> Recently we have made some improvements in rx; are they enough to
> make rx a real competitor for regexps?
I cannot answer the question without knowing which practical factors
made rx inconvenient in the past. Where can one find this
information?
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-05 14:13 ` Mattias Engdegård
@ 2019-07-06 9:08 ` Eli Zaretskii
2019-07-06 11:33 ` Mattias Engdegård
0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2019-07-06 9:08 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 36496
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 5 Jul 2019 16:13:52 +0200
> Cc: 36496@debbugs.gnu.org
>
> > This is a large section. The ELisp reference is already a large book,
> > printed in two separate volumes. So I think if we want to include
> > this section, it will have to be on a separate file that is
> > conditionally included @ifnottex.
> >
> > Alternatively, we could make this a separate manual.
>
> It is about 7-8 pages in all.
It's more that 2500 lines. We have in doc/misc/ separate manuals much
smaller than this. So making a separate manual out of this is not
radically different from what we have already.
> One page could be saved by combining the character class descriptions with the existing ones; they are basically the same. However, that would probably preclude separation into separate files or manuals.
>
> The category names also take up about one page, but that information isn't available anywhere else, since those names are specific to rx. (It would be nice if the names were defined along with the categories, but that isn't the case at present.)
I don't think we should go out of our way to make this text shorter.
it is well written, and doesn't waste words, so any attempt to make it
shorter will IMO make it less useful.
> I would prefer @ifnottex to having a separate manual
Either alternative is fine with me.
> The revised patch (attached) does not separate the contents, because I wanted to hear your opinion on the matter first.
Opinion on which matter? on whether or not make it a separate manual?
If so, you now have my opinion.
> >> +@xref{Syntax Class Table} for details. Please note that
> > ^
> > Comma missing there.
>
> Ah, yes. Apparently, a comma is inserted automatically in the TeX version, so that we get the desired "See Section XIV, page 123, for details"; this is documented. In the info and html versions there is no page number, so a comma doesn't feel like proper English: "See Section XIV, for details" has a distinct German tone to my ears.
> Explicit comma after @xref seems to be common in the Emacs manuals, so rather than to fight it out I castled the clauses.
The comma is common because older versions of makeinfo insisted on
having it, and would complain if there weren't one. The latest
versions no longer complain, but we would still like to support the
old versions, as they are ~15 times faster, so some people still keep
them around.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 9:08 ` Eli Zaretskii
@ 2019-07-06 11:33 ` Mattias Engdegård
2019-07-06 11:41 ` Eli Zaretskii
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Mattias Engdegård @ 2019-07-06 11:33 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36496
[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]
6 juli 2019 kl. 11.08 skrev Eli Zaretskii <eliz@gnu.org>:
>
>> It is about 7-8 pages in all.
>
> It's more that 2500 lines. We have in doc/misc/ separate manuals much
> smaller than this. So making a separate manual out of this is not
> radically different from what we have already.
It was a visual count of printed pages in the pdf; a lot of lines in the source are mark-up.
In any case, the attached patch has @ifnottex added to it. I didn't move the text to a separate file, since there was no existing "lispref-extras" document to put them in. In addition, some of the additions were to existing sections (pcase, and the complex regexp example).
> Opinion on which matter? on whether or not make it a separate manual?
> If so, you now have my opinion.
Thanks, that's what I meant.
> The comma is common because older versions of makeinfo insisted on
> having it, and would complain if there weren't one. The latest
> versions no longer complain, but we would still like to support the
> old versions, as they are ~15 times faster, so some people still keep
> them around.
Thank you very much for clearing that up; I always wondered.
Also attached is a patch for replacing the rx doc string with a condensed summary. I basically copied the one I wrote for ry.
[-- Attachment #2: 0001-Describe-the-rx-notation-in-the-elisp-manual-bug-364.patch --]
[-- Type: application/octet-stream, Size: 23647 bytes --]
From d5c54a21127a92e7dec1c03c262de551f3a76e27 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Thu, 4 Jul 2019 13:01:52 +0200
Subject: [PATCH] Describe the rx notation in the elisp manual (bug#36496)
* doc/lispref/searching.texi (Regular Expressions): New menu entry.
(Regexp Example): Add rx form of the example.
(Rx Notation, Rx Constructs, Rx Functions): New nodes.
* doc/lispref/control.texi (pcase Macro): Describe the rx pattern.
---
doc/lispref/control.texi | 25 ++
doc/lispref/searching.texi | 565 +++++++++++++++++++++++++++++++++++++
2 files changed, 590 insertions(+)
diff --git a/doc/lispref/control.texi b/doc/lispref/control.texi
index e308d68b75..de6cd9301f 100644
--- a/doc/lispref/control.texi
+++ b/doc/lispref/control.texi
@@ -618,6 +618,31 @@ pcase Macro
to @var{body-forms} (thus avoiding an evaluation error on match),
if any of the sub-patterns let-binds a set of symbols,
they @emph{must} all bind the same set of symbols.
+
+@ifnottex
+@anchor{rx in pcase}
+@item (rx @var{rx-expr}@dots{})
+Matches strings against the regexp @var{rx-expr}@dots{}, using the
+@code{rx} regexp notation (@pxref{Rx Notation}), as if by
+@code{string-match}.
+
+In addition to the usual @code{rx} syntax, @var{rx-expr}@dots{} can
+contain the following constructs:
+
+@table @code
+@item (let @var{ref} @var{rx-expr}@dots{})
+Bind the symbol @var{ref} to a submatch that matches
+@var{rx-expr}@enddots{}. @var{ref} is bound in @var{body-forms} to
+the string of the submatch or nil, but can also be used in
+@code{backref}.
+
+@item (backref @var{ref})
+Like the standard @code{backref} construct, but @var{ref} can here
+also be a name introduced by a previous @code{(let @var{ref} @dots{})}
+construct.
+@end table
+@end ifnottex
+
@end table
@anchor{pcase-example-0}
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index ef1cffc446..17c4790f5e 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -254,6 +254,9 @@ Regular Expressions
@menu
* Syntax of Regexps:: Rules for writing regular expressions.
* Regexp Example:: Illustrates regular expression syntax.
+@ifnottex
+* Rx Notation:: An alternative, structured regexp notation.
+@end ifnottex
* Regexp Functions:: Functions for operating on regular expressions.
@end menu
@@ -359,6 +362,7 @@ Regexp Special
preceding expression either once or not at all. For example,
@samp{ca?r} matches @samp{car} or @samp{cr}; nothing else.
+@anchor{Non-greedy repetition}
@item @samp{*?}, @samp{+?}, @samp{??}
@cindex non-greedy repetition characters in regexp
These are @dfn{non-greedy} variants of the operators @samp{*}, @samp{+}
@@ -951,6 +955,567 @@ Regexp Example
beyond the minimum needed to end a sentence.
@end table
+@ifnottex
+In the @code{rx} notation (@pxref{Rx Notation}), the regexp could be written
+
+@example
+@group
+(rx (any ".?!") ; Punctuation ending sentence.
+ (zero-or-more (any "\"')]@}")) ; Closing quotes or brackets.
+ (or line-end
+ (seq " " line-end)
+ "\t"
+ " ") ; Two spaces.
+ (zero-or-more (any "\t\n "))) ; Optional extra whitespace.
+@end group
+@end example
+
+Since @code{rx} regexps are just S-expressions, they can be formatted
+and commented as such.
+@end ifnottex
+
+@ifnottex
+@node Rx Notation
+@subsection The @code{rx} Structured Regexp Notation
+@cindex rx
+@cindex regexp syntax
+
+ As an alternative to the string-based syntax, Emacs provides the
+structured @code{rx} notation based on Lisp S-expressions. This
+notation is usually easier to read, write and maintain than regexp
+strings, and can be indented and commented freely. It requires a
+conversion into string form since that is what regexp functions
+expect, but that conversion typically takes place during
+byte-compilation rather than when the Lisp code using the regexp is
+run.
+
+ Here is an @code{rx} regexp@footnote{It could be written much
+simpler with non-greedy operators (how?), but that would make the
+example less interesting.} that matches a block comment in the C
+programming language:
+
+@example
+@group
+(rx "/*" ; Initial /*
+ (zero-or-more
+ (or (not (any "*")) ; Either non-*,
+ (seq "*" ; or * followed by
+ (not (any "/"))))) ; non-/
+ (one-or-more "*") ; At least one star,
+ "/") ; and the final /
+@end group
+@end example
+
+@noindent
+or, using shorter synonyms and written more compactly,
+
+@example
+@group
+(rx "/*"
+ (* (| (not (any "*"))
+ (: "*" (not (any "/")))))
+ (+ "*") "/")
+@end group
+@end example
+
+@noindent
+In conventional string syntax, it would be written
+
+@example
+"/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
+@end example
+
+The @code{rx} notation is mainly useful in Lisp code; it cannot be
+used in most interactive situations where a regexp is requested, such
+as when running @code{query-replace-regexp} or in variable
+customisation.
+
+@menu
+* Rx Constructs:: Constructs valid in rx forms.
+* Rx Functions:: Functions and macros that use rx forms.
+@end menu
+
+@node Rx Constructs
+@subsubsection Constructs in @code{rx} regexps
+
+The various forms in @code{rx} regexps are described below. The
+shorthand @var{rx} represents any @code{rx} form, and @var{rx}@dots{}
+means one or more @code{rx} forms. Where the corresponding string
+regexp syntax is given, @var{A}, @var{B}, @dots{} are string regexp
+subexpressions.
+@c With the new implementation of rx, this can be changed from
+@c 'one or more' to 'zero or more'.
+
+@subsubheading Literals
+
+@table @asis
+@item @code{"some-string"}
+Match the string @samp{some-string} literally. There are no
+characters with special meaning, unlike in string regexps.
+
+@item @code{?C}
+Match the character @samp{C} literally.
+@end table
+
+@subsubheading Fundamental structure
+
+@table @asis
+@item @code{(seq @var{rx}@dots{})}
+@cindex @code{seq} in rx
+@itemx @code{(sequence @var{rx}@dots{})}
+@cindex @code{sequence} in rx
+@itemx @code{(: @var{rx}@dots{})}
+@cindex @code{:} in rx
+@itemx @code{(and @var{rx}@dots{})}
+@cindex @code{and} in rx
+Match the @var{rx}s in sequence. Without arguments, the expression
+matches the empty string.@*
+Corresponding string regexp: @samp{@var{A}@var{B}@dots{}}
+(subexpressions in sequence).
+
+@item @code{(or @var{rx}@dots{})}
+@cindex @code{or} in rx
+@itemx @code{(| @var{rx}@dots{})}
+@cindex @code{|} in rx
+Match exactly one of the @var{rx}s, trying from left to right.
+Without arguments, the expression will not match anything at all.@*
+Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
+@end table
+
+@subsubheading Repetition
+
+@table @code
+@item (zero-or-more @var{rx}@dots{})
+@cindex @code{zero-or-more} in rx
+@itemx (0+ @var{rx}@dots{})
+@cindex @code{0+} in rx
+@itemx (* @var{rx}@dots{})
+@cindex @code{*} in rx
+Match the @var{rx}s zero or more times.@*
+Corresponding string regexp: @samp{@var{A}*}
+
+@item (one-or-more @var{rx}@dots{})
+@cindex @code{one-or-more} in rx
+@itemx (1+ @var{rx}@dots{})
+@cindex @code{1+} in rx
+@itemx (+ @var{rx}@dots{})
+@cindex @code{+} in rx
+Match the @var{rx}s one or more times.@*
+Corresponding string regexp: @samp{@var{A}+}
+
+@item (zero-or-one @var{rx}@dots{})
+@cindex @code{zero-or-one} in rx
+@itemx (optional @var{rx}@dots{})
+@cindex @code{optional} in rx
+@itemx (opt @var{rx}@dots{})
+@cindex @code{opt} in rx
+@itemx (? @var{rx}@dots{})
+@cindex @code{?} in rx
+Match the @var{rx}s once or an empty string.@*
+Corresponding string regexp: @samp{@var{A}?}
+
+@item (= @var{n} @var{rx}@dots{})
+@cindex @code{=} in rx
+@itemx (repeat @var{n} @var{rx})
+Match the @var{rx}s exactly @var{n} times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n}\@}}
+
+@item (>= @var{n} @var{rx}@dots{})
+@cindex @code{>=} in rx
+Match the @var{rx}s @var{n} or more times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},\@}}
+
+@item (** @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{**} in rx
+@itemx (repeat @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{repeat} in rx
+Match the @var{rx}s at least @var{n} but no more than @var{m} times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},@var{m}\@}}
+@end table
+
+@subsubheading Non-greedy repetition
+
+Normally, repetition forms are greedy, in that they attempt to match
+as many times as possible. The following three forms are non-greedy; they
+try to match as few times as possible (@pxref{Non-greedy repetition}).
+
+@table @code
+@item (*? @var{rx}@dots{})
+@cindex @code{*?} in rx
+Match the @var{rx}s zero or more times, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}*?}
+
+@item (+? @var{rx}@dots{})
+@cindex @code{+?} in rx
+Match the @var{rx}s one or more times, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}+?}
+
+@item (?? @var{rx}@dots{})
+@cindex @code{??} in rx
+Match the @var{rx}s or an empty string, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}??}
+@end table
+
+The greediness of some repetition forms can be controlled using the
+following constructs. However, it is usually better to use the
+explicit non-greedy forms above instead.
+
+@table @code
+@item (minimal-match @var{rx})
+@cindex @code{minimal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching.
+
+@item (maximal-match @var{rx})
+@cindex @code{maximal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching. This is the default.
+@end table
+
+@subsubheading Matching single characters
+
+@table @asis
+@item @code{(any @var{set}@dots{})}
+@cindex @code{any} in rx
+@itemx @code{(char @var{set}@dots{})}
+@cindex @code{char} in rx
+@itemx @code{(in @var{set}@dots{})}
+@cindex @code{in} in rx
+@cindex character class in rx
+Match a single character from one of the @var{set}s. Each @var{set}
+is a character, a string representing the set of its characters, a
+range or a character class (see below). A range is either a
+hyphen-separated string like @code{"A-Z"}, or a cons of characters
+like @code{(?A . ?Z)}.
+
+Note that hyphen (@code{-}) is special in strings in this construct,
+since it acts as a range separator. To include a hyphen, add it as a
+separate character or single-character string.@*
+Corresponding string regexp: @samp{[@dots{}]}
+
+@item @code{(not @var{charspec})}
+@cindex @code{not} in rx
+Match a character not included in @var{charspec}. @var{charspec} can
+be an @code{any}, @code{syntax} or @code{category} form, or a
+character class.@*
+Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
+@samp{\C@var{code}}
+
+@item @code{not-newline}, @code{nonl}
+@cindex @code{not-newline} in rx
+@cindex @code{nonl} in rx
+Match any character except a newline.@*
+Corresponding string regexp: @samp{.} (dot)
+
+@item @code{anything}
+@cindex @code{anything} in rx
+Match any character.@*
+Corresponding string regexp: @samp{.\|\n} (for example)
+
+@item character class
+@cindex character class in rx
+Match a character from a named character class:
+
+@table @asis
+@item @code{alpha}, @code{alphabetic}, @code{letter}
+Match alphabetic characters. More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+alphabetic.
+
+@item @code{alnum}, @code{alphanumeric}
+Match alphabetic characters and digits. More precisely, match
+characters whose Unicode @samp{general-category} property indicates
+that they are alphabetic or decimal digits.
+
+@item @code{digit}, @code{numeric}, @code{num}
+Match the digits @samp{0}--@samp{9}.
+
+@item @code{xdigit}, @code{hex-digit}, @code{hex}
+Match the hexadecimal digits @samp{0}--@samp{9}, @samp{A}--@samp{F}
+and @samp{a}--@samp{f}.
+
+@item @code{cntrl}, @code{control}
+Match any character whose code is in the range 0--31.
+
+@item @code{blank}
+Match horizontal whitespace. More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+spacing separators.
+
+@item @code{space}, @code{whitespace}, @code{white}
+Match any character that has whitespace syntax
+(@pxref{Syntax Class Table}).
+
+@item @code{lower}, @code{lower-case}
+Match anything lower-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+upper-case letter.
+
+@item @code{upper}, @code{upper-case}
+Match anything upper-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+lower-case letter.
+
+@item @code{graph}, @code{graphic}
+Match any character except whitespace, @acronym{ASCII} and
+non-@acronym{ASCII} control characters, surrogates, and codepoints
+unassigned by Unicode, as indicated by the Unicode
+@samp{general-category} property.
+
+@item @code{print}, @code{printing}
+Match whitespace or a character matched by @code{graph}.
+
+@item @code{punct}, @code{punctuation}
+Match any punctuation character. (At present, for multibyte
+characters, anything that has non-word syntax.)
+
+@item @code{word}, @code{wordchar}
+Match any character that has word syntax (@pxref{Syntax Class Table}).
+
+@item @code{ascii}
+Match any @acronym{ASCII} character (codes 0--127).
+
+@item @code{nonascii}
+Match any non-@acronym{ASCII} character (but not raw bytes).
+@end table
+
+Corresponding string regexp: @samp{[[:@var{class}:]]}
+
+@item @code{(syntax @var{syntax})}
+@cindex @code{syntax} in rx
+Match a character with syntax @var{syntax}, being one of the following
+names:
+
+@multitable {@code{close-parenthesis}} {Syntax character}
+@headitem Syntax name @tab Syntax character
+@item @code{whitespace} @tab @code{-}
+@item @code{punctuation} @tab @code{.}
+@item @code{word} @tab @code{w}
+@item @code{symbol} @tab @code{_}
+@item @code{open-parenthesis} @tab @code{(}
+@item @code{close-parenthesis} @tab @code{)}
+@item @code{expression-prefix} @tab @code{'}
+@item @code{string-quote} @tab @code{"}
+@item @code{paired-delimiter} @tab @code{$}
+@item @code{escape} @tab @code{\}
+@item @code{character-quote} @tab @code{/}
+@item @code{comment-start} @tab @code{<}
+@item @code{comment-end} @tab @code{>}
+@item @code{string-delimiter} @tab @code{|}
+@item @code{comment-delimiter} @tab @code{!}
+@end multitable
+
+For details, @pxref{Syntax Class Table}. Please note that
+@code{(syntax punctuation)} is @emph{not} equivalent to the character class
+@code{punctuation}.@*
+Corresponding string regexp: @samp{\s@var{code}}
+
+@item @code {(category @var{category})}
+@cindex @code{category} in rx
+Match a character in category @var{category}, which is either one of
+the names below or its category character.
+
+@multitable {@code{vowel-modifying-diacritical-mark}} {Category character}
+@headitem Category name @tab Category character
+@item @code{space-for-indent} @tab space
+@item @code{base} @tab @code{.}
+@item @code{consonant} @tab @code{0}
+@item @code{base-vowel} @tab @code{1}
+@item @code{upper-diacritical-mark} @tab @code{2}
+@item @code{lower-diacritical-mark} @tab @code{3}
+@item @code{tone-mark} @tab @code{4}
+@item @code{symbol} @tab @code{5}
+@item @code{digit} @tab @code{6}
+@item @code{vowel-modifying-diacritical-mark} @tab @code{7}
+@item @code{vowel-sign} @tab @code{8}
+@item @code{semivowel-lower} @tab @code{9}
+@item @code{not-at-end-of-line} @tab @code{<}
+@item @code{not-at-beginning-of-line} @tab @code{>}
+@item @code{alpha-numeric-two-byte} @tab @code{A}
+@item @code{chinese-two-byte} @tab @code{C}
+@item @code{greek-two-byte} @tab @code{G}
+@item @code{japanese-hiragana-two-byte} @tab @code{H}
+@item @code{indian-two-byte} @tab @code{I}
+@item @code{japanese-katakana-two-byte} @tab @code{K}
+@item @code{strong-left-to-right} @tab @code{L}
+@item @code{korean-hangul-two-byte} @tab @code{N}
+@item @code{strong-right-to-left} @tab @code{R}
+@item @code{cyrillic-two-byte} @tab @code{Y}
+@item @code{combining-diacritic} @tab @code{^}
+@item @code{ascii} @tab @code{a}
+@item @code{arabic} @tab @code{b}
+@item @code{chinese} @tab @code{c}
+@item @code{ethiopic} @tab @code{e}
+@item @code{greek} @tab @code{g}
+@item @code{korean} @tab @code{h}
+@item @code{indian} @tab @code{i}
+@item @code{japanese} @tab @code{j}
+@item @code{japanese-katakana} @tab @code{k}
+@item @code{latin} @tab @code{l}
+@item @code{lao} @tab @code{o}
+@item @code{tibetan} @tab @code{q}
+@item @code{japanese-roman} @tab @code{r}
+@item @code{thai} @tab @code{t}
+@item @code{vietnamese} @tab @code{v}
+@item @code{hebrew} @tab @code{w}
+@item @code{cyrillic} @tab @code{y}
+@item @code{can-break} @tab @code{|}
+@end multitable
+
+For more information about currently defined categories, run the
+command @kbd{M-x describe-categories @key{RET}}. For how to define
+new categories, @pxref{Categories}.@*
+Corresponding string regexp: @samp{\c@var{code}}
+@end table
+
+@subsubheading Zero-width assertions
+
+These all match the empty string, but only in specific places.
+
+@table @asis
+@item @code{line-start}, @code{bol}
+@cindex @code{line-start} in rx
+@cindex @code{bol} in rx
+Match at the beginning of a line.@*
+Corresponding string regexp: @samp{^}
+
+@item @code{line-end}, @code{eol}
+@cindex @code{line-end} in rx
+@cindex @code{eol} in rx
+Match at the end of a line.@*
+Corresponding string regexp: @samp{$}
+
+@item @code{string-start}, @code{bos}, @code{buffer-start}, @code{bot}
+@cindex @code{string-start} in rx
+@cindex @code{bos} in rx
+@cindex @code{buffer-start} in rx
+@cindex @code{bot} in rx
+Match at the start of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\`}
+
+@item @code{string-end}, @code{eos}, @code{buffer-end}, @code{eot}
+@cindex @code{string-end} in rx
+@cindex @code{eos} in rx
+@cindex @code{buffer-end} in rx
+@cindex @code{eot} in rx
+Match at the end of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\'}
+
+@item @code{point}
+@cindex @code{point} in rx
+Match at point.@*
+Corresponding string regexp: @samp{\=}
+
+@item @code{word-start}
+@cindex @code{word-start} in rx
+Match at the beginning of a word.@*
+Corresponding string regexp: @samp{\<}
+
+@item @code{word-end}
+@cindex @code{word-end} in rx
+Match at the end of a word.@*
+Corresponding string regexp: @samp{\>}
+
+@item @code{word-boundary}
+@cindex @code{word-boundary} in rx
+Match at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\b}
+
+@item @code{not-word-boundary}
+@cindex @code{not-word-boundary} in rx
+Match anywhere but at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\B}
+
+@item @code{symbol-start}
+@cindex @code{symbol-start} in rx
+Match at the beginning of a symbol.@*
+Corresponding string regexp: @samp{\_<}
+
+@item @code{symbol-end}
+@cindex @code{symbol-end} in rx
+Match at the end of a symbol.@*
+Corresponding string regexp: @samp{\_>}
+@end table
+
+@subsubheading Capture groups
+
+@table @code
+@item (group @var{rx}@dots{})
+@cindex @code{group} in rx
+@itemx (submatch @var{rx}@dots{})
+@cindex @code{submatch} in rx
+Match the @var{rx}s, making the matched text and position accessible
+in the match data. The first group in a regexp is numbered 1;
+subsequent groups will be numbered one higher than the previous
+group.@*
+Corresponding string regexp: @samp{\(@dots{}\)}
+
+@item (group-n @var{n} @var{rx}@dots{})
+@cindex @code{group-n} in rx
+@itemx (submatch-n @var{n} @var{rx}@dots{})
+@cindex @code{submatch-n} in rx
+Like @code{group}, but explicitly assign the group number @var{n}.
+@var{n} must be positive.@*
+Corresponding string regexp: @samp{\(?@var{n}:@dots{}\)}
+
+@item (backref @var{n})
+@cindex @code{backref} in rx
+Match the text previously matched by group number @var{n}.
+@var{n} must be in the range 1--9.@*
+Corresponding string regexp: @samp{\@var{n}}
+@end table
+
+@subsubheading Dynamic inclusion
+
+@table @code
+@item (literal @var{expr})
+@cindex @code{literal} in rx
+Match the literal string that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (regexp @var{expr})
+@cindex @code{regexp} in rx
+@itemx (regex @var{expr})
+@cindex @code{regex} in rx
+Match the string regexp that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (eval @var{expr})
+@cindex @code{eval} in rx
+Match the rx form that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at macro-expansion
+time for @code{rx}, at call time for @code{rx-to-string},
+in the current global environment.
+@end table
+
+@node Rx Functions
+@subsubsection Functions and macros using @code{rx} regexps
+
+@defmac rx rx-expr@dots{}
+Translate the @var{rx-expr}s to a string regexp, as if they were the
+body of a @code{(seq @dots{})} form. The @code{rx} macro expands to a
+string constant, or, if @code{literal} or @code{regexp} forms are
+used, a Lisp expression that evaluates to a string.
+@end defmac
+
+@defun rx-to-string rx-expr &optional no-group
+Translate @var{rx-expr} to a string regexp which is returned.
+If @var{no-group} is absent or nil, bracket the result in a
+non-capturing group, @samp{\(?:@dots{}\)}, if necessary to ensure that
+a postfix operator appended to it will apply to the whole expression.
+
+Arguments to @code{literal} and @code{regexp} forms in @var{rx-expr}
+must be string literals.
+@end defun
+
+The @code{pcase} macro can use @code{rx} expressions as patterns
+directly; @pxref{rx in pcase}.
+@end ifnottex
+
@node Regexp Functions
@subsection Regular Expression Functions
--
2.20.1 (Apple Git-117)
[-- Attachment #3: 0001-Shorter-rx-doc-string-bug-36496.patch --]
[-- Type: application/octet-stream, Size: 16805 bytes --]
From 48b2baa78e62a97ca2d621c5be643e6b539d78bf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sat, 6 Jul 2019 13:22:15 +0200
Subject: [PATCH] Shorter `rx' doc string (bug#36496)
* lisp/emacs-lisp/rx.el (rx): Replace long description with a condensed
summary of the rx syntax, with reference to the manual section.
---
lisp/emacs-lisp/rx.el | 416 ++++++++++--------------------------------
1 file changed, 95 insertions(+), 321 deletions(-)
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 24dd6cbf1d..e65460c39d 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -959,327 +959,101 @@ rx-to-string
;;;###autoload
(defmacro rx (&rest regexps)
"Translate regular expressions REGEXPS in sexp form to a regexp string.
-REGEXPS is a non-empty sequence of forms of the sort listed below.
-
-Note that `rx' is a Lisp macro; when used in a Lisp program being
-compiled, the translation is performed by the compiler. The
-`literal' and `regexp' forms accept subforms that will evaluate
-to strings, in addition to constant strings. If REGEXPS include
-such forms, then the result is an expression which returns a
-regexp string, rather than a regexp string directly. See
-`rx-to-string' for performing translation completely at run time.
-
-The following are valid subforms of regular expressions in sexp
-notation.
-
-STRING
- matches string STRING literally.
-
-CHAR
- matches character CHAR literally.
-
-`not-newline', `nonl'
- matches any character except a newline.
-
-`anything'
- matches any character
-
-`(any SET ...)'
-`(in SET ...)'
-`(char SET ...)'
- matches any character in SET .... SET may be a character or string.
- Ranges of characters can be specified as `A-Z' in strings.
- Ranges may also be specified as conses like `(?A . ?Z)'.
- Reversed ranges like `Z-A' and `(?Z . ?A)' are not permitted.
-
- SET may also be the name of a character class: `digit',
- `control', `hex-digit', `blank', `graph', `print', `alnum',
- `alpha', `ascii', `nonascii', `lower', `punct', `space', `upper',
- `word', or one of their synonyms.
-
-`(not (any SET ...))'
- matches any character not in SET ...
-
-`line-start', `bol'
- matches the empty string, but only at the beginning of a line
- in the text being matched
-
-`line-end', `eol'
- is similar to `line-start' but matches only at the end of a line
-
-`string-start', `bos', `bot'
- matches the empty string, but only at the beginning of the
- string being matched against.
-
-`string-end', `eos', `eot'
- matches the empty string, but only at the end of the
- string being matched against.
-
-`buffer-start'
- matches the empty string, but only at the beginning of the
- buffer being matched against. Actually equivalent to `string-start'.
-
-`buffer-end'
- matches the empty string, but only at the end of the
- buffer being matched against. Actually equivalent to `string-end'.
-
-`point'
- matches the empty string, but only at point.
-
-`word-start', `bow'
- matches the empty string, but only at the beginning of a word.
-
-`word-end', `eow'
- matches the empty string, but only at the end of a word.
-
-`word-boundary'
- matches the empty string, but only at the beginning or end of a
- word.
-
-`(not word-boundary)'
-`not-word-boundary'
- matches the empty string, but not at the beginning or end of a
- word.
-
-`symbol-start'
- matches the empty string, but only at the beginning of a symbol.
-
-`symbol-end'
- matches the empty string, but only at the end of a symbol.
-
-`digit', `numeric', `num'
- matches 0 through 9.
-
-`control', `cntrl'
- matches any character whose code is in the range 0-31.
-
-`hex-digit', `hex', `xdigit'
- matches 0 through 9, a through f and A through F.
-
-`blank'
- matches horizontal whitespace, as defined by Annex C of the
- Unicode Technical Standard #18. In particular, it matches
- spaces, tabs, and other characters whose Unicode
- `general-category' property indicates they are spacing
- separators.
-
-`graphic', `graph'
- matches graphic characters--everything except whitespace, ASCII
- and non-ASCII control characters, surrogates, and codepoints
- unassigned by Unicode.
-
-`printing', `print'
- matches whitespace and graphic characters.
-
-`alphanumeric', `alnum'
- matches alphabetic characters and digits. For multibyte characters,
- it matches characters whose Unicode `general-category' property
- indicates they are alphabetic or decimal number characters.
-
-`letter', `alphabetic', `alpha'
- matches alphabetic characters. For multibyte characters,
- it matches characters whose Unicode `general-category' property
- indicates they are alphabetic characters.
-
-`ascii'
- matches ASCII (unibyte) characters.
-
-`nonascii'
- matches non-ASCII (multibyte) characters.
-
-`lower', `lower-case'
- matches anything lower-case, as determined by the current case
- table. If `case-fold-search' is non-nil, this also matches any
- upper-case letter.
-
-`upper', `upper-case'
- matches anything upper-case, as determined by the current case
- table. If `case-fold-search' is non-nil, this also matches any
- lower-case letter.
-
-`punctuation', `punct'
- matches punctuation. (But at present, for multibyte characters,
- it matches anything that has non-word syntax.)
-
-`space', `whitespace', `white'
- matches anything that has whitespace syntax.
-
-`word', `wordchar'
- matches anything that has word syntax.
-
-`not-wordchar'
- matches anything that has non-word syntax.
-
-`(syntax SYNTAX)'
- matches a character with syntax SYNTAX. SYNTAX must be one
- of the following symbols, or a symbol corresponding to the syntax
- character, e.g. `\\.' for `\\s.'.
-
- `whitespace' (\\s- in string notation)
- `punctuation' (\\s.)
- `word' (\\sw)
- `symbol' (\\s_)
- `open-parenthesis' (\\s()
- `close-parenthesis' (\\s))
- `expression-prefix' (\\s')
- `string-quote' (\\s\")
- `paired-delimiter' (\\s$)
- `escape' (\\s\\)
- `character-quote' (\\s/)
- `comment-start' (\\s<)
- `comment-end' (\\s>)
- `string-delimiter' (\\s|)
- `comment-delimiter' (\\s!)
-
-`(not (syntax SYNTAX))'
- matches a character that doesn't have syntax SYNTAX.
-
-`(category CATEGORY)'
- matches a character with category CATEGORY. CATEGORY must be
- either a character to use for C, or one of the following symbols.
-
- `space-for-indent' (\\c\\s in string notation)
- `base' (\\c.)
- `consonant' (\\c0)
- `base-vowel' (\\c1)
- `upper-diacritical-mark' (\\c2)
- `lower-diacritical-mark' (\\c3)
- `tone-mark' (\\c4)
- `symbol' (\\c5)
- `digit' (\\c6)
- `vowel-modifying-diacritical-mark' (\\c7)
- `vowel-sign' (\\c8)
- `semivowel-lower' (\\c9)
- `not-at-end-of-line' (\\c<)
- `not-at-beginning-of-line' (\\c>)
- `alpha-numeric-two-byte' (\\cA)
- `chinese-two-byte' (\\cC)
- `greek-two-byte' (\\cG)
- `japanese-hiragana-two-byte' (\\cH)
- `indian-two-byte' (\\cI)
- `japanese-katakana-two-byte' (\\cK)
- `strong-left-to-right' (\\cL)
- `korean-hangul-two-byte' (\\cN)
- `strong-right-to-left' (\\cR)
- `cyrillic-two-byte' (\\cY)
- `combining-diacritic' (\\c^)
- `ascii' (\\ca)
- `arabic' (\\cb)
- `chinese' (\\cc)
- `ethiopic' (\\ce)
- `greek' (\\cg)
- `korean' (\\ch)
- `indian' (\\ci)
- `japanese' (\\cj)
- `japanese-katakana' (\\ck)
- `latin' (\\cl)
- `lao' (\\co)
- `tibetan' (\\cq)
- `japanese-roman' (\\cr)
- `thai' (\\ct)
- `vietnamese' (\\cv)
- `hebrew' (\\cw)
- `cyrillic' (\\cy)
- `can-break' (\\c|)
-
-`(not (category CATEGORY))'
- matches a character that doesn't have category CATEGORY.
-
-`(and SEXP1 SEXP2 ...)'
-`(: SEXP1 SEXP2 ...)'
-`(seq SEXP1 SEXP2 ...)'
-`(sequence SEXP1 SEXP2 ...)'
- matches what SEXP1 matches, followed by what SEXP2 matches, etc.
- Without arguments, matches the empty string.
-
-`(submatch SEXP1 SEXP2 ...)'
-`(group SEXP1 SEXP2 ...)'
- like `and', but makes the match accessible with `match-end',
- `match-beginning', and `match-string'.
-
-`(submatch-n N SEXP1 SEXP2 ...)'
-`(group-n N SEXP1 SEXP2 ...)'
- like `group', but make it an explicitly-numbered group with
- group number N.
-
-`(or SEXP1 SEXP2 ...)'
-`(| SEXP1 SEXP2 ...)'
- matches anything that matches SEXP1 or SEXP2, etc. If all
- args are strings, use `regexp-opt' to optimize the resulting
- regular expression. Without arguments, never matches anything.
-
-`(minimal-match SEXP)'
- produce a non-greedy regexp for SEXP. Normally, regexps matching
- zero or more occurrences of something are \"greedy\" in that they
- match as much as they can, as long as the overall regexp can
- still match. A non-greedy regexp matches as little as possible.
-
-`(maximal-match SEXP)'
- produce a greedy regexp for SEXP. This is the default.
-
-Below, `SEXP ...' represents a sequence of regexp forms, treated as if
-enclosed in `(and ...)'.
-
-`(zero-or-more SEXP ...)'
-`(0+ SEXP ...)'
- matches zero or more occurrences of what SEXP ... matches.
-
-`(* SEXP ...)'
- like `zero-or-more', but always produces a greedy regexp, independent
- of `rx-greedy-flag'.
-
-`(*? SEXP ...)'
- like `zero-or-more', but always produces a non-greedy regexp,
- independent of `rx-greedy-flag'.
-
-`(one-or-more SEXP ...)'
-`(1+ SEXP ...)'
- matches one or more occurrences of SEXP ...
-
-`(+ SEXP ...)'
- like `one-or-more', but always produces a greedy regexp.
-
-`(+? SEXP ...)'
- like `one-or-more', but always produces a non-greedy regexp.
-
-`(zero-or-one SEXP ...)'
-`(optional SEXP ...)'
-`(opt SEXP ...)'
- matches zero or one occurrences of A.
-
-`(? SEXP ...)'
- like `zero-or-one', but always produces a greedy regexp.
-
-`(?? SEXP ...)'
- like `zero-or-one', but always produces a non-greedy regexp.
-
-`(repeat N SEXP)'
-`(= N SEXP ...)'
- matches N occurrences.
-
-`(>= N SEXP ...)'
- matches N or more occurrences.
-
-`(repeat N M SEXP)'
-`(** N M SEXP ...)'
- matches N to M occurrences.
-
-`(backref N)'
- matches what was matched previously by submatch N.
-
-`(literal STRING-EXPR)'
- matches STRING-EXPR literally, where STRING-EXPR is any lisp
- expression that evaluates to a string.
-
-`(regexp REGEXP-EXPR)'
- include REGEXP-EXPR in string notation in the result, where
- REGEXP-EXPR is any lisp expression that evaluates to a
- string containing a valid regexp.
-
-`(eval FORM)'
- evaluate FORM and insert result. If result is a string,
- `regexp-quote' it. Note that FORM is evaluated during
- macroexpansion."
+Each argument is one of the forms below; RX is a subform, and RX... stands
+for one or more RXs. For details, see Info node `(elisp) Rx Notation'.
+See `rx-to-string' for the corresponding function.
+
+STRING Match a literal string.
+CHAR Match a literal character.
+
+(seq RX...) Match the RXs in sequence. Alias: :, sequence, and
+(or RX...) Match one of the RXs. Alias: |
+
+(zero-or-more RX...) Match RXs zero or more times. Alias: *, 0+
+(one-or-more RX...) Match RXs one or more times. Alias: +, 1+
+(zero-or-one RX...) Match RXs or the empty string. Alias: ?, opt, optional
+(*? RX...) Match RXs zero or more times, non-greedily.
+(+? RX...) Match RXs one or more times, non-greedily.
+(?? RX...) Match RXs or the empty string, non-greedily.
+(= N RX...) Match RXs exactly N times.
+(>= N RX...) Match RXs N or more times.
+(** N M RX...) Match RXs N to M times. Alias: repeat
+(minimal-match RX) Match RX, with zero-or-more, one-or-more,
+ zero-or-one, 0+, 1+, opt, and optional
+ using non-greedy matching.
+(maximal-match RX) Match RX, with zero-or-more, one-or-more,
+ zero-or-one, 0+, 1+, opt, and optional
+ using greedy matching, which is the default.
+
+(any SET...) Match a character from one of the SETs. Each SET is a
+ character, a string, a range as string \"A-Z\" or cons
+ (?A . ?Z), or a character class (see below). Alias: in, char
+(not CHARSPEC) Match one character not matched by CHARSPEC. CHARSPEC
+ can be (any ...), (syntax ...), (category ...),
+ or a character class.
+not-newline Match any character except a newline. Alias: nonl
+anything Match any character.
+
+CHARCLASS Match a character from a character class. One of:
+ alpha, alphabetic, letter alphabetic characters (defined by Unicode)
+ alnum, alphanumeric alphabetic or decimal digit chars (Unicode)
+ digit numeric, num 0-9
+ xdigit, hex-digit, hex 0-9, A-F, a-f
+ cntrl, control ASCII codes 0-31
+ blank horizontal whitespace (Unicode)
+ space, whitespace, white chars with whitespace syntax
+ lower, lower-case lower-case chars, from current case table
+ upper, upper-case upper-case chars, from current case table
+ graph, graphic graphic characters (Unicode)
+ print, printing whitespace or graphic (Unicode)
+ punct, punctuation not control, space, letter or digit (ASCII);
+ not word syntax (non-ASCII)
+ word, wordchar characters with word syntax
+ ascii ASCII characters (codes 0-127)
+ nonascii non-ASCII characters (but not raw bytes)
+
+(syntax SYNTAX) Match a character with syntax SYNTAX, being one of:
+ whitespace, punctuation, word, symbol, open-parenthesis,
+ close-parenthesis, expression-prefix, string-quote,
+ paired-delimiter, escape, character-quote, comment-start,
+ comment-end, string-delimiter, comment-delimiter
+
+(category CAT) Match a character in category CAT, being one of:
+ space-for-indent, base, consonant, base-vowel,
+ upper-diacritical-mark, lower-diacritical-mark, tone-mark, symbol,
+ digit, vowel-modifying-diacritical-mark, vowel-sign,
+ semivowel-lower, not-at-end-of-line, not-at-beginning-of-line,
+ alpha-numeric-two-byte, chinese-two-byte, greek-two-byte,
+ japanese-hiragana-two-byte, indian-two-byte,
+ japanese-katakana-two-byte, strong-left-to-right,
+ korean-hangul-two-byte, strong-right-to-left, cyrillic-two-byte,
+ combining-diacritic, ascii, arabic, chinese, ethiopic, greek,
+ korean, indian, japanese, japanese-katakana, latin, lao,
+ tibetan, japanese-roman, thai, vietnamese, hebrew, cyrillic,
+ can-break
+
+Zero-width assertions: these all match the empty string in specific places.
+ line-start at the beginning of a line. Alias: bol
+ line-end at the end of a line. Alias: eol
+ string-start at the start of the string or buffer.
+ Alias: buffer-start, bos, bot
+ string-end at the end of the string or buffer.
+ Alias: buffer-end, eos, eot
+ point at point.
+ word-start at the beginning of a word.
+ word-end at the end of a word.
+ word-boundary at the beginning or end of a word.
+ not-word-boundary not at the beginning or end of a word.
+ symbol-start at the beginning of a symbol.
+ symbol-end at the end of a symbol.
+
+(group RX...) Match RXs and define a capture group. Alias: submatch
+(group-n N RX...) Match RXs and define capture group N. Alias: submatch-n
+(backref N) Match the text that capture group N matched.
+
+(literal EXPR) Match the literal string from evaluating the EXPR at run time.
+(regexp EXPR) Match the string regexp from evaluating EXPR at run time.
+(eval EXPR) Match the rx sexp from evaluating EXPR at compile time."
(let* ((rx--compile-to-lisp t)
(re (cond ((null regexps)
(error "No regexp"))
--
2.20.1 (Apple Git-117)
^ permalink raw reply related [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 11:33 ` Mattias Engdegård
@ 2019-07-06 11:41 ` Eli Zaretskii
2019-07-06 18:56 ` Mattias Engdegård
2019-07-06 11:59 ` Noam Postavsky
2019-07-06 23:56 ` Richard Stallman
2 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2019-07-06 11:41 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 36496
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 6 Jul 2019 13:33:35 +0200
> Cc: 36496@debbugs.gnu.org
>
> diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
> index ef1cffc446..17c4790f5e 100644
> --- a/doc/lispref/searching.texi
> +++ b/doc/lispref/searching.texi
> @@ -254,6 +254,9 @@ Regular Expressions
> @menu
> * Syntax of Regexps:: Rules for writing regular expressions.
> * Regexp Example:: Illustrates regular expression syntax.
> +@ifnottex
> +* Rx Notation:: An alternative, structured regexp notation.
> +@end ifnottex
> * Regexp Functions:: Functions for operating on regular expressions.
> @end menu
I believe you need the same conditional addition in elisp.texi, in the
detailed menu there.
> * lisp/emacs-lisp/rx.el (rx): Replace long description with a condensed
> summary of the rx syntax, with reference to the manual section.
This is OK, but it is inconsistent wrt whether each construct's
description ends in a period. I suggest to end them all with a
period.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 11:33 ` Mattias Engdegård
2019-07-06 11:41 ` Eli Zaretskii
@ 2019-07-06 11:59 ` Noam Postavsky
2019-07-06 23:56 ` Richard Stallman
2 siblings, 0 replies; 26+ messages in thread
From: Noam Postavsky @ 2019-07-06 11:59 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 36496
Mattias Engdegård <mattiase@acm.org> writes:
> +(zero-or-more RX...) Match RXs zero or more times. Alias: *, 0+
> +(one-or-more RX...) Match RXs one or more times. Alias: +, 1+
> +(zero-or-one RX...) Match RXs or the empty string. Alias: ?, opt, optional
*, +, and ? are not exact aliases of the above: they're always greedy
(as opposed to depending on rx-greedy-flag). I think it's a bit
confusing to rely on the description of minimal-match and maximal-match
to explain that.
> +(minimal-match RX) Match RX, with zero-or-more, one-or-more,
> + zero-or-one, 0+, 1+, opt, and optional
> + using non-greedy matching.
> +(maximal-match RX) Match RX, with zero-or-more, one-or-more,
> + zero-or-one, 0+, 1+, opt, and optional
> + using greedy matching, which is the default.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 11:41 ` Eli Zaretskii
@ 2019-07-06 18:56 ` Mattias Engdegård
2019-07-06 19:10 ` Eli Zaretskii
2019-07-06 19:12 ` Noam Postavsky
0 siblings, 2 replies; 26+ messages in thread
From: Mattias Engdegård @ 2019-07-06 18:56 UTC (permalink / raw)
To: Eli Zaretskii, Noam Postavsky; +Cc: 36496
[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]
6 juli 2019 kl. 13.41 skrev Eli Zaretskii <eliz@gnu.org>:
>
> I believe you need the same conditional addition in elisp.texi, in the
> detailed menu there.
Thank you, forgot that one. Added.
>> * lisp/emacs-lisp/rx.el (rx): Replace long description with a condensed
>> summary of the rx syntax, with reference to the manual section.
>
> This is OK, but it is inconsistent wrt whether each construct's
> description ends in a period. I suggest to end them all with a
> period.
Added, except at the end of the lists of aliases which looked better with a minimum of punctuation (and weren't sentences to begin with).
6 juli 2019 kl. 13.59 skrev Noam Postavsky <npostavs@gmail.com>:
>
> *, +, and ? are not exact aliases of the above: they're always greedy
> (as opposed to depending on rx-greedy-flag). I think it's a bit
> confusing to rely on the description of minimal-match and maximal-match
> to explain that.
Ah, you called out my little white lie. They are synonyms in practice, because almost nobody uses minimal-match, probably for good reasons. (xr used to generate {minimal|maximal}-match, but it was decidedly less readable so it got changed.)
Yet you are right in the sense that the documentation should not lie or wilfully obscure the workings. There appears to be no good solution, because the underlying design isn't very good. It might be different if minimal-match affected the entire expression inside, including (or ...) and (** ...), but that will have to wait for the next big engine.
The new patch versions describe the semantics more objectively, while still recommending the user to stay clear of minimal-match. Good enough?
[-- Attachment #2: 0001-Describe-the-rx-notation-in-the-elisp-manual-bug-364.patch --]
[-- Type: application/octet-stream, Size: 24814 bytes --]
From 8c01cf75ec3043c9f7ac5c3d8766616bf6a47e1e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Thu, 4 Jul 2019 13:01:52 +0200
Subject: [PATCH 1/2] Describe the rx notation in the elisp manual (bug#36496)
The additions are excluded from the print version to avoid making it
thicker.
* doc/lispref/elisp.texi (Top): New menu entry.
* doc/lispref/searching.texi (Regular Expressions): New menu entry.
(Regexp Example): Add rx form of the example.
(Rx Notation, Rx Constructs, Rx Functions): New nodes.
* doc/lispref/control.texi (pcase Macro): Describe the rx pattern.
---
doc/lispref/control.texi | 25 ++
doc/lispref/elisp.texi | 3 +
doc/lispref/searching.texi | 573 +++++++++++++++++++++++++++++++++++++
3 files changed, 601 insertions(+)
diff --git a/doc/lispref/control.texi b/doc/lispref/control.texi
index e308d68b75..de6cd9301f 100644
--- a/doc/lispref/control.texi
+++ b/doc/lispref/control.texi
@@ -618,6 +618,31 @@ pcase Macro
to @var{body-forms} (thus avoiding an evaluation error on match),
if any of the sub-patterns let-binds a set of symbols,
they @emph{must} all bind the same set of symbols.
+
+@ifnottex
+@anchor{rx in pcase}
+@item (rx @var{rx-expr}@dots{})
+Matches strings against the regexp @var{rx-expr}@dots{}, using the
+@code{rx} regexp notation (@pxref{Rx Notation}), as if by
+@code{string-match}.
+
+In addition to the usual @code{rx} syntax, @var{rx-expr}@dots{} can
+contain the following constructs:
+
+@table @code
+@item (let @var{ref} @var{rx-expr}@dots{})
+Bind the symbol @var{ref} to a submatch that matches
+@var{rx-expr}@enddots{}. @var{ref} is bound in @var{body-forms} to
+the string of the submatch or nil, but can also be used in
+@code{backref}.
+
+@item (backref @var{ref})
+Like the standard @code{backref} construct, but @var{ref} can here
+also be a name introduced by a previous @code{(let @var{ref} @dots{})}
+construct.
+@end table
+@end ifnottex
+
@end table
@anchor{pcase-example-0}
diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi
index e18759654d..c86f7f3dfb 100644
--- a/doc/lispref/elisp.texi
+++ b/doc/lispref/elisp.texi
@@ -1298,6 +1298,9 @@ Top
* Syntax of Regexps:: Rules for writing regular expressions.
* Regexp Example:: Illustrates regular expression syntax.
+@ifnottex
+* Rx Notation:: An alternative, structured regexp notation.
+@end ifnottex
* Regexp Functions:: Functions for operating on regular expressions.
Syntax of Regular Expressions
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index ef1cffc446..f95c9bf976 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -254,6 +254,9 @@ Regular Expressions
@menu
* Syntax of Regexps:: Rules for writing regular expressions.
* Regexp Example:: Illustrates regular expression syntax.
+@ifnottex
+* Rx Notation:: An alternative, structured regexp notation.
+@end ifnottex
* Regexp Functions:: Functions for operating on regular expressions.
@end menu
@@ -359,6 +362,7 @@ Regexp Special
preceding expression either once or not at all. For example,
@samp{ca?r} matches @samp{car} or @samp{cr}; nothing else.
+@anchor{Non-greedy repetition}
@item @samp{*?}, @samp{+?}, @samp{??}
@cindex non-greedy repetition characters in regexp
These are @dfn{non-greedy} variants of the operators @samp{*}, @samp{+}
@@ -951,6 +955,575 @@ Regexp Example
beyond the minimum needed to end a sentence.
@end table
+@ifnottex
+In the @code{rx} notation (@pxref{Rx Notation}), the regexp could be written
+
+@example
+@group
+(rx (any ".?!") ; Punctuation ending sentence.
+ (zero-or-more (any "\"')]@}")) ; Closing quotes or brackets.
+ (or line-end
+ (seq " " line-end)
+ "\t"
+ " ") ; Two spaces.
+ (zero-or-more (any "\t\n "))) ; Optional extra whitespace.
+@end group
+@end example
+
+Since @code{rx} regexps are just S-expressions, they can be formatted
+and commented as such.
+@end ifnottex
+
+@ifnottex
+@node Rx Notation
+@subsection The @code{rx} Structured Regexp Notation
+@cindex rx
+@cindex regexp syntax
+
+ As an alternative to the string-based syntax, Emacs provides the
+structured @code{rx} notation based on Lisp S-expressions. This
+notation is usually easier to read, write and maintain than regexp
+strings, and can be indented and commented freely. It requires a
+conversion into string form since that is what regexp functions
+expect, but that conversion typically takes place during
+byte-compilation rather than when the Lisp code using the regexp is
+run.
+
+ Here is an @code{rx} regexp@footnote{It could be written much
+simpler with non-greedy operators (how?), but that would make the
+example less interesting.} that matches a block comment in the C
+programming language:
+
+@example
+@group
+(rx "/*" ; Initial /*
+ (zero-or-more
+ (or (not (any "*")) ; Either non-*,
+ (seq "*" ; or * followed by
+ (not (any "/"))))) ; non-/
+ (one-or-more "*") ; At least one star,
+ "/") ; and the final /
+@end group
+@end example
+
+@noindent
+or, using shorter synonyms and written more compactly,
+
+@example
+@group
+(rx "/*"
+ (* (| (not (any "*"))
+ (: "*" (not (any "/")))))
+ (+ "*") "/")
+@end group
+@end example
+
+@noindent
+In conventional string syntax, it would be written
+
+@example
+"/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
+@end example
+
+The @code{rx} notation is mainly useful in Lisp code; it cannot be
+used in most interactive situations where a regexp is requested, such
+as when running @code{query-replace-regexp} or in variable
+customisation.
+
+@menu
+* Rx Constructs:: Constructs valid in rx forms.
+* Rx Functions:: Functions and macros that use rx forms.
+@end menu
+
+@node Rx Constructs
+@subsubsection Constructs in @code{rx} regexps
+
+The various forms in @code{rx} regexps are described below. The
+shorthand @var{rx} represents any @code{rx} form, and @var{rx}@dots{}
+means one or more @code{rx} forms. Where the corresponding string
+regexp syntax is given, @var{A}, @var{B}, @dots{} are string regexp
+subexpressions.
+@c With the new implementation of rx, this can be changed from
+@c 'one or more' to 'zero or more'.
+
+@subsubheading Literals
+
+@table @asis
+@item @code{"some-string"}
+Match the string @samp{some-string} literally. There are no
+characters with special meaning, unlike in string regexps.
+
+@item @code{?C}
+Match the character @samp{C} literally.
+@end table
+
+@subsubheading Sequence and alternative
+
+@table @asis
+@item @code{(seq @var{rx}@dots{})}
+@cindex @code{seq} in rx
+@itemx @code{(sequence @var{rx}@dots{})}
+@cindex @code{sequence} in rx
+@itemx @code{(: @var{rx}@dots{})}
+@cindex @code{:} in rx
+@itemx @code{(and @var{rx}@dots{})}
+@cindex @code{and} in rx
+Match the @var{rx}s in sequence. Without arguments, the expression
+matches the empty string.@*
+Corresponding string regexp: @samp{@var{A}@var{B}@dots{}}
+(subexpressions in sequence).
+
+@item @code{(or @var{rx}@dots{})}
+@cindex @code{or} in rx
+@itemx @code{(| @var{rx}@dots{})}
+@cindex @code{|} in rx
+Match exactly one of the @var{rx}s, trying from left to right.
+Without arguments, the expression will not match anything at all.@*
+Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
+@end table
+
+@subsubheading Repetition
+
+Normally, repetition forms are greedy, in that they attempt to match
+as many times as possible. Some forms are non-greedy; they try to
+match as few times as possible (@pxref{Non-greedy repetition}).
+
+@table @code
+@item (zero-or-more @var{rx}@dots{})
+@cindex @code{zero-or-more} in rx
+@itemx (0+ @var{rx}@dots{})
+@cindex @code{0+} in rx
+Match the @var{rx}s zero or more times. Greedy by default.@*
+Corresponding string regexp: @samp{@var{A}*} (greedy),
+@samp{@var{A}*?} (non-greedy)
+
+@item (one-or-more @var{rx}@dots{})
+@cindex @code{one-or-more} in rx
+@itemx (1+ @var{rx}@dots{})
+@cindex @code{1+} in rx
+Match the @var{rx}s one or more times. Greedy by default.@*
+Corresponding string regexp: @samp{@var{A}+} (greedy),
+@samp{@var{A}+?} (non-greedy)
+
+@item (zero-or-one @var{rx}@dots{})
+@cindex @code{zero-or-one} in rx
+@itemx (optional @var{rx}@dots{})
+@cindex @code{optional} in rx
+@itemx (opt @var{rx}@dots{})
+@cindex @code{opt} in rx
+Match the @var{rx}s once or an empty string. Greedy by default.@*
+Corresponding string regexp: @samp{@var{A}?} (greedy),
+@samp{@var{A}??} (non-greedy).
+
+@item (* @var{rx}@dots{})
+@cindex @code{*} in rx
+Match the @var{rx}s zero or more times. Greedy.@*
+Corresponding string regexp: @samp{@var{A}*}
+
+@item (+ @var{rx}@dots{})
+@cindex @code{+} in rx
+Match the @var{rx}s one or more times. Greedy.@*
+Corresponding string regexp: @samp{@var{A}+}
+
+@item (? @var{rx}@dots{})
+@cindex @code{?} in rx
+Match the @var{rx}s once or an empty string. Greedy.@*
+Corresponding string regexp: @samp{@var{A}?}
+
+@item (*? @var{rx}@dots{})
+@cindex @code{*?} in rx
+Match the @var{rx}s zero or more times. Non-greedy.@*
+Corresponding string regexp: @samp{@var{A}*?}
+
+@item (+? @var{rx}@dots{})
+@cindex @code{+?} in rx
+Match the @var{rx}s one or more times. Non-greedy.@*
+Corresponding string regexp: @samp{@var{A}+?}
+
+@item (?? @var{rx}@dots{})
+@cindex @code{??} in rx
+Match the @var{rx}s or an empty string. Non-greedy.@*
+Corresponding string regexp: @samp{@var{A}??}
+
+@item (= @var{n} @var{rx}@dots{})
+@cindex @code{=} in rx
+@itemx (repeat @var{n} @var{rx})
+Match the @var{rx}s exactly @var{n} times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n}\@}}
+
+@item (>= @var{n} @var{rx}@dots{})
+@cindex @code{>=} in rx
+Match the @var{rx}s @var{n} or more times. Greedy.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},\@}}
+
+@item (** @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{**} in rx
+@itemx (repeat @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{repeat} in rx
+Match the @var{rx}s at least @var{n} but no more than @var{m} times. Greedy.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},@var{m}\@}}
+@end table
+
+The greediness of some repetition forms can be controlled using the
+following constructs. However, it is usually better to use the
+explicit non-greedy forms above when such matching is required.
+
+@table @code
+@item (minimal-match @var{rx})
+@cindex @code{minimal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching.
+
+@item (maximal-match @var{rx})
+@cindex @code{maximal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching. This is the default.
+@end table
+
+@subsubheading Matching single characters
+
+@table @asis
+@item @code{(any @var{set}@dots{})}
+@cindex @code{any} in rx
+@itemx @code{(char @var{set}@dots{})}
+@cindex @code{char} in rx
+@itemx @code{(in @var{set}@dots{})}
+@cindex @code{in} in rx
+@cindex character class in rx
+Match a single character from one of the @var{set}s. Each @var{set}
+is a character, a string representing the set of its characters, a
+range or a character class (see below). A range is either a
+hyphen-separated string like @code{"A-Z"}, or a cons of characters
+like @code{(?A . ?Z)}.
+
+Note that hyphen (@code{-}) is special in strings in this construct,
+since it acts as a range separator. To include a hyphen, add it as a
+separate character or single-character string.@*
+Corresponding string regexp: @samp{[@dots{}]}
+
+@item @code{(not @var{charspec})}
+@cindex @code{not} in rx
+Match a character not included in @var{charspec}. @var{charspec} can
+be an @code{any}, @code{syntax} or @code{category} form, or a
+character class.@*
+Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
+@samp{\C@var{code}}
+
+@item @code{not-newline}, @code{nonl}
+@cindex @code{not-newline} in rx
+@cindex @code{nonl} in rx
+Match any character except a newline.@*
+Corresponding string regexp: @samp{.} (dot)
+
+@item @code{anything}
+@cindex @code{anything} in rx
+Match any character.@*
+Corresponding string regexp: @samp{.\|\n} (for example)
+
+@item character class
+@cindex character class in rx
+Match a character from a named character class:
+
+@table @asis
+@item @code{alpha}, @code{alphabetic}, @code{letter}
+Match alphabetic characters. More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+alphabetic.
+
+@item @code{alnum}, @code{alphanumeric}
+Match alphabetic characters and digits. More precisely, match
+characters whose Unicode @samp{general-category} property indicates
+that they are alphabetic or decimal digits.
+
+@item @code{digit}, @code{numeric}, @code{num}
+Match the digits @samp{0}--@samp{9}.
+
+@item @code{xdigit}, @code{hex-digit}, @code{hex}
+Match the hexadecimal digits @samp{0}--@samp{9}, @samp{A}--@samp{F}
+and @samp{a}--@samp{f}.
+
+@item @code{cntrl}, @code{control}
+Match any character whose code is in the range 0--31.
+
+@item @code{blank}
+Match horizontal whitespace. More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+spacing separators.
+
+@item @code{space}, @code{whitespace}, @code{white}
+Match any character that has whitespace syntax
+(@pxref{Syntax Class Table}).
+
+@item @code{lower}, @code{lower-case}
+Match anything lower-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+upper-case letter.
+
+@item @code{upper}, @code{upper-case}
+Match anything upper-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+lower-case letter.
+
+@item @code{graph}, @code{graphic}
+Match any character except whitespace, @acronym{ASCII} and
+non-@acronym{ASCII} control characters, surrogates, and codepoints
+unassigned by Unicode, as indicated by the Unicode
+@samp{general-category} property.
+
+@item @code{print}, @code{printing}
+Match whitespace or a character matched by @code{graph}.
+
+@item @code{punct}, @code{punctuation}
+Match any punctuation character. (At present, for multibyte
+characters, anything that has non-word syntax.)
+
+@item @code{word}, @code{wordchar}
+Match any character that has word syntax (@pxref{Syntax Class Table}).
+
+@item @code{ascii}
+Match any @acronym{ASCII} character (codes 0--127).
+
+@item @code{nonascii}
+Match any non-@acronym{ASCII} character (but not raw bytes).
+@end table
+
+Corresponding string regexp: @samp{[[:@var{class}:]]}
+
+@item @code{(syntax @var{syntax})}
+@cindex @code{syntax} in rx
+Match a character with syntax @var{syntax}, being one of the following
+names:
+
+@multitable {@code{close-parenthesis}} {Syntax character}
+@headitem Syntax name @tab Syntax character
+@item @code{whitespace} @tab @code{-}
+@item @code{punctuation} @tab @code{.}
+@item @code{word} @tab @code{w}
+@item @code{symbol} @tab @code{_}
+@item @code{open-parenthesis} @tab @code{(}
+@item @code{close-parenthesis} @tab @code{)}
+@item @code{expression-prefix} @tab @code{'}
+@item @code{string-quote} @tab @code{"}
+@item @code{paired-delimiter} @tab @code{$}
+@item @code{escape} @tab @code{\}
+@item @code{character-quote} @tab @code{/}
+@item @code{comment-start} @tab @code{<}
+@item @code{comment-end} @tab @code{>}
+@item @code{string-delimiter} @tab @code{|}
+@item @code{comment-delimiter} @tab @code{!}
+@end multitable
+
+For details, @pxref{Syntax Class Table}. Please note that
+@code{(syntax punctuation)} is @emph{not} equivalent to the character class
+@code{punctuation}.@*
+Corresponding string regexp: @samp{\s@var{code}}
+
+@item @code {(category @var{category})}
+@cindex @code{category} in rx
+Match a character in category @var{category}, which is either one of
+the names below or its category character.
+
+@multitable {@code{vowel-modifying-diacritical-mark}} {Category character}
+@headitem Category name @tab Category character
+@item @code{space-for-indent} @tab space
+@item @code{base} @tab @code{.}
+@item @code{consonant} @tab @code{0}
+@item @code{base-vowel} @tab @code{1}
+@item @code{upper-diacritical-mark} @tab @code{2}
+@item @code{lower-diacritical-mark} @tab @code{3}
+@item @code{tone-mark} @tab @code{4}
+@item @code{symbol} @tab @code{5}
+@item @code{digit} @tab @code{6}
+@item @code{vowel-modifying-diacritical-mark} @tab @code{7}
+@item @code{vowel-sign} @tab @code{8}
+@item @code{semivowel-lower} @tab @code{9}
+@item @code{not-at-end-of-line} @tab @code{<}
+@item @code{not-at-beginning-of-line} @tab @code{>}
+@item @code{alpha-numeric-two-byte} @tab @code{A}
+@item @code{chinese-two-byte} @tab @code{C}
+@item @code{greek-two-byte} @tab @code{G}
+@item @code{japanese-hiragana-two-byte} @tab @code{H}
+@item @code{indian-two-byte} @tab @code{I}
+@item @code{japanese-katakana-two-byte} @tab @code{K}
+@item @code{strong-left-to-right} @tab @code{L}
+@item @code{korean-hangul-two-byte} @tab @code{N}
+@item @code{strong-right-to-left} @tab @code{R}
+@item @code{cyrillic-two-byte} @tab @code{Y}
+@item @code{combining-diacritic} @tab @code{^}
+@item @code{ascii} @tab @code{a}
+@item @code{arabic} @tab @code{b}
+@item @code{chinese} @tab @code{c}
+@item @code{ethiopic} @tab @code{e}
+@item @code{greek} @tab @code{g}
+@item @code{korean} @tab @code{h}
+@item @code{indian} @tab @code{i}
+@item @code{japanese} @tab @code{j}
+@item @code{japanese-katakana} @tab @code{k}
+@item @code{latin} @tab @code{l}
+@item @code{lao} @tab @code{o}
+@item @code{tibetan} @tab @code{q}
+@item @code{japanese-roman} @tab @code{r}
+@item @code{thai} @tab @code{t}
+@item @code{vietnamese} @tab @code{v}
+@item @code{hebrew} @tab @code{w}
+@item @code{cyrillic} @tab @code{y}
+@item @code{can-break} @tab @code{|}
+@end multitable
+
+For more information about currently defined categories, run the
+command @kbd{M-x describe-categories @key{RET}}. For how to define
+new categories, @pxref{Categories}.@*
+Corresponding string regexp: @samp{\c@var{code}}
+@end table
+
+@subsubheading Zero-width assertions
+
+These all match the empty string, but only in specific places.
+
+@table @asis
+@item @code{line-start}, @code{bol}
+@cindex @code{line-start} in rx
+@cindex @code{bol} in rx
+Match at the beginning of a line.@*
+Corresponding string regexp: @samp{^}
+
+@item @code{line-end}, @code{eol}
+@cindex @code{line-end} in rx
+@cindex @code{eol} in rx
+Match at the end of a line.@*
+Corresponding string regexp: @samp{$}
+
+@item @code{string-start}, @code{bos}, @code{buffer-start}, @code{bot}
+@cindex @code{string-start} in rx
+@cindex @code{bos} in rx
+@cindex @code{buffer-start} in rx
+@cindex @code{bot} in rx
+Match at the start of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\`}
+
+@item @code{string-end}, @code{eos}, @code{buffer-end}, @code{eot}
+@cindex @code{string-end} in rx
+@cindex @code{eos} in rx
+@cindex @code{buffer-end} in rx
+@cindex @code{eot} in rx
+Match at the end of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\'}
+
+@item @code{point}
+@cindex @code{point} in rx
+Match at point.@*
+Corresponding string regexp: @samp{\=}
+
+@item @code{word-start}
+@cindex @code{word-start} in rx
+Match at the beginning of a word.@*
+Corresponding string regexp: @samp{\<}
+
+@item @code{word-end}
+@cindex @code{word-end} in rx
+Match at the end of a word.@*
+Corresponding string regexp: @samp{\>}
+
+@item @code{word-boundary}
+@cindex @code{word-boundary} in rx
+Match at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\b}
+
+@item @code{not-word-boundary}
+@cindex @code{not-word-boundary} in rx
+Match anywhere but at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\B}
+
+@item @code{symbol-start}
+@cindex @code{symbol-start} in rx
+Match at the beginning of a symbol.@*
+Corresponding string regexp: @samp{\_<}
+
+@item @code{symbol-end}
+@cindex @code{symbol-end} in rx
+Match at the end of a symbol.@*
+Corresponding string regexp: @samp{\_>}
+@end table
+
+@subsubheading Capture groups
+
+@table @code
+@item (group @var{rx}@dots{})
+@cindex @code{group} in rx
+@itemx (submatch @var{rx}@dots{})
+@cindex @code{submatch} in rx
+Match the @var{rx}s, making the matched text and position accessible
+in the match data. The first group in a regexp is numbered 1;
+subsequent groups will be numbered one higher than the previous
+group.@*
+Corresponding string regexp: @samp{\(@dots{}\)}
+
+@item (group-n @var{n} @var{rx}@dots{})
+@cindex @code{group-n} in rx
+@itemx (submatch-n @var{n} @var{rx}@dots{})
+@cindex @code{submatch-n} in rx
+Like @code{group}, but explicitly assign the group number @var{n}.
+@var{n} must be positive.@*
+Corresponding string regexp: @samp{\(?@var{n}:@dots{}\)}
+
+@item (backref @var{n})
+@cindex @code{backref} in rx
+Match the text previously matched by group number @var{n}.
+@var{n} must be in the range 1--9.@*
+Corresponding string regexp: @samp{\@var{n}}
+@end table
+
+@subsubheading Dynamic inclusion
+
+@table @code
+@item (literal @var{expr})
+@cindex @code{literal} in rx
+Match the literal string that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (regexp @var{expr})
+@cindex @code{regexp} in rx
+@itemx (regex @var{expr})
+@cindex @code{regex} in rx
+Match the string regexp that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (eval @var{expr})
+@cindex @code{eval} in rx
+Match the rx form that is the result from evaluating the Lisp
+expression @var{expr}. The evaluation takes place at macro-expansion
+time for @code{rx}, at call time for @code{rx-to-string},
+in the current global environment.
+@end table
+
+@node Rx Functions
+@subsubsection Functions and macros using @code{rx} regexps
+
+@defmac rx rx-expr@dots{}
+Translate the @var{rx-expr}s to a string regexp, as if they were the
+body of a @code{(seq @dots{})} form. The @code{rx} macro expands to a
+string constant, or, if @code{literal} or @code{regexp} forms are
+used, a Lisp expression that evaluates to a string.
+@end defmac
+
+@defun rx-to-string rx-expr &optional no-group
+Translate @var{rx-expr} to a string regexp which is returned.
+If @var{no-group} is absent or nil, bracket the result in a
+non-capturing group, @samp{\(?:@dots{}\)}, if necessary to ensure that
+a postfix operator appended to it will apply to the whole expression.
+
+Arguments to @code{literal} and @code{regexp} forms in @var{rx-expr}
+must be string literals.
+@end defun
+
+The @code{pcase} macro can use @code{rx} expressions as patterns
+directly; @pxref{rx in pcase}.
+@end ifnottex
+
@node Regexp Functions
@subsection Regular Expression Functions
--
2.20.1 (Apple Git-117)
[-- Attachment #3: 0002-Shorter-rx-doc-string-bug-36496.patch --]
[-- Type: application/octet-stream, Size: 16907 bytes --]
From 8cf5f5583a5c042ea856155eb9a78b21fc38310f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sat, 6 Jul 2019 13:22:15 +0200
Subject: [PATCH 2/2] Shorter `rx' doc string (bug#36496)
* lisp/emacs-lisp/rx.el (rx): Replace long description with a condensed
summary of the rx syntax, with reference to the manual section.
---
lisp/emacs-lisp/rx.el | 417 ++++++++++--------------------------------
1 file changed, 96 insertions(+), 321 deletions(-)
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 24dd6cbf1d..8fccf9c470 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -959,327 +959,102 @@ rx-to-string
;;;###autoload
(defmacro rx (&rest regexps)
"Translate regular expressions REGEXPS in sexp form to a regexp string.
-REGEXPS is a non-empty sequence of forms of the sort listed below.
-
-Note that `rx' is a Lisp macro; when used in a Lisp program being
-compiled, the translation is performed by the compiler. The
-`literal' and `regexp' forms accept subforms that will evaluate
-to strings, in addition to constant strings. If REGEXPS include
-such forms, then the result is an expression which returns a
-regexp string, rather than a regexp string directly. See
-`rx-to-string' for performing translation completely at run time.
-
-The following are valid subforms of regular expressions in sexp
-notation.
-
-STRING
- matches string STRING literally.
-
-CHAR
- matches character CHAR literally.
-
-`not-newline', `nonl'
- matches any character except a newline.
-
-`anything'
- matches any character
-
-`(any SET ...)'
-`(in SET ...)'
-`(char SET ...)'
- matches any character in SET .... SET may be a character or string.
- Ranges of characters can be specified as `A-Z' in strings.
- Ranges may also be specified as conses like `(?A . ?Z)'.
- Reversed ranges like `Z-A' and `(?Z . ?A)' are not permitted.
-
- SET may also be the name of a character class: `digit',
- `control', `hex-digit', `blank', `graph', `print', `alnum',
- `alpha', `ascii', `nonascii', `lower', `punct', `space', `upper',
- `word', or one of their synonyms.
-
-`(not (any SET ...))'
- matches any character not in SET ...
-
-`line-start', `bol'
- matches the empty string, but only at the beginning of a line
- in the text being matched
-
-`line-end', `eol'
- is similar to `line-start' but matches only at the end of a line
-
-`string-start', `bos', `bot'
- matches the empty string, but only at the beginning of the
- string being matched against.
-
-`string-end', `eos', `eot'
- matches the empty string, but only at the end of the
- string being matched against.
-
-`buffer-start'
- matches the empty string, but only at the beginning of the
- buffer being matched against. Actually equivalent to `string-start'.
-
-`buffer-end'
- matches the empty string, but only at the end of the
- buffer being matched against. Actually equivalent to `string-end'.
-
-`point'
- matches the empty string, but only at point.
-
-`word-start', `bow'
- matches the empty string, but only at the beginning of a word.
-
-`word-end', `eow'
- matches the empty string, but only at the end of a word.
-
-`word-boundary'
- matches the empty string, but only at the beginning or end of a
- word.
-
-`(not word-boundary)'
-`not-word-boundary'
- matches the empty string, but not at the beginning or end of a
- word.
-
-`symbol-start'
- matches the empty string, but only at the beginning of a symbol.
-
-`symbol-end'
- matches the empty string, but only at the end of a symbol.
-
-`digit', `numeric', `num'
- matches 0 through 9.
-
-`control', `cntrl'
- matches any character whose code is in the range 0-31.
-
-`hex-digit', `hex', `xdigit'
- matches 0 through 9, a through f and A through F.
-
-`blank'
- matches horizontal whitespace, as defined by Annex C of the
- Unicode Technical Standard #18. In particular, it matches
- spaces, tabs, and other characters whose Unicode
- `general-category' property indicates they are spacing
- separators.
-
-`graphic', `graph'
- matches graphic characters--everything except whitespace, ASCII
- and non-ASCII control characters, surrogates, and codepoints
- unassigned by Unicode.
-
-`printing', `print'
- matches whitespace and graphic characters.
-
-`alphanumeric', `alnum'
- matches alphabetic characters and digits. For multibyte characters,
- it matches characters whose Unicode `general-category' property
- indicates they are alphabetic or decimal number characters.
-
-`letter', `alphabetic', `alpha'
- matches alphabetic characters. For multibyte characters,
- it matches characters whose Unicode `general-category' property
- indicates they are alphabetic characters.
-
-`ascii'
- matches ASCII (unibyte) characters.
-
-`nonascii'
- matches non-ASCII (multibyte) characters.
-
-`lower', `lower-case'
- matches anything lower-case, as determined by the current case
- table. If `case-fold-search' is non-nil, this also matches any
- upper-case letter.
-
-`upper', `upper-case'
- matches anything upper-case, as determined by the current case
- table. If `case-fold-search' is non-nil, this also matches any
- lower-case letter.
-
-`punctuation', `punct'
- matches punctuation. (But at present, for multibyte characters,
- it matches anything that has non-word syntax.)
-
-`space', `whitespace', `white'
- matches anything that has whitespace syntax.
-
-`word', `wordchar'
- matches anything that has word syntax.
-
-`not-wordchar'
- matches anything that has non-word syntax.
-
-`(syntax SYNTAX)'
- matches a character with syntax SYNTAX. SYNTAX must be one
- of the following symbols, or a symbol corresponding to the syntax
- character, e.g. `\\.' for `\\s.'.
-
- `whitespace' (\\s- in string notation)
- `punctuation' (\\s.)
- `word' (\\sw)
- `symbol' (\\s_)
- `open-parenthesis' (\\s()
- `close-parenthesis' (\\s))
- `expression-prefix' (\\s')
- `string-quote' (\\s\")
- `paired-delimiter' (\\s$)
- `escape' (\\s\\)
- `character-quote' (\\s/)
- `comment-start' (\\s<)
- `comment-end' (\\s>)
- `string-delimiter' (\\s|)
- `comment-delimiter' (\\s!)
-
-`(not (syntax SYNTAX))'
- matches a character that doesn't have syntax SYNTAX.
-
-`(category CATEGORY)'
- matches a character with category CATEGORY. CATEGORY must be
- either a character to use for C, or one of the following symbols.
-
- `space-for-indent' (\\c\\s in string notation)
- `base' (\\c.)
- `consonant' (\\c0)
- `base-vowel' (\\c1)
- `upper-diacritical-mark' (\\c2)
- `lower-diacritical-mark' (\\c3)
- `tone-mark' (\\c4)
- `symbol' (\\c5)
- `digit' (\\c6)
- `vowel-modifying-diacritical-mark' (\\c7)
- `vowel-sign' (\\c8)
- `semivowel-lower' (\\c9)
- `not-at-end-of-line' (\\c<)
- `not-at-beginning-of-line' (\\c>)
- `alpha-numeric-two-byte' (\\cA)
- `chinese-two-byte' (\\cC)
- `greek-two-byte' (\\cG)
- `japanese-hiragana-two-byte' (\\cH)
- `indian-two-byte' (\\cI)
- `japanese-katakana-two-byte' (\\cK)
- `strong-left-to-right' (\\cL)
- `korean-hangul-two-byte' (\\cN)
- `strong-right-to-left' (\\cR)
- `cyrillic-two-byte' (\\cY)
- `combining-diacritic' (\\c^)
- `ascii' (\\ca)
- `arabic' (\\cb)
- `chinese' (\\cc)
- `ethiopic' (\\ce)
- `greek' (\\cg)
- `korean' (\\ch)
- `indian' (\\ci)
- `japanese' (\\cj)
- `japanese-katakana' (\\ck)
- `latin' (\\cl)
- `lao' (\\co)
- `tibetan' (\\cq)
- `japanese-roman' (\\cr)
- `thai' (\\ct)
- `vietnamese' (\\cv)
- `hebrew' (\\cw)
- `cyrillic' (\\cy)
- `can-break' (\\c|)
-
-`(not (category CATEGORY))'
- matches a character that doesn't have category CATEGORY.
-
-`(and SEXP1 SEXP2 ...)'
-`(: SEXP1 SEXP2 ...)'
-`(seq SEXP1 SEXP2 ...)'
-`(sequence SEXP1 SEXP2 ...)'
- matches what SEXP1 matches, followed by what SEXP2 matches, etc.
- Without arguments, matches the empty string.
-
-`(submatch SEXP1 SEXP2 ...)'
-`(group SEXP1 SEXP2 ...)'
- like `and', but makes the match accessible with `match-end',
- `match-beginning', and `match-string'.
-
-`(submatch-n N SEXP1 SEXP2 ...)'
-`(group-n N SEXP1 SEXP2 ...)'
- like `group', but make it an explicitly-numbered group with
- group number N.
-
-`(or SEXP1 SEXP2 ...)'
-`(| SEXP1 SEXP2 ...)'
- matches anything that matches SEXP1 or SEXP2, etc. If all
- args are strings, use `regexp-opt' to optimize the resulting
- regular expression. Without arguments, never matches anything.
-
-`(minimal-match SEXP)'
- produce a non-greedy regexp for SEXP. Normally, regexps matching
- zero or more occurrences of something are \"greedy\" in that they
- match as much as they can, as long as the overall regexp can
- still match. A non-greedy regexp matches as little as possible.
-
-`(maximal-match SEXP)'
- produce a greedy regexp for SEXP. This is the default.
-
-Below, `SEXP ...' represents a sequence of regexp forms, treated as if
-enclosed in `(and ...)'.
-
-`(zero-or-more SEXP ...)'
-`(0+ SEXP ...)'
- matches zero or more occurrences of what SEXP ... matches.
-
-`(* SEXP ...)'
- like `zero-or-more', but always produces a greedy regexp, independent
- of `rx-greedy-flag'.
-
-`(*? SEXP ...)'
- like `zero-or-more', but always produces a non-greedy regexp,
- independent of `rx-greedy-flag'.
-
-`(one-or-more SEXP ...)'
-`(1+ SEXP ...)'
- matches one or more occurrences of SEXP ...
-
-`(+ SEXP ...)'
- like `one-or-more', but always produces a greedy regexp.
-
-`(+? SEXP ...)'
- like `one-or-more', but always produces a non-greedy regexp.
-
-`(zero-or-one SEXP ...)'
-`(optional SEXP ...)'
-`(opt SEXP ...)'
- matches zero or one occurrences of A.
-
-`(? SEXP ...)'
- like `zero-or-one', but always produces a greedy regexp.
-
-`(?? SEXP ...)'
- like `zero-or-one', but always produces a non-greedy regexp.
-
-`(repeat N SEXP)'
-`(= N SEXP ...)'
- matches N occurrences.
-
-`(>= N SEXP ...)'
- matches N or more occurrences.
-
-`(repeat N M SEXP)'
-`(** N M SEXP ...)'
- matches N to M occurrences.
-
-`(backref N)'
- matches what was matched previously by submatch N.
-
-`(literal STRING-EXPR)'
- matches STRING-EXPR literally, where STRING-EXPR is any lisp
- expression that evaluates to a string.
-
-`(regexp REGEXP-EXPR)'
- include REGEXP-EXPR in string notation in the result, where
- REGEXP-EXPR is any lisp expression that evaluates to a
- string containing a valid regexp.
-
-`(eval FORM)'
- evaluate FORM and insert result. If result is a string,
- `regexp-quote' it. Note that FORM is evaluated during
- macroexpansion."
+Each argument is one of the forms below; RX is a subform, and RX... stands
+for one or more RXs. For details, see Info node `(elisp) Rx Notation'.
+See `rx-to-string' for the corresponding function.
+
+STRING Match a literal string.
+CHAR Match a literal character.
+
+(seq RX...) Match the RXs in sequence. Alias: :, sequence, and
+(or RX...) Match one of the RXs. Alias: |
+
+(zero-or-more RX...) Match RXs zero or more times. Alias: 0+
+(one-or-more RX...) Match RXs one or more times. Alias: 1+
+(zero-or-one RX...) Match RXs or the empty string. Alias: opt, optional
+(* RX...) Match RXs zero or more times; greedy.
+(+ RX...) Match RXs one or more times; greedy.
+(? RX...) Match RXs or the empty string; greedy.
+(*? RX...) Match RXs zero or more times; non-greedy.
+(+? RX...) Match RXs one or more times; non-greedy.
+(?? RX...) Match RXs or the empty string; non-greedy.
+(= N RX...) Match RXs exactly N times.
+(>= N RX...) Match RXs N or more times.
+(** N M RX...) Match RXs N to M times. Alias: repeat
+(minimal-match RX) Match RX, with zero-or-more, one-or-more, zero-or-one
+ and aliases using non-greedy matching.
+(maximal-match RX) Match RX, with zero-or-more, one-or-more, zero-or-one
+ and aliases using greedy matching, which is the default.
+
+(any SET...) Match a character from one of the SETs. Each SET is a
+ character, a string, a range as string \"A-Z\" or cons
+ (?A . ?Z), or a character class (see below). Alias: in, char
+(not CHARSPEC) Match one character not matched by CHARSPEC. CHARSPEC
+ can be (any ...), (syntax ...), (category ...),
+ or a character class.
+not-newline Match any character except a newline. Alias: nonl
+anything Match any character.
+
+CHARCLASS Match a character from a character class. One of:
+ alpha, alphabetic, letter Alphabetic characters (defined by Unicode).
+ alnum, alphanumeric Alphabetic or decimal digit chars (Unicode).
+ digit numeric, num 0-9.
+ xdigit, hex-digit, hex 0-9, A-F, a-f.
+ cntrl, control ASCII codes 0-31.
+ blank Horizontal whitespace (Unicode).
+ space, whitespace, white Chars with whitespace syntax.
+ lower, lower-case Lower-case chars, from current case table.
+ upper, upper-case Upper-case chars, from current case table.
+ graph, graphic Graphic characters (Unicode).
+ print, printing Whitespace or graphic (Unicode).
+ punct, punctuation Not control, space, letter or digit (ASCII);
+ not word syntax (non-ASCII).
+ word, wordchar Characters with word syntax.
+ ascii ASCII characters (codes 0-127).
+ nonascii Non-ASCII characters (but not raw bytes).
+
+(syntax SYNTAX) Match a character with syntax SYNTAX, being one of:
+ whitespace, punctuation, word, symbol, open-parenthesis,
+ close-parenthesis, expression-prefix, string-quote,
+ paired-delimiter, escape, character-quote, comment-start,
+ comment-end, string-delimiter, comment-delimiter
+
+(category CAT) Match a character in category CAT, being one of:
+ space-for-indent, base, consonant, base-vowel,
+ upper-diacritical-mark, lower-diacritical-mark, tone-mark, symbol,
+ digit, vowel-modifying-diacritical-mark, vowel-sign,
+ semivowel-lower, not-at-end-of-line, not-at-beginning-of-line,
+ alpha-numeric-two-byte, chinese-two-byte, greek-two-byte,
+ japanese-hiragana-two-byte, indian-two-byte,
+ japanese-katakana-two-byte, strong-left-to-right,
+ korean-hangul-two-byte, strong-right-to-left, cyrillic-two-byte,
+ combining-diacritic, ascii, arabic, chinese, ethiopic, greek,
+ korean, indian, japanese, japanese-katakana, latin, lao,
+ tibetan, japanese-roman, thai, vietnamese, hebrew, cyrillic,
+ can-break
+
+Zero-width assertions: these all match the empty string in specific places.
+ line-start At the beginning of a line. Alias: bol
+ line-end At the end of a line. Alias: eol
+ string-start At the start of the string or buffer.
+ Alias: buffer-start, bos, bot
+ string-end At the end of the string or buffer.
+ Alias: buffer-end, eos, eot
+ point At point.
+ word-start At the beginning of a word.
+ word-end At the end of a word.
+ word-boundary At the beginning or end of a word.
+ not-word-boundary Not at the beginning or end of a word.
+ symbol-start At the beginning of a symbol.
+ symbol-end At the end of a symbol.
+
+(group RX...) Match RXs and define a capture group. Alias: submatch
+(group-n N RX...) Match RXs and define capture group N. Alias: submatch-n
+(backref N) Match the text that capture group N matched.
+
+(literal EXPR) Match the literal string from evaluating EXPR at run time.
+(regexp EXPR) Match the string regexp from evaluating EXPR at run time.
+(eval EXPR) Match the rx sexp from evaluating EXPR at compile time."
(let* ((rx--compile-to-lisp t)
(re (cond ((null regexps)
(error "No regexp"))
--
2.20.1 (Apple Git-117)
^ permalink raw reply related [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 18:56 ` Mattias Engdegård
@ 2019-07-06 19:10 ` Eli Zaretskii
2019-07-06 19:45 ` Mattias Engdegård
2019-07-06 19:12 ` Noam Postavsky
1 sibling, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2019-07-06 19:10 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: npostavs, 36496
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 6 Jul 2019 20:56:57 +0200
> Cc: 36496@debbugs.gnu.org
>
> >> * lisp/emacs-lisp/rx.el (rx): Replace long description with a condensed
> >> summary of the rx syntax, with reference to the manual section.
> >
> > This is OK, but it is inconsistent wrt whether each construct's
> > description ends in a period. I suggest to end them all with a
> > period.
>
> Added, except at the end of the lists of aliases which looked better with a minimum of punctuation (and weren't sentences to begin with).
It still looks jarring:
+(seq RX...) Match the RXs in sequence. Alias: :, sequence, and
+(or RX...) Match one of the RXs. Alias: |
+
+(zero-or-more RX...) Match RXs zero or more times. Alias: 0+
+(one-or-more RX...) Match RXs one or more times. Alias: 1+
+(zero-or-one RX...) Match RXs or the empty string. Alias: opt, optional
Honestly, they look like incorrect English: a sentence, starting with
a capital letter, but not ending with a period. I hope you will
reconsider.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 18:56 ` Mattias Engdegård
2019-07-06 19:10 ` Eli Zaretskii
@ 2019-07-06 19:12 ` Noam Postavsky
1 sibling, 0 replies; 26+ messages in thread
From: Noam Postavsky @ 2019-07-06 19:12 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 36496
Mattias Engdegård <mattiase@acm.org> writes:
> Ah, you called out my little white lie. They are synonyms in practice,
> because almost nobody uses minimal-match, probably for good
> reasons. (xr used to generate {minimal|maximal}-match, but it was
> decidedly less readable so it got changed.)
>
> Yet you are right in the sense that the documentation should not lie
> or wilfully obscure the workings. There appears to be no good
> solution, because the underlying design isn't very good. It might be
> different if minimal-match affected the entire expression inside,
> including (or ...) and (** ...), but that will have to wait for the
> next big engine.
>
> The new patch versions describe the semantics more objectively, while
> still recommending the user to stay clear of minimal-match. Good
> enough?
> +(zero-or-more RX...) Match RXs zero or more times. Alias: 0+
> +(one-or-more RX...) Match RXs one or more times. Alias: 1+
> +(zero-or-one RX...) Match RXs or the empty string. Alias: opt, optional
> +(* RX...) Match RXs zero or more times; greedy.
> +(+ RX...) Match RXs one or more times; greedy.
> +(? RX...) Match RXs or the empty string; greedy.
Yep, that looks fine.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 19:10 ` Eli Zaretskii
@ 2019-07-06 19:45 ` Mattias Engdegård
2019-07-07 2:29 ` Eli Zaretskii
0 siblings, 1 reply; 26+ messages in thread
From: Mattias Engdegård @ 2019-07-06 19:45 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Noam Postavsky, 36496
[-- Attachment #1: Type: text/plain, Size: 245 bytes --]
6 juli 2019 kl. 21.10 skrev Eli Zaretskii <eliz@gnu.org>:
>
> Honestly, they look like incorrect English: a sentence, starting with
> a capital letter, but not ending with a period. I hope you will
> reconsider.
Very well, full stops added.
[-- Attachment #2: 0002-Shorter-rx-doc-string-bug-36496.patch --]
[-- Type: application/octet-stream, Size: 16921 bytes --]
From 584c325f1488df5c25b69c84222034f0d9a74e9e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sat, 6 Jul 2019 13:22:15 +0200
Subject: [PATCH 2/2] Shorter `rx' doc string (bug#36496)
* lisp/emacs-lisp/rx.el (rx): Replace long description with a condensed
summary of the rx syntax, with reference to the manual section.
---
lisp/emacs-lisp/rx.el | 417 ++++++++++--------------------------------
1 file changed, 96 insertions(+), 321 deletions(-)
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 24dd6cbf1d..249529e54e 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -959,327 +959,102 @@ rx-to-string
;;;###autoload
(defmacro rx (&rest regexps)
"Translate regular expressions REGEXPS in sexp form to a regexp string.
-REGEXPS is a non-empty sequence of forms of the sort listed below.
-
-Note that `rx' is a Lisp macro; when used in a Lisp program being
-compiled, the translation is performed by the compiler. The
-`literal' and `regexp' forms accept subforms that will evaluate
-to strings, in addition to constant strings. If REGEXPS include
-such forms, then the result is an expression which returns a
-regexp string, rather than a regexp string directly. See
-`rx-to-string' for performing translation completely at run time.
-
-The following are valid subforms of regular expressions in sexp
-notation.
-
-STRING
- matches string STRING literally.
-
-CHAR
- matches character CHAR literally.
-
-`not-newline', `nonl'
- matches any character except a newline.
-
-`anything'
- matches any character
-
-`(any SET ...)'
-`(in SET ...)'
-`(char SET ...)'
- matches any character in SET .... SET may be a character or string.
- Ranges of characters can be specified as `A-Z' in strings.
- Ranges may also be specified as conses like `(?A . ?Z)'.
- Reversed ranges like `Z-A' and `(?Z . ?A)' are not permitted.
-
- SET may also be the name of a character class: `digit',
- `control', `hex-digit', `blank', `graph', `print', `alnum',
- `alpha', `ascii', `nonascii', `lower', `punct', `space', `upper',
- `word', or one of their synonyms.
-
-`(not (any SET ...))'
- matches any character not in SET ...
-
-`line-start', `bol'
- matches the empty string, but only at the beginning of a line
- in the text being matched
-
-`line-end', `eol'
- is similar to `line-start' but matches only at the end of a line
-
-`string-start', `bos', `bot'
- matches the empty string, but only at the beginning of the
- string being matched against.
-
-`string-end', `eos', `eot'
- matches the empty string, but only at the end of the
- string being matched against.
-
-`buffer-start'
- matches the empty string, but only at the beginning of the
- buffer being matched against. Actually equivalent to `string-start'.
-
-`buffer-end'
- matches the empty string, but only at the end of the
- buffer being matched against. Actually equivalent to `string-end'.
-
-`point'
- matches the empty string, but only at point.
-
-`word-start', `bow'
- matches the empty string, but only at the beginning of a word.
-
-`word-end', `eow'
- matches the empty string, but only at the end of a word.
-
-`word-boundary'
- matches the empty string, but only at the beginning or end of a
- word.
-
-`(not word-boundary)'
-`not-word-boundary'
- matches the empty string, but not at the beginning or end of a
- word.
-
-`symbol-start'
- matches the empty string, but only at the beginning of a symbol.
-
-`symbol-end'
- matches the empty string, but only at the end of a symbol.
-
-`digit', `numeric', `num'
- matches 0 through 9.
-
-`control', `cntrl'
- matches any character whose code is in the range 0-31.
-
-`hex-digit', `hex', `xdigit'
- matches 0 through 9, a through f and A through F.
-
-`blank'
- matches horizontal whitespace, as defined by Annex C of the
- Unicode Technical Standard #18. In particular, it matches
- spaces, tabs, and other characters whose Unicode
- `general-category' property indicates they are spacing
- separators.
-
-`graphic', `graph'
- matches graphic characters--everything except whitespace, ASCII
- and non-ASCII control characters, surrogates, and codepoints
- unassigned by Unicode.
-
-`printing', `print'
- matches whitespace and graphic characters.
-
-`alphanumeric', `alnum'
- matches alphabetic characters and digits. For multibyte characters,
- it matches characters whose Unicode `general-category' property
- indicates they are alphabetic or decimal number characters.
-
-`letter', `alphabetic', `alpha'
- matches alphabetic characters. For multibyte characters,
- it matches characters whose Unicode `general-category' property
- indicates they are alphabetic characters.
-
-`ascii'
- matches ASCII (unibyte) characters.
-
-`nonascii'
- matches non-ASCII (multibyte) characters.
-
-`lower', `lower-case'
- matches anything lower-case, as determined by the current case
- table. If `case-fold-search' is non-nil, this also matches any
- upper-case letter.
-
-`upper', `upper-case'
- matches anything upper-case, as determined by the current case
- table. If `case-fold-search' is non-nil, this also matches any
- lower-case letter.
-
-`punctuation', `punct'
- matches punctuation. (But at present, for multibyte characters,
- it matches anything that has non-word syntax.)
-
-`space', `whitespace', `white'
- matches anything that has whitespace syntax.
-
-`word', `wordchar'
- matches anything that has word syntax.
-
-`not-wordchar'
- matches anything that has non-word syntax.
-
-`(syntax SYNTAX)'
- matches a character with syntax SYNTAX. SYNTAX must be one
- of the following symbols, or a symbol corresponding to the syntax
- character, e.g. `\\.' for `\\s.'.
-
- `whitespace' (\\s- in string notation)
- `punctuation' (\\s.)
- `word' (\\sw)
- `symbol' (\\s_)
- `open-parenthesis' (\\s()
- `close-parenthesis' (\\s))
- `expression-prefix' (\\s')
- `string-quote' (\\s\")
- `paired-delimiter' (\\s$)
- `escape' (\\s\\)
- `character-quote' (\\s/)
- `comment-start' (\\s<)
- `comment-end' (\\s>)
- `string-delimiter' (\\s|)
- `comment-delimiter' (\\s!)
-
-`(not (syntax SYNTAX))'
- matches a character that doesn't have syntax SYNTAX.
-
-`(category CATEGORY)'
- matches a character with category CATEGORY. CATEGORY must be
- either a character to use for C, or one of the following symbols.
-
- `space-for-indent' (\\c\\s in string notation)
- `base' (\\c.)
- `consonant' (\\c0)
- `base-vowel' (\\c1)
- `upper-diacritical-mark' (\\c2)
- `lower-diacritical-mark' (\\c3)
- `tone-mark' (\\c4)
- `symbol' (\\c5)
- `digit' (\\c6)
- `vowel-modifying-diacritical-mark' (\\c7)
- `vowel-sign' (\\c8)
- `semivowel-lower' (\\c9)
- `not-at-end-of-line' (\\c<)
- `not-at-beginning-of-line' (\\c>)
- `alpha-numeric-two-byte' (\\cA)
- `chinese-two-byte' (\\cC)
- `greek-two-byte' (\\cG)
- `japanese-hiragana-two-byte' (\\cH)
- `indian-two-byte' (\\cI)
- `japanese-katakana-two-byte' (\\cK)
- `strong-left-to-right' (\\cL)
- `korean-hangul-two-byte' (\\cN)
- `strong-right-to-left' (\\cR)
- `cyrillic-two-byte' (\\cY)
- `combining-diacritic' (\\c^)
- `ascii' (\\ca)
- `arabic' (\\cb)
- `chinese' (\\cc)
- `ethiopic' (\\ce)
- `greek' (\\cg)
- `korean' (\\ch)
- `indian' (\\ci)
- `japanese' (\\cj)
- `japanese-katakana' (\\ck)
- `latin' (\\cl)
- `lao' (\\co)
- `tibetan' (\\cq)
- `japanese-roman' (\\cr)
- `thai' (\\ct)
- `vietnamese' (\\cv)
- `hebrew' (\\cw)
- `cyrillic' (\\cy)
- `can-break' (\\c|)
-
-`(not (category CATEGORY))'
- matches a character that doesn't have category CATEGORY.
-
-`(and SEXP1 SEXP2 ...)'
-`(: SEXP1 SEXP2 ...)'
-`(seq SEXP1 SEXP2 ...)'
-`(sequence SEXP1 SEXP2 ...)'
- matches what SEXP1 matches, followed by what SEXP2 matches, etc.
- Without arguments, matches the empty string.
-
-`(submatch SEXP1 SEXP2 ...)'
-`(group SEXP1 SEXP2 ...)'
- like `and', but makes the match accessible with `match-end',
- `match-beginning', and `match-string'.
-
-`(submatch-n N SEXP1 SEXP2 ...)'
-`(group-n N SEXP1 SEXP2 ...)'
- like `group', but make it an explicitly-numbered group with
- group number N.
-
-`(or SEXP1 SEXP2 ...)'
-`(| SEXP1 SEXP2 ...)'
- matches anything that matches SEXP1 or SEXP2, etc. If all
- args are strings, use `regexp-opt' to optimize the resulting
- regular expression. Without arguments, never matches anything.
-
-`(minimal-match SEXP)'
- produce a non-greedy regexp for SEXP. Normally, regexps matching
- zero or more occurrences of something are \"greedy\" in that they
- match as much as they can, as long as the overall regexp can
- still match. A non-greedy regexp matches as little as possible.
-
-`(maximal-match SEXP)'
- produce a greedy regexp for SEXP. This is the default.
-
-Below, `SEXP ...' represents a sequence of regexp forms, treated as if
-enclosed in `(and ...)'.
-
-`(zero-or-more SEXP ...)'
-`(0+ SEXP ...)'
- matches zero or more occurrences of what SEXP ... matches.
-
-`(* SEXP ...)'
- like `zero-or-more', but always produces a greedy regexp, independent
- of `rx-greedy-flag'.
-
-`(*? SEXP ...)'
- like `zero-or-more', but always produces a non-greedy regexp,
- independent of `rx-greedy-flag'.
-
-`(one-or-more SEXP ...)'
-`(1+ SEXP ...)'
- matches one or more occurrences of SEXP ...
-
-`(+ SEXP ...)'
- like `one-or-more', but always produces a greedy regexp.
-
-`(+? SEXP ...)'
- like `one-or-more', but always produces a non-greedy regexp.
-
-`(zero-or-one SEXP ...)'
-`(optional SEXP ...)'
-`(opt SEXP ...)'
- matches zero or one occurrences of A.
-
-`(? SEXP ...)'
- like `zero-or-one', but always produces a greedy regexp.
-
-`(?? SEXP ...)'
- like `zero-or-one', but always produces a non-greedy regexp.
-
-`(repeat N SEXP)'
-`(= N SEXP ...)'
- matches N occurrences.
-
-`(>= N SEXP ...)'
- matches N or more occurrences.
-
-`(repeat N M SEXP)'
-`(** N M SEXP ...)'
- matches N to M occurrences.
-
-`(backref N)'
- matches what was matched previously by submatch N.
-
-`(literal STRING-EXPR)'
- matches STRING-EXPR literally, where STRING-EXPR is any lisp
- expression that evaluates to a string.
-
-`(regexp REGEXP-EXPR)'
- include REGEXP-EXPR in string notation in the result, where
- REGEXP-EXPR is any lisp expression that evaluates to a
- string containing a valid regexp.
-
-`(eval FORM)'
- evaluate FORM and insert result. If result is a string,
- `regexp-quote' it. Note that FORM is evaluated during
- macroexpansion."
+Each argument is one of the forms below; RX is a subform, and RX... stands
+for one or more RXs. For details, see Info node `(elisp) Rx Notation'.
+See `rx-to-string' for the corresponding function.
+
+STRING Match a literal string.
+CHAR Match a literal character.
+
+(seq RX...) Match the RXs in sequence. Alias: :, sequence, and.
+(or RX...) Match one of the RXs. Alias: |.
+
+(zero-or-more RX...) Match RXs zero or more times. Alias: 0+.
+(one-or-more RX...) Match RXs one or more times. Alias: 1+.
+(zero-or-one RX...) Match RXs or the empty string. Alias: opt, optional.
+(* RX...) Match RXs zero or more times; greedy.
+(+ RX...) Match RXs one or more times; greedy.
+(? RX...) Match RXs or the empty string; greedy.
+(*? RX...) Match RXs zero or more times; non-greedy.
+(+? RX...) Match RXs one or more times; non-greedy.
+(?? RX...) Match RXs or the empty string; non-greedy.
+(= N RX...) Match RXs exactly N times.
+(>= N RX...) Match RXs N or more times.
+(** N M RX...) Match RXs N to M times. Alias: repeat.
+(minimal-match RX) Match RX, with zero-or-more, one-or-more, zero-or-one
+ and aliases using non-greedy matching.
+(maximal-match RX) Match RX, with zero-or-more, one-or-more, zero-or-one
+ and aliases using greedy matching, which is the default.
+
+(any SET...) Match a character from one of the SETs. Each SET is a
+ character, a string, a range as string \"A-Z\" or cons
+ (?A . ?Z), or a character class (see below). Alias: in, char.
+(not CHARSPEC) Match one character not matched by CHARSPEC. CHARSPEC
+ can be (any ...), (syntax ...), (category ...),
+ or a character class.
+not-newline Match any character except a newline. Alias: nonl.
+anything Match any character.
+
+CHARCLASS Match a character from a character class. One of:
+ alpha, alphabetic, letter Alphabetic characters (defined by Unicode).
+ alnum, alphanumeric Alphabetic or decimal digit chars (Unicode).
+ digit numeric, num 0-9.
+ xdigit, hex-digit, hex 0-9, A-F, a-f.
+ cntrl, control ASCII codes 0-31.
+ blank Horizontal whitespace (Unicode).
+ space, whitespace, white Chars with whitespace syntax.
+ lower, lower-case Lower-case chars, from current case table.
+ upper, upper-case Upper-case chars, from current case table.
+ graph, graphic Graphic characters (Unicode).
+ print, printing Whitespace or graphic (Unicode).
+ punct, punctuation Not control, space, letter or digit (ASCII);
+ not word syntax (non-ASCII).
+ word, wordchar Characters with word syntax.
+ ascii ASCII characters (codes 0-127).
+ nonascii Non-ASCII characters (but not raw bytes).
+
+(syntax SYNTAX) Match a character with syntax SYNTAX, being one of:
+ whitespace, punctuation, word, symbol, open-parenthesis,
+ close-parenthesis, expression-prefix, string-quote,
+ paired-delimiter, escape, character-quote, comment-start,
+ comment-end, string-delimiter, comment-delimiter
+
+(category CAT) Match a character in category CAT, being one of:
+ space-for-indent, base, consonant, base-vowel,
+ upper-diacritical-mark, lower-diacritical-mark, tone-mark, symbol,
+ digit, vowel-modifying-diacritical-mark, vowel-sign,
+ semivowel-lower, not-at-end-of-line, not-at-beginning-of-line,
+ alpha-numeric-two-byte, chinese-two-byte, greek-two-byte,
+ japanese-hiragana-two-byte, indian-two-byte,
+ japanese-katakana-two-byte, strong-left-to-right,
+ korean-hangul-two-byte, strong-right-to-left, cyrillic-two-byte,
+ combining-diacritic, ascii, arabic, chinese, ethiopic, greek,
+ korean, indian, japanese, japanese-katakana, latin, lao,
+ tibetan, japanese-roman, thai, vietnamese, hebrew, cyrillic,
+ can-break
+
+Zero-width assertions: these all match the empty string in specific places.
+ line-start At the beginning of a line. Alias: bol.
+ line-end At the end of a line. Alias: eol.
+ string-start At the start of the string or buffer.
+ Alias: buffer-start, bos, bot.
+ string-end At the end of the string or buffer.
+ Alias: buffer-end, eos, eot.
+ point At point.
+ word-start At the beginning of a word.
+ word-end At the end of a word.
+ word-boundary At the beginning or end of a word.
+ not-word-boundary Not at the beginning or end of a word.
+ symbol-start At the beginning of a symbol.
+ symbol-end At the end of a symbol.
+
+(group RX...) Match RXs and define a capture group. Alias: submatch.
+(group-n N RX...) Match RXs and define capture group N. Alias: submatch-n.
+(backref N) Match the text that capture group N matched.
+
+(literal EXPR) Match the literal string from evaluating EXPR at run time.
+(regexp EXPR) Match the string regexp from evaluating EXPR at run time.
+(eval EXPR) Match the rx sexp from evaluating EXPR at compile time."
(let* ((rx--compile-to-lisp t)
(re (cond ((null regexps)
(error "No regexp"))
--
2.20.1 (Apple Git-117)
^ permalink raw reply related [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 11:33 ` Mattias Engdegård
2019-07-06 11:41 ` Eli Zaretskii
2019-07-06 11:59 ` Noam Postavsky
@ 2019-07-06 23:56 ` Richard Stallman
2 siblings, 0 replies; 26+ messages in thread
From: Richard Stallman @ 2019-07-06 23:56 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 36496
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> >> It is about 7-8 pages in all.
Would it be feasible to format it differently so that it comes out as
fewer pages? That might be possible by using Texinfo a different way.
Or perhaps not. That is why I am asking.
Here's another idea: document the two syntaxes in a single table,
where each item says how to do the job in a regexp
and how to do it in rx.
--
Dr Richard Stallman
President, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 6:47 ` Eli Zaretskii
@ 2019-07-06 23:59 ` Richard Stallman
2019-07-07 0:36 ` Drew Adams
0 siblings, 1 reply; 26+ messages in thread
From: Richard Stallman @ 2019-07-06 23:59 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: mattiase, 36496
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> > In the past, various practical factors have made rx somewhat inconvenient,
> > and that prevented rx from competing with the regexp syntax.
> > Recently we have made some improvements in rx; are they enough to
> > make rx a real competitor for regexps?
> I cannot answer the question without knowing which practical factors
> made rx inconvenient in the past. Where can one find this
> information?
I don't know. I think people discussed it in the
past -- perhaps on emacs-devel. I don't remember details.
What's clear is that rx didn't replace regexp syntax in the past.
There had to be reasons.
--
Dr Richard Stallman
President, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 23:59 ` Richard Stallman
@ 2019-07-07 0:36 ` Drew Adams
2019-07-07 23:51 ` Richard Stallman
0 siblings, 1 reply; 26+ messages in thread
From: Drew Adams @ 2019-07-07 0:36 UTC (permalink / raw)
To: rms, Eli Zaretskii; +Cc: mattiase, 36496
> > I cannot answer the question without knowing which practical factors
> > made rx inconvenient in the past. Where can one find this
> > information?
>
> I don't know. I think people discussed it in the
> past -- perhaps on emacs-devel. I don't remember details.
>
> What's clear is that rx didn't replace regexp syntax in the past.
> There had to be reasons.
I don't want to sidetrack this thread. But one of
the things mentioned in some previous threads about
`rx' was that some people (including me) thought it
would be great if you could invoke a command on a
regexp (e.g. a regexp string in code) and have an
equivalent `rx' expression pop up, for inspection
and understanding.
A regexp string can be very concise (advantage),
even if obtuse (disadvantage). Much of the time one
doesn't need to dig into the content of the regexp.
It would be nice to be able to have only the result
of `rx' in the code and be able to get its `rx'
expression on demand.
In sum, I'd say that one advantage of a regexp is
its concision. But when you need or want to grok
it it's good to be able to get its `rx' sexp.
With such a feature people could use `rx' or its
result in code, au choix. And they could see the
`rx' equivalent for a regexp on demand.
This is orthogonal to having good doc for `rx'.
I mention it only because the question came up of
disadvantages of `rx' (reasons why it might not
replace a regexp).
(Another reason, if it's true, would be if there
are some regexp constructs that `rx' cannot
handle/reproduce.)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-06 19:45 ` Mattias Engdegård
@ 2019-07-07 2:29 ` Eli Zaretskii
2019-07-07 11:31 ` Mattias Engdegård
0 siblings, 1 reply; 26+ messages in thread
From: Eli Zaretskii @ 2019-07-07 2:29 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: npostavs, 36496
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 6 Jul 2019 21:45:58 +0200
> Cc: Noam Postavsky <npostavs@gmail.com>, 36496@debbugs.gnu.org
>
> > Honestly, they look like incorrect English: a sentence, starting with
> > a capital letter, but not ending with a period. I hope you will
> > reconsider.
>
> Very well, full stops added.
Thanks, LGTM.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-07 2:29 ` Eli Zaretskii
@ 2019-07-07 11:31 ` Mattias Engdegård
2019-07-07 14:33 ` Eli Zaretskii
2022-04-25 15:12 ` Lars Ingebrigtsen
0 siblings, 2 replies; 26+ messages in thread
From: Mattias Engdegård @ 2019-07-07 11:31 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 36496, Noam Postavsky, Richard Stallman
7 juli 2019 kl. 04.29 skrev Eli Zaretskii <eliz@gnu.org>:
> Thanks, LGTM.
Thanks for reviewing! Pushed to master.
7 juli 2019 kl. 01.56 skrev Richard Stallman <rms@gnu.org>:
> Would it be feasible to format it differently so that it comes out as
> fewer pages? That might be possible by using Texinfo a different way.
One way, already mentioned, would be to merge the character class descriptions for rx and string regexps. That would save about one page, at the cost of making the list slightly messier since rx has synonyms for each item which are not legal in string regexps ([:digit:] vs `digit', `numeric' and `num').
Eliding the categories table would save another page, if we accept a reference to other formats or the rx doc string.
I see little room beyond that. We could remove the examples, but they are short and doing so would make an already dry section even drier.
> Here's another idea: document the two syntaxes in a single table,
> where each item says how to do the job in a regexp
> and how to do it in rx.
Probably not a bad idea and one that would put the syntaxes on equal footing, but it involves a complete rewrite of all the regexp sections; more than I am prepared to do right now.
Should this bug be kept open for possible improvements to the print manual, or can we regard that as a separate issue?
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-07 11:31 ` Mattias Engdegård
@ 2019-07-07 14:33 ` Eli Zaretskii
2022-04-25 15:12 ` Lars Ingebrigtsen
1 sibling, 0 replies; 26+ messages in thread
From: Eli Zaretskii @ 2019-07-07 14:33 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: 36496, npostavs, rms
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 7 Jul 2019 13:31:06 +0200
> Cc: Noam Postavsky <npostavs@gmail.com>, 36496@debbugs.gnu.org,
> Richard Stallman <rms@gnu.org>
>
> Should this bug be kept open for possible improvements to the print manual, or can we regard that as a separate issue?
It doesn't really matter, IMO. I'd say leave this bug open, as
opening another doesn't seem to be worth the hassle.
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-07 0:36 ` Drew Adams
@ 2019-07-07 23:51 ` Richard Stallman
2019-07-08 0:56 ` Drew Adams
2019-07-08 23:44 ` Richard Stallman
0 siblings, 2 replies; 26+ messages in thread
From: Richard Stallman @ 2019-07-07 23:51 UTC (permalink / raw)
To: Drew Adams; +Cc: mattiase, 36496
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> would be great if you could invoke a command on a
> regexp (e.g. a regexp string in code) and have an
> equivalent `rx' expression pop up, for inspection
> and understanding.
I agree. That would make rx much more convenient for people who like
the shortness of some regexps. It could be part of Lisp mode, so you
could use this on a regexp constant in a source file.
I suspect that the long-windedness of rx input is a substantial
deterrent to its use. It may be better for complex patterns but worse
for simple ones.
> It would be nice to be able to have only the result
> of `rx' in the code and be able to get its `rx'
> expression on demand.
I think it would be clearer, usually, for Lisp source to have the rx
form. That would help people get used to rx. For complex patterns,
the rx form is easier to understand and change.
WHat would people think of making all the functions that want a regexp
accept an rx input equivalently? If the arg is not a string, treat it
as rx format. Compilation could convert a constant non-string, for
such args, to a regexp string.
Commands that read a regexp using the minibuffer could offer a key to
say that you are entering rx format. The only problem is, which key
would it be?
--
Dr Richard Stallman
President, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-07 23:51 ` Richard Stallman
@ 2019-07-08 0:56 ` Drew Adams
2019-07-08 23:46 ` Richard Stallman
2019-07-08 23:44 ` Richard Stallman
1 sibling, 1 reply; 26+ messages in thread
From: Drew Adams @ 2019-07-08 0:56 UTC (permalink / raw)
To: rms; +Cc: mattiase, 36496
> > would be great if you could invoke a command on a
> > regexp (e.g. a regexp string in code) and have an
> > equivalent `rx' expression pop up, for inspection
> > and understanding.
>
> I agree. That would make rx much more convenient for people who like
> the shortness of some regexps.
It would also help someone understand a complex regexp.
It could also help someone learn about regexps by, in
effect analyzing them (on demand).
It would also be good to be able to select _part_ of a
complex regexp - a part that is itself a valid regexp,
and use such an inspection command on just that part,
to show what `rx' it corresponds to. IOW, select some
text, not necessarily a string, and (if its a valid
regexp) get its `rx' form.
> It could be part of Lisp mode, so you
> could use this on a regexp constant in a source file.
>
> I suspect that the long-windedness of rx input is a substantial
> deterrent to its use. It may be better for complex patterns but worse
> for simple ones.
>
> > It would be nice to be able to have only the result
> > of `rx' in the code and be able to get its `rx'
> > expression on demand.
>
> I think it would be clearer, usually, for Lisp source to have the rx
> form. That would help people get used to rx. For complex patterns,
> the rx form is easier to understand and change.
>
> WHat would people think of making all the functions that want a regexp
> accept an rx input equivalently? If the arg is not a string, treat it
> as rx format.
Do you mean they'd accept a quoted `rx' form (list)?
What would a use case be - as opposed to accepting
the result of macro-expanding such a form? Assuming
there's good use case, maybe so.
[But there may be some functions that already have a
(different) interpretation of a list value for the
same arg that could alternatively be a regexp string.
(So maybe not "all" such functions.)]
> Compilation could convert a constant non-string, for
> such args, to a regexp string.
Same question as above, about the use case for a
quoted `rx'-form arg (versus macro-expanding it to
provide a regexp string arg).
Even assuming such a use case, should the compiler
assume that _every_ such list arg should be compiled
to a regexp string?
And wouldn't such compile-time conversion just
amount to macro-expanding it? I guess I might be
missing your point/suggestion.
> Commands that read a regexp using the minibuffer could offer a key to
> say that you are entering rx format.
Sounds good to me.
> The only problem is, which key would it be?
Some non-repeatable key. Some key that can't be
used (by default) to edit minibuffer text. Maybe
something like `C-x x'?
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-07 23:51 ` Richard Stallman
2019-07-08 0:56 ` Drew Adams
@ 2019-07-08 23:44 ` Richard Stallman
1 sibling, 0 replies; 26+ messages in thread
From: Richard Stallman @ 2019-07-08 23:44 UTC (permalink / raw)
To: drew.adams, mattiase, 36496; +Cc: rms
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
Can a regexp string be used as a part of an rx pattern?
For instance, can you do this?
(or "foo+" "bar+")
If not, that might be good to add, so people can get the benefits of
both formats. You could use the simple regexp constructs that are
easy to read, and use rx instead of the uglier complex regexp
constructs.
--
Dr Richard Stallman
President, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-08 0:56 ` Drew Adams
@ 2019-07-08 23:46 ` Richard Stallman
2019-07-09 0:19 ` Drew Adams
0 siblings, 1 reply; 26+ messages in thread
From: Richard Stallman @ 2019-07-08 23:46 UTC (permalink / raw)
To: Drew Adams; +Cc: mattiase, 36496
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> Do you mean they'd accept a quoted `rx' form (list)?
> What would a use case be - as opposed to accepting
> the result of macro-expanding such a form? Assuming
> there's good use case, maybe so.
Quoting is a little more brief than writing (rx ...).
> [But there may be some functions that already have a
> (different) interpretation of a list value for the
> same arg that could alternatively be a regexp string.
> (So maybe not "all" such functions.)]
Are there any? If so, it would be desirable to change them.
> Even assuming such a use case, should the compiler
> assume that _every_ such list arg should be compiled
> to a regexp string?
Why not? Is there any case in which it would be better
to translate the rx to a regexp at run time?
> > The only problem is, which key would it be?
> Some non-repeatable key. Some key that can't be
> used (by default) to edit minibuffer text. Maybe
> something like `C-x x'?
Is there any reasonable one-character key?
--
Dr Richard Stallman
President, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-08 23:46 ` Richard Stallman
@ 2019-07-09 0:19 ` Drew Adams
0 siblings, 0 replies; 26+ messages in thread
From: Drew Adams @ 2019-07-09 0:19 UTC (permalink / raw)
To: rms; +Cc: mattiase, 36496
> > Do you mean they'd accept a quoted `rx' form (list)?
> > What would a use case be - as opposed to accepting
> > the result of macro-expanding such a form? Assuming
> > there's good use case, maybe so.
>
> Quoting is a little more brief than writing (rx ...).
Sorry, but I don't really understand. I know little
about `rx'.
> > [But there may be some functions that already have a
> > (different) interpretation of a list value for the
> > same arg that could alternatively be a regexp string.
> > (So maybe not "all" such functions.)]
>
> Are there any? If so, it would be desirable to change them.
I was responding to this from you:
WHat would people think of making all the functions that want a regexp
accept an rx input equivalently? If the arg is not a string, treat it
as rx format. Compilation could convert a constant non-string, for
such args, to a regexp string.
I see now that you said "an rx input", so presumably
not just a list as arg but a list with car `rx'. I
was thinking you meant just a list. I'd bet there
are some functions that accept an arg that can be a
(nonempty) list or a string, and maybe even a regexp
string. If there are then some minor adjustment
could be called for; that's all.
> > Even assuming such a use case, should the compiler
> > assume that _every_ such list arg should be compiled
> > to a regexp string?
>
> Why not? Is there any case in which it would be better
> to translate the rx to a regexp at run time?
I think I misunderstood you, and might still.
Still, there's a difference between passing
(quote SOME-MACRO-SEXP) as arg and passing
SOME-MACRO-SEXP. We have `quote' for a reason.
If you want macro-expansion at compile time
why wouldn't you just pass (rx...) as the arg,
instead of (quote (rx...))?
But you know all this better than I, so no doubt
I'm just missing your point - in which case feel
free to ignore.
> > > The only problem is, which key would it be?
>
> > Some non-repeatable key. Some key that can't be
> > used (by default) to edit minibuffer text. Maybe
> > something like `C-x x'?
>
> Is there any reasonable one-character key?
There are lots of 1-char keys that are not defined
in a minibuffer keymap by default. And perhaps
even some that are defined there by default but
that aren't useful for reading a regexp (so could
be co-opted when reading regexp input, if needed).
As just one example, `M-R' is not defined (`M-r'
is). `M-R' is `move-to-window-line-top-bottom',
which isn't so useful in a minibuffer window.
(It is a repeatable key, BTW, and its global
binding is a repeatable command. But that command
isn't very useful in the minibuffer.)
^ permalink raw reply [flat|nested] 26+ messages in thread
* bug#36496: [PATCH] Describe the rx notation in the lisp manual
2019-07-07 11:31 ` Mattias Engdegård
2019-07-07 14:33 ` Eli Zaretskii
@ 2022-04-25 15:12 ` Lars Ingebrigtsen
1 sibling, 0 replies; 26+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-25 15:12 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: Noam Postavsky, Richard Stallman, 36496
Mattias Engdegård <mattiase@acm.org> writes:
> Thanks for reviewing! Pushed to master.
The discussion then turned to other matters, but as far as I can tell,
the issue was fixed (i.e., Mattias added rx documentation to the
manual), so I'm closing this bug report.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2022-04-25 15:12 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-07-04 12:13 bug#36496: [PATCH] Describe the rx notation in the lisp manual Mattias Engdegård
2019-07-04 14:59 ` Drew Adams
2019-07-04 16:28 ` Eli Zaretskii
2019-07-05 14:13 ` Mattias Engdegård
2019-07-06 9:08 ` Eli Zaretskii
2019-07-06 11:33 ` Mattias Engdegård
2019-07-06 11:41 ` Eli Zaretskii
2019-07-06 18:56 ` Mattias Engdegård
2019-07-06 19:10 ` Eli Zaretskii
2019-07-06 19:45 ` Mattias Engdegård
2019-07-07 2:29 ` Eli Zaretskii
2019-07-07 11:31 ` Mattias Engdegård
2019-07-07 14:33 ` Eli Zaretskii
2022-04-25 15:12 ` Lars Ingebrigtsen
2019-07-06 19:12 ` Noam Postavsky
2019-07-06 11:59 ` Noam Postavsky
2019-07-06 23:56 ` Richard Stallman
2019-07-06 0:10 ` Richard Stallman
2019-07-06 6:47 ` Eli Zaretskii
2019-07-06 23:59 ` Richard Stallman
2019-07-07 0:36 ` Drew Adams
2019-07-07 23:51 ` Richard Stallman
2019-07-08 0:56 ` Drew Adams
2019-07-08 23:46 ` Richard Stallman
2019-07-09 0:19 ` Drew Adams
2019-07-08 23:44 ` Richard Stallman
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).