bug#36496: [PATCH] Describe the rx notation in the lisp manual

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: "Mattias Engdegård" <mattiase@acm.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 36496@debbugs.gnu.org
Subject: bug#36496: [PATCH] Describe the rx notation in the lisp manual
Date: Sat, 6 Jul 2019 13:33:35 +0200	[thread overview]
Message-ID: <F6C1EF17-20F5-4551-B9CE-6C1D6C404EBD@acm.org> (raw)
In-Reply-To: <83r2738r9j.fsf@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]

6 juli 2019 kl. 11.08 skrev Eli Zaretskii <eliz@gnu.org>:
> 
>> It is about 7-8 pages in all.
> 
> It's more that 2500 lines.  We have in doc/misc/ separate manuals much
> smaller than this.  So making a separate manual out of this is not
> radically different from what we have already.

It was a visual count of printed pages in the pdf; a lot of lines in the source are mark-up.
In any case, the attached patch has @ifnottex added to it. I didn't move the text to a separate file, since there was no existing "lispref-extras" document to put them in. In addition, some of the additions were to existing sections (pcase, and the complex regexp example).

> Opinion on which matter? on whether or not make it a separate manual?
> If so, you now have my opinion.

Thanks, that's what I meant.

> The comma is common because older versions of makeinfo insisted on
> having it, and would complain if there weren't one.  The latest
> versions no longer complain, but we would still like to support the
> old versions, as they are ~15 times faster, so some people still keep
> them around.

Thank you very much for clearing that up; I always wondered.

Also attached is a patch for replacing the rx doc string with a condensed summary.  I basically copied the one I wrote for ry.


[-- Attachment #2: 0001-Describe-the-rx-notation-in-the-elisp-manual-bug-364.patch --]
[-- Type: application/octet-stream, Size: 23647 bytes --]

From d5c54a21127a92e7dec1c03c262de551f3a76e27 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Thu, 4 Jul 2019 13:01:52 +0200
Subject: [PATCH] Describe the rx notation in the elisp manual (bug#36496)

* doc/lispref/searching.texi (Regular Expressions): New menu entry.
(Regexp Example): Add rx form of the example.
(Rx Notation, Rx Constructs, Rx Functions): New nodes.
* doc/lispref/control.texi (pcase Macro): Describe the rx pattern.
---
 doc/lispref/control.texi   |  25 ++
 doc/lispref/searching.texi | 565 +++++++++++++++++++++++++++++++++++++
 2 files changed, 590 insertions(+)

diff --git a/doc/lispref/control.texi b/doc/lispref/control.texi
index e308d68b75..de6cd9301f 100644
--- a/doc/lispref/control.texi
+++ b/doc/lispref/control.texi
@@ -618,6 +618,31 @@ pcase Macro
 to @var{body-forms} (thus avoiding an evaluation error on match),
 if any of the sub-patterns let-binds a set of symbols,
 they @emph{must} all bind the same set of symbols.
+
+@ifnottex
+@anchor{rx in pcase}
+@item (rx @var{rx-expr}@dots{})
+Matches strings against the regexp @var{rx-expr}@dots{}, using the
+@code{rx} regexp notation (@pxref{Rx Notation}), as if by
+@code{string-match}.
+
+In addition to the usual @code{rx} syntax, @var{rx-expr}@dots{} can
+contain the following constructs:
+
+@table @code
+@item (let @var{ref} @var{rx-expr}@dots{})
+Bind the symbol @var{ref} to a submatch that matches
+@var{rx-expr}@enddots{}.  @var{ref} is bound in @var{body-forms} to
+the string of the submatch or nil, but can also be used in
+@code{backref}.
+
+@item (backref @var{ref})
+Like the standard @code{backref} construct, but @var{ref} can here
+also be a name introduced by a previous @code{(let @var{ref} @dots{})}
+construct.
+@end table
+@end ifnottex
+
 @end table
 
 @anchor{pcase-example-0}
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index ef1cffc446..17c4790f5e 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -254,6 +254,9 @@ Regular Expressions
 @menu
 * Syntax of Regexps::       Rules for writing regular expressions.
 * Regexp Example::          Illustrates regular expression syntax.
+@ifnottex
+* Rx Notation::             An alternative, structured regexp notation.
+@end ifnottex
 * Regexp Functions::        Functions for operating on regular expressions.
 @end menu
 
@@ -359,6 +362,7 @@ Regexp Special
 preceding expression either once or not at all.  For example,
 @samp{ca?r} matches @samp{car} or @samp{cr}; nothing else.
 
+@anchor{Non-greedy repetition}
 @item @samp{*?}, @samp{+?}, @samp{??}
 @cindex non-greedy repetition characters in regexp
 These are @dfn{non-greedy} variants of the operators @samp{*}, @samp{+}
@@ -951,6 +955,567 @@ Regexp Example
 beyond the minimum needed to end a sentence.
 @end table
 
+@ifnottex
+In the @code{rx} notation (@pxref{Rx Notation}), the regexp could be written
+
+@example
+@group
+(rx (any ".?!")                    ; Punctuation ending sentence.
+    (zero-or-more (any "\"')]@}"))  ; Closing quotes or brackets.
+    (or line-end
+        (seq " " line-end)
+        "\t"
+        "  ")                      ; Two spaces.
+    (zero-or-more (any "\t\n ")))  ; Optional extra whitespace.
+@end group
+@end example
+
+Since @code{rx} regexps are just S-expressions, they can be formatted
+and commented as such.
+@end ifnottex
+
+@ifnottex
+@node Rx Notation
+@subsection The @code{rx} Structured Regexp Notation
+@cindex rx
+@cindex regexp syntax
+
+  As an alternative to the string-based syntax, Emacs provides the
+structured @code{rx} notation based on Lisp S-expressions.  This
+notation is usually easier to read, write and maintain than regexp
+strings, and can be indented and commented freely.  It requires a
+conversion into string form since that is what regexp functions
+expect, but that conversion typically takes place during
+byte-compilation rather than when the Lisp code using the regexp is
+run.
+
+  Here is an @code{rx} regexp@footnote{It could be written much
+simpler with non-greedy operators (how?), but that would make the
+example less interesting.} that matches a block comment in the C
+programming language:
+
+@example
+@group
+(rx "/*"                          ; Initial /*
+    (zero-or-more
+     (or (not (any "*"))          ;  Either non-*,
+         (seq "*"                 ;  or * followed by
+              (not (any "/")))))  ;  non-/
+    (one-or-more "*")             ; At least one star,
+    "/")                          ; and the final /
+@end group
+@end example
+
+@noindent
+or, using shorter synonyms and written more compactly,
+
+@example
+@group
+(rx "/*"
+    (* (| (not (any "*"))
+          (: "*" (not (any "/")))))
+    (+ "*") "/")
+@end group
+@end example
+
+@noindent
+In conventional string syntax, it would be written
+
+@example
+"/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
+@end example
+
+The @code{rx} notation is mainly useful in Lisp code; it cannot be
+used in most interactive situations where a regexp is requested, such
+as when running @code{query-replace-regexp} or in variable
+customisation.
+
+@menu
+* Rx Constructs::       Constructs valid in rx forms.
+* Rx Functions::        Functions and macros that use rx forms.
+@end menu
+
+@node Rx Constructs
+@subsubsection Constructs in @code{rx} regexps
+
+The various forms in @code{rx} regexps are described below.  The
+shorthand @var{rx} represents any @code{rx} form, and @var{rx}@dots{}
+means one or more @code{rx} forms.  Where the corresponding string
+regexp syntax is given, @var{A}, @var{B}, @dots{} are string regexp
+subexpressions.
+@c With the new implementation of rx, this can be changed from
+@c 'one or more' to 'zero or more'.
+
+@subsubheading Literals
+
+@table @asis
+@item @code{"some-string"}
+Match the string @samp{some-string} literally.  There are no
+characters with special meaning, unlike in string regexps.
+
+@item @code{?C}
+Match the character @samp{C} literally.
+@end table
+
+@subsubheading Fundamental structure
+
+@table @asis
+@item @code{(seq @var{rx}@dots{})}
+@cindex @code{seq} in rx
+@itemx @code{(sequence @var{rx}@dots{})}
+@cindex @code{sequence} in rx
+@itemx @code{(: @var{rx}@dots{})}
+@cindex @code{:} in rx
+@itemx @code{(and @var{rx}@dots{})}
+@cindex @code{and} in rx
+Match the @var{rx}s in sequence.  Without arguments, the expression
+matches the empty string.@*
+Corresponding string regexp: @samp{@var{A}@var{B}@dots{}}
+(subexpressions in sequence).
+
+@item @code{(or @var{rx}@dots{})}
+@cindex @code{or} in rx
+@itemx @code{(| @var{rx}@dots{})}
+@cindex @code{|} in rx
+Match exactly one of the @var{rx}s, trying from left to right.
+Without arguments, the expression will not match anything at all.@*
+Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
+@end table
+
+@subsubheading Repetition
+
+@table @code
+@item (zero-or-more @var{rx}@dots{})
+@cindex @code{zero-or-more} in rx
+@itemx (0+ @var{rx}@dots{})
+@cindex @code{0+} in rx
+@itemx (* @var{rx}@dots{})
+@cindex @code{*} in rx
+Match the @var{rx}s zero or more times.@*
+Corresponding string regexp: @samp{@var{A}*}
+
+@item (one-or-more @var{rx}@dots{})
+@cindex @code{one-or-more} in rx
+@itemx (1+ @var{rx}@dots{})
+@cindex @code{1+} in rx
+@itemx (+ @var{rx}@dots{})
+@cindex @code{+} in rx
+Match the @var{rx}s one or more times.@*
+Corresponding string regexp: @samp{@var{A}+}
+
+@item (zero-or-one @var{rx}@dots{})
+@cindex @code{zero-or-one} in rx
+@itemx (optional @var{rx}@dots{})
+@cindex @code{optional} in rx
+@itemx (opt @var{rx}@dots{})
+@cindex @code{opt} in rx
+@itemx (? @var{rx}@dots{})
+@cindex @code{?} in rx
+Match the @var{rx}s once or an empty string.@*
+Corresponding string regexp: @samp{@var{A}?}
+
+@item (= @var{n} @var{rx}@dots{})
+@cindex @code{=} in rx
+@itemx (repeat @var{n} @var{rx})
+Match the @var{rx}s exactly @var{n} times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n}\@}}
+
+@item (>= @var{n} @var{rx}@dots{})
+@cindex @code{>=} in rx
+Match the @var{rx}s @var{n} or more times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},\@}}
+
+@item (** @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{**} in rx
+@itemx (repeat @var{n} @var{m} @var{rx}@dots{})
+@cindex @code{repeat} in rx
+Match the @var{rx}s at least @var{n} but no more than @var{m} times.@*
+Corresponding string regexp: @samp{@var{A}\@{@var{n},@var{m}\@}}
+@end table
+
+@subsubheading Non-greedy repetition
+
+Normally, repetition forms are greedy, in that they attempt to match
+as many times as possible.  The following three forms are non-greedy; they
+try to match as few times as possible (@pxref{Non-greedy repetition}).
+
+@table @code
+@item (*? @var{rx}@dots{})
+@cindex @code{*?} in rx
+Match the @var{rx}s zero or more times, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}*?}
+
+@item (+? @var{rx}@dots{})
+@cindex @code{+?} in rx
+Match the @var{rx}s one or more times, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}+?}
+
+@item (?? @var{rx}@dots{})
+@cindex @code{??} in rx
+Match the @var{rx}s or an empty string, non-greedily.@*
+Corresponding string regexp: @samp{@var{A}??}
+@end table
+
+The greediness of some repetition forms can be controlled using the
+following constructs.  However, it is usually better to use the
+explicit non-greedy forms above instead.
+
+@table @code
+@item (minimal-match @var{rx})
+@cindex @code{minimal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching.
+
+@item (maximal-match @var{rx})
+@cindex @code{maximal-match} in rx
+Match @var{rx}, with @code{zero-or-more}, @code{0+},
+@code{one-or-more}, @code{1+}, @code{zero-or-one}, @code{opt} and
+@code{option} using non-greedy matching.  This is the default.
+@end table
+
+@subsubheading Matching single characters
+
+@table @asis
+@item @code{(any @var{set}@dots{})}
+@cindex @code{any} in rx
+@itemx @code{(char @var{set}@dots{})}
+@cindex @code{char} in rx
+@itemx @code{(in @var{set}@dots{})}
+@cindex @code{in} in rx
+@cindex character class in rx
+Match a single character from one of the @var{set}s.  Each @var{set}
+is a character, a string representing the set of its characters, a
+range or a character class (see below).  A range is either a
+hyphen-separated string like @code{"A-Z"}, or a cons of characters
+like @code{(?A . ?Z)}.
+
+Note that hyphen (@code{-}) is special in strings in this construct,
+since it acts as a range separator.  To include a hyphen, add it as a
+separate character or single-character string.@*
+Corresponding string regexp: @samp{[@dots{}]}
+
+@item @code{(not @var{charspec})}
+@cindex @code{not} in rx
+Match a character not included in @var{charspec}.  @var{charspec} can
+be an @code{any}, @code{syntax} or @code{category} form, or a
+character class.@*
+Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
+@samp{\C@var{code}}
+
+@item @code{not-newline}, @code{nonl}
+@cindex @code{not-newline} in rx
+@cindex @code{nonl} in rx
+Match any character except a newline.@*
+Corresponding string regexp: @samp{.} (dot)
+
+@item @code{anything}
+@cindex @code{anything} in rx
+Match any character.@*
+Corresponding string regexp: @samp{.\|\n} (for example)
+
+@item character class
+@cindex character class in rx
+Match a character from a named character class:
+
+@table @asis
+@item @code{alpha}, @code{alphabetic}, @code{letter}
+Match alphabetic characters.  More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+alphabetic.
+
+@item @code{alnum}, @code{alphanumeric}
+Match alphabetic characters and digits.  More precisely, match
+characters whose Unicode @samp{general-category} property indicates
+that they are alphabetic or decimal digits.
+
+@item @code{digit}, @code{numeric}, @code{num}
+Match the digits @samp{0}--@samp{9}.
+
+@item @code{xdigit}, @code{hex-digit}, @code{hex}
+Match the hexadecimal digits @samp{0}--@samp{9}, @samp{A}--@samp{F}
+and @samp{a}--@samp{f}.
+
+@item @code{cntrl}, @code{control}
+Match any character whose code is in the range 0--31.
+
+@item @code{blank}
+Match horizontal whitespace.  More precisely, match characters whose
+Unicode @samp{general-category} property indicates that they are
+spacing separators.
+
+@item @code{space}, @code{whitespace}, @code{white}
+Match any character that has whitespace syntax
+(@pxref{Syntax Class Table}).
+
+@item @code{lower}, @code{lower-case}
+Match anything lower-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+upper-case letter.
+
+@item @code{upper}, @code{upper-case}
+Match anything upper-case, as determined by the current case table.
+If @code{case-fold-search} is non-nil, this also matches any
+lower-case letter.
+
+@item @code{graph}, @code{graphic}
+Match any character except whitespace, @acronym{ASCII} and
+non-@acronym{ASCII} control characters, surrogates, and codepoints
+unassigned by Unicode, as indicated by the Unicode
+@samp{general-category} property.
+
+@item @code{print}, @code{printing}
+Match whitespace or a character matched by @code{graph}.
+
+@item @code{punct}, @code{punctuation}
+Match any punctuation character.  (At present, for multibyte
+characters, anything that has non-word syntax.)
+
+@item @code{word}, @code{wordchar}
+Match any character that has word syntax (@pxref{Syntax Class Table}).
+
+@item @code{ascii}
+Match any @acronym{ASCII} character (codes 0--127).
+
+@item @code{nonascii}
+Match any non-@acronym{ASCII} character (but not raw bytes).
+@end table
+
+Corresponding string regexp: @samp{[[:@var{class}:]]}
+
+@item @code{(syntax @var{syntax})}
+@cindex @code{syntax} in rx
+Match a character with syntax @var{syntax}, being one of the following
+names:
+
+@multitable {@code{close-parenthesis}} {Syntax character}
+@headitem Syntax name          @tab Syntax character
+@item @code{whitespace}        @tab @code{-}
+@item @code{punctuation}       @tab @code{.}
+@item @code{word}              @tab @code{w}
+@item @code{symbol}            @tab @code{_}
+@item @code{open-parenthesis}  @tab @code{(}
+@item @code{close-parenthesis} @tab @code{)}
+@item @code{expression-prefix} @tab @code{'}
+@item @code{string-quote}      @tab @code{"}
+@item @code{paired-delimiter}  @tab @code{$}
+@item @code{escape}            @tab @code{\}
+@item @code{character-quote}   @tab @code{/}
+@item @code{comment-start}     @tab @code{<}
+@item @code{comment-end}       @tab @code{>}
+@item @code{string-delimiter}  @tab @code{|}
+@item @code{comment-delimiter} @tab @code{!}
+@end multitable
+
+For details, @pxref{Syntax Class Table}.  Please note that
+@code{(syntax punctuation)} is @emph{not} equivalent to the character class
+@code{punctuation}.@*
+Corresponding string regexp: @samp{\s@var{code}}
+
+@item @code {(category @var{category})}
+@cindex @code{category} in rx
+Match a character in category @var{category}, which is either one of
+the names below or its category character.
+
+@multitable {@code{vowel-modifying-diacritical-mark}} {Category character}
+@headitem Category name                       @tab Category character
+@item @code{space-for-indent}                 @tab space
+@item @code{base}                             @tab @code{.}
+@item @code{consonant}                        @tab @code{0}
+@item @code{base-vowel}                       @tab @code{1}
+@item @code{upper-diacritical-mark}           @tab @code{2}
+@item @code{lower-diacritical-mark}           @tab @code{3}
+@item @code{tone-mark}                        @tab @code{4}
+@item @code{symbol}                           @tab @code{5}
+@item @code{digit}                            @tab @code{6}
+@item @code{vowel-modifying-diacritical-mark} @tab @code{7}
+@item @code{vowel-sign}                       @tab @code{8}
+@item @code{semivowel-lower}                  @tab @code{9}
+@item @code{not-at-end-of-line}               @tab @code{<}
+@item @code{not-at-beginning-of-line}         @tab @code{>}
+@item @code{alpha-numeric-two-byte}           @tab @code{A}
+@item @code{chinese-two-byte}                 @tab @code{C}
+@item @code{greek-two-byte}                   @tab @code{G}
+@item @code{japanese-hiragana-two-byte}       @tab @code{H}
+@item @code{indian-two-byte}                  @tab @code{I}
+@item @code{japanese-katakana-two-byte}       @tab @code{K}
+@item @code{strong-left-to-right}             @tab @code{L}
+@item @code{korean-hangul-two-byte}           @tab @code{N}
+@item @code{strong-right-to-left}             @tab @code{R}
+@item @code{cyrillic-two-byte}                @tab @code{Y}
+@item @code{combining-diacritic}              @tab @code{^}
+@item @code{ascii}                            @tab @code{a}
+@item @code{arabic}                           @tab @code{b}
+@item @code{chinese}                          @tab @code{c}
+@item @code{ethiopic}                         @tab @code{e}
+@item @code{greek}                            @tab @code{g}
+@item @code{korean}                           @tab @code{h}
+@item @code{indian}                           @tab @code{i}
+@item @code{japanese}                         @tab @code{j}
+@item @code{japanese-katakana}                @tab @code{k}
+@item @code{latin}                            @tab @code{l}
+@item @code{lao}                              @tab @code{o}
+@item @code{tibetan}                          @tab @code{q}
+@item @code{japanese-roman}                   @tab @code{r}
+@item @code{thai}                             @tab @code{t}
+@item @code{vietnamese}                       @tab @code{v}
+@item @code{hebrew}                           @tab @code{w}
+@item @code{cyrillic}                         @tab @code{y}
+@item @code{can-break}                        @tab @code{|}
+@end multitable
+
+For more information about currently defined categories, run the
+command @kbd{M-x describe-categories @key{RET}}.  For how to define
+new categories, @pxref{Categories}.@*
+Corresponding string regexp: @samp{\c@var{code}}
+@end table
+
+@subsubheading Zero-width assertions
+
+These all match the empty string, but only in specific places.
+
+@table @asis
+@item @code{line-start}, @code{bol}
+@cindex @code{line-start} in rx
+@cindex @code{bol} in rx
+Match at the beginning of a line.@*
+Corresponding string regexp: @samp{^}
+
+@item @code{line-end}, @code{eol}
+@cindex @code{line-end} in rx
+@cindex @code{eol} in rx
+Match at the end of a line.@*
+Corresponding string regexp: @samp{$}
+
+@item @code{string-start}, @code{bos}, @code{buffer-start}, @code{bot}
+@cindex @code{string-start} in rx
+@cindex @code{bos} in rx
+@cindex @code{buffer-start} in rx
+@cindex @code{bot} in rx
+Match at the start of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\`}
+
+@item @code{string-end}, @code{eos}, @code{buffer-end}, @code{eot}
+@cindex @code{string-end} in rx
+@cindex @code{eos} in rx
+@cindex @code{buffer-end} in rx
+@cindex @code{eot} in rx
+Match at the end of the string or buffer being matched against.@*
+Corresponding string regexp: @samp{\'}
+
+@item @code{point}
+@cindex @code{point} in rx
+Match at point.@*
+Corresponding string regexp: @samp{\=}
+
+@item @code{word-start}
+@cindex @code{word-start} in rx
+Match at the beginning of a word.@*
+Corresponding string regexp: @samp{\<}
+
+@item @code{word-end}
+@cindex @code{word-end} in rx
+Match at the end of a word.@*
+Corresponding string regexp: @samp{\>}
+
+@item @code{word-boundary}
+@cindex @code{word-boundary} in rx
+Match at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\b}
+
+@item @code{not-word-boundary}
+@cindex @code{not-word-boundary} in rx
+Match anywhere but at the beginning or end of a word.@*
+Corresponding string regexp: @samp{\B}
+
+@item @code{symbol-start}
+@cindex @code{symbol-start} in rx
+Match at the beginning of a symbol.@*
+Corresponding string regexp: @samp{\_<}
+
+@item @code{symbol-end}
+@cindex @code{symbol-end} in rx
+Match at the end of a symbol.@*
+Corresponding string regexp: @samp{\_>}
+@end table
+
+@subsubheading Capture groups
+
+@table @code
+@item (group @var{rx}@dots{})
+@cindex @code{group} in rx
+@itemx (submatch @var{rx}@dots{})
+@cindex @code{submatch} in rx
+Match the @var{rx}s, making the matched text and position accessible
+in the match data.  The first group in a regexp is numbered 1;
+subsequent groups will be numbered one higher than the previous
+group.@*
+Corresponding string regexp: @samp{\(@dots{}\)}
+
+@item (group-n @var{n} @var{rx}@dots{})
+@cindex @code{group-n} in rx
+@itemx (submatch-n @var{n} @var{rx}@dots{})
+@cindex @code{submatch-n} in rx
+Like @code{group}, but explicitly assign the group number @var{n}.
+@var{n} must be positive.@*
+Corresponding string regexp: @samp{\(?@var{n}:@dots{}\)}
+
+@item (backref @var{n})
+@cindex @code{backref} in rx
+Match the text previously matched by group number @var{n}.
+@var{n} must be in the range 1--9.@*
+Corresponding string regexp: @samp{\@var{n}}
+@end table
+
+@subsubheading Dynamic inclusion
+
+@table @code
+@item (literal @var{expr})
+@cindex @code{literal} in rx
+Match the literal string that is the result from evaluating the Lisp
+expression @var{expr}.  The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (regexp @var{expr})
+@cindex @code{regexp} in rx
+@itemx (regex @var{expr})
+@cindex @code{regex} in rx
+Match the string regexp that is the result from evaluating the Lisp
+expression @var{expr}.  The evaluation takes place at call time, in
+the current lexical environment.
+
+@item (eval @var{expr})
+@cindex @code{eval} in rx
+Match the rx form that is the result from evaluating the Lisp
+expression @var{expr}.  The evaluation takes place at macro-expansion
+time for @code{rx}, at call time for @code{rx-to-string},
+in the current global environment.
+@end table
+
+@node Rx Functions
+@subsubsection Functions and macros using @code{rx} regexps
+
+@defmac rx rx-expr@dots{}
+Translate the @var{rx-expr}s to a string regexp, as if they were the
+body of a @code{(seq @dots{})} form.  The @code{rx} macro expands to a
+string constant, or, if @code{literal} or @code{regexp} forms are
+used, a Lisp expression that evaluates to a string.
+@end defmac
+
+@defun rx-to-string rx-expr &optional no-group
+Translate @var{rx-expr} to a string regexp which is returned.
+If @var{no-group} is absent or nil, bracket the result in a
+non-capturing group, @samp{\(?:@dots{}\)}, if necessary to ensure that
+a postfix operator appended to it will apply to the whole expression.
+
+Arguments to @code{literal} and @code{regexp} forms in @var{rx-expr}
+must be string literals.
+@end defun
+
+The @code{pcase} macro can use @code{rx} expressions as patterns
+directly; @pxref{rx in pcase}.
+@end ifnottex
+
 @node Regexp Functions
 @subsection Regular Expression Functions
 
-- 
2.20.1 (Apple Git-117)


[-- Attachment #3: 0001-Shorter-rx-doc-string-bug-36496.patch --]
[-- Type: application/octet-stream, Size: 16805 bytes --]

From 48b2baa78e62a97ca2d621c5be643e6b539d78bf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sat, 6 Jul 2019 13:22:15 +0200
Subject: [PATCH] Shorter `rx' doc string (bug#36496)

* lisp/emacs-lisp/rx.el (rx): Replace long description with a condensed
summary of the rx syntax, with reference to the manual section.
---
 lisp/emacs-lisp/rx.el | 416 ++++++++++--------------------------------
 1 file changed, 95 insertions(+), 321 deletions(-)

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 24dd6cbf1d..e65460c39d 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -959,327 +959,101 @@ rx-to-string
 ;;;###autoload
 (defmacro rx (&rest regexps)
   "Translate regular expressions REGEXPS in sexp form to a regexp string.
-REGEXPS is a non-empty sequence of forms of the sort listed below.
-
-Note that `rx' is a Lisp macro; when used in a Lisp program being
-compiled, the translation is performed by the compiler.  The
-`literal' and `regexp' forms accept subforms that will evaluate
-to strings, in addition to constant strings.  If REGEXPS include
-such forms, then the result is an expression which returns a
-regexp string, rather than a regexp string directly.  See
-`rx-to-string' for performing translation completely at run time.
-
-The following are valid subforms of regular expressions in sexp
-notation.
-
-STRING
-     matches string STRING literally.
-
-CHAR
-     matches character CHAR literally.
-
-`not-newline', `nonl'
-     matches any character except a newline.
-
-`anything'
-     matches any character
-
-`(any SET ...)'
-`(in SET ...)'
-`(char SET ...)'
-     matches any character in SET ....  SET may be a character or string.
-     Ranges of characters can be specified as `A-Z' in strings.
-     Ranges may also be specified as conses like `(?A . ?Z)'.
-     Reversed ranges like `Z-A' and `(?Z . ?A)' are not permitted.
-
-     SET may also be the name of a character class: `digit',
-     `control', `hex-digit', `blank', `graph', `print', `alnum',
-     `alpha', `ascii', `nonascii', `lower', `punct', `space', `upper',
-     `word', or one of their synonyms.
-
-`(not (any SET ...))'
-     matches any character not in SET ...
-
-`line-start', `bol'
-     matches the empty string, but only at the beginning of a line
-     in the text being matched
-
-`line-end', `eol'
-     is similar to `line-start' but matches only at the end of a line
-
-`string-start', `bos', `bot'
-     matches the empty string, but only at the beginning of the
-     string being matched against.
-
-`string-end', `eos', `eot'
-     matches the empty string, but only at the end of the
-     string being matched against.
-
-`buffer-start'
-     matches the empty string, but only at the beginning of the
-     buffer being matched against.  Actually equivalent to `string-start'.
-
-`buffer-end'
-     matches the empty string, but only at the end of the
-     buffer being matched against.  Actually equivalent to `string-end'.
-
-`point'
-     matches the empty string, but only at point.
-
-`word-start', `bow'
-     matches the empty string, but only at the beginning of a word.
-
-`word-end', `eow'
-     matches the empty string, but only at the end of a word.
-
-`word-boundary'
-     matches the empty string, but only at the beginning or end of a
-     word.
-
-`(not word-boundary)'
-`not-word-boundary'
-     matches the empty string, but not at the beginning or end of a
-     word.
-
-`symbol-start'
-     matches the empty string, but only at the beginning of a symbol.
-
-`symbol-end'
-     matches the empty string, but only at the end of a symbol.
-
-`digit', `numeric', `num'
-     matches 0 through 9.
-
-`control', `cntrl'
-     matches any character whose code is in the range 0-31.
-
-`hex-digit', `hex', `xdigit'
-     matches 0 through 9, a through f and A through F.
-
-`blank'
-     matches horizontal whitespace, as defined by Annex C of the
-     Unicode Technical Standard #18.  In particular, it matches
-     spaces, tabs, and other characters whose Unicode
-     `general-category' property indicates they are spacing
-     separators.
-
-`graphic', `graph'
-     matches graphic characters--everything except whitespace, ASCII
-     and non-ASCII control characters, surrogates, and codepoints
-     unassigned by Unicode.
-
-`printing', `print'
-     matches whitespace and graphic characters.
-
-`alphanumeric', `alnum'
-     matches alphabetic characters and digits.  For multibyte characters,
-     it matches characters whose Unicode `general-category' property
-     indicates they are alphabetic or decimal number characters.
-
-`letter', `alphabetic', `alpha'
-     matches alphabetic characters.  For multibyte characters,
-     it matches characters whose Unicode `general-category' property
-     indicates they are alphabetic characters.
-
-`ascii'
-     matches ASCII (unibyte) characters.
-
-`nonascii'
-     matches non-ASCII (multibyte) characters.
-
-`lower', `lower-case'
-     matches anything lower-case, as determined by the current case
-     table.  If `case-fold-search' is non-nil, this also matches any
-     upper-case letter.
-
-`upper', `upper-case'
-     matches anything upper-case, as determined by the current case
-     table.  If `case-fold-search' is non-nil, this also matches any
-     lower-case letter.
-
-`punctuation', `punct'
-     matches punctuation.  (But at present, for multibyte characters,
-     it matches anything that has non-word syntax.)
-
-`space', `whitespace', `white'
-     matches anything that has whitespace syntax.
-
-`word', `wordchar'
-     matches anything that has word syntax.
-
-`not-wordchar'
-     matches anything that has non-word syntax.
-
-`(syntax SYNTAX)'
-     matches a character with syntax SYNTAX.  SYNTAX must be one
-     of the following symbols, or a symbol corresponding to the syntax
-     character, e.g. `\\.' for `\\s.'.
-
-     `whitespace'		(\\s- in string notation)
-     `punctuation'		(\\s.)
-     `word'			(\\sw)
-     `symbol'			(\\s_)
-     `open-parenthesis'		(\\s()
-     `close-parenthesis'	(\\s))
-     `expression-prefix'	(\\s')
-     `string-quote'		(\\s\")
-     `paired-delimiter'		(\\s$)
-     `escape'			(\\s\\)
-     `character-quote'		(\\s/)
-     `comment-start'		(\\s<)
-     `comment-end'		(\\s>)
-     `string-delimiter'		(\\s|)
-     `comment-delimiter'	(\\s!)
-
-`(not (syntax SYNTAX))'
-     matches a character that doesn't have syntax SYNTAX.
-
-`(category CATEGORY)'
-     matches a character with category CATEGORY.  CATEGORY must be
-     either a character to use for C, or one of the following symbols.
-
-     `space-for-indent'                 (\\c\\s in string notation)
-     `base'                             (\\c.)
-     `consonant'			(\\c0)
-     `base-vowel'			(\\c1)
-     `upper-diacritical-mark'		(\\c2)
-     `lower-diacritical-mark'		(\\c3)
-     `tone-mark'		        (\\c4)
-     `symbol'			        (\\c5)
-     `digit'			        (\\c6)
-     `vowel-modifying-diacritical-mark'	(\\c7)
-     `vowel-sign'			(\\c8)
-     `semivowel-lower'			(\\c9)
-     `not-at-end-of-line'		(\\c<)
-     `not-at-beginning-of-line'		(\\c>)
-     `alpha-numeric-two-byte'		(\\cA)
-     `chinese-two-byte'			(\\cC)
-     `greek-two-byte'			(\\cG)
-     `japanese-hiragana-two-byte'	(\\cH)
-     `indian-two-byte'			(\\cI)
-     `japanese-katakana-two-byte'	(\\cK)
-     `strong-left-to-right'             (\\cL)
-     `korean-hangul-two-byte'		(\\cN)
-     `strong-right-to-left'             (\\cR)
-     `cyrillic-two-byte'		(\\cY)
-     `combining-diacritic'		(\\c^)
-     `ascii'				(\\ca)
-     `arabic'				(\\cb)
-     `chinese'				(\\cc)
-     `ethiopic'				(\\ce)
-     `greek'				(\\cg)
-     `korean'				(\\ch)
-     `indian'				(\\ci)
-     `japanese'				(\\cj)
-     `japanese-katakana'		(\\ck)
-     `latin'				(\\cl)
-     `lao'				(\\co)
-     `tibetan'				(\\cq)
-     `japanese-roman'			(\\cr)
-     `thai'				(\\ct)
-     `vietnamese'			(\\cv)
-     `hebrew'				(\\cw)
-     `cyrillic'				(\\cy)
-     `can-break'			(\\c|)
-
-`(not (category CATEGORY))'
-     matches a character that doesn't have category CATEGORY.
-
-`(and SEXP1 SEXP2 ...)'
-`(: SEXP1 SEXP2 ...)'
-`(seq SEXP1 SEXP2 ...)'
-`(sequence SEXP1 SEXP2 ...)'
-     matches what SEXP1 matches, followed by what SEXP2 matches, etc.
-     Without arguments, matches the empty string.
-
-`(submatch SEXP1 SEXP2 ...)'
-`(group SEXP1 SEXP2 ...)'
-     like `and', but makes the match accessible with `match-end',
-     `match-beginning', and `match-string'.
-
-`(submatch-n N SEXP1 SEXP2 ...)'
-`(group-n N SEXP1 SEXP2 ...)'
-     like `group', but make it an explicitly-numbered group with
-     group number N.
-
-`(or SEXP1 SEXP2 ...)'
-`(| SEXP1 SEXP2 ...)'
-     matches anything that matches SEXP1 or SEXP2, etc.  If all
-     args are strings, use `regexp-opt' to optimize the resulting
-     regular expression.  Without arguments, never matches anything.
-
-`(minimal-match SEXP)'
-     produce a non-greedy regexp for SEXP.  Normally, regexps matching
-     zero or more occurrences of something are \"greedy\" in that they
-     match as much as they can, as long as the overall regexp can
-     still match.  A non-greedy regexp matches as little as possible.
-
-`(maximal-match SEXP)'
-     produce a greedy regexp for SEXP.  This is the default.
-
-Below, `SEXP ...' represents a sequence of regexp forms, treated as if
-enclosed in `(and ...)'.
-
-`(zero-or-more SEXP ...)'
-`(0+ SEXP ...)'
-     matches zero or more occurrences of what SEXP ... matches.
-
-`(* SEXP ...)'
-     like `zero-or-more', but always produces a greedy regexp, independent
-     of `rx-greedy-flag'.
-
-`(*? SEXP ...)'
-     like `zero-or-more', but always produces a non-greedy regexp,
-     independent of `rx-greedy-flag'.
-
-`(one-or-more SEXP ...)'
-`(1+ SEXP ...)'
-     matches one or more occurrences of SEXP ...
-
-`(+ SEXP ...)'
-     like `one-or-more', but always produces a greedy regexp.
-
-`(+? SEXP ...)'
-     like `one-or-more', but always produces a non-greedy regexp.
-
-`(zero-or-one SEXP ...)'
-`(optional SEXP ...)'
-`(opt SEXP ...)'
-     matches zero or one occurrences of A.
-
-`(? SEXP ...)'
-     like `zero-or-one', but always produces a greedy regexp.
-
-`(?? SEXP ...)'
-     like `zero-or-one', but always produces a non-greedy regexp.
-
-`(repeat N SEXP)'
-`(= N SEXP ...)'
-     matches N occurrences.
-
-`(>= N SEXP ...)'
-     matches N or more occurrences.
-
-`(repeat N M SEXP)'
-`(** N M SEXP ...)'
-     matches N to M occurrences.
-
-`(backref N)'
-     matches what was matched previously by submatch N.
-
-`(literal STRING-EXPR)'
-     matches STRING-EXPR literally, where STRING-EXPR is any lisp
-     expression that evaluates to a string.
-
-`(regexp REGEXP-EXPR)'
-     include REGEXP-EXPR in string notation in the result, where
-     REGEXP-EXPR is any lisp expression that evaluates to a
-     string containing a valid regexp.
-
-`(eval FORM)'
-     evaluate FORM and insert result.  If result is a string,
-     `regexp-quote' it.  Note that FORM is evaluated during
-     macroexpansion."
+Each argument is one of the forms below; RX is a subform, and RX... stands
+for one or more RXs.  For details, see Info node `(elisp) Rx Notation'.
+See `rx-to-string' for the corresponding function.
+
+STRING         Match a literal string.
+CHAR           Match a literal character.
+
+(seq RX...)    Match the RXs in sequence.  Alias: :, sequence, and
+(or RX...)     Match one of the RXs.  Alias: |
+
+(zero-or-more RX...) Match RXs zero or more times.  Alias: *, 0+
+(one-or-more RX...)  Match RXs one or more times.  Alias: +, 1+
+(zero-or-one RX...)  Match RXs or the empty string.  Alias: ?, opt, optional
+(*? RX...)      Match RXs zero or more times, non-greedily.
+(+? RX...)      Match RXs one or more times, non-greedily.
+(?? RX...)      Match RXs or the empty string, non-greedily.
+(= N RX...)     Match RXs exactly N times.
+(>= N RX...)    Match RXs N or more times.
+(** N M RX...)  Match RXs N to M times.  Alias: repeat
+(minimal-match RX)  Match RX, with zero-or-more, one-or-more,
+                zero-or-one, 0+, 1+, opt, and optional
+                using non-greedy matching.
+(maximal-match RX)  Match RX, with zero-or-more, one-or-more,
+                zero-or-one, 0+, 1+, opt, and optional
+                using greedy matching, which is the default.
+
+(any SET...)    Match a character from one of the SETs.  Each SET is a
+                character, a string, a range as string \"A-Z\" or cons
+                (?A . ?Z), or a character class (see below).  Alias: in, char
+(not CHARSPEC)  Match one character not matched by CHARSPEC.  CHARSPEC
+                can be (any ...), (syntax ...), (category ...),
+                or a character class.
+not-newline     Match any character except a newline.  Alias: nonl
+anything        Match any character.
+
+CHARCLASS       Match a character from a character class.  One of:
+ alpha, alphabetic, letter   alphabetic characters (defined by Unicode)
+ alnum, alphanumeric         alphabetic or decimal digit chars (Unicode)
+ digit numeric, num          0-9
+ xdigit, hex-digit, hex      0-9, A-F, a-f
+ cntrl, control              ASCII codes 0-31
+ blank                       horizontal whitespace (Unicode)
+ space, whitespace, white    chars with whitespace syntax
+ lower, lower-case           lower-case chars, from current case table
+ upper, upper-case           upper-case chars, from current case table
+ graph, graphic              graphic characters (Unicode)
+ print, printing             whitespace or graphic (Unicode)
+ punct, punctuation          not control, space, letter or digit (ASCII);
+                              not word syntax (non-ASCII)
+ word, wordchar              characters with word syntax
+ ascii                       ASCII characters (codes 0-127)
+ nonascii                    non-ASCII characters (but not raw bytes)
+
+(syntax SYNTAX)  Match a character with syntax SYNTAX, being one of:
+  whitespace, punctuation, word, symbol, open-parenthesis,
+  close-parenthesis, expression-prefix, string-quote,
+  paired-delimiter, escape, character-quote, comment-start,
+  comment-end, string-delimiter, comment-delimiter
+
+(category CAT)   Match a character in category CAT, being one of:
+  space-for-indent, base, consonant, base-vowel,
+  upper-diacritical-mark, lower-diacritical-mark, tone-mark, symbol,
+  digit, vowel-modifying-diacritical-mark, vowel-sign,
+  semivowel-lower, not-at-end-of-line, not-at-beginning-of-line,
+  alpha-numeric-two-byte, chinese-two-byte, greek-two-byte,
+  japanese-hiragana-two-byte, indian-two-byte,
+  japanese-katakana-two-byte, strong-left-to-right,
+  korean-hangul-two-byte, strong-right-to-left, cyrillic-two-byte,
+  combining-diacritic, ascii, arabic, chinese, ethiopic, greek,
+  korean, indian, japanese, japanese-katakana, latin, lao,
+  tibetan, japanese-roman, thai, vietnamese, hebrew, cyrillic,
+  can-break
+
+Zero-width assertions: these all match the empty string in specific places.
+ line-start         at the beginning of a line.  Alias: bol
+ line-end           at the end of a line.  Alias: eol
+ string-start       at the start of the string or buffer.
+                     Alias: buffer-start, bos, bot
+ string-end         at the end of the string or buffer.
+                     Alias: buffer-end, eos, eot
+ point              at point.
+ word-start         at the beginning of a word.
+ word-end           at the end of a word.
+ word-boundary      at the beginning or end of a word.
+ not-word-boundary  not at the beginning or end of a word.
+ symbol-start       at the beginning of a symbol.
+ symbol-end         at the end of a symbol.
+
+(group RX...)  Match RXs and define a capture group.  Alias: submatch
+(group-n N RX...) Match RXs and define capture group N.  Alias: submatch-n
+(backref N)    Match the text that capture group N matched.
+
+(literal EXPR) Match the literal string from evaluating the EXPR at run time.
+(regexp EXPR)  Match the string regexp from evaluating EXPR at run time.
+(eval EXPR)    Match the rx sexp from evaluating EXPR at compile time."
   (let* ((rx--compile-to-lisp t)
          (re (cond ((null regexps)
                     (error "No regexp"))
-- 
2.20.1 (Apple Git-117)

next prev parent reply	other threads:[~2019-07-06 11:33 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-04 12:13 bug#36496: [PATCH] Describe the rx notation in the lisp manual Mattias Engdegård
2019-07-04 14:59 ` Drew Adams
2019-07-04 16:28 ` Eli Zaretskii
2019-07-05 14:13   ` Mattias Engdegård
2019-07-06  9:08     ` Eli Zaretskii
2019-07-06 11:33       ` Mattias Engdegård [this message]
2019-07-06 11:41         ` Eli Zaretskii
2019-07-06 18:56           ` Mattias Engdegård
2019-07-06 19:10             ` Eli Zaretskii
2019-07-06 19:45               ` Mattias Engdegård
2019-07-07  2:29                 ` Eli Zaretskii
2019-07-07 11:31                   ` Mattias Engdegård
2019-07-07 14:33                     ` Eli Zaretskii
2022-04-25 15:12                     ` Lars Ingebrigtsen
2019-07-06 19:12             ` Noam Postavsky
2019-07-06 11:59         ` Noam Postavsky
2019-07-06 23:56         ` Richard Stallman
2019-07-06  0:10   ` Richard Stallman
2019-07-06  6:47     ` Eli Zaretskii
2019-07-06 23:59       ` Richard Stallman
2019-07-07  0:36         ` Drew Adams
2019-07-07 23:51           ` Richard Stallman
2019-07-08  0:56             ` Drew Adams
2019-07-08 23:46               ` Richard Stallman
2019-07-09  0:19                 ` Drew Adams
2019-07-08 23:44             ` Richard Stallman

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:e308d68b7 dfblob:ef1cffc44 dfblob:24dd6cbf1 dfblob:de6cd9301
dfblob:17c4790f5 dfblob:e65460c39
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F6C1EF17-20F5-4551-B9CE-6C1D6C404EBD@acm.org \
    --to=mattiase@acm.org \
    --cc=36496@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).