* bug#41970: Suggestions for corrections to Emacs and Elisp manuals
@ 2020-06-20 20:44 Jay Bingham
2020-06-20 21:50 ` Drew Adams
2022-05-09 11:39 ` Lars Ingebrigtsen
0 siblings, 2 replies; 3+ messages in thread
From: Jay Bingham @ 2020-06-20 20:44 UTC (permalink / raw)
To: 41970
[-- Attachment #1: Type: text/plain, Size: 18256 bytes --]
Information about the operators and constructsused to create regular
expressions is contained in two locations in the Info manuals, one in
the Emacs manual (section _15.6 Syntax of Regular Expressions_), the
other in the Elisp manual (section _34.3.1.1 Special Characters in
Regular Expressions_). The first paragraph in section 15.6 of the Emacs
manual provides the justification for maintaining two versions of the
material, even though the two versions containmostly the same
information. There are legitimate differences, however all of the
differencescannot be attributed to the "features used mainly in Lisp
programs". Here are differences that I have noticed, which I believe
should not be differences.
Section_15.6 Syntax of Regular Expressions_of the Emacs manual contains
descriptions of the postfix repetition operators ‘\{N\}’ and ‘\{N,M\}’.
These operators are not described the Elisp manual in section 34.3.1.1,
but are described in section _34.3.1.3 Backslash Constructs in Regular
Expressions_where they are defined as ‘\{M\}’ and ‘\{M,N\}’. Since the
Emacs manual also has a section for backslash constructs, _15.7
Backslash in Regular Expressions_, moving the descriptions of the
postfix repetition operators to section 15.7 and naming the as they are
named in the Elisp manual would contribute greatly to the consistencyof
the two manuals. Additionallythe description of ‘\{M,N\}’ in the Elisp
manual contains information not included in the Emacs manual version
that would be appropriate to include there.
The terminology used in section _15.6 Syntax of Regular Expressions_to
describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first
paragraph and the final paragraph in the section both refer to these
constructs as "a character alternative", while the paragraphs describing
them call them a “character set”. In section 34.3.1.1 of the Elisp
manual the phrase used consistentlyto describe them and refer to them is
"a character alternative". It would increase the consistencyof both
manuals to use the same terminology to describe and refer to these
constructs. A more grammatically correct phrase to describe these
features would be "a set of alternative characters" (but when have
programming nerds ever been that concerned with grammatical
correctness). Whatever phrase is used to describe and refer to these
constructs, it shouldbe consistent throughout both manuals. (The
introduction to tsection _34.3.1.2 Character Classes_in the Elisp manual
included).
In both section _15.6 Syntax of Regular Expressions_and section
_34.3.1.1 Special Characters in Regular Expressions_near the end of each
section is a paragraph which contains the sentence:
As a ‘\’ is not special inside a character alternative, it can never
remove the special meaning of ‘-’ or ‘]’.
In both sections, in the description of the ‘[ ... ]’ construct, isa
sentence which states that the characters ‘]’, ‘-’ and ‘^’ are special
inside character alternatives.
Shouldn't the sentencesfound in both sections that are cited
aboveinclude the '^' character?
The construct ‘\(?NUM: ... \)’ that is described in the Elisp manual,
section _34.3.1.3 Backslash Constructs in Regular Expressions_ is not
included in the Emacs manual section _15.7 Backslash in Regular
Expressions_, it should be. However, the description of the construct in
section 34.3.1.3 should be modified to make it clear that only the
digits 1 through 9 can be used as NUM. Here is a suggestion for doing that:
‘\(?DIGIT:...\)’
is the explicitly numbered groupconstruct. Normal groups get their
number implicitly, based on their position, which can be inconvenient.
This construct allows a specific group number (limited to the digits 1
through 9, see: ‘\DIGIT’ construct)to be assigned to the group
construct. There is no particular restriction on the numbering, e.g.,
several groups can have the same number in which case the last one to
match (i.e., the rightmost match) will be recorded. Implicitly numbered
groups always get the smallest integer larger than the largest one of
any previous group.
In the Emacs manual section _15.7 Backslash in Regular Expressions_ in
the description of the ‘\D’ construct the following sentence in the
second paragraph is misleading:
Then, later on in the regular expression, you can use ‘\’ followed by
the digit D to mean “match the same text matched the Dth time by the ‘\(
... \)’ construct”.
This does not agree with the description in the paragraphs that surround
it nor with the description of the construct in the Elisp manual,
section _34.3.1.3 Backslash Constructs in Regular Expressions_. This is
not an error introduced in version 26, it has been present since at
least version 23. It should read:
Then, later on in the regular expression, ‘\’ followed by the digit D
can be used to mean “match the same text matched by the Dth ‘\( ... \)’
construct”.
In section _15.7 Backslash in Regular Expressions_of the Emacs manual
the descriptions for the constructs ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’,
‘\>’, ‘\w’, ‘\W’, ‘\_<’, ‘\_>’, ‘\sC’, ‘\SC’, ‘\cC’ and ‘\CC’ appear in
the order show here, while in section _34.3.1.3 Backslash Constructs in
Regular Expressions_of the Elisp manual they appear in the following
order: ‘\w’, ‘\W’, ‘\sCODE’, ‘\SCODE’, ‘\cC’, ‘\CC’, ‘\`’, ‘\'’, ‘\=’,
‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\_<’and ‘\_>’, which groups the constructs
which match characters together and those which match empty strings
relative to positions together. This grouping makes much more sense than
the apparenthaphazardorder used in the Emacs manual. The order in the
Emacs manual should match that of the Elsip manual.
Also in section _34.3.1.3 Backslash Constructs in Regular Expressions
_ofthe Elsip manual the four constructs havingplaceholders: ‘\sCODE’,
‘\SCODE’, ‘\cC’ and‘\CC’,the same convention is not used for
specifyingthe placeholders. Either the constructs ‘\sCODE’and‘\SCODE’
should be written as ‘\sC’ and‘\SC’ or the constructs ‘\cC’ and‘\CC’
should be written as ‘\cCODE’ and‘\CCODE’ makingthe convention
consistent throughout the section. The same convention should be used in
both the Emacs manual and the Elisp manual in all constructswhere place
holdersoccur. I prefer the use of a mnemonic as a placeholder over the
use of a dingle character.
Adopting this convention would necessitate changing the ‘\{M\}’,
‘\{M,N\}and ‘\D’ constructs as well. I suggest the following: ‘\{NUM\}’,
‘\{MIN,MAX\}and ‘\DIGIT’. I prefer the convention used in the online
version of the Elisp manual where placeholders are shown in lowercase
italics. I do not know it that is possible to do or if it would conflict
with the convention of showing place holders in all caps that is used in
function descriptions. Since it is possible to cause links to files and
the names of variables to be displayed differently in function
descriptions, it should not be difficult to define a mechanism for
displaying place holders in italics in function descriptions.
In section _34.3.1.3 Backslash Constructs in Regular Expressions _ofthe
Elsip manual in the paragraph that introduces the regular expression
constructs match the empty string the word ‘consume’ would be more
appropriate than the phrase ‘use up’.
The format of the descriptions in section _34.3.1.3 Backslash Constructs
in Regular Expressions _ofthe Elsip manual is not consistent. I offer
you the following which I have attempted to add some consistency to by
stating the name of the operator/construct then describing how it is
used. The corrections and improvements mentioned above are incorporated
into what follows.
For the most part, ‘\’ followed by any character matches only that
character. However, there are several exceptions: two-character
sequences starting with ‘\’ that have special meanings. The second
character in the sequence is always an ordinary character when used on
its own. Here are the ‘\’ operators and constructs.
‘\|’
is the alternative operator. Two regular expressions Aand Bwith ‘\|’
between forms an expression that matches either the text matched by Aor
the text matched by B
Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string.
‘\|’ applies to the largest possible surrounding expressions. Only a
surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’.
When full backtracking capability is needed to handle multiple uses of
‘\|’, use the POSIX regular expression functions (see POSIX Regexps in
the Elisp manual).
‘\{/num/\}’
is the postfix number of repetitions operator. It specifies the exact
number of consecutive repetitionsthat the preceding regular expression
must match. For example, ‘x\{4\}’ matches only the string ‘xxxx’;
‘c[ad]\{3\}r’ matches only the eight valid strings that can be created
with two characters in three places, that is the strings: ‘caaar’,
‘caadr’, ‘cadar’, ‘caddr’, ‘cdaar’, ‘cdadr’, ‘cddar’, ‘cdddr’.
‘\{/min/,/max/\}’
is the postfix range of repetitions operator. It specifies the range of
consecutive repetitionsbetween /min/and /max/that the preceding regular
expression must match, i.e. at least /min/times, but no more than
/max/times. If /min/is omitted, the minimum is 0, but the preceding
regular expression must match at least /max/times; if /max/is omitted,
there is no maximum.
‘\{0,1\}’ or ‘\{,1\}’ is equivalent to ‘?’.
‘\{0,\}’ or ‘\{,\}’is equivalent to ‘*’.
‘\{1,\}’ is equivalent to ‘+’.
For example, ‘c[ad]\{1,2\}r’ matches only the strings: ‘car’, ‘cdr’,
‘caar’, ‘cadr’, ‘cdar’, and ‘cddr’.
The maximum value allowed for /num/, /min/and /max/is 2**15 − 1.
‘\( … \)’
is the grouping construct that serves three purposes:
1.
To enclose a set of ‘\|’ alternatives for other operations. Thus,
‘\(foo\|bar\)x’ matches either ‘foox’ or ‘barx’.
2.
To enclose a complicated expression for the postfix operators ‘*’,
‘+’ and ‘?’ to operate on. Thus, ‘ba\(na\)*’ matches ‘bananana’,
etc., with any number of (zero or more) ‘na’ strings.
3.
To record a matched substring for future reference with ‘\/digit/’
(described below).
This last application is not a consequence of the idea of a
parenthetical grouping; it is a separate feature that is assigned as a
second meaning to the same ‘\( … \)’ construct. In practice there is
usually no conflict between the two meanings; when there is a conflict,
a “shy” group (described below) can be used.
‘\(?: … \)’
is the “shy” group construct. A shy group serves the first two purposes
of an ordinary group (controlling the nesting of other operators), but
it does not record the matched substring; it can’t be referred back to
with ‘\digit’ construct (see below). This is useful in mechanically
combining regular expressions, so that groups can be added for syntactic
purposes without interfering with the numbering of the groups that are
meant to be referred to.
‘\(?/digit/: … \)’
is the explicitly numbered groupconstruct. Normal groups get their
number implicitly, based on their position, which can be inconvenient.
This construct allows a specific group number (limited to the digits 1
through 9, see: ‘\/digit/’ construct)to be assigned to the group
construct. There is no particular restriction on the numbering, e.g.,
several groups can have the same number in which case the last one to
match (i.e., the rightmost match) will be recorded. Implicitly numbered
groups always get the smallest integer larger than the largest one of
any previous group.
‘\/digit/’
is the back reference operator. It matches the same text that matched
the /digit/^/th/ occurrence of a ‘\( … \)’ construct.
After the end of a ‘\( … \)’ construct, the matcher remembers the
beginning and end of the text matched by that construct. Later in the
regular expression, ‘\’ followed by the /digit/can be used to match the
same text matched by the /digit/^/th/ ‘\( … \)construct.
The strings matching the first nine ‘\( … \)’ constructs appearing in a
regular expression are assigned numbers 1 through 9 in the order that
the open-parentheses appear in the regular expression. So ‘\1’ through
‘\9’ can be used to refer to the text matched by the corresponding ‘\( …
\)’ constructs.
For example, ‘\(.*\)\1’ matches any newline-free string that is composed
of two identical halves. The ‘\(.*\)’ matches the first half, which may
be anything, but the ‘\1’ that follows must match the same exact text.
If a ‘\( … \)’ construct matches more than once (which can easily happen
if it is followed by ‘*’), only the last match is recorded.
If a particular grouping construct in the regular expression was never
matched—for instance, if it appears inside of an alternative that wasn’t
used, or inside of a repetition that repeated zero times—then the
corresponding ‘\digit’ construct never matches anything. For example,
the regexp ‘\(foo\(b*\)\|lose\)\2’ cannot match ‘lose’ because the
second alternative inside the larger group matches it, which results in
‘\2’ being undefined and unable to match anything. It can match ‘foobb’,
because the first alternative matches ‘foob’ and ‘\2’ matches the second
‘b’.
The following operators pertaining to words and syntax are controlled by
the setting of the syntax table (/See:/_Table of Syntax Classes_).
‘\w’
is the word-constituent operator, it matches any word-constituent
character. The syntax table determines which characters these
are. (/See:/_Table of Syntax Classes_)
‘\W’
is the non-word-constituent operator, it matches any character that is
not a word-constituent. (/See:/_Table of Syntax Classes_)
‘\s/code/’
is the syntax class operator, it matches any character whose syntax is
/code/. Here /code/is a character that designates a particular syntax
class: thus, ‘w’ for word constituent, ‘-’ or ‘’ for whitespace, ‘.’ for
ordinary punctuation, etc. (/See:/_Table of Syntax Classes_)
‘\S/code/’
is the non syntax class operator, it matches any character whose syntax
is not /code/. (/See:/_Table of Syntax Classes_)
‘\c/code/’
is the character category operator, it matches any character that
belongs to the category /code/. For example, ‘\cc’ matches Chinese
characters, ‘\cg’ matches Greek characters, etc. For the description of
the known categories, type ‘M-x describe-categories <RET>’. (/See
also:/_Category Characters_)
‘\C/code/’
is the non character category operator, it matches any character that
does _not_belong to category /code/. (/See:/_Category Characters_)
The following regular expression constructs match the empty string—that
is, they don't consume any characters—but whether they match depends on
the context. For all, the beginning and end of the accessible portion of
the buffer are treated as if they were the actual beginning and end of
the buffer.
\`’
is the beginning of string operator, it matches the empty string, but
only at the beginning of the string or buffer (or its accessible
portion) being matched against.
‘\’’
is the end of string operator, it matches the empty string, but only at
the end of the string or buffer (or its accessible portion) being
matched against.
‘\=’
is the at point operator, it matches the empty string, but only at point.
‘\b’
is the beginning or end of word operator, it matches the empty string,
but only at the beginning or end of a word. Thus, ‘\bfoo\b’ matches any
occurrence of ‘foo’ as a separate word. ‘\bballs?\b’ matches ‘ball’ or
‘balls’ as a separate word.
‘\b’ matches at the beginning or end of the buffer regardless of what
text appears next to it.
‘\B’
is the middle of word operator, it matches the empty string, but _not_at
the beginning or end of a word.
‘\<’
is the beginning of word operator, it matches the empty string, but only
at the beginning of a word; furthermore, ‘\<’ matches at the beginning
of the buffer only if a word-constituent character follows.
‘\>’
is the end of word operator, it matches the empty string, but only at
the end of a word; furthermore, ‘\>’ matches at the end of the buffer
only if the contents end with a word-constituent character.
‘\_<’
is the beginning of symbol operator, it matches the empty string, but
only at the beginning of a symbol. A symbol is a sequence of one or more
symbol-constituent characters. A symbol-constituent character is a
character whose syntax is either ‘w’ or ‘_’. It matches at the beginning
of the buffer only if a symbol-constituent character immediately follows
the beginning of the buffer. As with words, the syntax table determines
which characters are symbol-constituent.
‘\_>’
is the end of symbol operator, it matches the empty string, but only at
the end of a symbol. It matches at the end of the buffer only if a
symbol-constituent character immediately precedes the end of the buffer.
Not every string is a valid regular expression. For example, a string
that ends inside a set of alternative characters without a terminating
‘]’ is invalid, and so is a string that ends with a single ‘\’. If an
invalid regular expression is passed to any of the search functions, an
invalid-regexp error is signaled.
J C Bingham
- Georgetown, TX USA -
___________________________
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
[-- Attachment #2: Type: text/html, Size: 92763 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* bug#41970: Suggestions for corrections to Emacs and Elisp manuals
2020-06-20 20:44 bug#41970: Suggestions for corrections to Emacs and Elisp manuals Jay Bingham
@ 2020-06-20 21:50 ` Drew Adams
2022-05-09 11:39 ` Lars Ingebrigtsen
1 sibling, 0 replies; 3+ messages in thread
From: Drew Adams @ 2020-06-20 21:50 UTC (permalink / raw)
To: Jay Bingham, 41970
> The terminology used in section 15.6 Syntax of Regular Expressions to describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first paragraph and the final paragraph in the section both refer to these constructs as "a character alternative", while the paragraphs describing them call them a “character set”. In section 34.3.1.1 of the Elisp manual the phrase used consistently to describe them and refer to them is "a character alternative".
> It would increase the consistency of both manuals to use the same terminology to describe and refer to these constructs. A more grammatically correct phrase to describe these features would be "a set of alternative characters" (but when have programming nerds ever been that concerned with grammatical correctness).
A nit:
These references refer to the syntax construct [...], and not to the set of chars that it represents. It is wrong to call this construct "a character set", and it would be wrong to call it "a set of alternative characters". What it _matches_, or represents, is any _one_ char of a set of alternative chars. But the syntax construct is not a set of chars.
^ permalink raw reply [flat|nested] 3+ messages in thread
* bug#41970: Suggestions for corrections to Emacs and Elisp manuals
2020-06-20 20:44 bug#41970: Suggestions for corrections to Emacs and Elisp manuals Jay Bingham
2020-06-20 21:50 ` Drew Adams
@ 2022-05-09 11:39 ` Lars Ingebrigtsen
1 sibling, 0 replies; 3+ messages in thread
From: Lars Ingebrigtsen @ 2022-05-09 11:39 UTC (permalink / raw)
To: Jay Bingham; +Cc: 41970
Jay Bingham <binghamjc@msn.com> writes:
> Here are differences that I have noticed, which I believe should not
> be differences.
(I'm going through old bug reports that unfortunately weren't resolved
at the time.)
Thanks for the suggested improvements -- I've now adjusted these
sections in the manuals for Emacs 29 (where I agreed with the
suggestions).
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-05-09 11:39 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-20 20:44 bug#41970: Suggestions for corrections to Emacs and Elisp manuals Jay Bingham
2020-06-20 21:50 ` Drew Adams
2022-05-09 11:39 ` Lars Ingebrigtsen
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).