unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#41970: Suggestions for corrections to Emacs and Elisp manuals
@ 2020-06-20 20:44 Jay Bingham
  2020-06-20 21:50 ` Drew Adams
  2022-05-09 11:39 ` Lars Ingebrigtsen
  0 siblings, 2 replies; 3+ messages in thread
From: Jay Bingham @ 2020-06-20 20:44 UTC (permalink / raw)
  To: 41970

[-- Attachment #1: Type: text/plain, Size: 18256 bytes --]

Information about the operators and constructsused to create regular 
expressions is contained in two locations in the Info manuals, one in 
the Emacs manual (section _15.6 Syntax of Regular Expressions_), the 
other in the Elisp manual (section _34.3.1.1 Special Characters in 
Regular Expressions_). The first paragraph in section 15.6 of the Emacs 
manual provides the justification for maintaining two versions of the 
material, even though the two versions containmostly the same 
information. There are legitimate differences, however all of the 
differencescannot be attributed to the "features used mainly in Lisp 
programs". Here are differences that I have noticed, which I believe 
should not be differences.

Section_15.6 Syntax of Regular Expressions_of the Emacs manual contains 
descriptions of the postfix repetition operators ‘\{N\}’ and ‘\{N,M\}’. 
These operators are not described the Elisp manual in section 34.3.1.1, 
but are described in section _34.3.1.3 Backslash Constructs in Regular 
Expressions_where they are defined as ‘\{M\}’ and ‘\{M,N\}’. Since the 
Emacs manual also has a section for backslash constructs, _15.7 
Backslash in Regular Expressions_, moving the descriptions of the 
postfix repetition operators to section 15.7 and naming the as they are 
named in the Elisp manual would contribute greatly to the consistencyof 
the two manuals. Additionallythe description of ‘\{M,N\}’ in the Elisp 
manual contains information not included in the Emacs manual version 
that would be appropriate to include there.

The terminology used in section _15.6 Syntax of Regular Expressions_to 
describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first 
paragraph and the final paragraph in the section both refer to these 
constructs as "a character alternative", while the paragraphs describing 
them call them a “character set”. In section 34.3.1.1 of the Elisp 
manual the phrase used consistentlyto describe them and refer to them is 
"a character alternative". It would increase the consistencyof both 
manuals to use the same terminology to describe and refer to these 
constructs. A more grammatically correct phrase to describe these 
features would be "a set of alternative characters" (but when have 
programming nerds ever been that concerned with grammatical 
correctness). Whatever phrase is used to describe and refer to these 
constructs, it shouldbe consistent throughout both manuals. (The 
introduction to tsection _34.3.1.2 Character Classes_in the Elisp manual 
included).

In both section _15.6 Syntax of Regular Expressions_and section 
_34.3.1.1 Special Characters in Regular Expressions_near the end of each 
section is a paragraph which contains the sentence:

As a ‘\’ is not special inside a character alternative, it can never 
remove the special meaning of ‘-’ or ‘]’.

In both sections, in the description of the ‘[ ... ]’ construct, isa 
sentence which states that the characters ‘]’, ‘-’ and ‘^’ are special 
inside character alternatives.

Shouldn't the sentencesfound in both sections that are cited 
aboveinclude the '^' character?

The construct ‘\(?NUM: ... \)’ that is described in the Elisp manual, 
section _34.3.1.3 Backslash Constructs in Regular Expressions_ is not 
included in the Emacs manual section _15.7 Backslash in Regular 
Expressions_, it should be. However, the description of the construct in 
section 34.3.1.3 should be modified to make it clear that only the 
digits 1 through 9 can be used as NUM. Here is a suggestion for doing that:

‘\(?DIGIT:...\)’

is the explicitly numbered groupconstruct. Normal groups get their 
number implicitly, based on their position, which can be inconvenient. 
This construct allows a specific group number (limited to the digits 1 
through 9, see: ‘\DIGIT’ construct)to be assigned to the group 
construct. There is no particular restriction on the numbering, e.g., 
several groups can have the same number in which case the last one to 
match (i.e., the rightmost match) will be recorded. Implicitly numbered 
groups always get the smallest integer larger than the largest one of 
any previous group.

In the Emacs manual section _15.7 Backslash in Regular Expressions_ in 
the description of the ‘\D’ construct the following sentence in the 
second paragraph is misleading:

Then, later on in the regular expression, you can use ‘\’ followed by 
the digit D to mean “match the same text matched the Dth time by the ‘\( 
... \)’ construct”.

This does not agree with the description in the paragraphs that surround 
it nor with the description of the construct in the Elisp manual, 
section _34.3.1.3 Backslash Constructs in Regular Expressions_. This is 
not an error introduced in version 26, it has been present since at 
least version 23. It should read:

Then, later on in the regular expression, ‘\’ followed by the digit D 
can be used to mean “match the same text matched by the Dth ‘\( ... \)’ 
construct”.

In section _15.7 Backslash in Regular Expressions_of the Emacs manual 
the descriptions for the constructs ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, 
‘\>’, ‘\w’, ‘\W’, ‘\_<’, ‘\_>’, ‘\sC’, ‘\SC’, ‘\cC’ and ‘\CC’ appear in 
the order show here, while in section _34.3.1.3 Backslash Constructs in 
Regular Expressions_of the Elisp manual they appear in the following 
order: ‘\w’, ‘\W’, ‘\sCODE’, ‘\SCODE’, ‘\cC’, ‘\CC’, ‘\`’, ‘\'’, ‘\=’, 
‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\_<’and ‘\_>’, which groups the constructs 
which match characters together and those which match empty strings 
relative to positions together. This grouping makes much more sense than 
the apparenthaphazardorder used in the Emacs manual. The order in the 
Emacs manual should match that of the Elsip manual.

Also in section _34.3.1.3 Backslash Constructs in Regular Expressions 
_ofthe Elsip manual the four constructs havingplaceholders: ‘\sCODE’, 
‘\SCODE’, ‘\cC’ and‘\CC’,the same convention is not used for 
specifyingthe placeholders. Either the constructs ‘\sCODE’and‘\SCODE’ 
should be written as ‘\sC’ and‘\SC’ or the constructs ‘\cC’ and‘\CC’ 
should be written as ‘\cCODE’ and‘\CCODE’ makingthe convention 
consistent throughout the section. The same convention should be used in 
both the Emacs manual and the Elisp manual in all constructswhere place 
holdersoccur. I prefer the use of a mnemonic as a placeholder over the 
use of a dingle character.

Adopting this convention would necessitate changing the ‘\{M\}’, 
‘\{M,N\}and ‘\D’ constructs as well. I suggest the following: ‘\{NUM\}’, 
‘\{MIN,MAX\}and ‘\DIGIT’. I prefer the convention used in the online 
version of the Elisp manual where placeholders are shown in lowercase 
italics. I do not know it that is possible to do or if it would conflict 
with the convention of showing place holders in all caps that is used in 
function descriptions. Since it is possible to cause links to files and 
the names of variables to be displayed differently in function 
descriptions, it should not be difficult to define a mechanism for 
displaying place holders in italics in function descriptions.

In section _34.3.1.3 Backslash Constructs in Regular Expressions _ofthe 
Elsip manual in the paragraph that introduces the regular expression 
constructs match the empty string the word ‘consume’ would be more 
appropriate than the phrase ‘use up’.

The format of the descriptions in section _34.3.1.3 Backslash Constructs 
in Regular Expressions _ofthe Elsip manual is not consistent. I offer 
you the following which I have attempted to add some consistency to by 
stating the name of the operator/construct then describing how it is 
used. The corrections and improvements mentioned above are incorporated 
into what follows.

For the most part, ‘\’ followed by any character matches only that 
character. However, there are several exceptions: two-character 
sequences starting with ‘\’ that have special meanings. The second 
character in the sequence is always an ordinary character when used on 
its own. Here are the ‘\’ operators and constructs.

‘\|’

is the alternative operator. Two regular expressions Aand Bwith ‘\|’ 
between forms an expression that matches either the text matched by Aor 
the text matched by B

Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string.

‘\|’ applies to the largest possible surrounding expressions. Only a 
surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’.

When full backtracking capability is needed to handle multiple uses of 
‘\|’, use the POSIX regular expression functions (see POSIX Regexps in 
the Elisp manual).

‘\{/num/\}’

is the postfix number of repetitions operator. It specifies the exact 
number of consecutive repetitionsthat the preceding regular expression 
must match. For example, ‘x\{4\}’ matches only the string ‘xxxx’; 
‘c[ad]\{3\}r’ matches only the eight valid strings that can be created 
with two characters in three places, that is the strings: ‘caaar’, 
‘caadr’, ‘cadar’, ‘caddr’, ‘cdaar’, ‘cdadr’, ‘cddar’, ‘cdddr’.

‘\{/min/,/max/\}’

is the postfix range of repetitions operator. It specifies the range of 
consecutive repetitionsbetween /min/and /max/that the preceding regular 
expression must match, i.e. at least /min/times, but no more than 
/max/times. If /min/is omitted, the minimum is 0, but the preceding 
regular expression must match at least /max/times; if /max/is omitted, 
there is no maximum.

‘\{0,1\}’ or ‘\{,1\}’ is equivalent to ‘?’.

‘\{0,\}’ or ‘\{,\}’is equivalent to ‘*’.

‘\{1,\}’ is equivalent to ‘+’.

For example, ‘c[ad]\{1,2\}r’ matches only the strings: ‘car’, ‘cdr’, 
‘caar’, ‘cadr’, ‘cdar’, and ‘cddr’.

The maximum value allowed for /num/, /min/and /max/is 2**15 − 1.

‘\( … \)’

is the grouping construct that serves three purposes:

 1.

    To enclose a set of ‘\|’ alternatives for other operations. Thus,
    ‘\(foo\|bar\)x’ matches either ‘foox’ or ‘barx’.

 2.

    To enclose a complicated expression for the postfix operators ‘*’,
    ‘+’ and ‘?’ to operate on. Thus, ‘ba\(na\)*’ matches ‘bananana’,
    etc., with any number of (zero or more) ‘na’ strings.

 3.

    To record a matched substring for future reference with ‘\/digit/’
    (described below).

This last application is not a consequence of the idea of a 
parenthetical grouping; it is a separate feature that is assigned as a 
second meaning to the same ‘\( … \)’ construct. In practice there is 
usually no conflict between the two meanings; when there is a conflict, 
a “shy” group (described below) can be used.

‘\(?: … \)’

is the “shy” group construct. A shy group serves the first two purposes 
of an ordinary group (controlling the nesting of other operators), but 
it does not record the matched substring; it can’t be referred back to 
with ‘\digit’ construct (see below). This is useful in mechanically 
combining regular expressions, so that groups can be added for syntactic 
purposes without interfering with the numbering of the groups that are 
meant to be referred to.

‘\(?/digit/: … \)’

is the explicitly numbered groupconstruct. Normal groups get their 
number implicitly, based on their position, which can be inconvenient. 
This construct allows a specific group number (limited to the digits 1 
through 9, see: ‘\/digit/’ construct)to be assigned to the group 
construct. There is no particular restriction on the numbering, e.g., 
several groups can have the same number in which case the last one to 
match (i.e., the rightmost match) will be recorded. Implicitly numbered 
groups always get the smallest integer larger than the largest one of 
any previous group.

‘\/digit/’

is the back reference operator. It matches the same text that matched 
the /digit/^/th/ occurrence of a ‘\( … \)’ construct.

After the end of a ‘\( … \)’ construct, the matcher remembers the 
beginning and end of the text matched by that construct. Later in the 
regular expression, ‘\’ followed by the /digit/can be used to match the 
same text matched by the /digit/^/th/ ‘\( … \)construct.

The strings matching the first nine ‘\( … \)’ constructs appearing in a 
regular expression are assigned numbers 1 through 9 in the order that 
the open-parentheses appear in the regular expression. So ‘\1’ through 
‘\9’ can be used to refer to the text matched by the corresponding ‘\( … 
\)’ constructs.

For example, ‘\(.*\)\1’ matches any newline-free string that is composed 
of two identical halves. The ‘\(.*\)’ matches the first half, which may 
be anything, but the ‘\1’ that follows must match the same exact text.

If a ‘\( … \)’ construct matches more than once (which can easily happen 
if it is followed by ‘*’), only the last match is recorded.

If a particular grouping construct in the regular expression was never 
matched—for instance, if it appears inside of an alternative that wasn’t 
used, or inside of a repetition that repeated zero times—then the 
corresponding ‘\digit’ construct never matches anything. For example, 
the regexp ‘\(foo\(b*\)\|lose\)\2’ cannot match ‘lose’ because the 
second alternative inside the larger group matches it, which results in 
‘\2’ being undefined and unable to match anything. It can match ‘foobb’, 
because the first alternative matches ‘foob’ and ‘\2’ matches the second 
‘b’.

The following operators pertaining to words and syntax are controlled by 
the setting of the syntax table (/See:/_Table of Syntax Classes_).

‘\w’

is the word-constituent operator, it matches any word-constituent 
character. The syntax table determines which characters these 
are. (/See:/_Table of Syntax Classes_)

‘\W’

is the non-word-constituent operator, it matches any character that is 
not a word-constituent. (/See:/_Table of Syntax Classes_)

‘\s/code/’

is the syntax class operator, it matches any character whose syntax is 
/code/. Here /code/is a character that designates a particular syntax 
class: thus, ‘w’ for word constituent, ‘-’ or ‘’ for whitespace, ‘.’ for 
ordinary punctuation, etc. (/See:/_Table of Syntax Classes_)

‘\S/code/’

is the non syntax class operator, it matches any character whose syntax 
is not /code/. (/See:/_Table of Syntax Classes_)

‘\c/code/’

is the character category operator, it matches any character that 
belongs to the category /code/. For example, ‘\cc’ matches Chinese 
characters, ‘\cg’ matches Greek characters, etc. For the description of 
the known categories, type ‘M-x describe-categories <RET>’. (/See 
also:/_Category Characters_)

‘\C/code/’

is the non character category operator, it matches any character that 
does _not_belong to category /code/. (/See:/_Category Characters_)

The following regular expression constructs match the empty string—that 
is, they don't consume any characters—but whether they match depends on 
the context. For all, the beginning and end of the accessible portion of 
the buffer are treated as if they were the actual beginning and end of 
the buffer.

\`’

is the beginning of string operator, it matches the empty string, but 
only at the beginning of the string or buffer (or its accessible 
portion) being matched against.

‘\’’

is the end of string operator, it matches the empty string, but only at 
the end of the string or buffer (or its accessible portion) being 
matched against.

‘\=’

is the at point operator, it matches the empty string, but only at point.

‘\b’

is the beginning or end of word operator, it matches the empty string, 
but only at the beginning or end of a word. Thus, ‘\bfoo\b’ matches any 
occurrence of ‘foo’ as a separate word. ‘\bballs?\b’ matches ‘ball’ or 
‘balls’ as a separate word.

‘\b’ matches at the beginning or end of the buffer regardless of what 
text appears next to it.

‘\B’

is the middle of word operator, it matches the empty string, but _not_at 
the beginning or end of a word.

‘\<’

is the beginning of word operator, it matches the empty string, but only 
at the beginning of a word; furthermore, ‘\<’ matches at the beginning 
of the buffer only if a word-constituent character follows.

‘\>’

is the end of word operator, it matches the empty string, but only at 
the end of a word; furthermore, ‘\>’ matches at the end of the buffer 
only if the contents end with a word-constituent character.

‘\_<’

is the beginning of symbol operator, it matches the empty string, but 
only at the beginning of a symbol. A symbol is a sequence of one or more 
symbol-constituent characters. A symbol-constituent character is a 
character whose syntax is either ‘w’ or ‘_’. It matches at the beginning 
of the buffer only if a symbol-constituent character immediately follows 
the beginning of the buffer. As with words, the syntax table determines 
which characters are symbol-constituent.

‘\_>’

is the end of symbol operator, it matches the empty string, but only at 
the end of a symbol. It matches at the end of the buffer only if a 
symbol-constituent character immediately precedes the end of the buffer.

Not every string is a valid regular expression. For example, a string 
that ends inside a set of alternative characters without a terminating 
‘]’ is invalid, and so is a string that ends with a single ‘\’. If an 
invalid regular expression is passed to any of the search functions, an 
invalid-regexp error is signaled.


J C Bingham
    - Georgetown, TX USA -
___________________________




-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

[-- Attachment #2: Type: text/html, Size: 92763 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#41970: Suggestions for corrections to Emacs and Elisp manuals
  2020-06-20 20:44 bug#41970: Suggestions for corrections to Emacs and Elisp manuals Jay Bingham
@ 2020-06-20 21:50 ` Drew Adams
  2022-05-09 11:39 ` Lars Ingebrigtsen
  1 sibling, 0 replies; 3+ messages in thread
From: Drew Adams @ 2020-06-20 21:50 UTC (permalink / raw)
  To: Jay Bingham, 41970

> The terminology used in section 15.6 Syntax of Regular Expressions to describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first paragraph and the final paragraph in the section both refer to these constructs as "a character alternative", while the paragraphs describing them call them a “character set”. In section 34.3.1.1 of the Elisp manual the phrase used consistently to describe them and refer to them is "a character alternative".

> It would increase the consistency of both manuals to use the same terminology to describe and refer to these constructs. A more grammatically correct phrase to describe these features would be "a set of alternative characters" (but when have programming nerds ever been that concerned with grammatical correctness).

A nit:

These references refer to the syntax construct [...], and not to the set of chars that it represents.  It is wrong to call this construct "a character set", and it would be wrong to call it "a set of alternative characters".  What it _matches_, or represents, is any _one_ char of a set of alternative chars.  But the syntax construct is not a set of chars.





^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#41970: Suggestions for corrections to Emacs and Elisp manuals
  2020-06-20 20:44 bug#41970: Suggestions for corrections to Emacs and Elisp manuals Jay Bingham
  2020-06-20 21:50 ` Drew Adams
@ 2022-05-09 11:39 ` Lars Ingebrigtsen
  1 sibling, 0 replies; 3+ messages in thread
From: Lars Ingebrigtsen @ 2022-05-09 11:39 UTC (permalink / raw)
  To: Jay Bingham; +Cc: 41970

Jay Bingham <binghamjc@msn.com> writes:

> Here are differences that I have noticed, which I believe should not
> be differences.

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

Thanks for the suggested improvements -- I've now adjusted these
sections in the manuals for Emacs 29 (where I agreed with the
suggestions).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-05-09 11:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-20 20:44 bug#41970: Suggestions for corrections to Emacs and Elisp manuals Jay Bingham
2020-06-20 21:50 ` Drew Adams
2022-05-09 11:39 ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).