From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Jay Bingham Newsgroups: gmane.emacs.bugs Subject: bug#41970: Suggestions for corrections to Emacs and Elisp manuals Date: Sat, 20 Jun 2020 15:44:31 -0500 Message-ID: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="------------26B672C605FDE69FE36FD5BF" Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="112710"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.9.0 To: 41970@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sat Jun 20 23:04:02 2020 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jmkea-000SwQ-GL for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 20 Jun 2020 23:03:56 +0200 Original-Received: from localhost ([::1]:41500 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jmkeY-0003Q8-UT for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 20 Jun 2020 17:03:55 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:52272) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jmkap-0005Ih-7s for bug-gnu-emacs@gnu.org; Sat, 20 Jun 2020 17:00:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:47283) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jmkao-0000qX-Ov for bug-gnu-emacs@gnu.org; Sat, 20 Jun 2020 17:00:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1jmkao-0006Lf-Mu for bug-gnu-emacs@gnu.org; Sat, 20 Jun 2020 17:00:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Jay Bingham Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 20 Jun 2020 21:00:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 41970 X-GNU-PR-Package: emacs X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.159268679724359 (code B ref -1); Sat, 20 Jun 2020 21:00:02 +0000 Original-Received: (at submit) by debbugs.gnu.org; 20 Jun 2020 20:59:57 +0000 Original-Received: from localhost ([127.0.0.1]:58830 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jmkaj-0006Kp-7A for submit@debbugs.gnu.org; Sat, 20 Jun 2020 16:59:57 -0400 Original-Received: from lists.gnu.org ([209.51.188.17]:54316) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jmkaW-0006KG-Ux for submit@debbugs.gnu.org; Sat, 20 Jun 2020 16:59:44 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:52212) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jmkaW-0004aB-HR for bug-gnu-emacs@gnu.org; Sat, 20 Jun 2020 16:59:44 -0400 Original-Received: from mail-mw2nam10olkn2108.outbound.protection.outlook.com ([40.92.42.108]:50977 helo=NAM10-MW2-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jmkaR-0000nR-1n for bug-gnu-emacs@gnu.org; Sat, 20 Jun 2020 16:59:43 -0400 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ghTuJ96oykUswXy78VTpGjdBYh4mgKekN3Z+9eNFAUgIonRJDjoVAv41gcFJ6bdUlw0g75Rg9KueL8ld8u/ouTvCs/Ugqm9fcH8TS/0NWHLIT/KN4C/cIXVHv/TalkNVAXDF2h2dVvL6CcTWlOE7SSGwlKr6uNK9FvXwmLeKafjWvXYC/z8NzeKYdNzMbgUPsfKKLNW2magliQ8MbFbe0+zGgNBd4kjWWxWiqnb3f+fi9FnbffvwHFcst0eyDVOTZ38Tmxs9tanTfP9sfR7Yf7HX/TIlnR5c2adeAy9iqZGysuudpozy/a5w8vng8Ol6Z3rJz180ekAphUeAyJGSfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=3P0kALB1QwVPT7HIkiUfbmnNGSytzLXTga0tGBJiq9U=; b=ClvwEVvVB7b0bKe52yavwaNAwHAbxrIQuLjIzpia2yBujEhrn7qwbwzsCVskXPsEbjlYhJ17tHnn94XqjkdgtK+9/oqGhB+XNJ3ASZtICOKwG3Vkuo5iuYCQDRylw6k/2SIgbF4BuLjq7FTsFPnQw+62XFAvV+mGSzK8CSguZOwvCP9neKdG8FpSPkFRvAQ2D3gVOZcGXKox7m1tdvM702sIna3PREGYzGbgk4CyH1hVzxfDCjmZQj7AZksqpf/klTcIThjXP/VJBjmL2VnPKYlOvAu/Ed/SU5xSTlfoeRyJA3tNL/BMjYfkIgSh8bRQ3CtnS+pV2IIfXWS+9uFOJQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=msn.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=3P0kALB1QwVPT7HIkiUfbmnNGSytzLXTga0tGBJiq9U=; b=AtxmhEv7XMGZOVOiCh9WZV2+XksNYCsXGckm65Q7vZZ1ThfWN1JB6H2H1iCilkKHQvJBLCSqp1J82A8GBHP0mFlr9CEBLNKLZyaBxvppN47MSQHEnYuOU5u12UHLicZ21BySGmR0lwi5EfUg8akwnk89PoCcUCd1TiABuxTNSAMPF6xPc87P6RaNlHdjb67FDPH4cjeATJJwI5IqD22g9v7/ipAxYxeQZbH3keIoul1nxzGKI6lBxNqWSz8CQNk3X2e3HuCQL5i0aMcrB5cQndyGn0j5WReY1zP3y3AUC3BbD2kABtP3jp35i0Jek7yilzlO9gE5oqbBWMIHUsdk4w== Original-Received: from BN7NAM10FT052.eop-nam10.prod.protection.outlook.com (2a01:111:e400:7e8f::4d) by BN7NAM10HT180.eop-nam10.prod.protection.outlook.com (2a01:111:e400:7e8f::322) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3109.22; Sat, 20 Jun 2020 20:44:34 +0000 Original-Received: from SN6PR02MB5005.namprd02.prod.outlook.com (2a01:111:e400:7e8f::49) by BN7NAM10FT052.mail.protection.outlook.com (2a01:111:e400:7e8f::198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3109.22 via Frontend Transport; Sat, 20 Jun 2020 20:44:34 +0000 X-IncomingTopHeaderMarker: OriginalChecksum:AFE5D1F28B88FF34AF74E924A3E150589396B98545AD94175BA9C9F2EA4FC461; UpperCasedChecksum:DDA532FF801011BC5C94F8408C5B55E7B189F4A4A82C2C79ED98C20DCEC6B65C; SizeAsReceived:8725; Count:49 Original-Received: from SN6PR02MB5005.namprd02.prod.outlook.com ([fe80::e434:505f:aaeb:6e54]) by SN6PR02MB5005.namprd02.prod.outlook.com ([fe80::e434:505f:aaeb:6e54%6]) with mapi id 15.20.3109.021; Sat, 20 Jun 2020 20:44:34 +0000 Content-Language: en-US X-Antivirus: Avast (VPS 200620-6, 06/20/2020), Outbound message X-Antivirus-Status: Clean X-ClientProxiedBy: DM5PR21CA0039.namprd21.prod.outlook.com (2603:10b6:3:ed::25) To SN6PR02MB5005.namprd02.prod.outlook.com (2603:10b6:805:71::25) X-Microsoft-Original-Message-ID: <081d0fae-963e-b56f-2628-b7b9f08624b0@msn.com> X-MS-Exchange-MessageSentRepresentingType: 1 Original-Received: from [192.168.5.11] (74.194.217.103) by DM5PR21CA0039.namprd21.prod.outlook.com (2603:10b6:3:ed::25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3153.4 via Frontend Transport; Sat, 20 Jun 2020 20:44:33 +0000 X-Antivirus: Avast (VPS 200620-6, 06/20/2020), Outbound message X-Antivirus-Status: Clean X-Microsoft-Original-Message-ID: <081d0fae-963e-b56f-2628-b7b9f08624b0@msn.com> X-TMN: [4ARB/I9gmC9fslDCNkTZqEurpItvl1Mg] X-MS-PublicTrafficType: Email X-IncomingHeaderCount: 49 X-EOPAttributedMessage: 0 X-MS-Office365-Filtering-Correlation-Id: 9f5babc6-bfc9-48a6-1e7e-08d8155abd48 X-MS-TrafficTypeDiagnostic: BN7NAM10HT180: X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: MO6eQ/qGnW0ezK0gWbXciwOUS1mDOArTL0IPLzCQim7bzsRiNgf6py7fnoROcUdMEGtgpUB5NY4T+eDfKUtoQwKLAfG59uQBkTrgmNCyLqALBn/iolm5pvjtY/GzKdiu+MXVSciz5kdQtTVoq4TB2JUp0C2lUeB7Cp+OAYpxPfSGIx2o7bgK6Vbr/dVTXFquv0TN6Wxl+GlKhwRtlmKaARwHSLHaP00rXiRjG13DrEafRCv4RFOwEN1sFUCI0ZuY X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:0; SRV:; IPV:NLI; SFV:NSPM; H:SN6PR02MB5005.namprd02.prod.outlook.com; PTR:; CAT:NONE; SFTY:; SFS:; DIR:OUT; SFP:1901; X-MS-Exchange-AntiSpam-MessageData: FzYz8i+rYqFW1vIC+gakVH2+nRyLd6EQ/KUCDAQ4SGDBy6x6bE5FHf+YOLOoUKuHk1vT8Qt5w5/3V97fmjpHSiRbMFPBakMiYsZmHbyhMjJWPiAlhAbK/oIvqLGZVNwufLtuXQessbS19rDy9FoTMw== X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 9f5babc6-bfc9-48a6-1e7e-08d8155abd48 X-MS-Exchange-CrossTenant-OriginalArrivalTime: 20 Jun 2020 20:44:34.4742 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-FromEntityHeader: Internet X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN7NAM10HT180 Received-SPF: pass client-ip=40.92.42.108; envelope-from=binghamjc@msn.com; helo=NAM10-MW2-obe.outbound.protection.outlook.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/06/20 16:59:36 X-ACL-Warn: Detected OS = Windows NT kernel [generic] [fuzzy] X-Spam_score_int: -7 X-Spam_score: -0.8 X-Spam_bar: / X-Spam_report: (-0.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FORGED_MUA_MOZILLA=2.309, FREEMAIL_FROM=0.001, HTML_FONT_FACE_BAD=0.001, HTML_MESSAGE=0.001, MSGID_FROM_MTA_HEADER=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-1, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01 autolearn=_AUTOLEARN X-Spam_action: no action X-Mailman-Approved-At: Sat, 20 Jun 2020 16:59:55 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:182217 Archived-At: --------------26B672C605FDE69FE36FD5BF Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Information about the operators and constructsused to create regular expressions is contained in two locations in the Info manuals, one in the Emacs manual (section _15.6 Syntax of Regular Expressions_), the other in the Elisp manual (section _34.3.1.1 Special Characters in Regular Expressions_). The first paragraph in section 15.6 of the Emacs manual provides the justification for maintaining two versions of the material, even though the two versions containmostly the same information. There are legitimate differences, however all of the differencescannot be attributed to the "features used mainly in Lisp programs". Here are differences that I have noticed, which I believe should not be differences. Section_15.6 Syntax of Regular Expressions_of the Emacs manual contains descriptions of the postfix repetition operators ‘\{N\}’ and ‘\{N,M\}’. These operators are not described the Elisp manual in section 34.3.1.1, but are described in section _34.3.1.3 Backslash Constructs in Regular Expressions_where they are defined as ‘\{M\}’ and ‘\{M,N\}’. Since the Emacs manual also has a section for backslash constructs, _15.7 Backslash in Regular Expressions_, moving the descriptions of the postfix repetition operators to section 15.7 and naming the as they are named in the Elisp manual would contribute greatly to the consistencyof the two manuals. Additionallythe description of ‘\{M,N\}’ in the Elisp manual contains information not included in the Emacs manual version that would be appropriate to include there. The terminology used in section _15.6 Syntax of Regular Expressions_to describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first paragraph and the final paragraph in the section both refer to these constructs as "a character alternative", while the paragraphs describing them call them a “character set”. In section 34.3.1.1 of the Elisp manual the phrase used consistentlyto describe them and refer to them is "a character alternative". It would increase the consistencyof both manuals to use the same terminology to describe and refer to these constructs. A more grammatically correct phrase to describe these features would be "a set of alternative characters" (but when have programming nerds ever been that concerned with grammatical correctness). Whatever phrase is used to describe and refer to these constructs, it shouldbe consistent throughout both manuals. (The introduction to tsection _34.3.1.2 Character Classes_in the Elisp manual included). In both section _15.6 Syntax of Regular Expressions_and section _34.3.1.1 Special Characters in Regular Expressions_near the end of each section is a paragraph which contains the sentence: As a ‘\’ is not special inside a character alternative, it can never remove the special meaning of ‘-’ or ‘]’. In both sections, in the description of the ‘[ ... ]’ construct, isa sentence which states that the characters ‘]’, ‘-’ and ‘^’ are special inside character alternatives. Shouldn't the sentencesfound in both sections that are cited aboveinclude the '^' character? The construct ‘\(?NUM: ... \)’ that is described in the Elisp manual, section _34.3.1.3 Backslash Constructs in Regular Expressions_ is not included in the Emacs manual section _15.7 Backslash in Regular Expressions_, it should be. However, the description of the construct in section 34.3.1.3 should be modified to make it clear that only the digits 1 through 9 can be used as NUM. Here is a suggestion for doing that: ‘\(?DIGIT:...\)’ is the explicitly numbered groupconstruct. Normal groups get their number implicitly, based on their position, which can be inconvenient. This construct allows a specific group number (limited to the digits 1 through 9, see: ‘\DIGIT’ construct)to be assigned to the group construct. There is no particular restriction on the numbering, e.g., several groups can have the same number in which case the last one to match (i.e., the rightmost match) will be recorded. Implicitly numbered groups always get the smallest integer larger than the largest one of any previous group. In the Emacs manual section _15.7 Backslash in Regular Expressions_ in the description of the ‘\D’ construct the following sentence in the second paragraph is misleading: Then, later on in the regular expression, you can use ‘\’ followed by the digit D to mean “match the same text matched the Dth time by the ‘\( ... \)’ construct”. This does not agree with the description in the paragraphs that surround it nor with the description of the construct in the Elisp manual, section _34.3.1.3 Backslash Constructs in Regular Expressions_. This is not an error introduced in version 26, it has been present since at least version 23. It should read: Then, later on in the regular expression, ‘\’ followed by the digit D can be used to mean “match the same text matched by the Dth ‘\( ... \)’ construct”. In section _15.7 Backslash in Regular Expressions_of the Emacs manual the descriptions for the constructs ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\w’, ‘\W’, ‘\_<’, ‘\_>’, ‘\sC’, ‘\SC’, ‘\cC’ and ‘\CC’ appear in the order show here, while in section _34.3.1.3 Backslash Constructs in Regular Expressions_of the Elisp manual they appear in the following order: ‘\w’, ‘\W’, ‘\sCODE’, ‘\SCODE’, ‘\cC’, ‘\CC’, ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\_<’and ‘\_>’, which groups the constructs which match characters together and those which match empty strings relative to positions together. This grouping makes much more sense than the apparenthaphazardorder used in the Emacs manual. The order in the Emacs manual should match that of the Elsip manual. Also in section _34.3.1.3 Backslash Constructs in Regular Expressions _ofthe Elsip manual the four constructs havingplaceholders: ‘\sCODE’, ‘\SCODE’, ‘\cC’ and‘\CC’,the same convention is not used for specifyingthe placeholders. Either the constructs ‘\sCODE’and‘\SCODE’ should be written as ‘\sC’ and‘\SC’ or the constructs ‘\cC’ and‘\CC’ should be written as ‘\cCODE’ and‘\CCODE’ makingthe convention consistent throughout the section. The same convention should be used in both the Emacs manual and the Elisp manual in all constructswhere place holdersoccur. I prefer the use of a mnemonic as a placeholder over the use of a dingle character. Adopting this convention would necessitate changing the ‘\{M\}’, ‘\{M,N\}and ‘\D’ constructs as well. I suggest the following: ‘\{NUM\}’, ‘\{MIN,MAX\}and ‘\DIGIT’. I prefer the convention used in the online version of the Elisp manual where placeholders are shown in lowercase italics. I do not know it that is possible to do or if it would conflict with the convention of showing place holders in all caps that is used in function descriptions. Since it is possible to cause links to files and the names of variables to be displayed differently in function descriptions, it should not be difficult to define a mechanism for displaying place holders in italics in function descriptions. In section _34.3.1.3 Backslash Constructs in Regular Expressions _ofthe Elsip manual in the paragraph that introduces the regular expression constructs match the empty string the word ‘consume’ would be more appropriate than the phrase ‘use up’. The format of the descriptions in section _34.3.1.3 Backslash Constructs in Regular Expressions _ofthe Elsip manual is not consistent. I offer you the following which I have attempted to add some consistency to by stating the name of the operator/construct then describing how it is used. The corrections and improvements mentioned above are incorporated into what follows. For the most part, ‘\’ followed by any character matches only that character. However, there are several exceptions: two-character sequences starting with ‘\’ that have special meanings. The second character in the sequence is always an ordinary character when used on its own. Here are the ‘\’ operators and constructs. ‘\|’ is the alternative operator. Two regular expressions Aand Bwith ‘\|’ between forms an expression that matches either the text matched by Aor the text matched by B Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string. ‘\|’ applies to the largest possible surrounding expressions. Only a surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’. When full backtracking capability is needed to handle multiple uses of ‘\|’, use the POSIX regular expression functions (see POSIX Regexps in the Elisp manual). ‘\{/num/\}’ is the postfix number of repetitions operator. It specifies the exact number of consecutive repetitionsthat the preceding regular expression must match. For example, ‘x\{4\}’ matches only the string ‘xxxx’; ‘c[ad]\{3\}r’ matches only the eight valid strings that can be created with two characters in three places, that is the strings: ‘caaar’, ‘caadr’, ‘cadar’, ‘caddr’, ‘cdaar’, ‘cdadr’, ‘cddar’, ‘cdddr’. ‘\{/min/,/max/\}’ is the postfix range of repetitions operator. It specifies the range of consecutive repetitionsbetween /min/and /max/that the preceding regular expression must match, i.e. at least /min/times, but no more than /max/times. If /min/is omitted, the minimum is 0, but the preceding regular expression must match at least /max/times; if /max/is omitted, there is no maximum. ‘\{0,1\}’ or ‘\{,1\}’ is equivalent to ‘?’. ‘\{0,\}’ or ‘\{,\}’is equivalent to ‘*’. ‘\{1,\}’ is equivalent to ‘+’. For example, ‘c[ad]\{1,2\}r’ matches only the strings: ‘car’, ‘cdr’, ‘caar’, ‘cadr’, ‘cdar’, and ‘cddr’. The maximum value allowed for /num/, /min/and /max/is 2**15 − 1. ‘\( … \)’ is the grouping construct that serves three purposes: 1. To enclose a set of ‘\|’ alternatives for other operations. Thus, ‘\(foo\|bar\)x’ matches either ‘foox’ or ‘barx’. 2. To enclose a complicated expression for the postfix operators ‘*’, ‘+’ and ‘?’ to operate on. Thus, ‘ba\(na\)*’ matches ‘bananana’, etc., with any number of (zero or more) ‘na’ strings. 3. To record a matched substring for future reference with ‘\/digit/’ (described below). This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature that is assigned as a second meaning to the same ‘\( … \)’ construct. In practice there is usually no conflict between the two meanings; when there is a conflict, a “shy” group (described below) can be used. ‘\(?: … \)’ is the “shy” group construct. A shy group serves the first two purposes of an ordinary group (controlling the nesting of other operators), but it does not record the matched substring; it can’t be referred back to with ‘\digit’ construct (see below). This is useful in mechanically combining regular expressions, so that groups can be added for syntactic purposes without interfering with the numbering of the groups that are meant to be referred to. ‘\(?/digit/: … \)’ is the explicitly numbered groupconstruct. Normal groups get their number implicitly, based on their position, which can be inconvenient. This construct allows a specific group number (limited to the digits 1 through 9, see: ‘\/digit/’ construct)to be assigned to the group construct. There is no particular restriction on the numbering, e.g., several groups can have the same number in which case the last one to match (i.e., the rightmost match) will be recorded. Implicitly numbered groups always get the smallest integer larger than the largest one of any previous group. ‘\/digit/’ is the back reference operator. It matches the same text that matched the /digit/^/th/ occurrence of a ‘\( … \)’ construct. After the end of a ‘\( … \)’ construct, the matcher remembers the beginning and end of the text matched by that construct. Later in the regular expression, ‘\’ followed by the /digit/can be used to match the same text matched by the /digit/^/th/ ‘\( … \)construct. The strings matching the first nine ‘\( … \)’ constructs appearing in a regular expression are assigned numbers 1 through 9 in the order that the open-parentheses appear in the regular expression. So ‘\1’ through ‘\9’ can be used to refer to the text matched by the corresponding ‘\( … \)’ constructs. For example, ‘\(.*\)\1’ matches any newline-free string that is composed of two identical halves. The ‘\(.*\)’ matches the first half, which may be anything, but the ‘\1’ that follows must match the same exact text. If a ‘\( … \)’ construct matches more than once (which can easily happen if it is followed by ‘*’), only the last match is recorded. If a particular grouping construct in the regular expression was never matched—for instance, if it appears inside of an alternative that wasn’t used, or inside of a repetition that repeated zero times—then the corresponding ‘\digit’ construct never matches anything. For example, the regexp ‘\(foo\(b*\)\|lose\)\2’ cannot match ‘lose’ because the second alternative inside the larger group matches it, which results in ‘\2’ being undefined and unable to match anything. It can match ‘foobb’, because the first alternative matches ‘foob’ and ‘\2’ matches the second ‘b’. The following operators pertaining to words and syntax are controlled by the setting of the syntax table (/See:/_Table of Syntax Classes_). ‘\w’ is the word-constituent operator, it matches any word-constituent character. The syntax table determines which characters these are. (/See:/_Table of Syntax Classes_) ‘\W’ is the non-word-constituent operator, it matches any character that is not a word-constituent. (/See:/_Table of Syntax Classes_) ‘\s/code/’ is the syntax class operator, it matches any character whose syntax is /code/. Here /code/is a character that designates a particular syntax class: thus, ‘w’ for word constituent, ‘-’ or ‘’ for whitespace, ‘.’ for ordinary punctuation, etc. (/See:/_Table of Syntax Classes_) ‘\S/code/’ is the non syntax class operator, it matches any character whose syntax is not /code/. (/See:/_Table of Syntax Classes_) ‘\c/code/’ is the character category operator, it matches any character that belongs to the category /code/. For example, ‘\cc’ matches Chinese characters, ‘\cg’ matches Greek characters, etc. For the description of the known categories, type ‘M-x describe-categories ’. (/See also:/_Category Characters_) ‘\C/code/’ is the non character category operator, it matches any character that does _not_belong to category /code/. (/See:/_Category Characters_) The following regular expression constructs match the empty string—that is, they don't consume any characters—but whether they match depends on the context. For all, the beginning and end of the accessible portion of the buffer are treated as if they were the actual beginning and end of the buffer. \`’ is the beginning of string operator, it matches the empty string, but only at the beginning of the string or buffer (or its accessible portion) being matched against. ‘\’’ is the end of string operator, it matches the empty string, but only at the end of the string or buffer (or its accessible portion) being matched against. ‘\=’ is the at point operator, it matches the empty string, but only at point. ‘\b’ is the beginning or end of word operator, it matches the empty string, but only at the beginning or end of a word. Thus, ‘\bfoo\b’ matches any occurrence of ‘foo’ as a separate word. ‘\bballs?\b’ matches ‘ball’ or ‘balls’ as a separate word. ‘\b’ matches at the beginning or end of the buffer regardless of what text appears next to it. ‘\B’ is the middle of word operator, it matches the empty string, but _not_at the beginning or end of a word. ‘\<’ is the beginning of word operator, it matches the empty string, but only at the beginning of a word; furthermore, ‘\<’ matches at the beginning of the buffer only if a word-constituent character follows. ‘\>’ is the end of word operator, it matches the empty string, but only at the end of a word; furthermore, ‘\>’ matches at the end of the buffer only if the contents end with a word-constituent character. ‘\_<’ is the beginning of symbol operator, it matches the empty string, but only at the beginning of a symbol. A symbol is a sequence of one or more symbol-constituent characters. A symbol-constituent character is a character whose syntax is either ‘w’ or ‘_’. It matches at the beginning of the buffer only if a symbol-constituent character immediately follows the beginning of the buffer. As with words, the syntax table determines which characters are symbol-constituent. ‘\_>’ is the end of symbol operator, it matches the empty string, but only at the end of a symbol. It matches at the end of the buffer only if a symbol-constituent character immediately precedes the end of the buffer. Not every string is a valid regular expression. For example, a string that ends inside a set of alternative characters without a terminating ‘]’ is invalid, and so is a string that ends with a single ‘\’. If an invalid regular expression is passed to any of the search functions, an invalid-regexp error is signaled. J C Bingham    - Georgetown, TX USA - ___________________________ -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus --------------26B672C605FDE69FE36FD5BF Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit

Information about the operators and constructs used to create regular expressions is contained in two locations in the Info manuals, one in the Emacs manual (section 15.6 Syntax of Regular Expressions), the other in the Elisp manual (section 34.3.1.1 Special Characters in Regular Expressions). The first paragraph in section 15.6 of the Emacs manual provides the justification for maintaining two versions of the material, even though the two versions contain mostly the same information. There are legitimate differences, however all of the differences cannot be attributed to the "features used mainly in Lisp programs". Here are differences that I have noticed, which I believe should not be differences.

Section 15.6 Syntax of Regular Expressions of the Emacs manual contains descriptions of the postfix repetition operators ‘\{N\}’ and ‘\{N,M\}’. These operators are not described the Elisp manual in section 34.3.1.1, but are described in section 34.3.1.3 Backslash Constructs in Regular Expressions where they are defined as ‘\{M\}’ and ‘\{M,N\}’. Since the Emacs manual also has a section for backslash constructs, 15.7 Backslash in Regular Expressions, moving the descriptions of the postfix repetition operators to section 15.7 and naming the as they are named in the Elisp manual would contribute greatly to the consistency of the two manuals. Additionally the description of ‘\{M,N\}’ in the Elisp manual contains information not included in the Emacs manual version that would be appropriate to include there.

The terminology used in section 15.6 Syntax of Regular Expressions to describe and discuss the ‘[ ... ]’ and ‘[^ ... ]’ constructs. The first paragraph and the final paragraph in the section both refer to these constructs as "a character alternative", while the paragraphs describing them call them a “character set”. In section 34.3.1.1 of the Elisp manual the phrase used consistently to describe them and refer to them is "a character alternative". It would increase the consistency of both manuals to use the same terminology to describe and refer to these constructs. A more grammatically correct phrase to describe these features would be "a set of alternative characters" (but when have programming nerds ever been that concerned with grammatical correctness). Whatever phrase is used to describe and refer to these constructs, it should be consistent throughout both manuals. (The introduction to tsection 34.3.1.2 Character Classes in the Elisp manual included).

In both section 15.6 Syntax of Regular Expressions and section 34.3.1.1 Special Characters in Regular Expressions near the end of each section is a paragraph which contains the sentence:

As a ‘\’ is not special inside a character alternative, it can never remove the special meaning of ‘-’ or ‘]’.

In both sections, in the description of the ‘[ ... ]’ construct, is a sentence which states that the characters ‘]’, ‘-’ and ‘^’ are special inside character alternatives.

Shouldn't the sentences found in both sections that are cited above include the '^' character?

The construct ‘\(?NUM: ... \)’ that is described in the Elisp manual, section 34.3.1.3 Backslash Constructs in Regular Expressions is not included in the Emacs manual section 15.7 Backslash in Regular Expressions, it should be. However, the description of the construct in section 34.3.1.3 should be modified to make it clear that only the digits 1 through 9 can be used as NUM. Here is a suggestion for doing that:

\(?DIGIT:...\)

is the explicitly numbered group construct. Normal groups get their number implicitly, based on their position, which can be inconvenient. This construct allows a specific group number (limited to the digits 1 through 9, see: ‘\DIGIT’ construct) to be assigned to the group construct. There is no particular restriction on the numbering, e.g., several groups can have the same number in which case the last one to match (i.e., the rightmost match) will be recorded. Implicitly numbered groups always get the smallest integer larger than the largest one of any previous group.

In the Emacs manual section 15.7 Backslash in Regular Expressions in the description of the ‘\D’ construct the following sentence in the second paragraph is misleading:

Then, later on in the regular expression, you can use ‘\’ followed by the digit D to mean “match the same text matched the Dth time by the ‘\( ... \)’ construct”.

This does not agree with the description in the paragraphs that surround it nor with the description of the construct in the Elisp manual, section 34.3.1.3 Backslash Constructs in Regular Expressions. This is not an error introduced in version 26, it has been present since at least version 23. It should read:

Then, later on in the regular expression, ‘\’ followed by the digit D can be used to mean “match the same text matched by the Dth ‘\( ... \)’ construct”.

In section 15.7 Backslash in Regular Expressions of the Emacs manual the descriptions for the constructs ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\w’, ‘\W’, ‘\_<’, ‘\_>’, ‘\sC’, ‘\SC’, ‘\cC’ and ‘\CC’ appear in the order show here, while in section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elisp manual they appear in the following order: ‘\w’, ‘\W’, ‘\sCODE’, ‘\SCODE’, ‘\cC’, ‘\CC’, ‘\`’, ‘\'’, ‘\=’, ‘\b’, ‘\B’, ‘\<’, ‘\>’, ‘\_<’ and ‘\_>’, which groups the constructs which match characters together and those which match empty strings relative to positions together. This grouping makes much more sense than the apparent haphazard order used in the Emacs manual. The order in the Emacs manual should match that of the Elsip manual.

Also in section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elsip manual the four constructs having placeholders: \sCODE’, ‘\SCODE’, ‘\cCand\CC’, the same convention is not used for specifying the placeholders. Either the constructs \sCODE’ and\SCODEshould be written as ‘\sCand\SCor the constructs \cCand\CCshould be written as ‘\cCODEand\CCODEmaking the convention consistent throughout the section. The same convention should be used in both the Emacs manual and the Elisp manual in all constructs where place holders occur. I prefer the use of a mnemonic as a placeholder over the use of a dingle character.

Adopting this convention would necessitate changing the ‘\{M\}’, ‘\{M,N\} and ‘\D’ constructs as well. I suggest the following: ‘\{NUM\}’, ‘\{MIN,MAX\} and ‘\DIGIT’. I prefer the convention used in the online version of the Elisp manual where placeholders are shown in lowercase italics. I do not know it that is possible to do or if it would conflict with the convention of showing place holders in all caps that is used in function descriptions. Since it is possible to cause links to files and the names of variables to be displayed differently in function descriptions, it should not be difficult to define a mechanism for displaying place holders in italics in function descriptions.

In section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elsip manual in the paragraph that introduces the regular expression constructs match the empty string the word ‘consume’ would be more appropriate than the phrase ‘use up’.

The format of the descriptions in section 34.3.1.3 Backslash Constructs in Regular Expressions of the Elsip manual is not consistent. I offer you the following which I have attempted to add some consistency to by stating the name of the operator/construct then describing how it is used. The corrections and improvements mentioned above are incorporated into what follows.

For the most part, \’ followed by any character matches only that character.However, there are several exceptions: two-character sequences starting with ‘\’ that have special meanings.The second character in the sequence is always an ordinary character when used on its own.Here are the ‘\’ operators and constructs.

\|

is the alternative operator.Two regular expressions A and B with ‘\|’ between forms an expression that matches either the text matched by A or the text matched by B

Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string.

\|’ applies to the largest possible surrounding expressions.Only a surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’.

When full backtracking capability is needed to handle multiple uses of ‘\| , use the POSIX regular expression functions (see POSIX Regexps in the Elisp manual).

\{num\}

is the postfix number of repetitions operator. It specifies the exact number of consecutive repetitions that the preceding regular expression must match.For example, ‘x\{4\}’ matches only the string ‘xxxx’; ‘c[ad]\{3\}r’ matches only the eight valid strings that can be created with two characters in three places, that is the strings: ‘caaar’, ‘caadr’, ‘cadar’, ‘caddr’, ‘cdaar’, ‘cdadr’, ‘cddar’, ‘cdddr’.

\{min,max\}

is the postfix range of repetitions operator. It specifies the range of consecutive repetitions between min and max that the preceding regular expression must match, i.e. at least min times, but no more than max times.If min is omitted, the minimum is 0, but the preceding regular expression must match at least max times; if max is omitted, there is no maximum.

\{0,1\}’ or ‘\{,1\}’ is equivalent to ‘?’.

\{0,\}’ or ‘\{,\}’is equivalent to ‘*’.

\{1,\}’ is equivalent to ‘+’.

For example, ‘c[ad]\{1,2\}r’ matches only the strings: ‘car’, ‘cdr’, ‘caar’, ‘cadr’, ‘cdar’, and ‘cddr’.

The maximum value allowed for num, min and max is 2**15 − 1.

\( … \)

is the grouping construct that serves three purposes:

  1. To enclose a set of ‘\|’ alternatives for other operations. Thus, ‘\(foo\|bar\)x’ matches either ‘foox’ or ‘barx’.

  2. To enclose a complicated expression for the postfix operators ‘*’, ‘+’ and ‘?’ to operate on.Thus, ‘ba\(na\)*’ matches ‘bananana’, etc., with any number of (zero or more) ‘na’ strings.

  3. To record a matched substring for future reference with ‘\digit’ (described below).

This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature that is assigned as a second meaning to the same ‘\( … \)’ construct.In practice there is usually no conflict between the two meanings; when there is a conflict, a “shy” group (described below) can be used.

\(?: … \)

is the “shy” group construct. A shy group serves the first two purposes of an ordinary group (controlling the nesting of other operators), but it does not record the matched substring; it can’t be referred back to with ‘\digit ’ construct (see below).This is useful in mechanically combining regular expressions, so that groups can be added for syntactic purposes without interfering with the numbering of the groups that are meant to be referred to.

\(?digit: … \)

is the explicitly numbered group construct. Normal groups get their number implicitly, based on their position, which can be inconvenient. This construct allows a specific group number (limited to the digits 1 through 9, see: ‘\digit’ construct) to be assigned to the group construct. There is no particular restriction on the numbering, e.g., several groups can have the same number in which case the last one to match (i.e., the rightmost match) will be recorded. Implicitly numbered groups always get the smallest integer larger than the largest one of any previous group.

\digit

is the back reference operator. It matches the same text that matched the digitth occurrence of a ‘\( … \)’ construct.

After the end of a ‘\( … \)’ construct, the matcher remembers the beginning and end of the text matched by that construct.Later in the regular expression, ‘\’ followed by the digit can be used to match the same text matched by the digitth\( … \) construct.

The strings matching the first nine ‘\( … \)’ constructs appearing in a regular expression are assigned numbers 1 through 9 in the order that the open-parentheses appear in the regular expression.So ‘\1’ through ‘\9’ can be used to refer to the text matched by the corresponding ‘\( … \)’ constructs.

For example, ‘\(.*\)\1’ matches any newline-free string that is composed of two identical halves.The ‘\(.*\)’ matches the first half, which may be anything, but the ‘\1’ that follows must match the same exact text.

If a ‘\( … \)’ construct matches more than once (which can easily happen if it is followed by ‘*’), only the last match is recorded.

If a particular grouping construct in the regular expression was never matched—for instance, if it appears inside of an alternative that wasn’t used, or inside of a repetition that repeated zero times—then the corresponding ‘\digit’ construct never matches anything. For example, the regexp ‘\(foo\(b*\)\|lose\)\2’ cannot match ‘lose’ because the second alternative inside the larger group matches it, which results in ‘\2’ being undefined and unable to match anything. It can match ‘foobb’, because the first alternative matches ‘foob’ and ‘\2’ matches the second ‘b’.

The following operators pertaining to words and syntax are controlled by the setting of the syntax table (See: Table of Syntax Classes).

\w

is the word-constituent operator, it matches any word-constituent character.The syntax table determines which characters these are.(See: Table of Syntax Classes)

\W

is the non-word-constituent operator, it matches any character that is not a word-constituent.(See: Table of Syntax Classes)

\scode

is the syntax class operator, it matches any character whose syntax is code.Here code is a character that designates a particular syntax class: thus, ‘w’ for word constituent, ‘-’ or ‘ ’ for whitespace, ‘.’ for ordinary punctuation, etc.(See: Table of Syntax Classes)

\Scode

is the non syntax class operator, it matches any character whose syntax is not code.(See: Table of Syntax Classes)

\ccode

is the character category operator, it matches any character that belongs to the category code.For example, ‘\cc’ matches Chinese characters, ‘\cg’ matches Greek characters, etc.For the description of the known categories, type ‘M-x describe-categories <RET>’.(See also: Category Characters)

\Ccode

is the non character category operator, it matches any character that does not belong to category code.(See: Category Characters)

The following regular expression constructs match the empty string—that is, they don't consume any characters—but whether they match depends on the context. For all, the beginning and end of the accessible portion of the buffer are treated as if they were the actual beginning and end of the buffer.

\`

is the beginning of string operator, it matches the empty string, but only at the beginning of the string or buffer (or its accessible portion) being matched against.

\’

is the end of string operator, it matches the empty string, but only at the end of the string or buffer (or its accessible portion) being matched against.

\=

is the at point operator, it matches the empty string, but only at point.

\b

is the beginning or end of word operator, it matches the empty string, but only at the beginning or end of a word.Thus, ‘\bfoo\b’ matches any occurrence of ‘foo’ as a separate word.\bballs?\b’ matches ‘ball’ or ‘balls’ as a separate word.

\b’ matches at the beginning or end of the buffer regardless of what text appears next to it.

\B

is the middle of word operator, it matches the empty string, but not at the beginning or end of a word.

\<

is the beginning of word operator, it matches the empty string, but only at the beginning of a word; furthermore, ‘\<’ matches at the beginning of the buffer only if a word-constituent character follows.

\>

is the end of word operator, it matches the empty string, but only at the end of a word; furthermore, ‘\>’ matches at the end of the buffer only if the contents end with a word-constituent character.

\_<

is the beginning of symbol operator, it matches the empty string, but only at the beginning of a symbol.A symbol is a sequence of one or more symbol-constituent characters.A symbol-constituent character is a character whose syntax is either ‘w’ or ‘_’. It matches at the beginning of the buffer only if a symbol-constituent character immediately follows the beginning of the buffer. As with words, the syntax table determines which characters are symbol-constituent.

\_>

is the end of symbol operator, it matches the empty string, but only at the end of a symbol. It matches at the end of the buffer only if a symbol-constituent character immediately precedes the end of the buffer.

Not every string is a valid regular expression. For example, a string that ends inside a set of alternative characters without a terminating ‘]’ is invalid, and so is a string that ends with a single ‘\’. If an invalid regular expression is passed to any of the search functions, an invalid-regexp error is signaled.


J C Bingham
   - Georgetown, TX USA -
___________________________



Virus-free. www.avast.com
--------------26B672C605FDE69FE36FD5BF--