From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.bugs Subject: bug#25706: 26.0.50; Slow C file fontification Date: Wed, 9 Dec 2020 18:00:30 +0100 Message-ID: References: <956BCA08-0376-4FAD-B1F7-2087C03F6181@acm.org> <53CC4F6E-716E-4D4B-8903-F32CCB676163@acm.org> <05F2A660-A403-4B81-AE77-416A739160A7@acm.org> Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.17\)) Content-Type: multipart/mixed; boundary="Apple-Mail=_F377F32E-C6EB-48F2-AE3E-C1E251445E4E" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="3974"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Lars Ingebrigtsen , 25706@debbugs.gnu.org To: Alan Mackenzie Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Wed Dec 09 18:03:42 2020 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kn2sQ-0000pg-4Z for geb-bug-gnu-emacs@m.gmane-mx.org; Wed, 09 Dec 2020 18:03:42 +0100 Original-Received: from localhost ([::1]:50236 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kn2sO-0007mr-Ha for geb-bug-gnu-emacs@m.gmane-mx.org; Wed, 09 Dec 2020 12:03:40 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:36054) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kn2pr-0006Q0-Sd for bug-gnu-emacs@gnu.org; Wed, 09 Dec 2020 12:01:04 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:52614) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kn2pq-0000pR-K5; Wed, 09 Dec 2020 12:01:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kn2pq-0007pQ-GW; Wed, 09 Dec 2020 12:01:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org, bug-cc-mode@gnu.org Resent-Date: Wed, 09 Dec 2020 17:01:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 25706 X-GNU-PR-Package: emacs,cc-mode X-GNU-PR-Keywords: moreinfo Original-Received: via spool by 25706-submit@debbugs.gnu.org id=B25706.160753324328974 (code B ref 25706); Wed, 09 Dec 2020 17:01:02 +0000 Original-Received: (at 25706) by debbugs.gnu.org; 9 Dec 2020 17:00:43 +0000 Original-Received: from localhost ([127.0.0.1]:35927 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kn2pW-0007Wu-HK for submit@debbugs.gnu.org; Wed, 09 Dec 2020 12:00:42 -0500 Original-Received: from mail153c50.megamailservers.eu ([91.136.10.163]:36496 helo=mail50c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kn2pU-0007Sp-3c for 25706@debbugs.gnu.org; Wed, 09 Dec 2020 12:00:41 -0500 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1607533237; bh=XCd8hPYb7U+b3EVSnsgI0IMnzQUo0CKjevdJG7ocbsY=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=kHsf9DC+0aeYoZqhlWlkpeQz6BCIob4VQXWoEdfrxqfp8yBSp1ZjHitr00qvw6jYn 41F0z2WSpJJBP7US5SLZJzjiCLa4/R3y8nuW/cfEDcEyFkqjyFP2aTLkPu2TD1s0EH uqb71ZtqgintaMx+mXCv8fngNB1bViLYVFqUXvA0= Feedback-ID: mattiase@acm.or Original-Received: from [192.168.0.4] (c188-150-171-71.bredband.comhem.se [188.150.171.71]) (authenticated bits=0) by mail50c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id 0B9H0ZYX026903; Wed, 9 Dec 2020 17:00:37 +0000 In-Reply-To: X-Mailer: Apple Mail (2.3445.104.17) X-CTCH-RefID: str=0001.0A782F22.5FD102B5.00D1, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=EoysUhUA c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=M51BFTxLslgA:10 a=WiJLGLckkQP1U-jykZMA:9 a=CjuIK1q_8ugA:10 a=VGo-vn1NgkJjxh3lK0EA:9 a=De_Ol2h6w80A:10 a=pHzHmUro8NiASowvMSCR:22 a=6VlIyEUom7LUIeUMNQJH:22 X-Origin-Country: SE X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:195552 Archived-At: --Apple-Mail=_F377F32E-C6EB-48F2-AE3E-C1E251445E4E Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii First, some Emacs regexp basics: 1. If A and B match single characters, then A\|B should be written [AB] = whenever possible. The reason is that A\|B adds a backtrack record which = uses stack space and wastes time if matching fails later on. The cost = can be quite noticeable, which we have seen. 2. Syntax-class constructs are usually better written as character = alternatives when possible. The \sX construct, for some X, is typically somewhat slower to match = than explicitly listing the characters to match. For example, if all you = care about are space and tab, then "\\s *" should be written "[ \t]*". 3. Unicode character classes are slower to match than ASCII-only ones. = For example, [[:alpha:]] is slower than [A-Za-z], assuming only those = characters are of interest. 4. [^...] will match \n unless included in the set. For example, = "[^a]\\|$" will almost never match the $ (end-of-line) branch, because a = newline will be matched by the first branch. The only exception is at = the very end of the buffer if it is not newline-terminated, but that is = rarely worth considering for source code. 5. \r (carriage return) normally doesn't appear in buffers even if the = file uses DOS line endings. Line endings are converted into a single \n = (newline) when the buffer is read. In particular, $ does NOT match at = \r, only before \n. When \r appears it is usually because the file contains a mixture of = line-ending styles, typically from being edited using broken tools. = Whether you want to take such files into account is a matter of = judgement; most modes don't bother. 6. Capturing groups costs more than non-capturing groups, but you = already know that. On to specifics: here are annotations for possible improvements in = cc-langs.el. (I didn't bother about capturing groups here.) --Apple-Mail=_F377F32E-C6EB-48F2-AE3E-C1E251445E4E Content-Disposition: attachment; filename=cc-regexp-annot.diff Content-Type: application/octet-stream; x-unix-mode=0644; name="cc-regexp-annot.diff" Content-Transfer-Encoding: 7bit diff --git a/lisp/progmodes/cc-langs.el b/lisp/progmodes/cc-langs.el index d6089ea295..695c41fce6 100644 --- a/lisp/progmodes/cc-langs.el +++ b/lisp/progmodes/cc-langs.el @@ -903,6 +903,7 @@ c-opt-cpp-prefix ;; TODO (ACM, 2005-04-01). Amend the following to recognize escaped NLs; ;; amend all uses of c-opt-cpp-prefix which count regexp-depth. t "\\s *#\\s *" +;;; XXX replace "\\s " with char alt, presumably [ \t] (2x) (java awk) nil) (c-lang-defvar c-opt-cpp-prefix (c-lang-const c-opt-cpp-prefix)) @@ -910,6 +911,7 @@ c-anchored-cpp-prefix "Regexp matching the prefix of a cpp directive anchored to BOL, in the languages that have a macro preprocessor." t "^\\s *\\(#\\)\\s *" +;;; XXX replace "\\s " with char alt, presumably [ \t] (2x) (java awk) nil) (c-lang-defvar c-anchored-cpp-prefix (c-lang-const c-anchored-cpp-prefix)) @@ -920,6 +922,7 @@ c-opt-cpp-start t (if (c-lang-const c-opt-cpp-prefix) (concat (c-lang-const c-opt-cpp-prefix) "\\([" c-alnum "]+\\)")) +;;; XXX all cpp directives are lower-case ASCII letters; should be [a-z]+ ;; Pike, being a scripting language, recognizes hash-bangs too. pike (concat (c-lang-const c-opt-cpp-prefix) "\\([" c-alnum "]+\\|!\\)")) @@ -968,6 +971,8 @@ c-opt-cpp-macro-define-start (concat (c-lang-const c-opt-cpp-prefix) (c-lang-const c-opt-cpp-macro-define) "[ \t]+\\(\\(\\sw\\|_\\)+\\)\\(([^)]*)\\)?" +;;; XXX \\(\\sw\\|_\\)+ should be [[:word:]_]+, +;;; XXX or more likely [[:alpha:]_][[:alnum:]_]* ;; ^ ^ #defined name "\\([ \t]\\|\\\\\n\\)*"))) (c-lang-defvar c-opt-cpp-macro-define-start @@ -980,6 +985,8 @@ c-opt-cpp-macro-define-id (concat (c-lang-const c-opt-cpp-prefix) ; # (c-lang-const c-opt-cpp-macro-define) ; define "[ \t]+\\(\\sw\\|_\\)+"))) +;;; XXX \\(\\sw\\|_\\)+ should be [[:word:]_]+, +;;; XXX or more likely [[:alpha:]_][[:alnum:]_]* (c-lang-defvar c-opt-cpp-macro-define-id (c-lang-const c-opt-cpp-macro-define-id)) @@ -990,6 +997,10 @@ c-anchored-hash-define-no-parens (concat (c-lang-const c-anchored-cpp-prefix) (c-lang-const c-opt-cpp-macro-define) "[ \t]+\\(\\sw\\|_\\)+\\([^(a-zA-Z0-9_]\\|$\\)"))) +;;; XXX \\(\\sw\\|_\\)+ should be [[:word:]_]+, +;;; XXX or more likely [[:alpha:]_][[:alnum:]_]* +;;; XXX but what about the ASCII-only tail? Besides, [^(a-zA-Z0-9_] will +;;; XXX always match \n so the $ is almost never useful! (c-lang-defconst c-cpp-expr-directives "List of cpp directives (without the prefix) that are followed by an @@ -1353,6 +1364,7 @@ c-assignment-op-regexp (concat ;; Need special case for "=" since it's a prefix of "==". "=\\([^=]\\|$\\)" +;;; XXX [^=] matches \n so the $ is almost never useful "\\|" (c-make-keywords-re nil (c--set-difference (c-lang-const c-assignment-operators) @@ -1412,6 +1424,7 @@ c-<-pseudo-digraph-cont-regexp template opener followed by the \"::\" operator - usually." t regexp-unmatchable c++ "::\\([^:>]\\|$\\)") +;;; XXX [^:>] matches \n so the $ is almost never useful (c-lang-defvar c-<-pseudo-digraph-cont-regexp (c-lang-const c-<-pseudo-digraph-cont-regexp)) @@ -1599,6 +1612,7 @@ c-simple-ws Does not contain a \\| operator at the top level." ;; "\\s " is not enough since it doesn't match line breaks. t "\\(\\s \\|[\n\r]\\)") +;;; XXX replace with single char alt: [ \t\n\r\f] (c-lang-defconst c-simple-ws-depth ;; Number of regexp grouping parens in `c-simple-ws'. @@ -1702,6 +1716,7 @@ c-last-c-comment-end-on-line-re comments. When a match is found, submatch 1 contains the comment ender." t "\\(\\*/\\)\\([^*]\\|\\*+\\([^*/]\\|$\\)\\)*$" +;;; XXX [^*/] matches \n so the $ is almost never useful awk nil) (c-lang-defvar c-last-c-comment-end-on-line-re (c-lang-const c-last-c-comment-end-on-line-re)) @@ -1778,6 +1793,7 @@ comment-start-skip (c-lang-const c-block-comment-starter))) "\\|") "\\)\\s *")) +;;; XXX replace "\\s " with char alt, presumably [ \t] (c-lang-setvar comment-start-skip (c-lang-const comment-start-skip)) (c-lang-defconst comment-end-can-be-escaped @@ -1792,6 +1808,7 @@ c-syntactic-ws-start ;; Regexp matching any sequence that can start syntactic whitespace. ;; The only uncertain case is '#' when there are cpp directives. t (concat "\\s \\|" +;;; XXX replace "\\s " with char alt, presumably [ \t] (c-make-keywords-re nil (append (list (c-lang-const c-line-comment-starter) (c-lang-const c-block-comment-starter) @@ -1799,6 +1816,7 @@ c-syntactic-ws-start "#")) '("\n" "\r"))) "\\|\\\\[\n\r]" +;;; XXX unclear if \r is ever relevant here (2x) (when (memq 'gen-comment-delim c-emacs-features) "\\|\\s!"))) (c-lang-defvar c-syntactic-ws-start (c-lang-const c-syntactic-ws-start)) @@ -1847,6 +1865,8 @@ c-unterminated-block-comment-regexp "]" "[^" (substring end 0 1) "\n\r]*" "\\)*")) +;;; XXX this is baroque, since c-block-comment-ender is either nil or "*/", +;;; XXX so why not special case those and be done with it? (t (error "Can't handle a block comment ender of length %s" (length end)))))))) @@ -1868,6 +1888,7 @@ c-block-comment-regexp ((= (length end) 2) (concat (regexp-quote (substring end 0 1)) "+" (regexp-quote (substring end 1 2)))) +;;; XXX see above; c-block-comment-ender is nil or "*/" (t (error "Can't handle a block comment ender of length %s" (length end)))))))) @@ -1883,6 +1904,7 @@ c-nonwhite-syntactic-ws "[^\n\r]*[\n\r]")) (c-lang-const c-block-comment-regexp) "\\\\[\n\r]" +;;; XXX \r here is probably unnecessary (3x) (when (memq 'gen-comment-delim c-emacs-features) "\\s!\\S!*\\s!")) "\\|")) @@ -1927,6 +1949,7 @@ c-single-line-syntactic-ws (c-lang-const c-block-comment-regexp) "\\s *\\)*") "\\s *")) +;;; XXX replace "\\s " with char alt, presumably [ \t] (3x) (c-lang-defconst c-single-line-syntactic-ws-depth ;; Number of regexp grouping parens in `c-single-line-syntactic-ws'. @@ -3476,6 +3499,7 @@ c-type-decl-prefix-key "\\)" "\\([^=]\\|$\\)") pike "\\(\\*\\)\\([^=]\\|$\\)") +;;; XXX [^=] matches \n so the $ is almost never useful (3x) (c-lang-defvar c-type-decl-prefix-key (c-lang-const c-type-decl-prefix-key) 'dont-doc) @@ -3498,6 +3522,7 @@ c-type-decl-operator-prefix-key "\\)" "\\([^=]\\|$\\)") pike "\\(\\*\\)\\([^=]\\|$\\)") +;;; XXX [^=] matches \n so the $ is almost never useful (3x) (c-lang-defvar c-type-decl-operator-prefix-key (c-lang-const c-type-decl-operator-prefix-key)) @@ -3647,6 +3672,8 @@ c-pre-id-bracelist-key " t regexp-unmatchable c++ "new\\([^[:alnum:]_$]\\|$\\)\\|&&?\\(\\S.\\|$\\)") +;;; XXX [^[:alnum:_$] matches \n so the $ is almost never useful +;;; XXX \\S. matches \n so the $ is almost never useful (c-lang-defvar c-pre-id-bracelist-key (c-lang-const c-pre-id-bracelist-key)) (c-lang-defconst c-recognize-typeless-decls --Apple-Mail=_F377F32E-C6EB-48F2-AE3E-C1E251445E4E--