From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier via "Bug reports for GNU Emacs, the Swiss army knife of text editors" Newsgroups: gmane.emacs.bugs Subject: bug#65726: 29.1.50; Crash in regexp engine Date: Tue, 05 Sep 2023 11:33:05 -0400 Message-ID: References: <8e1b4e50-0430-3eb3-e486-60def1e4821f@gmx.at> <83fs3u5e7u.fsf@gnu.org> <835y4q5apw.fsf@gnu.org> <776370AB-662F-4C0A-95BF-97DEA4F18F54@gmail.com> <3A0AC9E2-A420-47B6-870A-69C53FCAEF71@gmail.com> Reply-To: Stefan Monnier Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="3532"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: martin rudalics , Eli Zaretskii , 65726@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Tue Sep 05 17:34:22 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qdY4L-0000f1-Qh for geb-bug-gnu-emacs@m.gmane-mx.org; Tue, 05 Sep 2023 17:34:22 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qdY43-0007Q3-N1; Tue, 05 Sep 2023 11:34:03 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qdY42-0007PJ-1x for bug-gnu-emacs@gnu.org; Tue, 05 Sep 2023 11:34:02 -0400 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qdY41-0004eJ-Pa for bug-gnu-emacs@gnu.org; Tue, 05 Sep 2023 11:34:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1qdY41-0001FN-PL for bug-gnu-emacs@gnu.org; Tue, 05 Sep 2023 11:34:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Stefan Monnier Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 05 Sep 2023 15:34:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 65726 X-GNU-PR-Package: emacs Original-Received: via spool by 65726-submit@debbugs.gnu.org id=B65726.16939279984728 (code B ref 65726); Tue, 05 Sep 2023 15:34:01 +0000 Original-Received: (at 65726) by debbugs.gnu.org; 5 Sep 2023 15:33:18 +0000 Original-Received: from localhost ([127.0.0.1]:57893 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qdY3J-0001EB-I0 for submit@debbugs.gnu.org; Tue, 05 Sep 2023 11:33:18 -0400 Original-Received: from mailscanner.iro.umontreal.ca ([132.204.25.50]:37796) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qdY3G-0001Du-Ew for 65726@debbugs.gnu.org; Tue, 05 Sep 2023 11:33:16 -0400 Original-Received: from pmg1.iro.umontreal.ca (localhost.localdomain [127.0.0.1]) by pmg1.iro.umontreal.ca (Proxmox) with ESMTP id 39677100068; Tue, 5 Sep 2023 11:33:08 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=iro.umontreal.ca; s=mail; t=1693927986; bh=Y+PdumDpXf7s94Pg4qgXMk5NKSZUaj4Q/RZC78uLU/A=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=nhIGwAUlDzWVxmoEWvnRIxHZIBSUeIjAafcxZxorgq68jDK/s4OaqL3f2lr1xEYLB 9mMmI84Nzap5CeRZxW5fi//hkV4eWlRAc+XKbPKUB2T0YOV74zGMtMYIVhHDr9x+WJ HTs2LDcijiv9cc+G2uZM+3MG3K5dndqJLfQnuOcp9cZ35MYSDs239Jvk51EDqfld6e O2lBeFdoAwUci77v7EPSL1tIJb2zqE0lvqc0h6WWVlhaZt7L+ddOx3om4hIC01eDlC khZ8c3bVzBCmJVo8sMNuvn2cJ76gADSejtnOrsPAl7N68cWNiNhUMx9MkzYT9KEVmc lmJezE0aeEpaA== Original-Received: from mail01.iro.umontreal.ca (unknown [172.31.2.1]) by pmg1.iro.umontreal.ca (Proxmox) with ESMTP id 92BC2100033; Tue, 5 Sep 2023 11:33:06 -0400 (EDT) Original-Received: from pastel (69-165-136-223.dsl.teksavvy.com [69.165.136.223]) by mail01.iro.umontreal.ca (Postfix) with ESMTPSA id 5F83E1202A0; Tue, 5 Sep 2023 11:33:06 -0400 (EDT) In-Reply-To: ("Mattias =?UTF-8?Q?Engdeg=C3=A5rd?="'s message of "Tue, 5 Sep 2023 15:50:05 +0200") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:269360 Archived-At: >> Feel free. To me, it feels too much like testing the presence/absence >> of a very specific bug (i.e. too unlikely that we'll reintroduce that >> exact bug. It's not like it was a weird case I didn't think of). > Alright, pushed to emacs-29. It's a cheap test and offers at least some kind > of protection in the event that someone undoes your handiwork. Such things > do happen after all. I marked the backtracking test as failing because the "real" fix wil likely be too risky to put onto `emacs-29` (at least for now). >> Let's see if I can finagle something. > I liked your previous optimisation, so if you could finagle it back without > breaking anything that would be pukka. WDYT of the approach below? You can ignore the `mutually_exclusive_(charset|exactn)` thingies which just move code around (it's necessary for the patch). It's a bit ugly because the actual assumptions on which it relies aren't spelled out explicitly (and even less assert-checked at run time). The basic idea is that `done` points to the beginning of the "current loop": as long as we move forward we presume we may still be in the current loop, and when we find a jump backward it's either a jump to the beginning of a new (nested/further) loop (so we move `done`) or a jump to the same loop or to some earlier loop (which we presumably handle elsewhere). I also tried a safer option where we do: return ((p2 < done ? false : p2 == done ? true : mutually_exclusive_aux (bufp, p1, p2, p2 > p2_orig ? done : p2)) && (p2_other < done ? false : p2_other == done ? true : mutually_exclusive_aux (bufp, p1, p2_other, p2_other > p2_orig ? done : p2_other))); i.e. we only consider `done` itself as "already checked", which should definitely be safe. It also seems to be sufficient to handle things like: (should (looking-at "x*\\(=+?\\|h\\)+")) Adding an `assert (p2 >= done)` confirms that this does happen, so whether we return true or false when `p2 < done` does matter, so I guess we should go with the safer option unless we can really convince ourselves that the more optimistic option is also correct. Then again, maybe we should bite the bullet and actually keep track of the positions already visited, so we don't need any "fancy" argument. Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 7e75f0ac597..0d3cf39d619 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3643,19 +3643,125 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte, return not; } +/* Case where `p2` points to an `exactn`. */ +static bool +mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + bool multibyte = RE_MULTIBYTE_P (bufp); + int c + = (re_opcode_t) *p2 == endline ? '\n' + : RE_STRING_CHAR (p2 + 2, multibyte); + + if ((re_opcode_t) *p1 == exactn) + { + if (c != RE_STRING_CHAR (p1 + 2, multibyte)) + { + DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); + return true; + } + } + + else if ((re_opcode_t) *p1 == charset + || (re_opcode_t) *p1 == charset_not) + { + if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), + Qnil)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == anychar + && c == '\n') + { + DEBUG_PRINT (" . != \\n => fast loop.\n"); + return true; + } + return false; +} + +/* Case where `p2` points to an `charset`. */ +static bool +mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + /* It is hard to list up all the character in charset + P2 if it includes multibyte character. Give up in + such case. */ + if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) + { + /* Now, we are sure that P2 has no range table. + So, for the size of bitmap in P2, 'p2[1]' is + enough. But P1 may have range table, so the + size of bitmap table of P1 is extracted by + using macro 'CHARSET_BITMAP_SIZE'. + + In a multibyte case, we know that all the character + listed in P2 is ASCII. In a unibyte case, P1 has only a + bitmap table. So, in both cases, it is enough to test + only the bitmap table of P1. */ + + if ((re_opcode_t) *p1 == charset) + { + int idx; + /* We win if the charset inside the loop + has no overlap with the one after the loop. */ + for (idx = 0; + (idx < (int) p2[1] + && idx < CHARSET_BITMAP_SIZE (p1)); + idx++) + if ((p2[2 + idx] & p1[2 + idx]) != 0) + break; + + if (idx == p2[1] + || idx == CHARSET_BITMAP_SIZE (p1)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == charset_not) + { + int idx; + /* We win if the charset_not inside the loop lists + every character listed in the charset after. */ + for (idx = 0; idx < (int) p2[1]; idx++) + if (! (p2[2 + idx] == 0 + || (idx < CHARSET_BITMAP_SIZE (p1) + && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) + break; + + if (idx == p2[1]) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + } + return false; +} + /* True if "p1 matches something" implies "p2 fails". */ +/* Avoiding inf-loops: + We're trying to follow all paths reachable from `p2`, but since some + loops can match the empty string, this can loop back to `p2`. + To avoid inf-looping, we rely on a lexicographic ordering using `done` + and `p2`: at every recursion, either `done` is larger, or it + is unchanged and `p2` is larger. */ static bool -mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2) +mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2, re_char *done) { re_opcode_t op2; - bool multibyte = RE_MULTIBYTE_P (bufp); unsigned char *pend = bufp->buffer + bufp->used; re_char *p2_orig = p2; eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); + eassert (p2 >= done); + /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, look at what begins its body, since we will have to @@ -3684,98 +3790,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, case endline: case exactn: - { - int c - = (re_opcode_t) *p2 == endline ? '\n' - : RE_STRING_CHAR (p2 + 2, multibyte); - - if ((re_opcode_t) *p1 == exactn) - { - if (c != RE_STRING_CHAR (p1 + 2, multibyte)) - { - DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); - return true; - } - } - - else if ((re_opcode_t) *p1 == charset - || (re_opcode_t) *p1 == charset_not) - { - if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), - Qnil)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == anychar - && c == '\n') - { - DEBUG_PRINT (" . != \\n => fast loop.\n"); - return true; - } - } - break; + return mutually_exclusive_exactn (bufp, p1, p2); case charset: { if ((re_opcode_t) *p1 == exactn) - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); - - /* It is hard to list up all the character in charset - P2 if it includes multibyte character. Give up in - such case. */ - else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) - { - /* Now, we are sure that P2 has no range table. - So, for the size of bitmap in P2, 'p2[1]' is - enough. But P1 may have range table, so the - size of bitmap table of P1 is extracted by - using macro 'CHARSET_BITMAP_SIZE'. - - In a multibyte case, we know that all the character - listed in P2 is ASCII. In a unibyte case, P1 has only a - bitmap table. So, in both cases, it is enough to test - only the bitmap table of P1. */ - - if ((re_opcode_t) *p1 == charset) - { - int idx; - /* We win if the charset inside the loop - has no overlap with the one after the loop. */ - for (idx = 0; - (idx < (int) p2[1] - && idx < CHARSET_BITMAP_SIZE (p1)); - idx++) - if ((p2[2 + idx] & p1[2 + idx]) != 0) - break; - - if (idx == p2[1] - || idx == CHARSET_BITMAP_SIZE (p1)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == charset_not) - { - int idx; - /* We win if the charset_not inside the loop lists - every character listed in the charset after. */ - for (idx = 0; idx < (int) p2[1]; idx++) - if (! (p2[2 + idx] == 0 - || (idx < CHARSET_BITMAP_SIZE (p1) - && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) - break; - - if (idx == p2[1]) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - } + return mutually_exclusive_exactn (bufp, p2, p1); + else + return mutually_exclusive_charset (bufp, p1, p2); } break; @@ -3783,9 +3805,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, switch (*p1) { case exactn: + return mutually_exclusive_exactn (bufp, p2, p1); case charset: - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); + return mutually_exclusive_charset (bufp, p2, p1); case charset_not: /* When we have two charset_not, it's very unlikely that they don't overlap. The union of the two sets of excluded @@ -3830,11 +3852,18 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, int mcnt; p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); - /* Don't just test `mcnt > 0` because non-greedy loops have - their test at the end with an unconditional jump at the start. */ - if (p2 + mcnt > p2_orig) /* Ensure forward progress. */ - return (mutually_exclusive_p (bufp, p1, p2) - && mutually_exclusive_p (bufp, p1, p2 + mcnt)); + re_char *p2_other = p2 + mcnt; + /* Presumably, any position up to `done` should be a position we + already considered elsewhere (because our state machines are not + arbitrary: they consists of syntaxically nested loops with limited + control flow), so we can consider those back jumps as mutually + exclusive here. */ + return ((p2 <= done + || mutually_exclusive_aux (bufp, p1, p2, + p2 > p2_orig ? done : p2)) + && (p2_other <= done + || mutually_exclusive_aux (bufp, p1, p2_other, + p2_other > p2_orig ? done : p2_other))); break; } @@ -3846,6 +3875,12 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, return false; } +static bool +mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + return mutually_exclusive_aux (bufp, p1, p2, min (p2, p1)); +} /* Matching routines. */