From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier via "Bug reports for GNU Emacs, the Swiss army knife of text editors" Newsgroups: gmane.emacs.bugs Subject: bug#65726: 29.1.50; Crash in regexp engine Date: Thu, 14 Sep 2023 10:41:30 -0400 Message-ID: References: <8e1b4e50-0430-3eb3-e486-60def1e4821f@gmx.at> <83fs3u5e7u.fsf@gnu.org> <835y4q5apw.fsf@gnu.org> <776370AB-662F-4C0A-95BF-97DEA4F18F54@gmail.com> <3A0AC9E2-A420-47B6-870A-69C53FCAEF71@gmail.com> Reply-To: Stefan Monnier Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="6316"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: martin rudalics , Eli Zaretskii , 65726@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Sep 14 16:42:15 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qgnXq-0001SJ-OA for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 14 Sep 2023 16:42:15 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qgnXa-0002aD-8R; Thu, 14 Sep 2023 10:41:58 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qgnXY-0002Zx-R9 for bug-gnu-emacs@gnu.org; Thu, 14 Sep 2023 10:41:56 -0400 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qgnXY-0000oy-ET for bug-gnu-emacs@gnu.org; Thu, 14 Sep 2023 10:41:56 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1qgnXd-0002Sm-Vt for bug-gnu-emacs@gnu.org; Thu, 14 Sep 2023 10:42:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Stefan Monnier Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 14 Sep 2023 14:42:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 65726 X-GNU-PR-Package: emacs Original-Received: via spool by 65726-submit@debbugs.gnu.org id=B65726.16947025099450 (code B ref 65726); Thu, 14 Sep 2023 14:42:01 +0000 Original-Received: (at 65726) by debbugs.gnu.org; 14 Sep 2023 14:41:49 +0000 Original-Received: from localhost ([127.0.0.1]:40770 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qgnXR-0002SL-01 for submit@debbugs.gnu.org; Thu, 14 Sep 2023 10:41:49 -0400 Original-Received: from mailscanner.iro.umontreal.ca ([132.204.25.50]:45440) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qgnXM-0002S6-QK for 65726@debbugs.gnu.org; Thu, 14 Sep 2023 10:41:47 -0400 Original-Received: from pmg2.iro.umontreal.ca (localhost.localdomain [127.0.0.1]) by pmg2.iro.umontreal.ca (Proxmox) with ESMTP id 879778075E; Thu, 14 Sep 2023 10:41:33 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=iro.umontreal.ca; s=mail; t=1694702491; bh=AnPra6i1Gx/JPAVeB+PwZz5Q0R4RPyQB/eX0ZbunUBQ=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=LKmijECB7RwXY0zZrGhVFEK0tOdXzCY+9o5aZkTEqO6CsdkjB6TNGecAakw/yeZl2 AIFFeu5sOhCKXbdd1RA7IJdhJ3ILtkyVr0BgPn16b91TFzRR0WhMhr1kOHPqW4apb8 fEg3Vi2RZztojZfSgQyI7YjUz4YT8p2vMjpkYickCEtM/hgr8W0DvMZSSiXCEcqg8t YtCryMhFndtx73BYFyBkzJR0BS2WcMCfxQsV/1MOeZbOkfzq46NCk5yeA7toLk4bmJ 7SBUH6qIong/axD/2Q2F+ndaXDO8YvpqFV8rPuxdHGtvqh/kAjBQqqVpm+awa9sq7u 0AbC/q4MiE5bg== Original-Received: from mail01.iro.umontreal.ca (unknown [172.31.2.1]) by pmg2.iro.umontreal.ca (Proxmox) with ESMTP id A435180283; Thu, 14 Sep 2023 10:41:31 -0400 (EDT) Original-Received: from pastel (unknown [104.247.229.91]) by mail01.iro.umontreal.ca (Postfix) with ESMTPSA id 73A571202C2; Thu, 14 Sep 2023 10:41:31 -0400 (EDT) In-Reply-To: ("Mattias =?UTF-8?Q?Engdeg=C3=A5rd?="'s message of "Sat, 9 Sep 2023 18:34:42 +0200") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:270439 Archived-At: --=-=-= Content-Type: text/plain > so yes, we may need to remember where we've been. (At this point > someone will inevitably point out a helpful invariant that is obvious > in hindsight. This is just my cunning attempt at making that happen.) No helpful invariant that makes it trivial, but if we keep pushing the same idea that we rely on the assumption that we just have a syntactically nested loop nest, then we can handle that as in the patch below. I.e. keep the idea I proposed of keeping track of a beg..end region that's already been handled. But now we really do have to maintain both ends (before, I only had `done` which kept track of the end of the region), and just "restart" when we jump back to a point before the "done" region. Stefan --=-=-= Content-Type: text/x-diff Content-Disposition: inline; filename=regexp.c.patch diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 394ba22e9b0..a6b7faadf86 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3643,19 +3643,128 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte, return not; } +/* Case where `p2` points to an `exactn`. */ +static bool +mutually_exclusive_exactn (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + bool multibyte = RE_MULTIBYTE_P (bufp); + int c + = (re_opcode_t) *p2 == endline ? '\n' + : RE_STRING_CHAR (p2 + 2, multibyte); + + if ((re_opcode_t) *p1 == exactn) + { + if (c != RE_STRING_CHAR (p1 + 2, multibyte)) + { + DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); + return true; + } + } + + else if ((re_opcode_t) *p1 == charset + || (re_opcode_t) *p1 == charset_not) + { + if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), + Qnil)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == anychar + && c == '\n') + { + DEBUG_PRINT (" . != \\n => fast loop.\n"); + return true; + } + return false; +} + +/* Case where `p2` points to an `charset`. */ +static bool +mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + /* It is hard to list up all the character in charset + P2 if it includes multibyte character. Give up in + such case. */ + if (!RE_MULTIBYTE_P (bufp) || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) + { + /* Now, we are sure that P2 has no range table. + So, for the size of bitmap in P2, 'p2[1]' is + enough. But P1 may have range table, so the + size of bitmap table of P1 is extracted by + using macro 'CHARSET_BITMAP_SIZE'. + + In a multibyte case, we know that all the character + listed in P2 is ASCII. In a unibyte case, P1 has only a + bitmap table. So, in both cases, it is enough to test + only the bitmap table of P1. */ + + if ((re_opcode_t) *p1 == charset) + { + int idx; + /* We win if the charset inside the loop + has no overlap with the one after the loop. */ + for (idx = 0; + (idx < (int) p2[1] + && idx < CHARSET_BITMAP_SIZE (p1)); + idx++) + if ((p2[2 + idx] & p1[2 + idx]) != 0) + break; + + if (idx == p2[1] + || idx == CHARSET_BITMAP_SIZE (p1)) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + else if ((re_opcode_t) *p1 == charset_not) + { + int idx; + /* We win if the charset_not inside the loop lists + every character listed in the charset after. */ + for (idx = 0; idx < (int) p2[1]; idx++) + if (! (p2[2 + idx] == 0 + || (idx < CHARSET_BITMAP_SIZE (p1) + && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) + break; + + if (idx == p2[1]) + { + DEBUG_PRINT (" No match => fast loop.\n"); + return true; + } + } + } + return false; +} + /* True if "p1 matches something" implies "p2 fails". */ +/* Avoiding inf-loops: + We're trying to follow all paths reachable from `p2`, but since some + loops can match the empty string, this can loop back to `p2`. + To avoid inf-looping, we keep track of points that have been considered + "already". Instead of keeping a list of such points, we assume + that the code is a "sequential loop nest" and only keep track of + `done_beg` and `done_end` which delimit a chunk of bytecode we consider + as already considered. */ static bool -mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2) +mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2, re_char *done_beg, re_char *done_end) { re_opcode_t op2; - bool multibyte = RE_MULTIBYTE_P (bufp); unsigned char *pend = bufp->buffer + bufp->used; re_char *p2_orig = p2; eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); + eassert (done_beg <= done_end); + eassert (done_end <= p2); + /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, look at what begins its body, since we will have to @@ -3684,98 +3793,14 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, case endline: case exactn: - { - int c - = (re_opcode_t) *p2 == endline ? '\n' - : RE_STRING_CHAR (p2 + 2, multibyte); - - if ((re_opcode_t) *p1 == exactn) - { - if (c != RE_STRING_CHAR (p1 + 2, multibyte)) - { - DEBUG_PRINT (" '%c' != '%c' => fast loop.\n", c, p1[2]); - return true; - } - } - - else if ((re_opcode_t) *p1 == charset - || (re_opcode_t) *p1 == charset_not) - { - if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c), - Qnil)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == anychar - && c == '\n') - { - DEBUG_PRINT (" . != \\n => fast loop.\n"); - return true; - } - } - break; + return mutually_exclusive_exactn (bufp, p1, p2); case charset: { if ((re_opcode_t) *p1 == exactn) - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); - - /* It is hard to list up all the character in charset - P2 if it includes multibyte character. Give up in - such case. */ - else if (!multibyte || !CHARSET_RANGE_TABLE_EXISTS_P (p2)) - { - /* Now, we are sure that P2 has no range table. - So, for the size of bitmap in P2, 'p2[1]' is - enough. But P1 may have range table, so the - size of bitmap table of P1 is extracted by - using macro 'CHARSET_BITMAP_SIZE'. - - In a multibyte case, we know that all the character - listed in P2 is ASCII. In a unibyte case, P1 has only a - bitmap table. So, in both cases, it is enough to test - only the bitmap table of P1. */ - - if ((re_opcode_t) *p1 == charset) - { - int idx; - /* We win if the charset inside the loop - has no overlap with the one after the loop. */ - for (idx = 0; - (idx < (int) p2[1] - && idx < CHARSET_BITMAP_SIZE (p1)); - idx++) - if ((p2[2 + idx] & p1[2 + idx]) != 0) - break; - - if (idx == p2[1] - || idx == CHARSET_BITMAP_SIZE (p1)) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - else if ((re_opcode_t) *p1 == charset_not) - { - int idx; - /* We win if the charset_not inside the loop lists - every character listed in the charset after. */ - for (idx = 0; idx < (int) p2[1]; idx++) - if (! (p2[2 + idx] == 0 - || (idx < CHARSET_BITMAP_SIZE (p1) - && ((p2[2 + idx] & ~ p1[2 + idx]) == 0)))) - break; - - if (idx == p2[1]) - { - DEBUG_PRINT (" No match => fast loop.\n"); - return true; - } - } - } + return mutually_exclusive_exactn (bufp, p2, p1); + else + return mutually_exclusive_charset (bufp, p1, p2); } break; @@ -3783,9 +3808,9 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, switch (*p1) { case exactn: + return mutually_exclusive_exactn (bufp, p2, p1); case charset: - /* Reuse the code above. */ - return mutually_exclusive_p (bufp, p2, p1); + return mutually_exclusive_charset (bufp, p2, p1); case charset_not: /* When we have two charset_not, it's very unlikely that they don't overlap. The union of the two sets of excluded @@ -3830,12 +3855,30 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, int mcnt; p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); - /* Don't just test `mcnt > 0` because non-greedy loops have - their test at the end with an unconditional jump at the start. */ - if (p2 > p2_orig && mcnt >= 0) /* Ensure forward progress. */ - return (mutually_exclusive_p (bufp, p1, p2) - && mutually_exclusive_p (bufp, p1, p2 + mcnt)); - break; + re_char *p2_other = p2 + mcnt; + + /* Presumably, any position in `done_beg..end` should be a position we + already considered elsewhere (because our state machines are not + arbitrary: they consists of syntaxically nested loops with limited + control flow), so we can consider those back jumps as mutually + exclusive here. + This shows in the code below when `p3 > p2_orig`, i.e. when we jump + forward: in that case we bump `done_end` up to `p3` under + the assumption that any other reachable position + between `done_end` and `p3` will be checked by the *other* + call to RECURSE. + The recursion should terminate because of a lexical ordering between + `done_beg`, `done_end`, and `p2`: + At each recursion, either `done_beg` gets smaller, + or `done_beg` is unchanged and `done_end` gets larger + or they're both unchanged and `p2` gets larger. */ +#define RECURSE(p3) \ + ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \ + : (p3) <= done_end ? true \ + : mutually_exclusive_aux (bufp, p1, p3, done_beg, \ + (p3) > p2_orig ? done_end : (p3))) + + return RECURSE (p2) && RECURSE (p2_other); } default: @@ -3846,6 +3889,13 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, return false; } +static bool +mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, + re_char *p2) +{ + re_char *done = min (p2, p1); + return mutually_exclusive_aux (bufp, p1, p2, done, done); +} /* Matching routines. */ --=-=-=--