From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier via "Bug reports for GNU Emacs, the Swiss army knife of text editors" Newsgroups: gmane.emacs.bugs Subject: bug#65726: 29.1.50; Crash in regexp engine Date: Sun, 17 Sep 2023 22:14:37 -0400 Message-ID: References: <8e1b4e50-0430-3eb3-e486-60def1e4821f@gmx.at> <83fs3u5e7u.fsf@gnu.org> <835y4q5apw.fsf@gnu.org> <776370AB-662F-4C0A-95BF-97DEA4F18F54@gmail.com> <3A0AC9E2-A420-47B6-870A-69C53FCAEF71@gmail.com> <65105BCA-1884-4673-A8F8-03E3C7ABAAEA@gmail.com> Reply-To: Stefan Monnier Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="26605"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: martin rudalics , Eli Zaretskii , 65726@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon Sep 18 04:15:19 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qi3nC-0006gr-FF for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 18 Sep 2023 04:15:18 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qi3mp-0007iM-Ri; Sun, 17 Sep 2023 22:14:55 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qi3mo-0007hy-Rf for bug-gnu-emacs@gnu.org; Sun, 17 Sep 2023 22:14:54 -0400 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qi3mo-0006ir-K2 for bug-gnu-emacs@gnu.org; Sun, 17 Sep 2023 22:14:54 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1qi3mw-0004eF-GI for bug-gnu-emacs@gnu.org; Sun, 17 Sep 2023 22:15:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Stefan Monnier Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 18 Sep 2023 02:15:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 65726 X-GNU-PR-Package: emacs Original-Received: via spool by 65726-submit@debbugs.gnu.org id=B65726.169500329617837 (code B ref 65726); Mon, 18 Sep 2023 02:15:02 +0000 Original-Received: (at 65726) by debbugs.gnu.org; 18 Sep 2023 02:14:56 +0000 Original-Received: from localhost ([127.0.0.1]:51709 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qi3mq-0004dc-34 for submit@debbugs.gnu.org; Sun, 17 Sep 2023 22:14:56 -0400 Original-Received: from mailscanner.iro.umontreal.ca ([132.204.25.50]:18907) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qi3mn-0004dH-OP for 65726@debbugs.gnu.org; Sun, 17 Sep 2023 22:14:54 -0400 Original-Received: from pmg2.iro.umontreal.ca (localhost.localdomain [127.0.0.1]) by pmg2.iro.umontreal.ca (Proxmox) with ESMTP id 0B5A4806AC; Sun, 17 Sep 2023 22:14:40 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=iro.umontreal.ca; s=mail; t=1695003278; bh=7lFsvY7TfGRN8McEO/TdFeXojCeBmAGGrvlSdTTOLhI=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=J/B1nAklttwN3Z1MFiIStwZfI+7Iu/SVK7AFTQS6iyRdCfqa0bdtU8KadDPrTjlG+ glBO4Bm4X+rzV/iAcmvS8OtlpMUsVW3coY91wkhiWWbbos0N+Uf8UkNsmSL+pCfTyo AAvkTBdfWr5ginbdIAZ8edi/j20WN8tmGpSIlfY6CTXJP7ErgIQp1Gar+L/smDH70w 6NqofI5prNLOwri709Wkaxffd7WYPcN0XnuoGPHYhvcLoIsmCTLxSvmIvJRJw2UU+m +Fljvh9UPTKpAlgll54QhIL9SW1G8SwQ0MzCutuyB/ulnapzErgY8roH98v2oUUWZB OqPE7B3GJC9LA== Original-Received: from mail01.iro.umontreal.ca (unknown [172.31.2.1]) by pmg2.iro.umontreal.ca (Proxmox) with ESMTP id 5D0A180250; Sun, 17 Sep 2023 22:14:38 -0400 (EDT) Original-Received: from pastel (unknown [45.72.220.249]) by mail01.iro.umontreal.ca (Postfix) with ESMTPSA id 1C969120218; Sun, 17 Sep 2023 22:14:38 -0400 (EDT) In-Reply-To: (Stefan Monnier's message of "Sat, 16 Sep 2023 11:48:02 -0400") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:270733 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > I'm reworking my code so it takes 2 arguments: the loop-entry and the > loop-exit. Any jump outside of those bounds should never happen so we > can assert it to check that our assumptions are right (and we can > return false when we don't compile assertions, so it's never over-optimis= tic). Hmm... not quite working yet. With the patch below I get: % make test/src/regex-emacs-tests [...] passed 30/34 regexp-multibyte-unibyte (0.000264 sec) 0: /on_failure_jump_smart to 9 3: /exactn/1/x 6: /jump to 0 9: /start_memory/1 11: /on_failure_jump to 26 14: /on_failure_jump_smart to 23 17: /exactn/1/=3D 20: /jump to 14 23: /jump to 29 26: /exactn/1/h 29: /stop_memory/1 31: /on_failure_jump_loop to 37 34: /jump to 9 37: /succeed 38: end of pattern. 38 bytes used/128 bytes allocated. re_nsub: 1 regs_alloc: 1 can_be_null: 0 p1=3D=3D3, p2=3D=3D34, loop-entry=3D=3D14, loop-exit=3D=3D26 Test regexp-tests-backtrack-optimization backtrace: looking-at("x*\\(=3D*\\|h\\)+") [...] The problem is that my code sees the `jump 9` (near the end) followed by `start_memory` and `on_failure_jump to 26` as a a backward jump that defines a loop whose body starts at 14 and whose exit point is at 26, but that 14..26 is not a loop, it's a `|` and those don't obey the assumption I made about the exit point (the 2 destinations of such an `on_failure_jump` correspond to the first and the second patterns of the `|`, i.e. entry&middle rather than entry&exit, so there *can* be jumps straight out from the first point without going through the second point =F0=9F=99=81). BTW, here's an indented version of the code: 0: /on_failure_jump_smart to 9 { 3: /exactn/1/x 6: } /jump to 0 9: { /start_memory/1 11: /on_failure_jump to 26 14: /on_failure_jump_smart to 23 { 17: /exactn/1/=3D 20: } /jump to 14 23: /jump to 29 26: /exactn/1/h 29: /stop_memory/1 31: /on_failure_jump_loop to 37 34: } /jump to 9 37: /succeed 38: end of pattern. Another problem we see here is that the main (second) loop spans 9..37 and is controlled by the `/on_failure_jump_loop to 37` but the two destination of the OFJ are not 9 and 37 but 34 and 37 =F0=9F=99=82. So maybe we should `skip_noops` before looking at the destinations to decide if it's a loop? Stefan --=-=-= Content-Type: text/x-diff Content-Disposition: inline; filename=regexp.c.patch diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 55d0a6e8df8..c54f9f81890 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -3068,8 +3068,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte) continue; case succeed_n: - /* If N == 0, it should be an on_failure_jump_loop instead. */ - DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0)); + /* If N == 0, it should be an on_failure_jump_loop instead. + `j` can be negative because `EXTRACT_NUMBER` extracts a + signed number whereas `succeed_n` treats it as unsigned. */ + DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0)); p += 4; /* We only care about one iteration of the loop, so we don't need to consider the case where this behaves like an @@ -3743,20 +3745,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, } /* True if "p1 matches something" implies "p2 fails". */ -/* Avoiding inf-loops: - We're trying to follow all paths reachable from `p2`, but since some +/* We're trying to follow all paths reachable from `p2`, but since some loops can match the empty string, this can loop back to `p2`. - To avoid inf-looping, we keep track of points that have been considered - "already". Instead of keeping a list of such points, `done_beg` and - `done_end` delimit a chunk of bytecode we already considered. - To guarantee termination, a lexical ordering between `done_*` and `p2` - should be obeyed: - At each recursion, either `done_beg` gets smaller, - or `done_beg` is unchanged and `done_end` gets larger - or they're both unchanged and `p2` gets larger. */ + To avoid inf-looping, we take advantage of the fact that + the bytecode we generate is made of syntactically nested loops, more + specifically, every loop has a single entry point and single exit point. + The function takes 2 more arguments (`loop_entry` and `loop_exit`). + `loop_entry` points to the sole entry point of the current loop and + `loop_exit` points to its sole exit point (when non-NULL). + The function can assume that `loop_exit` is "mutually exclusive". + Termination of recursive calls is ensured because either `loop_entry` + is larger, or it's equal but `p2` is larger. */ static bool mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2, re_char *done_beg, re_char *done_end) + re_char *p2, re_char *loop_entry, re_char *loop_exit) { re_opcode_t op2; unsigned char *pend = bufp->buffer + bufp->used; @@ -3765,8 +3767,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); - eassert (done_beg <= done_end); - eassert (done_end <= p2); + if (p2 == loop_exit) + /* Presumably already checked elsewhere. */ + return true; + eassert (loop_entry && p2 >= loop_entry); + /* eassert (!loop_exit || p2 < loop_exit); */ + if (p2 < loop_entry) + return false; + if (loop_exit && p2 > loop_exit) + { + void print_compiled_pattern (struct re_pattern_buffer *bufp); + print_compiled_pattern (bufp); + fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n", + p1 - bufp->buffer, p2 - bufp->buffer, + loop_entry - bufp->buffer, loop_exit - bufp->buffer); + error ("WOW1!"); + } + /* return false; */ /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, @@ -3860,27 +3877,30 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, EXTRACT_NUMBER_AND_INCR (mcnt, p2); re_char *p2_other = p2 + mcnt; - /* When we jump backward we bump `done_end` up to `p3` under - the assumption that any other position between `done_end` - and `p3` is either: - - checked by the other call to RECURSE. - - not reachable from here (e.g. for positions before the - `on_failure_jump`), or at least not without first - jumping before `done_beg`. - This should hold because our state machines are not arbitrary: - they consists of syntaxically nested loops with limited - control flow. - FIXME: This can fail (i.e. return true when it shouldn't) - if we start generating bytecode with a different shape, - so maybe we should bite the bullet and replace done_beg/end - with an actual list of positions we've already processed. */ -#define RECURSE(p3) \ - ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \ - : (p3) <= done_end ? true \ - : mutually_exclusive_aux (bufp, p1, p3, done_beg, \ - (p3) > p2_orig ? done_end : (p3))) - - return RECURSE (p2) && RECURSE (p2_other); + /* If we jump backward to the entry point of the current loop + it means it's a zero-length cycle through that loop, so + this cycle itself does not break mutual-exclusion. + If we jump backward to a new loop nested within the current one + then the `p2_other` points to the exit of that inner loop + and the other RECURSE will check what happens when we leave + the loop so we can focus on checking just the loop body, + so we pass new values for `loop-entry` and `loop_exit`. + In all other cases, we just recurse "normally": if it's before + `loop_entry` it's a bug that will be caught by checks at + the entrace of the function and otherwise it's a forward jump + which doesn't need any special treatment. + FIXME: This is failsafe (can't return true when it shouldn't) + but it could be too conservative if we start generating bytecode + with a different shape, so maybe we should bite the bullet and + replace done_beg/end with an actual list of positions we've + already processed. */ +#define RECURSE(p3, p3_other) \ + ((p3) == loop_entry ? true \ + : loop_entry < (p3) && (p3) < p2_orig ? \ + mutually_exclusive_aux (bufp, p1, p3, p3, p3_other) \ + : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit)) + + return RECURSE (p2, p2_other) && RECURSE (p2_other, p2); } default: @@ -3895,7 +3915,7 @@ #define RECURSE(p3) \ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, re_char *p2) { - return mutually_exclusive_aux (bufp, p1, p2, p2, p2); + return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL); } /* Matching routines. */ --=-=-=--