From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier via "Bug reports for GNU Emacs, the Swiss army knife of text editors" Newsgroups: gmane.emacs.bugs Subject: bug#65726: 29.1.50; Crash in regexp engine Date: Mon, 18 Sep 2023 08:32:40 -0400 Message-ID: References: <8e1b4e50-0430-3eb3-e486-60def1e4821f@gmx.at> <83fs3u5e7u.fsf@gnu.org> <835y4q5apw.fsf@gnu.org> <776370AB-662F-4C0A-95BF-97DEA4F18F54@gmail.com> <3A0AC9E2-A420-47B6-870A-69C53FCAEF71@gmail.com> <65105BCA-1884-4673-A8F8-03E3C7ABAAEA@gmail.com> Reply-To: Stefan Monnier Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="18399"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: martin rudalics , Eli Zaretskii , 65726@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon Sep 18 14:34:14 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qiDS9-0004U6-CU for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 18 Sep 2023 14:34:13 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qiDRx-0001dM-8q; Mon, 18 Sep 2023 08:34:01 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qiDRr-0001X4-Au for bug-gnu-emacs@gnu.org; Mon, 18 Sep 2023 08:33:55 -0400 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qiDRp-0007Yu-Nr for bug-gnu-emacs@gnu.org; Mon, 18 Sep 2023 08:33:54 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1qiDRx-0002RQ-M0 for bug-gnu-emacs@gnu.org; Mon, 18 Sep 2023 08:34:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Stefan Monnier Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 18 Sep 2023 12:34:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 65726 X-GNU-PR-Package: emacs Original-Received: via spool by 65726-submit@debbugs.gnu.org id=B65726.16950403849299 (code B ref 65726); Mon, 18 Sep 2023 12:34:01 +0000 Original-Received: (at 65726) by debbugs.gnu.org; 18 Sep 2023 12:33:04 +0000 Original-Received: from localhost ([127.0.0.1]:52348 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qiDR1-0002Pt-6n for submit@debbugs.gnu.org; Mon, 18 Sep 2023 08:33:04 -0400 Original-Received: from mailscanner.iro.umontreal.ca ([132.204.25.50]:43675) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qiDQv-0002PE-VZ for 65726@debbugs.gnu.org; Mon, 18 Sep 2023 08:33:01 -0400 Original-Received: from pmg2.iro.umontreal.ca (localhost.localdomain [127.0.0.1]) by pmg2.iro.umontreal.ca (Proxmox) with ESMTP id E9C2A805A7; Mon, 18 Sep 2023 08:32:43 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=iro.umontreal.ca; s=mail; t=1695040362; bh=nqMMCv/WQR7aMX7Mpg/kIuNXuVrihhJT6FuWj5unFtw=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=JZu5BQXLJh02HYw/PDSUV9ljQVtn2ykX/VinRRd2HNvFpyHMnxU/Dy9g75a/sRWsP 0cnY+pT0lXeizWgqsRXf9SlNYgijH0klWYzrFIgl1KQbgRvOsm8FuMMHmM8HBozfRI R1RYVW3eLsruxlOe1NZ9EwjrSnmCnJwT6qUsbgsI4x4J9C79PhWPX4cYIxIgkjGVXA axAeFWcUXnFjmKdAGa09f0xsc/rFmpEHhj5r4/+lib9GCQ4dnR/VpliAQBunhHKfNm 4YaoxehJKFOUyukM2QTMGsCVWfDldgHW65z+YFSIWIRgC7xMPv/roLjhG2whGwikGH UKH2h0bcTx2Bw== Original-Received: from mail01.iro.umontreal.ca (unknown [172.31.2.1]) by pmg2.iro.umontreal.ca (Proxmox) with ESMTP id 5D59080250; Mon, 18 Sep 2023 08:32:42 -0400 (EDT) Original-Received: from pastel (unknown [45.72.220.249]) by mail01.iro.umontreal.ca (Postfix) with ESMTPSA id 2993112041A; Mon, 18 Sep 2023 08:32:42 -0400 (EDT) In-Reply-To: (Stefan Monnier's message of "Sun, 17 Sep 2023 23:59:32 -0400") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:270790 Archived-At: --=-=-= Content-Type: text/plain > BTW maybe we should replace > > 31: /on_failure_jump_loop to 37 > 34: /jump to 9 > > with > > 31: /on_failure_dont_jump_loop to 9 In the mean time the patch below recognizes the above pattern, which gives me code which is safe (it should both always terminate and should never apply the optimization if it's not valid) yet does find the optimization in all the cases where I expected it. Stefan --=-=-= Content-Type: text/x-diff Content-Disposition: inline; filename=regexp.c.patch diff --git a/src/regex-emacs.c b/src/regex-emacs.c index 55d0a6e8df8..e45d0f19dbf 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -1923,12 +1923,22 @@ regex_compile (re_char *pattern, ptrdiff_t size, if (!zero_times_ok && simple) { /* Since simple * loops can be made faster by using - on_failure_keep_string_jump, we turn simple P+ - into PP* if P is simple. */ + on_failure_keep_string_jump, we turn P+ into PP* + if P is simple. + We can't use `top: ; OFJS exit; J top; exit:` + Because the OFJS needs to be at the beginning + since we want to be able to replace + `top: OFJS exit; ; J top; exit` + with + `top: OFKSJ exit; loop: ; J loop; exit`, + i.e. a single OFJ at the beginning of the loop + rather than once per iteration. */ unsigned char *p1, *p2; startoffset = b - laststart; GET_BUFFER_SPACE (startoffset); p1 = b; p2 = laststart; + /* This presumes that the code skipped + by `skip_one_char` is position-independent. */ while (p2 < p1) *b++ = *p2++; zero_times_ok = 1; @@ -3068,8 +3078,10 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte) continue; case succeed_n: - /* If N == 0, it should be an on_failure_jump_loop instead. */ - DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j > 0)); + /* If N == 0, it should be an on_failure_jump_loop instead. + `j` can be negative because `EXTRACT_NUMBER` extracts a + signed number whereas `succeed_n` treats it as unsigned. */ + DEBUG_STATEMENT (EXTRACT_NUMBER (j, p + 2); eassert (j != 0)); p += 4; /* We only care about one iteration of the loop, so we don't need to consider the case where this behaves like an @@ -3743,20 +3755,20 @@ mutually_exclusive_charset (struct re_pattern_buffer *bufp, re_char *p1, } /* True if "p1 matches something" implies "p2 fails". */ -/* Avoiding inf-loops: - We're trying to follow all paths reachable from `p2`, but since some +/* We're trying to follow all paths reachable from `p2`, but since some loops can match the empty string, this can loop back to `p2`. - To avoid inf-looping, we keep track of points that have been considered - "already". Instead of keeping a list of such points, `done_beg` and - `done_end` delimit a chunk of bytecode we already considered. - To guarantee termination, a lexical ordering between `done_*` and `p2` - should be obeyed: - At each recursion, either `done_beg` gets smaller, - or `done_beg` is unchanged and `done_end` gets larger - or they're both unchanged and `p2` gets larger. */ + To avoid inf-looping, we take advantage of the fact that + the bytecode we generate is made of syntactically nested loops, more + specifically, every loop has a single entry point and single exit point. + The function takes 2 more arguments (`loop_entry` and `loop_exit`). + `loop_entry` points to the sole entry point of the current loop and + `loop_exit` points to its sole exit point (when non-NULL). + The function can assume that `loop_exit` is "mutually exclusive". + Termination of recursive calls is ensured because either `loop_entry` + is larger, or it's equal but `p2` is larger. */ static bool mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, - re_char *p2, re_char *done_beg, re_char *done_end) + re_char *p2, re_char *loop_entry, re_char *loop_exit) { re_opcode_t op2; unsigned char *pend = bufp->buffer + bufp->used; @@ -3765,8 +3777,23 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, eassert (p1 >= bufp->buffer && p1 < pend && p2 >= bufp->buffer && p2 <= pend); - eassert (done_beg <= done_end); - eassert (done_end <= p2); + if (p2 == loop_exit) + /* Presumably already checked elsewhere. */ + return true; + eassert (loop_entry && p2 >= loop_entry); + /* eassert (!loop_exit || p2 < loop_exit); */ + if (p2 < loop_entry) + return false; + if (loop_exit && p2 > loop_exit) + { + void print_compiled_pattern (struct re_pattern_buffer *bufp); + print_compiled_pattern (bufp); + fprintf (stderr, "p1==%d, p2==%d, loop-entry==%d, loop-exit==%d\n", + p1 - bufp->buffer, p2 - bufp->buffer, + loop_entry - bufp->buffer, loop_exit - bufp->buffer); + error ("WOW1!"); + } + /* return false; */ /* Skip over open/close-group commands. If what follows this loop is a ...+ construct, @@ -3859,28 +3886,40 @@ mutually_exclusive_aux (struct re_pattern_buffer *bufp, re_char *p1, p2++; EXTRACT_NUMBER_AND_INCR (mcnt, p2); re_char *p2_other = p2 + mcnt; + /* For `+` loops, we often have an `on_failure_jump` that skips forward + over a subsequent `jump` for lack of an `on_failure_dont_jump` + kind of thing. Recognize this pattern since that subsequent + `jump` is the one that jumps to the loop-entry. */ + if ((re_opcode_t) p2[0] == jump && mcnt == 3) + { + EXTRACT_NUMBER (mcnt, p2 + 1); + p2 += mcnt + 3; + } - /* When we jump backward we bump `done_end` up to `p3` under - the assumption that any other position between `done_end` - and `p3` is either: - - checked by the other call to RECURSE. - - not reachable from here (e.g. for positions before the - `on_failure_jump`), or at least not without first - jumping before `done_beg`. - This should hold because our state machines are not arbitrary: - they consists of syntaxically nested loops with limited - control flow. - FIXME: This can fail (i.e. return true when it shouldn't) - if we start generating bytecode with a different shape, - so maybe we should bite the bullet and replace done_beg/end - with an actual list of positions we've already processed. */ -#define RECURSE(p3) \ - ((p3) < done_beg ? mutually_exclusive_aux (bufp, p1, p3, p3, p3) \ - : (p3) <= done_end ? true \ - : mutually_exclusive_aux (bufp, p1, p3, done_beg, \ - (p3) > p2_orig ? done_end : (p3))) - - return RECURSE (p2) && RECURSE (p2_other); + /* If we jump backward to the entry point of the current loop + it means it's a zero-length cycle through that loop, so + this cycle itself does not break mutual-exclusion. + If we jump backward to a new loop nested within the current one + then the `p2_other` points to the exit of that inner loop + and the other RECURSE will check what happens when we leave + the loop so we can focus on checking just the loop body, + so we pass new values for `loop-entry` and `loop_exit`. + In all other cases, we just recurse "normally": if it's before + `loop_entry` it's a bug that will be caught by checks at + the entrace of the function and otherwise it's a forward jump + which doesn't need any special treatment. + FIXME: This is failsafe (can't return true when it shouldn't) + but it could be too conservative if we start generating bytecode + with a different shape, so maybe we should bite the bullet and + replace done_beg/end with an actual list of positions we've + already processed. */ +#define RECURSE(p3, p3_other) \ + ((p3) == loop_entry ? true \ + : loop_entry < (p3) && (p3) < p2_orig && (p3_other) > p2_orig ? \ + mutually_exclusive_aux (bufp, p1, p3, p3, p3_other) \ + : mutually_exclusive_aux (bufp, p1, p3, loop_entry, loop_exit)) + + return RECURSE (p2, p2_other) && RECURSE (p2_other, p2); } default: @@ -3895,7 +3934,7 @@ #define RECURSE(p3) \ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1, re_char *p2) { - return mutually_exclusive_aux (bufp, p1, p2, p2, p2); + return mutually_exclusive_aux (bufp, p1, p2, bufp->buffer, NULL); } /* Matching routines. */ --=-=-=--