From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.bugs Subject: bug#64128: regexp parser zero-width assertion bugs Date: Sat, 17 Jun 2023 14:20:27 +0200 Message-ID: Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Content-Type: multipart/mixed; boundary="Apple-Mail=_13FA75CF-D51B-4626-A0E8-BF4B62348CEC" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="23012"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Paul Eggert , Stefan Monnier To: 64128@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sat Jun 17 14:21:23 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qAUvi-0005re-85 for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 17 Jun 2023 14:21:22 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qAUvQ-0002j6-Sw; Sat, 17 Jun 2023 08:21:04 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qAUvO-0002iu-MV for bug-gnu-emacs@gnu.org; Sat, 17 Jun 2023 08:21:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qAUvO-0007oi-CJ for bug-gnu-emacs@gnu.org; Sat, 17 Jun 2023 08:21:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1qAUvO-00016r-6V for bug-gnu-emacs@gnu.org; Sat, 17 Jun 2023 08:21:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 17 Jun 2023 12:21:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 64128 X-GNU-PR-Package: emacs X-Debbugs-Original-To: Emacs Bug Report Original-Received: via spool by submit@debbugs.gnu.org id=B.16870044374204 (code B ref -1); Sat, 17 Jun 2023 12:21:02 +0000 Original-Received: (at submit) by debbugs.gnu.org; 17 Jun 2023 12:20:37 +0000 Original-Received: from localhost ([127.0.0.1]:50992 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qAUuz-00015k-Dq for submit@debbugs.gnu.org; Sat, 17 Jun 2023 08:20:37 -0400 Original-Received: from lists.gnu.org ([209.51.188.17]:35400) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qAUuu-00015V-Ug for submit@debbugs.gnu.org; Sat, 17 Jun 2023 08:20:35 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qAUuu-0002fy-Jl for bug-gnu-emacs@gnu.org; Sat, 17 Jun 2023 08:20:32 -0400 Original-Received: from mail-lf1-x130.google.com ([2a00:1450:4864:20::130]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qAUut-0007jn-0Y for bug-gnu-emacs@gnu.org; Sat, 17 Jun 2023 08:20:32 -0400 Original-Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-4f849a0e371so2261672e87.1 for ; Sat, 17 Jun 2023 05:20:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1687004428; x=1689596428; h=to:cc:date:message-id:subject:mime-version:from:sender:from:to:cc :subject:date:message-id:reply-to; bh=PH/i4d67o/j/ZAVmpc/gRUD82Tsv1IeLzd4T+TFPXF4=; b=qrmye19N5zQKU+GIzt1ouJ+wzuORPnoco1Yb9qsD1K/XICuP1fBkRDJ3BCzWAheRsy Cr8dfyovyrjpz2PWZTxjS9cYArlClENxlWCJjcPy5iwg98bSy/6qem51rKNzPGbRc96z W9oT1faRs1LiQwTuN7wx75/HYcfPUB/AqSbVJwrIp6jg0LtTfapO6F/64k4lkTFq6E1Y GMTNIk2vxws0uRb9QrxsG9V/dDM2HzdP9hnTx8e7ThEdYbIUqzpkGtnj35E8FHvqST9c XNAwPliNSV6YLyP/G+9jHAHSGZyznpogk6wjhpb/KwIFzTSGdKcFwwuf0ZA4cuY/2YZW VVyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687004428; x=1689596428; h=to:cc:date:message-id:subject:mime-version:from:sender :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=PH/i4d67o/j/ZAVmpc/gRUD82Tsv1IeLzd4T+TFPXF4=; b=NYmecrrnlmduNAY/3CiuMQXn/0PlnxkToTljkv5ilX9xU8nkGKAufPOnwZUJy93xH0 1lLJPGbgi2nAyrqcTQdvnJa95chz+8EYup4Wr8eInq6TWgqaE0Dbj146nIQVBvHPirev ir1D1AQlUoQ7FiLvJED/8H9B7NVgbipeuXHgGeBR3XkmIrwmvYKN1qTtWnizUA8KMFm7 /7iSifF4ztftjiIEZiM6HbwFBTI7il85GbtO0rzxabVs8xx0o5jtqURIzCO3jBBN0d7m k5Dy2SAOsvsoaPwQOMvOGGEJMLfaMhiShjfYcecFJL/M0dNBoS9qfSIcWPfBS9rWtWzY glTA== X-Gm-Message-State: AC+VfDxNRRAEm1kuTaIVS2UEnUcr6eOE7u2KbznQVTeaL3DRvadzyjz2 oveCKI+m7AMAzwR0IgVclcwmPytzQFk= X-Google-Smtp-Source: ACHHUZ5wEIqI9aMK259zNWv2FhyJYpcjEKat3ZUpqO/xJLa6Z+fsKWzN0eRe0YHLtHapCk0TGJBpYw== X-Received: by 2002:a19:6456:0:b0:4f7:68d8:f3a2 with SMTP id b22-20020a196456000000b004f768d8f3a2mr2965880lfj.44.1687004428274; Sat, 17 Jun 2023 05:20:28 -0700 (PDT) Original-Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id t25-20020ac243b9000000b004f84694fe4dsm69071lfl.182.2023.06.17.05.20.27 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 17 Jun 2023 05:20:27 -0700 (PDT) X-Mailer: Apple Mail (2.3654.120.0.1.15) Received-SPF: pass client-ip=2a00:1450:4864:20::130; envelope-from=mattias.engdegard@gmail.com; helo=mail-lf1-x130.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:263529 Archived-At: --Apple-Mail=_13FA75CF-D51B-4626-A0E8-BF4B62348CEC Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii In Emacs regexps, some but not all zero-width assertions have the = special property in that they are not treated as an element for an = immediately following ?, * or +. For example, \b* matches a literal asterisk at a word boundary -- the `*` becomes literal = because it is treated as if there were nothing for it to act upon. Even = stranger: xy\b* is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: = the repetition operator encompasses several elements even though there = are no brackets given. Demo: (and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!") (match-data)) =3D> (0 18) Zero-width assertions that have the property: ^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B = (not-word-boundary) Zero-width assertions that do not have the property (and are treated as = any other element): \< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \=3D (point) These regexp patterns should be very rare in practice: they should = always be a mistake, but it would be nice if they behaved in a way that = makes some kind of sense. A modest improvement would be to make operators become literal after any = zero-width assertion, so that \<* becomes (: word-start "*") instead of (* word-start), and xy\b* becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary). Suggested patch attached. --Apple-Mail=_13FA75CF-D51B-4626-A0E8-BF4B62348CEC Content-Disposition: attachment; filename=regexp-zero-width-assertion-bug.diff Content-Type: application/octet-stream; x-unix-mode=0644; name="regexp-zero-width-assertion-bug.diff" Content-Transfer-Encoding: 7bit diff --git a/src/regex-emacs.c b/src/regex-emacs.c index e3237cd425a..120a727cf74 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -1716,7 +1716,9 @@ regex_compile (re_char *pattern, ptrdiff_t size, /* Address of start of the most recently finished expression. This tells, e.g., postfix * where to find the start of its - operand. Reset at the beginning of groups and alternatives. */ + operand. Reset at the beginning of groups and alternatives, + and after any zero-width assertion (which should not be the target + of any postfix repetition operators). */ unsigned char *laststart = 0; /* Address of beginning of regexp, or inside of last group. */ @@ -1847,12 +1849,14 @@ regex_compile (re_char *pattern, ptrdiff_t size, case '^': if (! (p == pattern + 1 || at_begline_loc_p (pattern, p))) goto normal_char; + laststart = 0; BUF_PUSH (begline); break; case '$': if (! (p == pend || at_endline_loc_p (p, pend))) goto normal_char; + laststart = 0; BUF_PUSH (endline); break; @@ -1892,7 +1896,7 @@ regex_compile (re_char *pattern, ptrdiff_t size, /* Star, etc. applied to an empty pattern is equivalent to an empty pattern. */ - if (!laststart || laststart == b) + if (laststart == b) break; /* Now we know whether or not zero matches is allowed @@ -2482,7 +2486,7 @@ regex_compile (re_char *pattern, ptrdiff_t size, goto normal_char; case '=': - laststart = b; + laststart = 0; BUF_PUSH (at_dot); break; @@ -2523,17 +2527,17 @@ regex_compile (re_char *pattern, ptrdiff_t size, case '<': - laststart = b; + laststart = 0; BUF_PUSH (wordbeg); break; case '>': - laststart = b; + laststart = 0; BUF_PUSH (wordend); break; case '_': - laststart = b; + laststart = 0; PATFETCH (c); if (c == '<') BUF_PUSH (symbeg); @@ -2544,18 +2548,22 @@ regex_compile (re_char *pattern, ptrdiff_t size, break; case 'b': + laststart = 0; BUF_PUSH (wordbound); break; case 'B': + laststart = 0; BUF_PUSH (notwordbound); break; case '`': + laststart = 0; BUF_PUSH (begbuf); break; case '\'': + laststart = 0; BUF_PUSH (endbuf); break; @@ -2597,7 +2605,7 @@ regex_compile (re_char *pattern, ptrdiff_t size, /* If followed by a repetition operator. */ || (p != pend - && (*p == '*' || *p == '+' || *p == '?' || *p == '^')) + && (*p == '*' || *p == '+' || *p == '?')) || (p + 1 < pend && p[0] == '\\' && p[1] == '{')) { /* Start building a new exactn. */ --Apple-Mail=_13FA75CF-D51B-4626-A0E8-BF4B62348CEC--