From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Yuan Fu Newsgroups: gmane.emacs.devel Subject: Re: Matching regex case-sensitively in C strings? Date: Tue, 8 Nov 2022 11:31:35 -0800 Message-ID: <5711A9D3-7BCB-44AE-8911-5E039FF5FBB8@gmail.com> References: <218795BA-107D-4A86-9ACF-0A44BD2EC3D2@gmail.com> <83edufyoad.fsf@gnu.org> <580E87E6-DCFD-42AE-807A-339BBB3878C2@acm.org> Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\)) Content-Type: multipart/mixed; boundary="Apple-Mail=_51E421DE-B75D-42E3-96E9-05422A99FCE6" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="5107"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Eli Zaretskii , emacs-devel To: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Nov 08 20:32:41 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1osUKu-0001AK-R3 for ged-emacs-devel@m.gmane-mx.org; Tue, 08 Nov 2022 20:32:40 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1osUJy-0001CW-TY; Tue, 08 Nov 2022 14:31:42 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1osUJx-0001BH-FI for emacs-devel@gnu.org; Tue, 08 Nov 2022 14:31:41 -0500 Original-Received: from mail-pf1-x42f.google.com ([2607:f8b0:4864:20::42f]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1osUJv-0003GP-ME; Tue, 08 Nov 2022 14:31:41 -0500 Original-Received: by mail-pf1-x42f.google.com with SMTP id 130so14678931pfu.8; Tue, 08 Nov 2022 11:31:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=7LLapUGbDLbzoya8OE4lnmKvojrHIrpZDuimfULr1CU=; b=mWwLbeZaxXNfzz1asykFNEpylAFIiv+gBpVH+fJq5YoJ8sQpyGMY1yAIMFODVkO/iB HbcniNwd2mTcx9chCB8GBTsPephELtO3vpi9hDOwOSahvKp/EG/PNiapsZnMbwcYsg3g 0W6UO57hRqNQ4ctxOA1Io8kM5/1f6KD/H9uRvEE5H9mjUA9J0JQYiK1OtPTlZYBT4l0z MfdsYItYDkHk40hdd/5XzIq/q2c3kohDPAHmIvBvTmfp/XbrF186sbi6Iio3IE8ReOiK BIKlvIRlZsRNQV8njS+stEsBzO6CWLO1a91ue6cri6pkZ3zNxkitxPQFS0pejy2IxCsm jpYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=7LLapUGbDLbzoya8OE4lnmKvojrHIrpZDuimfULr1CU=; b=mYBtDskWjn5zv2HA005bete44KbYwIAsIp3/xvJysDfN6R8zIjF5sI2Q77Ro5b5Xp6 mjB/7bcNHINYijtIafQ3QBMWVRMyzpZ0OlRLLAwsT5nmFFyIKJsgI3pZ0Xy9l7tkJVb1 PegRxycbIAxqZrRxec7ovKuU6xD6LQKxZ4oeblQkD4GsZu/cwaCJub+PhRxMRqeWr8sa VD4Dc+Tjllwk3yHNJnCMnSab4E5Yu1FlqlJPQUMhZmOHz2qdIgNFM8RAkbd+ZW6dSiPI JtFv0ZIhsQ2mWzAP5EqP/xUvpmnr29j3I2hFMS60mG97DXN/HFKJ4YbqqMZBo1nJkTE4 018w== X-Gm-Message-State: ACrzQf0MI64G33E6AcZC0bylsvvJeexOZeVYcg4AD5EqkDSr97I98CdE 5luDgsdEBNTmYuAPp+OAclY= X-Google-Smtp-Source: AMsMyM4ZC8KipcRuZjIkT8Egvph4F4ipryNoUaZ5byPzXdwfKj3OgBhoyPyUZno+Cs20xU01yUB43g== X-Received: by 2002:a63:450c:0:b0:443:94a1:3703 with SMTP id s12-20020a63450c000000b0044394a13703mr47438910pga.565.1667935897822; Tue, 08 Nov 2022 11:31:37 -0800 (PST) Original-Received: from smtpclient.apple (cpe-172-117-161-177.socal.res.rr.com. [172.117.161.177]) by smtp.gmail.com with ESMTPSA id c21-20020a63ef55000000b004393f60db36sm6157415pgk.32.2022.11.08.11.31.36 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 08 Nov 2022 11:31:37 -0800 (PST) In-Reply-To: <580E87E6-DCFD-42AE-807A-339BBB3878C2@acm.org> X-Mailer: Apple Mail (2.3696.120.41.1.1) Received-SPF: pass client-ip=2607:f8b0:4864:20::42f; envelope-from=casouri@gmail.com; helo=mail-pf1-x42f.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:299361 Archived-At: --Apple-Mail=_51E421DE-B75D-42E3-96E9-05422A99FCE6 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Nov 8, 2022, at 2:18 AM, Mattias Engdeg=C3=A5rd = wrote: >=20 > 7 nov. 2022 kl. 21.35 skrev Yuan Fu : >=20 >> fast_c_string_match_ignore_case (Lisp_Object regexp, >> const char *string, ptrdiff_t len) >> { >> regexp =3D string_make_unibyte (regexp); >=20 > This is expensive and not obviously correct when it makes a = difference. Ie, no longer "fast", and may hide bugs. > Something should be done about that. >=20 >> // Why do we need to unwind stack? >> specpdl_ref count =3D SPECPDL_INDEX (); >=20 > Because freeze_pattern pushes an unwind-protect on the specpdl. >=20 >> struct regexp_cache *cache_entry >> =3D compile_pattern (regexp, 0, Vascii_canon_table, 0, 0); >=20 > `Vascii_canon_table` is what makes it case-insensitive; you want to = use Qnil (but you probably already know that now). > Since this is the only thing that differs from your intended use, I = suggest you generalise this subroutine with a boolean parameter. >=20 >> // What does freezing a pattern do? >> freeze_pattern (cache_entry); >=20 > It locks the compiled pattern record to make the regexp engine = reentrant (but here it also seems to be used for GC purposes; not sure = about that). >=20 >> // What is re_match_object for? I see that it can be t, nil or a = string. >> re_match_object =3D Qt; >=20 > Described in regex-emacs.h: >=20 >> /* The string or buffer being matched. >> It is used for looking up syntax properties. >>=20 >> If the value is a Lisp string object, match text in that string; if >> it's nil, match text in the current buffer; if it's t, match text >> in a C string. >>=20 >> This value is effectively another parameter to re_search_2 and >> re_match_2. No calls into Lisp or thread switches are allowed >> before setting re_match_object and calling into the regex search >> and match functions. These functions capture the current value of >> re_match_object into gl_state on entry. >>=20 Thanks! How about: Yuan --Apple-Mail=_51E421DE-B75D-42E3-96E9-05422A99FCE6 Content-Disposition: attachment; filename=c_string_match.diff Content-Type: application/octet-stream; x-unix-mode=0644; name="c_string_match.diff" Content-Transfer-Encoding: 7bit diff --git a/src/lisp.h b/src/lisp.h index 1e41e2064c9..d0b3f8f05a5 100644 --- a/src/lisp.h +++ b/src/lisp.h @@ -4772,6 +4772,8 @@ fast_string_match_ignore_case (Lisp_Object regexp, Lisp_Object string) extern ptrdiff_t fast_c_string_match_ignore_case (Lisp_Object, const char *, ptrdiff_t); +extern ptrdiff_t fast_c_string_match (Lisp_Object, const char *, ptrdiff_t, + bool); extern ptrdiff_t fast_looking_at (Lisp_Object, ptrdiff_t, ptrdiff_t, ptrdiff_t, ptrdiff_t, Lisp_Object); extern ptrdiff_t find_newline1 (ptrdiff_t, ptrdiff_t, ptrdiff_t, ptrdiff_t, diff --git a/src/search.c b/src/search.c index b5d6a442c0f..90856cf5c12 100644 --- a/src/search.c +++ b/src/search.c @@ -505,10 +505,29 @@ fast_string_match_internal (Lisp_Object regexp, Lisp_Object string, fast_c_string_match_ignore_case (Lisp_Object regexp, const char *string, ptrdiff_t len) { + return fast_c_string_match (regexp, string, len, true); +} + +/* Match REGEXP against STRING, searching all of STRING and return the + index of the match, or negative on failure. This does not clobber + the match data. Ignore case when searching if IGNORE_CASE is true. + + We assume that STRING contains single-byte characters. */ + +ptrdiff_t +fast_c_string_match (Lisp_Object regepx, + const char *string, ptrdiff_t len, bool ignore_case) +{ + /* FIXME: This is expensive and not obviously correct when it makes + a difference. I.e., no longer "fast", and may hide bugs. + Something should be done about this. */ regexp = string_make_unibyte (regexp); + /* Record specpdl index because freeze_pattern pushes an + unwind-protect on the specpdl. */ specpdl_ref count = SPECPDL_INDEX (); + Lisp_Object translate_table = ignore_case ? Vascii_canon_table : Qnil; struct regexp_cache *cache_entry - = compile_pattern (regexp, 0, Vascii_canon_table, 0, 0); + = compile_pattern (regexp, 0, translate_table, 0, 0); freeze_pattern (cache_entry); re_match_object = Qt; ptrdiff_t val = re_search (&cache_entry->buf, string, len, 0, len, 0); --Apple-Mail=_51E421DE-B75D-42E3-96E9-05422A99FCE6 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii --Apple-Mail=_51E421DE-B75D-42E3-96E9-05422A99FCE6--