From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c Date: Sat, 23 Feb 2019 18:15:55 -0500 Message-ID: References: <20190222164522.GB5411@ACM> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="166787"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Feb 24 00:16:50 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gxgXK-000hKQ-8h for ged-emacs-devel@m.gmane.org; Sun, 24 Feb 2019 00:16:50 +0100 Original-Received: from localhost ([127.0.0.1]:43712 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gxgXF-0005oZ-Tr for ged-emacs-devel@m.gmane.org; Sat, 23 Feb 2019 18:16:45 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:42373) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gxgWd-0005oT-A7 for emacs-devel@gnu.org; Sat, 23 Feb 2019 18:16:08 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gxgWc-0005EM-Fu for emacs-devel@gnu.org; Sat, 23 Feb 2019 18:16:07 -0500 Original-Received: from [195.159.176.226] (port=36962 helo=blaine.gmane.org) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gxgWc-0005AU-6h for emacs-devel@gnu.org; Sat, 23 Feb 2019 18:16:06 -0500 Original-Received: from list by blaine.gmane.org with local (Exim 4.89) (envelope-from ) id 1gxgWY-000gYq-HT for emacs-devel@gnu.org; Sun, 24 Feb 2019 00:16:02 +0100 X-Injected-Via-Gmane: http://gmane.org/ Cancel-Lock: sha1:x5J6PBsly7AjmYleZi0OlvctRR0= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 195.159.176.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233554 Archived-At: > Primarily, there is an > > UPDATE_SYNTAX_TABLE (charpos); > > before determining the syntax of the previous character, which seems OK. > Later on, before determining the syntax of the next character, we have: > > UPDATE_SYNTAX_TABLE_FORWARD (charpos); > > . Between these two calls, charpos hasn't been changed. Good spotting. > Surely the argument to the second occurrence should be (charpos + 1)? I believe it's instead the other one that needs to use "charpos - 1" because the UPDATE_SYNTAX_TABLE is called just before reading the char *before* charpos (see patch below). > Also, probably less importantly, there is > > GET_CHAR_AFTER (c2, d, dummy); > > , whereas at the same place in the handler for case symend: we have > instead > > c2 = RE_STRING_CHAR (d, target_multibyte); > > . Is the effect of these macros identical, or is one of them up to > date, and the other one really needs updating as well, for correct > functionality? According to my reading of the code, they're identical in multibyte buffers not in unibyte buffers where RE_STRING_CHAR just returns a value between 0 and 255 (i.e. ASCII or Latin-1 more or less), whereas GET_CHAR_AFTER will return either an ASCII char (0..127) or a raw-byte char (4194176..4194303). I think it's more correct to return a raw-byte char (4194176..4194303), so I'd tend to think that GET_CHAR_AFTER is the better choice, but please don't quote me on this. > I came across these whilst investigating bug #34525. Making the > indicated changes to regex-emacs.c sadly doesn't help solve the symptoms > of that bug. :-( Does the patch below help? Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index b667a43a37..72fb5ec561 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -4813,7 +4813,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int dummy; ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); - UPDATE_SYNTAX_TABLE (charpos); + UPDATE_SYNTAX_TABLE (charpos - 1); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1);