From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c Date: Mon, 25 Feb 2019 18:56:56 +0000 Message-ID: <20190225185656.GA3605@ACM> References: <20190222164522.GB5411@ACM> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="233107"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mutt/1.10.1 (2018-07-13) Cc: emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Feb 25 20:02:24 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gyLWB-000yXY-7Y for ged-emacs-devel@m.gmane.org; Mon, 25 Feb 2019 20:02:23 +0100 Original-Received: from localhost ([127.0.0.1]:42978 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyLWA-0001R9-9Z for ged-emacs-devel@m.gmane.org; Mon, 25 Feb 2019 14:02:22 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:36965) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyLVz-0001Qo-Jy for emacs-devel@gnu.org; Mon, 25 Feb 2019 14:02:12 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyLVy-0005xg-J4 for emacs-devel@gnu.org; Mon, 25 Feb 2019 14:02:11 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:16165 helo=mail.muc.de) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1gyLVy-0005Rr-3S for emacs-devel@gnu.org; Mon, 25 Feb 2019 14:02:10 -0500 Original-Received: (qmail 72631 invoked by uid 3782); 25 Feb 2019 19:01:29 -0000 Original-Received: from acm.muc.de (p4FE15D69.dip0.t-ipconnect.de [79.225.93.105]) by colin.muc.de (tmda-ofmipd) with ESMTP; Mon, 25 Feb 2019 20:01:28 +0100 Original-Received: (qmail 5140 invoked by uid 1000); 25 Feb 2019 18:56:56 -0000 Content-Disposition: inline In-Reply-To: X-Delivery-Agent: TMDA/1.1.12 (Macallan) X-Primary-Address: acm@muc.de X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x [fuzzy] X-Received-From: 193.149.48.1 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233609 Archived-At: Hello, Stefan. Sorry about the delay in replying. On Sat, Feb 23, 2019 at 18:15:55 -0500, Stefan Monnier wrote: > > Primarily, there is an > > UPDATE_SYNTAX_TABLE (charpos); > > before determining the syntax of the previous character, which seems OK. > > Later on, before determining the syntax of the next character, we have: > > UPDATE_SYNTAX_TABLE_FORWARD (charpos); > > . Between these two calls, charpos hasn't been changed. > Good spotting. Thanks! > > Surely the argument to the second occurrence should be (charpos + 1)? > I believe it's instead the other one that needs to use "charpos - 1" > because the UPDATE_SYNTAX_TABLE is called just before reading the char > *before* charpos (see patch below). I don't think this is right. offset is calculated from d, and then decremented, before calculating charpos. > > Also, probably less importantly, there is > > GET_CHAR_AFTER (c2, d, dummy); > > , whereas at the same place in the handler for case symend: we have > > instead > > c2 = RE_STRING_CHAR (d, target_multibyte); > > . Is the effect of these macros identical, or is one of them up to > > date, and the other one really needs updating as well, for correct > > functionality? > According to my reading of the code, they're identical in multibyte > buffers not in unibyte buffers where RE_STRING_CHAR just returns a value > between 0 and 255 (i.e. ASCII or Latin-1 more or less), whereas > GET_CHAR_AFTER will return either an ASCII char (0..127) or a raw-byte > char (4194176..4194303). OK. > I think it's more correct to return a raw-byte char (4194176..4194303), > so I'd tend to think that GET_CHAR_AFTER is the better choice, but > please don't quote me on this. I won't say a word! > > I came across these whilst investigating bug #34525. Making the > > indicated changes to regex-emacs.c sadly doesn't help solve the symptoms > > of that bug. :-( > Does the patch below help? Unfortunately not, not for bug #34525. I did try it out, though. In the mean time, I've advanced somewhat in the debugging. > Stefan > diff --git a/src/regex-emacs.c b/src/regex-emacs.c > index b667a43a37..72fb5ec561 100644 > --- a/src/regex-emacs.c > +++ b/src/regex-emacs.c > @@ -4813,7 +4813,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, > int dummy; ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > - UPDATE_SYNTAX_TABLE (charpos); > + UPDATE_SYNTAX_TABLE (charpos - 1); > GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); > s1 = SYNTAX (c1); -- Alan Mackenzie (Nuremberg, Germany).