From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c Date: Fri, 1 Mar 2019 11:10:18 +0000 Message-ID: <20190301111018.GA5674@ACM> References: <20190222164522.GB5411@ACM> <20190225185656.GA3605@ACM> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="24486"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mutt/1.10.1 (2018-07-13) Cc: emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 01 12:22:52 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gzgFe-0006Fy-Lc for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 12:22:50 +0100 Original-Received: from localhost ([127.0.0.1]:35723 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gzgFd-0001o2-It for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 06:22:49 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:33020) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gzgEQ-0001Z6-L5 for emacs-devel@gnu.org; Fri, 01 Mar 2019 06:21:35 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gzg8V-0001Ur-Rd for emacs-devel@gnu.org; Fri, 01 Mar 2019 06:15:28 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:52851 helo=mail.muc.de) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1gzg8V-0001TU-Jk for emacs-devel@gnu.org; Fri, 01 Mar 2019 06:15:27 -0500 Original-Received: (qmail 1694 invoked by uid 3782); 1 Mar 2019 11:15:20 -0000 Original-Received: from acm.muc.de (p4FE15D75.dip0.t-ipconnect.de [79.225.93.117]) by colin.muc.de (tmda-ofmipd) with ESMTP; Fri, 01 Mar 2019 12:15:19 +0100 Original-Received: (qmail 5724 invoked by uid 1000); 1 Mar 2019 11:10:18 -0000 Content-Disposition: inline In-Reply-To: X-Delivery-Agent: TMDA/1.1.12 (Macallan) X-Primary-Address: acm@muc.de X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x [fuzzy] X-Received-From: 193.149.48.1 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233721 Archived-At: Hello, Stefan. On Mon, Feb 25, 2019 at 14:18:10 -0500, Stefan Monnier wrote: > >> > Surely the argument to the second occurrence should be (charpos + 1)? > >> I believe it's instead the other one that needs to use "charpos - 1" > >> because the UPDATE_SYNTAX_TABLE is called just before reading the char > >> *before* charpos (see patch below). > > I don't think this is right. offset is calculated from d, and then > > decremented, before calculating charpos. > Hmm... I think you're right (and the symend code does like you suggest). > This said, I find it odd that the code does: > ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > UPDATE_SYNTAX_TABLE (charpos); > Supposedly `d` is a char* pointing to the beginning of a potentially > multibyte char, In that case `d - 1` will point "somewhere before the > end of the previous multibyte char" but not necessarily at > its beginning. Maybe the patch below would be preferable to avoid > this situation? SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in marker.c). This latter function doesn't handle well the case of `d' being in the middle of a multibyte character; sometimes it "rounds it down", other times it "rounds it up" to a character position. I think it should be defined as rounding it down. It would be a relatively simple correction (at least, technically ;-). I think your patch "always do the arithmetic on a charpos, not a bytepos" is a very good idea indeed. But I'm still a little worried about buf_bytepos_to_charpos. Perhaps it should state that the result is undefined when the bytepos is "invalid". How many other places in Emacs call it with "invalid" byte positions? For that matter, how many charpos <-> bytepos functions are there in Emacs? Just this one? > Worse, in notwordbound we do: > ptrdiff_t offset = PTR_TO_OFFSET (d - 1); > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > UPDATE_SYNTAX_TABLE (charpos); > which seems even more broken because `d` might point to the first byte > after the gap, so `d - 1` will point in the middle of the gap, so it's > simply an invalid argument to PTR_TO_OFFSET. I don't think this is right. Both `d' and `offset' are byte measurements, not character measurements, so it shouldn't matter whether the "- 1" is inside or outside the parens. However, it would be less confusing if they were both (?all) the same. > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, > the result may be the same as if we did the decrement after the fact, > but it still looks fishy. WDYT? I think it is suboptimal to have both PTR_TO_OFFSET and POINTER_TO_OFFSET meaning different things in the same source file. ;-) > Stefan > diff --git a/src/regex-emacs.c b/src/regex-emacs.c > index b667a43a37..b21cba0e46 100644 > --- a/src/regex-emacs.c > +++ b/src/regex-emacs.c > @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, > int c1, c2; > int s1, s2; > int dummy; > - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > + ptrdiff_t offset = PTR_TO_OFFSET (d); > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > - UPDATE_SYNTAX_TABLE (charpos); > + UPDATE_SYNTAX_TABLE (charpos - 1); > GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); > s1 = SYNTAX (c1); There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in regex-emacs.c. I think I will check them all, amending them as in your patch. What do you say? -- Alan Mackenzie (Nuremberg, Germany).