From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c Date: Fri, 01 Mar 2019 15:46:14 +0200 Message-ID: <83bm2uiu6x.fsf@gnu.org> References: <20190222164522.GB5411@ACM> <20190225185656.GA3605@ACM> <20190301111018.GA5674@ACM> Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="177886"; mail-complaints-to="usenet@blaine.gmane.org" Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Alan Mackenzie Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 01 14:47:06 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gziVF-000k8j-Gk for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 14:47:05 +0100 Original-Received: from localhost ([127.0.0.1]:38084 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gziVE-0002SV-6h for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 08:47:04 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:59485) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gziV7-0002SN-Tw for emacs-devel@gnu.org; Fri, 01 Mar 2019 08:46:58 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:48318) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gziUz-00026x-Pa; Fri, 01 Mar 2019 08:46:53 -0500 Original-Received: from [176.228.60.248] (port=2693 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1gziUt-0003n9-DW; Fri, 01 Mar 2019 08:46:47 -0500 In-reply-to: <20190301111018.GA5674@ACM> (message from Alan Mackenzie on Fri, 1 Mar 2019 11:10:18 +0000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233727 Archived-At: > Date: Fri, 1 Mar 2019 11:10:18 +0000 > From: Alan Mackenzie > Cc: emacs-devel@gnu.org > > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in > marker.c). This latter function doesn't handle well the case of `d' > being in the middle of a multibyte character; sometimes it "rounds it > down", other times it "rounds it up" to a character position. I think > it should be defined as rounding it down. It would be a relatively > simple correction (at least, technically ;-). buf_bytepos_to_charpos is not supposed to be called when the byte position is in the middle of a multibyte sequence. We have the CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that. > For that matter, how many charpos <-> bytepos functions are there in > Emacs? Only one pair of such function exists for buffer text, and another pair for strings. > > ptrdiff_t offset = PTR_TO_OFFSET (d - 1); > > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > > UPDATE_SYNTAX_TABLE (charpos); > > > which seems even more broken because `d` might point to the first byte > > after the gap, so `d - 1` will point in the middle of the gap, so it's > > simply an invalid argument to PTR_TO_OFFSET. > > I don't think this is right. Both `d' and `offset' are byte > measurements, not character measurements, so it shouldn't matter whether > the "- 1" is inside or outside the parens. However, it would be less > confusing if they were both (?all) the same. That's orthogonal. Stefan is right in that you cannot in general do pointer arithmetics on pointers into buffer text without considering the gap. You need to convert 'd' into a byte position (which is actually an offset from the beginning of buffer text), then decrement it, then convert back into a 'char *' pointer to the previous byte. The macros used for these conversions take care of skipping the gap. However, since the caller already took care to split the text into two parts, one before the gap and the other after the gap, it sounds like we don't need to bother about the gap in this case, unless 'd - 1' happens to point before the beginning of string2 argument to re_match_2_internal. > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, > > the result may be the same as if we did the decrement after the fact, > > but it still looks fishy. WDYT? > > I think it is suboptimal to have both PTR_TO_OFFSET and > POINTER_TO_OFFSET meaning different things in the same source file. ;-) Those macros hide the fact that the argument could be a Lisp string or a buffer, so I don't think I agree with you here. > > diff --git a/src/regex-emacs.c b/src/regex-emacs.c > > index b667a43a37..b21cba0e46 100644 > > --- a/src/regex-emacs.c > > +++ b/src/regex-emacs.c > > @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, > > int c1, c2; > > int s1, s2; > > int dummy; > > - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > > + ptrdiff_t offset = PTR_TO_OFFSET (d); > > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > > - UPDATE_SYNTAX_TABLE (charpos); > > + UPDATE_SYNTAX_TABLE (charpos - 1); > > GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); > > s1 = SYNTAX (c1); > > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in > regex-emacs.c. I think I will check them all, amending them as in your > patch. > > What do you say? I'm not Stefan, but what I say is that we should only make sure 'd' never points to the very first byte of 'string2'. If it does, then decrementing it will produce invalid results. If we cannot decide whether that situation could happen, we should add an assertion there to that effect.