From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Alan Mackenzie Newsgroups: gmane.emacs.devel Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c Date: Fri, 1 Mar 2019 14:14:48 +0000 Message-ID: <20190301141448.GC5674@ACM> References: <20190222164522.GB5411@ACM> <20190225185656.GA3605@ACM> <20190301111018.GA5674@ACM> <83bm2uiu6x.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="152040"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mutt/1.10.1 (2018-07-13) Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 01 15:36:33 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gzjH7-000dQ7-9v for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 15:36:33 +0100 Original-Received: from localhost ([127.0.0.1]:38793 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gzjH6-0001lV-4Y for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 09:36:32 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:39015) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gzjDu-0007Qt-PG for emacs-devel@gnu.org; Fri, 01 Mar 2019 09:33:15 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gzj0y-0007P8-VG for emacs-devel@gnu.org; Fri, 01 Mar 2019 09:19:54 -0500 Original-Received: from colin.muc.de ([193.149.48.1]:19018 helo=mail.muc.de) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1gzj0y-0007OB-Me for emacs-devel@gnu.org; Fri, 01 Mar 2019 09:19:52 -0500 Original-Received: (qmail 80960 invoked by uid 3782); 1 Mar 2019 14:19:50 -0000 Original-Received: from acm.muc.de (p4FE15D75.dip0.t-ipconnect.de [79.225.93.117]) by colin.muc.de (tmda-ofmipd) with ESMTP; Fri, 01 Mar 2019 15:19:49 +0100 Original-Received: (qmail 10560 invoked by uid 1000); 1 Mar 2019 14:14:48 -0000 Content-Disposition: inline In-Reply-To: <83bm2uiu6x.fsf@gnu.org> X-Delivery-Agent: TMDA/1.1.12 (Macallan) X-Primary-Address: acm@muc.de X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x [fuzzy] X-Received-From: 193.149.48.1 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233731 Archived-At: Hello, Eli. On Fri, Mar 01, 2019 at 15:46:14 +0200, Eli Zaretskii wrote: > > Date: Fri, 1 Mar 2019 11:10:18 +0000 > > From: Alan Mackenzie > > Cc: emacs-devel@gnu.org > > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in > > marker.c). This latter function doesn't handle well the case of `d' > > being in the middle of a multibyte character; sometimes it "rounds it > > down", other times it "rounds it up" to a character position. I think > > it should be defined as rounding it down. It would be a relatively > > simple correction (at least, technically ;-). > buf_bytepos_to_charpos is not supposed to be called when the byte > position is in the middle of a multibyte sequence. We have the > CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that. Thanks, I didn't know that. Maybe we should put an assert into the code, like Stefan suggested. > > For that matter, how many charpos <-> bytepos functions are there in > > Emacs? > Only one pair of such function exists for buffer text, and another > pair for strings. That's good. > > > ptrdiff_t offset = PTR_TO_OFFSET (d - 1); > > > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > > > UPDATE_SYNTAX_TABLE (charpos); > > > which seems even more broken because `d` might point to the first byte > > > after the gap, so `d - 1` will point in the middle of the gap, so it's > > > simply an invalid argument to PTR_TO_OFFSET. > > I don't think this is right. Both `d' and `offset' are byte > > measurements, not character measurements, so it shouldn't matter whether > > the "- 1" is inside or outside the parens. However, it would be less > > confusing if they were both (?all) the same. > That's orthogonal. Stefan is right in that you cannot in general do > pointer arithmetics on pointers into buffer text without considering > the gap. You need to convert 'd' into a byte position (which is > actually an offset from the beginning of buffer text), then decrement > it, then convert back into a 'char *' pointer to the previous byte. > The macros used for these conversions take care of skipping the gap. > However, since the caller already took care to split the text into two > parts, one before the gap and the other after the gap, it sounds like > we don't need to bother about the gap in this case, unless 'd - 1' > happens to point before the beginning of string2 argument to > re_match_2_internal. I've got rid of all the questionable "d - 1"s. All these code pieces now first do PTR_TO_OFFSET (d), then do SYNTAX_TABLE_BYTE_TO_CHAR on the result, and then any arithmetic on the result of that. (See patch below). > > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, > > > the result may be the same as if we did the decrement after the fact, > > > but it still looks fishy. WDYT? > > I think it is suboptimal to have both PTR_TO_OFFSET and > > POINTER_TO_OFFSET meaning different things in the same source file. ;-) > Those macros hide the fact that the argument could be a Lisp string or > a buffer, so I don't think I agree with you here. I just meant that having the two names so similar might be confusing. [ .... ] > > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in > > regex-emacs.c. I think I will check them all, amending them as in your > > patch. > > What do you say? > I'm not Stefan, but what I say is that we should only make sure 'd' > never points to the very first byte of 'string2'. If it does, then > decrementing it will produce invalid results. If we cannot decide > whether that situation could happen, we should add an assertion there > to that effect. I'm fairly sure it's safe, through always first doing PTR_TO_OFFSET (d), which takes care of the gap. Here's the patch (already "tested") which gets rid of the unwanted "d - 1"s: diff --git a/src/regex-emacs.c b/src/regex-emacs.c index b667a43a37..45b4f8107c 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -4732,8 +4732,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int c1, c2; int s1, s2; int dummy; - ptrdiff_t offset = PTR_TO_OFFSET (d - 1); - ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); + ptrdiff_t offset = PTR_TO_OFFSET (d); + ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1; UPDATE_SYNTAX_TABLE (charpos); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); @@ -4811,8 +4811,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int c1, c2; int s1, s2; int dummy; - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; - ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); + ptrdiff_t offset = PTR_TO_OFFSET (d); + ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1; UPDATE_SYNTAX_TABLE (charpos); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); @@ -4826,7 +4826,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, { PREFETCH_NOLIMIT (); GET_CHAR_AFTER (c2, d, dummy); - UPDATE_SYNTAX_TABLE_FORWARD (charpos); + UPDATE_SYNTAX_TABLE_FORWARD (charpos + 1); s2 = SYNTAX (c2); /* ... and S2 is Sword, and WORD_BOUNDARY_P (C1, C2) @@ -4890,8 +4890,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, is the character at D, and S2 is the syntax of C2. */ int c1, c2; int s1, s2; - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; - ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); + ptrdiff_t offset = PTR_TO_OFFSET (d); + ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1; UPDATE_SYNTAX_TABLE (charpos); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); -- Alan Mackenzie (Nuremberg, Germany).