From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Alan Mackenzie <acm@muc.de>
Newsgroups: gmane.emacs.devel
Subject: Re: Questionable code in handling of wordend in the regexp engine in
	regex-emacs.c
Date: Fri, 1 Mar 2019 14:14:48 +0000
Message-ID: <20190301141448.GC5674@ACM>
References: <20190222164522.GB5411@ACM> <jwv1s3y85lm.fsf-monnier+emacs@gnu.org>
	<20190225185656.GA3605@ACM> <jwvtvgr1yc7.fsf-monnier+emacs@gnu.org>
	<20190301111018.GA5674@ACM> <83bm2uiu6x.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="152040"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: Mutt/1.10.1 (2018-07-13)
Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 01 15:36:33 2019
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256)
	(Exim 4.89)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1gzjH7-000dQ7-9v
	for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 15:36:33 +0100
Original-Received: from localhost ([127.0.0.1]:38793 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1gzjH6-0001lV-4Y
	for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 09:36:32 -0500
Original-Received: from eggs.gnu.org ([209.51.188.92]:39015)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <acm@muc.de>)
	id 1gzjDu-0007Qt-PG
	for emacs-devel@gnu.org; Fri, 01 Mar 2019 09:33:15 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <acm@muc.de>) id 1gzj0y-0007P8-VG
	for emacs-devel@gnu.org; Fri, 01 Mar 2019 09:19:54 -0500
Original-Received: from colin.muc.de ([193.149.48.1]:19018 helo=mail.muc.de)
	by eggs.gnu.org with smtp (Exim 4.71) (envelope-from <acm@muc.de>)
	id 1gzj0y-0007OB-Me
	for emacs-devel@gnu.org; Fri, 01 Mar 2019 09:19:52 -0500
Original-Received: (qmail 80960 invoked by uid 3782); 1 Mar 2019 14:19:50 -0000
Original-Received: from acm.muc.de (p4FE15D75.dip0.t-ipconnect.de [79.225.93.117]) by
	colin.muc.de (tmda-ofmipd) with ESMTP;
	Fri, 01 Mar 2019 15:19:49 +0100
Original-Received: (qmail 10560 invoked by uid 1000); 1 Mar 2019 14:14:48 -0000
Content-Disposition: inline
In-Reply-To: <83bm2uiu6x.fsf@gnu.org>
X-Delivery-Agent: TMDA/1.1.12 (Macallan)
X-Primary-Address: acm@muc.de
X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x [fuzzy]
X-Received-From: 193.149.48.1
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:233731
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/233731>

Hello, Eli.

On Fri, Mar 01, 2019 at 15:46:14 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 11:10:18 +0000
> > From: Alan Mackenzie <acm@muc.de>
> > Cc: emacs-devel@gnu.org

> > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
> > marker.c).  This latter function doesn't handle well the case of `d'
> > being in the middle of a multibyte character; sometimes it "rounds it
> > down", other times it "rounds it up" to a character position.  I think
> > it should be defined as rounding it down.  It would be a relatively
> > simple correction (at least, technically ;-).

> buf_bytepos_to_charpos is not supposed to be called when the byte
> position is in the middle of a multibyte sequence.  We have the
> CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.

Thanks, I didn't know that.  Maybe we should put an assert into the code,
like Stefan suggested.

> > For that matter, how many charpos <-> bytepos functions are there in
> > Emacs?

> Only one pair of such function exists for buffer text, and another
> pair for strings.

That's good.

> > > 		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> > > 		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > > 		UPDATE_SYNTAX_TABLE (charpos);

> > > which seems even more broken because `d` might point to the first byte
> > > after the gap, so `d - 1` will point in the middle of the gap, so it's
> > > simply an invalid argument to PTR_TO_OFFSET.

> > I don't think this is right.  Both `d' and `offset' are byte
> > measurements, not character measurements, so it shouldn't matter whether
> > the "- 1" is inside or outside the parens.  However, it would be less
> > confusing if they were both (?all) the same.

> That's orthogonal.  Stefan is right in that you cannot in general do
> pointer arithmetics on pointers into buffer text without considering
> the gap.  You need to convert 'd' into a byte position (which is
> actually an offset from the beginning of buffer text), then decrement
> it, then convert back into a 'char *' pointer to the previous byte.
> The macros used for these conversions take care of skipping the gap.

> However, since the caller already took care to split the text into two
> parts, one before the gap and the other after the gap, it sounds like
> we don't need to bother about the gap in this case, unless 'd - 1'
> happens to point before the beginning of string2 argument to
> re_match_2_internal.

I've got rid of all the questionable "d - 1"s.  All these code pieces now
first do PTR_TO_OFFSET (d), then do SYNTAX_TABLE_BYTE_TO_CHAR on the
result, and then any arithmetic on the result of that.  (See patch
below).

> > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> > > the result may be the same as if we did the decrement after the fact,
> > > but it still looks fishy.  WDYT?

> > I think it is suboptimal to have both PTR_TO_OFFSET and
> > POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

> Those macros hide the fact that the argument could be a Lisp string or
> a buffer, so I don't think I agree with you here.

I just meant that having the two names so similar might be confusing.

[ .... ]

> > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> > regex-emacs.c.  I think I will check them all, amending them as in your
> > patch.

> > What do you say?

> I'm not Stefan, but what I say is that we should only make sure 'd'
> never points to the very first byte of 'string2'.  If it does, then
> decrementing it will produce invalid results.  If we cannot decide
> whether that situation could happen, we should add an assertion there
> to that effect.

I'm fairly sure it's safe, through always first doing PTR_TO_OFFSET (d),
which takes care of the gap.

Here's the patch (already "tested") which gets rid of the unwanted "d -
1"s:


diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index b667a43a37..45b4f8107c 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -4732,8 +4732,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		int c1, c2;
 		int s1, s2;
 		int dummy;
-		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
-		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+                ptrdiff_t offset = PTR_TO_OFFSET (d);
+                ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 		UPDATE_SYNTAX_TABLE (charpos);
 		GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 		s1 = SYNTAX (c1);
@@ -4811,8 +4811,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 	      int c1, c2;
 	      int s1, s2;
 	      int dummy;
-	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 	      UPDATE_SYNTAX_TABLE (charpos);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);
@@ -4826,7 +4826,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		{
 		  PREFETCH_NOLIMIT ();
 		  GET_CHAR_AFTER (c2, d, dummy);
-		  UPDATE_SYNTAX_TABLE_FORWARD (charpos);
+                  UPDATE_SYNTAX_TABLE_FORWARD (charpos + 1);
 		  s2 = SYNTAX (c2);
 
 		  /* ... and S2 is Sword, and WORD_BOUNDARY_P (C1, C2)
@@ -4890,8 +4890,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		 is the character at D, and S2 is the syntax of C2.  */
 	      int c1, c2;
 	      int s1, s2;
-	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 	      UPDATE_SYNTAX_TABLE (charpos);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);


-- 
Alan Mackenzie (Nuremberg, Germany).