Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Alan Mackenzie <acm@muc.de>
To: Eli Zaretskii <eliz@gnu.org>
Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Date: Fri, 1 Mar 2019 14:14:48 +0000	[thread overview]
Message-ID: <20190301141448.GC5674@ACM> (raw)
In-Reply-To: <83bm2uiu6x.fsf@gnu.org>

Hello, Eli.

On Fri, Mar 01, 2019 at 15:46:14 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 11:10:18 +0000
> > From: Alan Mackenzie <acm@muc.de>
> > Cc: emacs-devel@gnu.org

> > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
> > marker.c).  This latter function doesn't handle well the case of `d'
> > being in the middle of a multibyte character; sometimes it "rounds it
> > down", other times it "rounds it up" to a character position.  I think
> > it should be defined as rounding it down.  It would be a relatively
> > simple correction (at least, technically ;-).

> buf_bytepos_to_charpos is not supposed to be called when the byte
> position is in the middle of a multibyte sequence.  We have the
> CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.

Thanks, I didn't know that.  Maybe we should put an assert into the code,
like Stefan suggested.

> > For that matter, how many charpos <-> bytepos functions are there in
> > Emacs?

> Only one pair of such function exists for buffer text, and another
> pair for strings.

That's good.

> > > 		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> > > 		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > > 		UPDATE_SYNTAX_TABLE (charpos);

> > > which seems even more broken because `d` might point to the first byte
> > > after the gap, so `d - 1` will point in the middle of the gap, so it's
> > > simply an invalid argument to PTR_TO_OFFSET.

> > I don't think this is right.  Both `d' and `offset' are byte
> > measurements, not character measurements, so it shouldn't matter whether
> > the "- 1" is inside or outside the parens.  However, it would be less
> > confusing if they were both (?all) the same.

> That's orthogonal.  Stefan is right in that you cannot in general do
> pointer arithmetics on pointers into buffer text without considering
> the gap.  You need to convert 'd' into a byte position (which is
> actually an offset from the beginning of buffer text), then decrement
> it, then convert back into a 'char *' pointer to the previous byte.
> The macros used for these conversions take care of skipping the gap.

> However, since the caller already took care to split the text into two
> parts, one before the gap and the other after the gap, it sounds like
> we don't need to bother about the gap in this case, unless 'd - 1'
> happens to point before the beginning of string2 argument to
> re_match_2_internal.

I've got rid of all the questionable "d - 1"s.  All these code pieces now
first do PTR_TO_OFFSET (d), then do SYNTAX_TABLE_BYTE_TO_CHAR on the
result, and then any arithmetic on the result of that.  (See patch
below).

> > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> > > the result may be the same as if we did the decrement after the fact,
> > > but it still looks fishy.  WDYT?

> > I think it is suboptimal to have both PTR_TO_OFFSET and
> > POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

> Those macros hide the fact that the argument could be a Lisp string or
> a buffer, so I don't think I agree with you here.

I just meant that having the two names so similar might be confusing.

[ .... ]

> > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> > regex-emacs.c.  I think I will check them all, amending them as in your
> > patch.

> > What do you say?

> I'm not Stefan, but what I say is that we should only make sure 'd'
> never points to the very first byte of 'string2'.  If it does, then
> decrementing it will produce invalid results.  If we cannot decide
> whether that situation could happen, we should add an assertion there
> to that effect.

I'm fairly sure it's safe, through always first doing PTR_TO_OFFSET (d),
which takes care of the gap.

Here's the patch (already "tested") which gets rid of the unwanted "d -
1"s:



diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index b667a43a37..45b4f8107c 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -4732,8 +4732,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		int c1, c2;
 		int s1, s2;
 		int dummy;
-		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
-		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+                ptrdiff_t offset = PTR_TO_OFFSET (d);
+                ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 		UPDATE_SYNTAX_TABLE (charpos);
 		GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 		s1 = SYNTAX (c1);
@@ -4811,8 +4811,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 	      int c1, c2;
 	      int s1, s2;
 	      int dummy;
-	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 	      UPDATE_SYNTAX_TABLE (charpos);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);
@@ -4826,7 +4826,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		{
 		  PREFETCH_NOLIMIT ();
 		  GET_CHAR_AFTER (c2, d, dummy);
-		  UPDATE_SYNTAX_TABLE_FORWARD (charpos);
+                  UPDATE_SYNTAX_TABLE_FORWARD (charpos + 1);
 		  s2 = SYNTAX (c2);
 
 		  /* ... and S2 is Sword, and WORD_BOUNDARY_P (C1, C2)
@@ -4890,8 +4890,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		 is the character at D, and S2 is the syntax of C2.  */
 	      int c1, c2;
 	      int s1, s2;
-	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 	      UPDATE_SYNTAX_TABLE (charpos);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);


-- 
Alan Mackenzie (Nuremberg, Germany).

next prev parent reply	other threads:[~2019-03-01 14:14 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-22 16:45 Questionable code in handling of wordend in the regexp engine in regex-emacs.c Alan Mackenzie
2019-02-23 23:15 ` Stefan Monnier
2019-02-25 18:56   ` Alan Mackenzie
2019-02-25 19:18     ` Stefan Monnier
2019-03-01 11:10       ` Alan Mackenzie
2019-03-01 13:41         ` Stefan Monnier
2019-03-01 13:46         ` Eli Zaretskii
2019-03-01 14:14           ` Alan Mackenzie [this message]
2019-03-01 14:43             ` Eli Zaretskii
2019-03-01 14:58               ` Alan Mackenzie
2019-03-01 16:22                 ` Eli Zaretskii
2019-03-01 16:38                   ` Alan Mackenzie
2019-03-01 19:16                     ` Alan Mackenzie
2019-03-01 19:31                       ` Eli Zaretskii
2019-03-02 11:16                         ` Alan Mackenzie
2019-03-02 12:18                           ` Eli Zaretskii
2019-03-02 13:18                             ` Alan Mackenzie
2019-03-02 13:37                               ` Eli Zaretskii
2019-03-04 17:25                               ` Eli Zaretskii
2019-03-05 10:51                                 ` Alan Mackenzie
2019-03-05 16:26                                   ` Eli Zaretskii
2019-03-02 12:21                           ` Eli Zaretskii

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b667a43a3 dfblob:45b4f8107 )
 OR (
bs:"Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190301141448.GC5674@ACM \
    --to=acm@muc.de \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).