Questionable code in handling of wordend in the regexp engine in regex-emacs.c

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Questionable code in handling of wordend in the regexp engine in regex-emacs.c
@ 2019-02-22 16:45 Alan Mackenzie
  2019-02-23 23:15 ` Stefan Monnier
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Mackenzie @ 2019-02-22 16:45 UTC (permalink / raw)
  To: emacs-devel

Hello, Emacs.

In function re_match_2_internal (in regex-emacs.c) in the handling of
case wordend: inside the large switch statement (approximately line 4800
in the file), there is some code, the look of which I don't like.

Primarily, there is an

    UPDATE_SYNTAX_TABLE (charpos);

before determining the syntax of the previous character, which seems OK.
Later on, before determining the syntax of the next character, we have:

    UPDATE_SYNTAX_TABLE_FORWARD (charpos);

.  Between these two calls, charpos hasn't been changed.  Surely the
argument to the second occurrence should be (charpos + 1)?

Also, probably less importantly, there is

    GET_CHAR_AFTER (c2, d, dummy);

, whereas at the same place in the handler for case symend: we have
instead

    c2 = RE_STRING_CHAR (d, target_multibyte);

.  Is the effect of these macros identical, or is one of them up to
date, and the other one really needs updating as well, for correct
functionality?

I came across these whilst investigating bug #34525.  Making the
indicated changes to regex-emacs.c sadly doesn't help solve the symptoms
of that bug.  :-(

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-02-22 16:45 Questionable code in handling of wordend in the regexp engine in regex-emacs.c Alan Mackenzie
@ 2019-02-23 23:15 ` Stefan Monnier
  2019-02-25 18:56   ` Alan Mackenzie
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Monnier @ 2019-02-23 23:15 UTC (permalink / raw)
  To: emacs-devel

> Primarily, there is an
>
>     UPDATE_SYNTAX_TABLE (charpos);
>
> before determining the syntax of the previous character, which seems OK.
> Later on, before determining the syntax of the next character, we have:
>
>     UPDATE_SYNTAX_TABLE_FORWARD (charpos);
>
> .  Between these two calls, charpos hasn't been changed.

Good spotting.

> Surely the argument to the second occurrence should be (charpos + 1)?

I believe it's instead the other one that needs to use "charpos - 1"
because the UPDATE_SYNTAX_TABLE is called just before reading the char
*before* charpos (see patch below).

> Also, probably less importantly, there is
>
>     GET_CHAR_AFTER (c2, d, dummy);
>
> , whereas at the same place in the handler for case symend: we have
> instead
>
>     c2 = RE_STRING_CHAR (d, target_multibyte);
>
> .  Is the effect of these macros identical, or is one of them up to
> date, and the other one really needs updating as well, for correct
> functionality?

According to my reading of the code, they're identical in multibyte
buffers not in unibyte buffers where RE_STRING_CHAR just returns a value
between 0 and 255 (i.e. ASCII or Latin-1 more or less), whereas
GET_CHAR_AFTER will return either an ASCII char (0..127) or a raw-byte
char (4194176..4194303).

I think it's more correct to return a raw-byte char (4194176..4194303),
so I'd tend to think that GET_CHAR_AFTER is the better choice, but
please don't quote me on this.

> I came across these whilst investigating bug #34525.  Making the
> indicated changes to regex-emacs.c sadly doesn't help solve the symptoms
> of that bug.  :-(

Does the patch below help?


        Stefan


diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index b667a43a37..72fb5ec561 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -4813,7 +4813,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 	      int dummy;
 	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
 	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
-	      UPDATE_SYNTAX_TABLE (charpos);
+	      UPDATE_SYNTAX_TABLE (charpos - 1);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);
 




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-02-23 23:15 ` Stefan Monnier
@ 2019-02-25 18:56   ` Alan Mackenzie
  2019-02-25 19:18     ` Stefan Monnier
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Mackenzie @ 2019-02-25 18:56 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Hello, Stefan.

Sorry about the delay in replying.

On Sat, Feb 23, 2019 at 18:15:55 -0500, Stefan Monnier wrote:
> > Primarily, there is an

> >     UPDATE_SYNTAX_TABLE (charpos);

> > before determining the syntax of the previous character, which seems OK.
> > Later on, before determining the syntax of the next character, we have:

> >     UPDATE_SYNTAX_TABLE_FORWARD (charpos);

> > .  Between these two calls, charpos hasn't been changed.

> Good spotting.

Thanks!

> > Surely the argument to the second occurrence should be (charpos + 1)?

> I believe it's instead the other one that needs to use "charpos - 1"
> because the UPDATE_SYNTAX_TABLE is called just before reading the char
> *before* charpos (see patch below).

I don't think this is right.  offset is calculated from d, and then
decremented, before calculating charpos.

> > Also, probably less importantly, there is

> >     GET_CHAR_AFTER (c2, d, dummy);

> > , whereas at the same place in the handler for case symend: we have
> > instead

> >     c2 = RE_STRING_CHAR (d, target_multibyte);

> > .  Is the effect of these macros identical, or is one of them up to
> > date, and the other one really needs updating as well, for correct
> > functionality?

> According to my reading of the code, they're identical in multibyte
> buffers not in unibyte buffers where RE_STRING_CHAR just returns a value
> between 0 and 255 (i.e. ASCII or Latin-1 more or less), whereas
> GET_CHAR_AFTER will return either an ASCII char (0..127) or a raw-byte
> char (4194176..4194303).

OK.

> I think it's more correct to return a raw-byte char (4194176..4194303),
> so I'd tend to think that GET_CHAR_AFTER is the better choice, but
> please don't quote me on this.

I won't say a word!

> > I came across these whilst investigating bug #34525.  Making the
> > indicated changes to regex-emacs.c sadly doesn't help solve the symptoms
> > of that bug.  :-(

> Does the patch below help?

Unfortunately not, not for bug #34525.  I did try it out, though.  In the
mean time, I've advanced somewhat in the debugging.

>         Stefan


> diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> index b667a43a37..72fb5ec561 100644
> --- a/src/regex-emacs.c
> +++ b/src/regex-emacs.c
> @@ -4813,7 +4813,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
>  	      int dummy;
 	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
>  	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> -	      UPDATE_SYNTAX_TABLE (charpos);
> +	      UPDATE_SYNTAX_TABLE (charpos - 1);
>  	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
>  	      s1 = SYNTAX (c1);
 
-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-02-25 18:56   ` Alan Mackenzie
@ 2019-02-25 19:18     ` Stefan Monnier
  2019-03-01 11:10       ` Alan Mackenzie
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Monnier @ 2019-02-25 19:18 UTC (permalink / raw)
  To: emacs-devel; +Cc: Alan Mackenzie

>> > Surely the argument to the second occurrence should be (charpos + 1)?
>> I believe it's instead the other one that needs to use "charpos - 1"
>> because the UPDATE_SYNTAX_TABLE is called just before reading the char
>> *before* charpos (see patch below).
> I don't think this is right.  offset is calculated from d, and then
> decremented, before calculating charpos.

Hmm... I think you're right (and the symend code does like you suggest).
This said, I find it odd that the code does:

	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
	      UPDATE_SYNTAX_TABLE (charpos);

Supposedly `d` is a char* pointing to the beginning of a potentially
multibyte char, In that case `d - 1` will point "somewhere before the
end of the previous multibyte char" but not necessarily at
its beginning.  Maybe the patch below would be preferable to avoid
this situation?

Worse, in notwordbound we do:

		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
		UPDATE_SYNTAX_TABLE (charpos);

which seems even more broken because `d` might point to the first byte
after the gap, so `d - 1` will point in the middle of the gap, so it's
simply an invalid argument to PTR_TO_OFFSET.  According to the
definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, the result may
be the same as if we did the decrement after the fact, but it still
looks fishy.  WDYT?

        Stefan

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index b667a43a37..b21cba0e46 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 	      int c1, c2;
 	      int s1, s2;
 	      int dummy;
-	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
+	      ptrdiff_t offset = PTR_TO_OFFSET (d);
 	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
-	      UPDATE_SYNTAX_TABLE (charpos);
+	      UPDATE_SYNTAX_TABLE (charpos - 1);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-02-25 19:18     ` Stefan Monnier
@ 2019-03-01 11:10       ` Alan Mackenzie
  2019-03-01 13:41         ` Stefan Monnier
  2019-03-01 13:46         ` Eli Zaretskii
  0 siblings, 2 replies; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-01 11:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Hello, Stefan.

On Mon, Feb 25, 2019 at 14:18:10 -0500, Stefan Monnier wrote:
> >> > Surely the argument to the second occurrence should be (charpos + 1)?
> >> I believe it's instead the other one that needs to use "charpos - 1"
> >> because the UPDATE_SYNTAX_TABLE is called just before reading the char
> >> *before* charpos (see patch below).
> > I don't think this is right.  offset is calculated from d, and then
> > decremented, before calculating charpos.

> Hmm... I think you're right (and the symend code does like you suggest).
> This said, I find it odd that the code does:

> 	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
> 	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> 	      UPDATE_SYNTAX_TABLE (charpos);

> Supposedly `d` is a char* pointing to the beginning of a potentially
> multibyte char, In that case `d - 1` will point "somewhere before the
> end of the previous multibyte char" but not necessarily at
> its beginning.  Maybe the patch below would be preferable to avoid
> this situation?

SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
marker.c).  This latter function doesn't handle well the case of `d'
being in the middle of a multibyte character; sometimes it "rounds it
down", other times it "rounds it up" to a character position.  I think
it should be defined as rounding it down.  It would be a relatively
simple correction (at least, technically ;-).

I think your patch "always do the arithmetic on a charpos, not a
bytepos" is a very good idea indeed.

But I'm still a little worried about buf_bytepos_to_charpos.  Perhaps it
should state that the result is undefined when the bytepos is "invalid".
How many other places in Emacs call it with "invalid" byte positions?
For that matter, how many charpos <-> bytepos functions are there in
Emacs?  Just this one?

> Worse, in notwordbound we do:

> 		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> 		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> 		UPDATE_SYNTAX_TABLE (charpos);

> which seems even more broken because `d` might point to the first byte
> after the gap, so `d - 1` will point in the middle of the gap, so it's
> simply an invalid argument to PTR_TO_OFFSET.

I don't think this is right.  Both `d' and `offset' are byte
measurements, not character measurements, so it shouldn't matter whether
the "- 1" is inside or outside the parens.  However, it would be less
confusing if they were both (?all) the same.

> According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> the result may be the same as if we did the decrement after the fact,
> but it still looks fishy.  WDYT?

I think it is suboptimal to have both PTR_TO_OFFSET and
POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

>         Stefan

> diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> index b667a43a37..b21cba0e46 100644
> --- a/src/regex-emacs.c
> +++ b/src/regex-emacs.c
> @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
>  	      int c1, c2;
>  	      int s1, s2;
>  	      int dummy;
> -	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
> +	      ptrdiff_t offset = PTR_TO_OFFSET (d);
>  	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> -	      UPDATE_SYNTAX_TABLE (charpos);
> +	      UPDATE_SYNTAX_TABLE (charpos - 1);
>  	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
>  	      s1 = SYNTAX (c1);

There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
regex-emacs.c.  I think I will check them all, amending them as in your
patch.

What do you say?

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 11:10       ` Alan Mackenzie
@ 2019-03-01 13:41         ` Stefan Monnier
  2019-03-01 13:46         ` Eli Zaretskii
  1 sibling, 0 replies; 22+ messages in thread
From: Stefan Monnier @ 2019-03-01 13:41 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

> down", other times it "rounds it up" to a character position.  I think
> it should be defined as rounding it down.  It would be a relatively
> simple correction (at least, technically ;-).

When moving forward, rounding it up is more natural ;-)

> But I'm still a little worried about buf_bytepos_to_charpos.  Perhaps it
> should state that the result is undefined when the bytepos is "invalid".

Yes, I think it's the intention.  Even better would be to signal an
error (when built with --enable-checking).

> For that matter, how many charpos <-> bytepos functions are there in
> Emacs?  Just this one?

I think so, yes.

>> Worse, in notwordbound we do:
>
>> 		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
>> 		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
>> 		UPDATE_SYNTAX_TABLE (charpos);
>
>> which seems even more broken because `d` might point to the first byte
>> after the gap, so `d - 1` will point in the middle of the gap, so it's
>> simply an invalid argument to PTR_TO_OFFSET.
>
> I don't think this is right.  Both `d' and `offset' are byte
> measurements, not character measurements, so it shouldn't matter whether
> the "- 1" is inside or outside the parens.  However, it would be less
> confusing if they were both (?all) the same.

The difference between `d` and `offset` is just an offset, indeed, but
it can be 2 different offsets depending on whether `d` is before or
after the gap, so what happens when `d` is within the gap depends on how
the test for "before/after the gap" is implemented.

More specifically, when `d` is N bytes before the end of the gap, the
code could consider it as being N bytes before the beginning of the
second part, or being "gap-size - N" bytes after the end of the
first part.

>> According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
>> the result may be the same as if we did the decrement after the fact,
>> but it still looks fishy.  WDYT?
>
> I think it is suboptimal to have both PTR_TO_OFFSET and
> POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

I'm so glad you're volunteering to clean this up.
Thank you, really.

> There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> regex-emacs.c.  I think I will check them all, amending them as in your
> patch.
> What do you say?

Thanks,


        Stefan



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 11:10       ` Alan Mackenzie
  2019-03-01 13:41         ` Stefan Monnier
@ 2019-03-01 13:46         ` Eli Zaretskii
  2019-03-01 14:14           ` Alan Mackenzie
  1 sibling, 1 reply; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-01 13:46 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Fri, 1 Mar 2019 11:10:18 +0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: emacs-devel@gnu.org
> 
> SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
> marker.c).  This latter function doesn't handle well the case of `d'
> being in the middle of a multibyte character; sometimes it "rounds it
> down", other times it "rounds it up" to a character position.  I think
> it should be defined as rounding it down.  It would be a relatively
> simple correction (at least, technically ;-).

buf_bytepos_to_charpos is not supposed to be called when the byte
position is in the middle of a multibyte sequence.  We have the
CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.

> For that matter, how many charpos <-> bytepos functions are there in
> Emacs?

Only one pair of such function exists for buffer text, and another
pair for strings.

> > 		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> > 		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > 		UPDATE_SYNTAX_TABLE (charpos);
> 
> > which seems even more broken because `d` might point to the first byte
> > after the gap, so `d - 1` will point in the middle of the gap, so it's
> > simply an invalid argument to PTR_TO_OFFSET.
> 
> I don't think this is right.  Both `d' and `offset' are byte
> measurements, not character measurements, so it shouldn't matter whether
> the "- 1" is inside or outside the parens.  However, it would be less
> confusing if they were both (?all) the same.

That's orthogonal.  Stefan is right in that you cannot in general do
pointer arithmetics on pointers into buffer text without considering
the gap.  You need to convert 'd' into a byte position (which is
actually an offset from the beginning of buffer text), then decrement
it, then convert back into a 'char *' pointer to the previous byte.
The macros used for these conversions take care of skipping the gap.

However, since the caller already took care to split the text into two
parts, one before the gap and the other after the gap, it sounds like
we don't need to bother about the gap in this case, unless 'd - 1'
happens to point before the beginning of string2 argument to
re_match_2_internal.

> > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> > the result may be the same as if we did the decrement after the fact,
> > but it still looks fishy.  WDYT?
> 
> I think it is suboptimal to have both PTR_TO_OFFSET and
> POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

Those macros hide the fact that the argument could be a Lisp string or
a buffer, so I don't think I agree with you here.

> > diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> > index b667a43a37..b21cba0e46 100644
> > --- a/src/regex-emacs.c
> > +++ b/src/regex-emacs.c
> > @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
> >  	      int c1, c2;
> >  	      int s1, s2;
> >  	      int dummy;
> > -	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
> > +	      ptrdiff_t offset = PTR_TO_OFFSET (d);
> >  	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > -	      UPDATE_SYNTAX_TABLE (charpos);
> > +	      UPDATE_SYNTAX_TABLE (charpos - 1);
> >  	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
> >  	      s1 = SYNTAX (c1);
>  
> There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> regex-emacs.c.  I think I will check them all, amending them as in your
> patch.
> 
> What do you say?

I'm not Stefan, but what I say is that we should only make sure 'd'
never points to the very first byte of 'string2'.  If it does, then
decrementing it will produce invalid results.  If we cannot decide
whether that situation could happen, we should add an assertion there
to that effect.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 13:46         ` Eli Zaretskii
@ 2019-03-01 14:14           ` Alan Mackenzie
  2019-03-01 14:43             ` Eli Zaretskii
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-01 14:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Hello, Eli.

On Fri, Mar 01, 2019 at 15:46:14 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 11:10:18 +0000
> > From: Alan Mackenzie <acm@muc.de>
> > Cc: emacs-devel@gnu.org

> > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
> > marker.c).  This latter function doesn't handle well the case of `d'
> > being in the middle of a multibyte character; sometimes it "rounds it
> > down", other times it "rounds it up" to a character position.  I think
> > it should be defined as rounding it down.  It would be a relatively
> > simple correction (at least, technically ;-).

> buf_bytepos_to_charpos is not supposed to be called when the byte
> position is in the middle of a multibyte sequence.  We have the
> CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.

Thanks, I didn't know that.  Maybe we should put an assert into the code,
like Stefan suggested.

> > For that matter, how many charpos <-> bytepos functions are there in
> > Emacs?

> Only one pair of such function exists for buffer text, and another
> pair for strings.

That's good.

> > > 		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> > > 		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > > 		UPDATE_SYNTAX_TABLE (charpos);

> > > which seems even more broken because `d` might point to the first byte
> > > after the gap, so `d - 1` will point in the middle of the gap, so it's
> > > simply an invalid argument to PTR_TO_OFFSET.

> > I don't think this is right.  Both `d' and `offset' are byte
> > measurements, not character measurements, so it shouldn't matter whether
> > the "- 1" is inside or outside the parens.  However, it would be less
> > confusing if they were both (?all) the same.

> That's orthogonal.  Stefan is right in that you cannot in general do
> pointer arithmetics on pointers into buffer text without considering
> the gap.  You need to convert 'd' into a byte position (which is
> actually an offset from the beginning of buffer text), then decrement
> it, then convert back into a 'char *' pointer to the previous byte.
> The macros used for these conversions take care of skipping the gap.

> However, since the caller already took care to split the text into two
> parts, one before the gap and the other after the gap, it sounds like
> we don't need to bother about the gap in this case, unless 'd - 1'
> happens to point before the beginning of string2 argument to
> re_match_2_internal.

I've got rid of all the questionable "d - 1"s.  All these code pieces now
first do PTR_TO_OFFSET (d), then do SYNTAX_TABLE_BYTE_TO_CHAR on the
result, and then any arithmetic on the result of that.  (See patch
below).

> > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> > > the result may be the same as if we did the decrement after the fact,
> > > but it still looks fishy.  WDYT?

> > I think it is suboptimal to have both PTR_TO_OFFSET and
> > POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

> Those macros hide the fact that the argument could be a Lisp string or
> a buffer, so I don't think I agree with you here.

I just meant that having the two names so similar might be confusing.

[ .... ]

> > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> > regex-emacs.c.  I think I will check them all, amending them as in your
> > patch.

> > What do you say?

> I'm not Stefan, but what I say is that we should only make sure 'd'
> never points to the very first byte of 'string2'.  If it does, then
> decrementing it will produce invalid results.  If we cannot decide
> whether that situation could happen, we should add an assertion there
> to that effect.

I'm fairly sure it's safe, through always first doing PTR_TO_OFFSET (d),
which takes care of the gap.

Here's the patch (already "tested") which gets rid of the unwanted "d -
1"s:



diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index b667a43a37..45b4f8107c 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -4732,8 +4732,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		int c1, c2;
 		int s1, s2;
 		int dummy;
-		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
-		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+                ptrdiff_t offset = PTR_TO_OFFSET (d);
+                ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 		UPDATE_SYNTAX_TABLE (charpos);
 		GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 		s1 = SYNTAX (c1);
@@ -4811,8 +4811,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 	      int c1, c2;
 	      int s1, s2;
 	      int dummy;
-	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 	      UPDATE_SYNTAX_TABLE (charpos);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);
@@ -4826,7 +4826,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		{
 		  PREFETCH_NOLIMIT ();
 		  GET_CHAR_AFTER (c2, d, dummy);
-		  UPDATE_SYNTAX_TABLE_FORWARD (charpos);
+                  UPDATE_SYNTAX_TABLE_FORWARD (charpos + 1);
 		  s2 = SYNTAX (c2);
 
 		  /* ... and S2 is Sword, and WORD_BOUNDARY_P (C1, C2)
@@ -4890,8 +4890,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
 		 is the character at D, and S2 is the syntax of C2.  */
 	      int c1, c2;
 	      int s1, s2;
-	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
 	      UPDATE_SYNTAX_TABLE (charpos);
 	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
 	      s1 = SYNTAX (c1);


-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 14:14           ` Alan Mackenzie
@ 2019-03-01 14:43             ` Eli Zaretskii
  2019-03-01 14:58               ` Alan Mackenzie
  0 siblings, 1 reply; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-01 14:43 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Fri, 1 Mar 2019 14:14:48 +0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> 
> > buf_bytepos_to_charpos is not supposed to be called when the byte
> > position is in the middle of a multibyte sequence.  We have the
> > CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.
> 
> Thanks, I didn't know that.  Maybe we should put an assert into the code,
> like Stefan suggested.

We could try.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 14:43             ` Eli Zaretskii
@ 2019-03-01 14:58               ` Alan Mackenzie
  2019-03-01 16:22                 ` Eli Zaretskii
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-01 14:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Hello, Eli.

On Fri, Mar 01, 2019 at 16:43:38 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 14:14:48 +0000
> > From: Alan Mackenzie <acm@muc.de>
> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org

> > > buf_bytepos_to_charpos is not supposed to be called when the byte
> > > position is in the middle of a multibyte sequence.  We have the
> > > CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.

> > Thanks, I didn't know that.  Maybe we should put an assert into the code,
> > like Stefan suggested.

> We could try.

How about this, as a first approximation?



diff --git a/src/marker.c b/src/marker.c
index 36d6b10c74..9faeca49f4 100644
--- a/src/marker.c
+++ b/src/marker.c
@@ -311,7 +311,8 @@ buf_charpos_to_bytepos (struct buffer *b, ptrdiff_t charpos)
     }									\
 }
 
-/* Return the character position corresponding to BYTEPOS in B.  */
+/* Return the character position corresponding to BYTEPOS in B.
+   BYTEPOS must be at a character boundary.  */
 
 ptrdiff_t
 buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
@@ -370,6 +371,8 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
 	  best_below++;
 	  BUF_INC_POS (b, best_below_byte);
 	}
+      /* Check BYTEPOS was at a character boundary. */
+      eassert (best_below_byte == bytepos);
 
       /* If this position is quite far from the nearest known position,
 	 cache the correspondence by creating a marker here.
@@ -397,6 +400,8 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
 	  best_above--;
 	  BUF_DEC_POS (b, best_above_byte);
 	}
+      /* Check BYTEPOS was at a character boundary. */
+      eassert (best_above_byte == bytepos);
 
       /* If this position is quite far from the nearest known position,
 	 cache the correspondence by creating a marker here.


-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 14:58               ` Alan Mackenzie
@ 2019-03-01 16:22                 ` Eli Zaretskii
  2019-03-01 16:38                   ` Alan Mackenzie
  0 siblings, 1 reply; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-01 16:22 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Fri, 1 Mar 2019 14:58:56 +0000
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Alan Mackenzie <acm@muc.de>
> 
> > > Thanks, I didn't know that.  Maybe we should put an assert into the code,
> > > like Stefan suggested.
> 
> > We could try.
> 
> How about this, as a first approximation?
> [...]
> +      /* Check BYTEPOS was at a character boundary. */
> +      eassert (best_below_byte == bytepos);

Actually, what I had in mind was a simple

  eassert (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));

right at the beginning of buf_bytepos_to_charpos.  But maybe if you
explain why you wanted a different assertion, I will change my mind.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 16:22                 ` Eli Zaretskii
@ 2019-03-01 16:38                   ` Alan Mackenzie
  2019-03-01 19:16                     ` Alan Mackenzie
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-01 16:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Hello, Eli.

On Fri, Mar 01, 2019 at 18:22:39 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 14:58:56 +0000
> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> > From: Alan Mackenzie <acm@muc.de>

> > > > Thanks, I didn't know that.  Maybe we should put an assert into the code,
> > > > like Stefan suggested.

> > > We could try.

> > How about this, as a first approximation?
> > [...]
> > +      /* Check BYTEPOS was at a character boundary. */
> > +      eassert (best_below_byte == bytepos);

> Actually, what I had in mind was a simple

>   eassert (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));

> right at the beginning of buf_bytepos_to_charpos.  But maybe if you
> explain why you wanted a different assertion, I will change my mind.

I had forgotten that it was possible to determine that a UTF8 byte was
at the start of a character.

I think your version is simpler and better.  Lets use it!

But first, I'll commit the fix to bug #34525, including the problems
with "d - 1".  That will be one fewer spot where that new assertion will
get triggered.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 16:38                   ` Alan Mackenzie
@ 2019-03-01 19:16                     ` Alan Mackenzie
  2019-03-01 19:31                       ` Eli Zaretskii
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-01 19:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Hello again, Eli.

On Fri, Mar 01, 2019 at 16:38:25 +0000, Alan Mackenzie wrote:
> On Fri, Mar 01, 2019 at 18:22:39 +0200, Eli Zaretskii wrote:
> > > Date: Fri, 1 Mar 2019 14:58:56 +0000
> > > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> > > From: Alan Mackenzie <acm@muc.de>

[ .... ]

> > Actually, what I had in mind was a simple

> >   eassert (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));

> > right at the beginning of buf_bytepos_to_charpos.

[ .... ]

I'm experimenting with:


diff --git a/src/marker.c b/src/marker.c
index b58051a8c2..89b6ca57f4 100644
--- a/src/marker.c
+++ b/src/marker.c
@@ -322,6 +322,9 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
   ptrdiff_t distance = BYTECHAR_DISTANCE_INITIAL;
 
   eassert (BUF_BEG_BYTE (b) <= bytepos && bytepos <= BUF_Z_BYTE (b));
+  /* Check bytepos is not in the middle of a character. */
+  eassert (bytepos >= BUF_Z_BYTE (b)
+           || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));
 
   best_above = BUF_Z (b);
   best_above_byte = BUF_Z_BYTE (b);


After configuring with --enable-checking and building, I tried make
check.  The tests errored out with this bytepos check three times.  In:

    src/coding-tests.log
    lisp/epg-tests.log
    lisp/emacs-lisp/package-tests.log

.  Quite possibly there is just one bug here, but there might be two or
three.  I think it would be best to track it/them down before committing
the change to marker.c.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 19:16                     ` Alan Mackenzie
@ 2019-03-01 19:31                       ` Eli Zaretskii
  2019-03-02 11:16                         ` Alan Mackenzie
  0 siblings, 1 reply; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-01 19:31 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Fri, 1 Mar 2019 19:16:07 +0000
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Alan Mackenzie <acm@muc.de>
> 
> +  /* Check bytepos is not in the middle of a character. */
> +  eassert (bytepos >= BUF_Z_BYTE (b)
> +           || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));

LGTM.

> After configuring with --enable-checking and building, I tried make
> check.  The tests errored out with this bytepos check three times.  In:
> 
>     src/coding-tests.log
>     lisp/epg-tests.log
>     lisp/emacs-lisp/package-tests.log
> 
> .  Quite possibly there is just one bug here, but there might be two or
> three.  I think it would be best to track it/them down before committing
> the change to marker.c.

I agree.  Let me know if you need help with that.

Thanks.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-01 19:31                       ` Eli Zaretskii
@ 2019-03-02 11:16                         ` Alan Mackenzie
  2019-03-02 12:18                           ` Eli Zaretskii
  2019-03-02 12:21                           ` Eli Zaretskii
  0 siblings, 2 replies; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-02 11:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Hello, Eli.

On Fri, Mar 01, 2019 at 21:31:37 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 19:16:07 +0000
> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> > From: Alan Mackenzie <acm@muc.de>

> > +  /* Check bytepos is not in the middle of a character. */
> > +  eassert (bytepos >= BUF_Z_BYTE (b)
> > +           || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));

> LGTM.

> > After configuring with --enable-checking and building, I tried make
> > check.  The tests errored out with this bytepos check three times.  In:

> >     src/coding-tests.log
> >     lisp/epg-tests.log
> >     lisp/emacs-lisp/package-tests.log

> > .  Quite possibly there is just one bug here, but there might be two or
> > three.  I think it would be best to track it/them down before committing
> > the change to marker.c.

> I agree.  Let me know if you need help with that.

On the coding tests, we fail in one of them when bytepos == 1, and the
first byte in the buffer is 0xa4.  The Lisp call stack at the time is

Lisp Backtrace:
"decode-coding-region" (0xc35b5cb0)
"progn" (0xc35b5d68)
"unwind-protect" (0xc35b5ea8)
"save-current-buffer" (0xc35b6018)
"let" (0xc35b6208)
0xd985d350 Lisp type 3
"ert--run-test-internal" (0xc35b69b0)
"ert-run-test" (0xc35b6ed8)
"ert-run-or-rerun-test" (0xc35b7410)
"ert-run-tests" (0xc35b7938)
"ert-run-tests-interactively" (0xc35b7f30)
"funcall-interactively" (0xc35b7f28)
"call-interactively" (0xc35b8310)
"command-execute" (0xc35b8838)
"execute-extended-command" (0xc35b8f20)
"funcall-interactively" (0xc35b8f18)
"call-interactively" (0xc35b9300)
"command-execute" (0xc35b9808)

, and the first few lines of the C backtrace are:

(gdb) backtrace
#0  terminate_due_to_signal (sig=6, backtrace_limit=2147483647) at emacs.c:370
#1  0x000055a4d8603595 in die (msg=0x55a4d87637a8 "bytepos >= BUF_Z_BYTE (b) || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))", file=0x55a4d8763728 "marker.c", line=327) at alloc.c:7442
#2  0x000055a4d85b42cf in buf_bytepos_to_charpos (b=0x55a4da051710, bytepos=1) at marker.c:326
#3  0x000055a4d85ab93f in move_gap_both (charpos=1, bytepos=1) at insdel.c:92
#4  0x000055a4d84ca5d9 in decode_coding_object (coding=0x7ffcc35b5950, src_object=XIL(0x55a4da051715), from=1, from_byte=1, to=5, to_byte=5, dst_object=XIL(0x55a4da051715)) at coding.c:8072
#5  0x000055a4d84cff9f in code_convert_region (start=make_number(1), end=make_number(5), coding_system=XIL(0x2a1906036aa0), dst_object=XIL(0x55a4da051715), encodep=false, norecord=false) at coding.c:9371
#6  0x000055a4d84d005a in Fdecode_coding_region (start=make_number(1), end=make_number(5), coding_system=XIL(0x2a1906036aa0), destination=XIL(0)) at coding.c:9401
#7  0x000055a4d86370e1 in eval_sub (form=XIL(0x55a4d981ded3)) at eval.c:2325

.  More precisely, the first few bytes of the buffer are:

    a4 a2 0d 0a 00 00 00 00

, and I suspect the buffer is just 4 bytes long (but I'm too lazy to
check it properly).

I have a suspicion that the CHAR_HEAD_P test isn't valid here.  I'm
guessing that we're converting an external coding, something like an
MS-DOS 8-bit coding, to internal UTF-8-like coding, so we can't use
CHAR_HEAD_P.

In fact, is buf_bytepos_to_charpos the Right Thing to use here?

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-02 11:16                         ` Alan Mackenzie
@ 2019-03-02 12:18                           ` Eli Zaretskii
  2019-03-02 13:18                             ` Alan Mackenzie
  2019-03-02 12:21                           ` Eli Zaretskii
  1 sibling, 1 reply; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-02 12:18 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Sat, 2 Mar 2019 11:16:40 +0000
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Alan Mackenzie <acm@muc.de>
> 
> On the coding tests, we fail in one of them when bytepos == 1, and the
> first byte in the buffer is 0xa4.  The Lisp call stack at the time is
> 
> Lisp Backtrace:
> "decode-coding-region" (0xc35b5cb0)

Ah, yes.  We should do this to prevent such false alarms:

  eassert (NILP (BVAR (b, enable_multibyte_characters))
           || bytepos >= BUF_Z_BYTE (b)
	   || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));

IOW, this test is irrelevant in unibyte buffers.

Thanks.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-02 12:18                           ` Eli Zaretskii
@ 2019-03-02 13:18                             ` Alan Mackenzie
  2019-03-02 13:37                               ` Eli Zaretskii
  2019-03-04 17:25                               ` Eli Zaretskii
  0 siblings, 2 replies; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-02 13:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Hello, Eli.

On Sat, Mar 02, 2019 at 14:18:00 +0200, Eli Zaretskii wrote:
> > Date: Sat, 2 Mar 2019 11:16:40 +0000
> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> > From: Alan Mackenzie <acm@muc.de>

> > On the coding tests, we fail in one of them when bytepos == 1, and the
> > first byte in the buffer is 0xa4.  The Lisp call stack at the time is

> > Lisp Backtrace:
> > "decode-coding-region" (0xc35b5cb0)

> Ah, yes.  We should do this to prevent such false alarms:

>   eassert (NILP (BVAR (b, enable_multibyte_characters))
>            || bytepos >= BUF_Z_BYTE (b)
> 	   || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));

> IOW, this test is irrelevant in unibyte buffers.

Instead I moved the eassert to after the bit where it checks for unibyte
buffers, giving this:



diff --git a/src/marker.c b/src/marker.c
index b58051a8c2..0b2e1bf5c6 100644
--- a/src/marker.c
+++ b/src/marker.c
@@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
   if (best_above == best_above_byte)
     return bytepos;
 
+  /* Check bytepos is not in the middle of a character. */
+  eassert (bytepos >= BUF_Z_BYTE (b)
+           || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));
+
   best_below = BEG;
   best_below_byte = BEG_BYTE;
 

I now no longer see the failed easserts in make check.

So I'll commit this sometime (real life is a bit urgent right now).

> Thanks.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-02 13:18                             ` Alan Mackenzie
@ 2019-03-02 13:37                               ` Eli Zaretskii
  2019-03-04 17:25                               ` Eli Zaretskii
  1 sibling, 0 replies; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-02 13:37 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Sat, 2 Mar 2019 13:18:01 +0000
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Alan Mackenzie <acm@muc.de>
> 
> Instead I moved the eassert to after the bit where it checks for unibyte
> buffers, giving this:
> 
> diff --git a/src/marker.c b/src/marker.c
> index b58051a8c2..0b2e1bf5c6 100644
> --- a/src/marker.c
> +++ b/src/marker.c
> @@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
>    if (best_above == best_above_byte)
>      return bytepos;
>  
> +  /* Check bytepos is not in the middle of a character. */
> +  eassert (bytepos >= BUF_Z_BYTE (b)
> +           || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));
> +
>    best_below = BEG;
>    best_below_byte = BEG_BYTE;
>  
> 
> I now no longer see the failed easserts in make check.
> 
> So I'll commit this sometime (real life is a bit urgent right now).

Great, thanks.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-02 13:18                             ` Alan Mackenzie
  2019-03-02 13:37                               ` Eli Zaretskii
@ 2019-03-04 17:25                               ` Eli Zaretskii
  2019-03-05 10:51                                 ` Alan Mackenzie
  1 sibling, 1 reply; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-04 17:25 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Sat, 2 Mar 2019 13:18:01 +0000
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Alan Mackenzie <acm@muc.de>
> 
> >   eassert (NILP (BVAR (b, enable_multibyte_characters))
> >            || bytepos >= BUF_Z_BYTE (b)
> > 	   || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));
> 
> > IOW, this test is irrelevant in unibyte buffers.
> 
> Instead I moved the eassert to after the bit where it checks for unibyte
> buffers, giving this:
> 
> 
> 
> diff --git a/src/marker.c b/src/marker.c
> index b58051a8c2..0b2e1bf5c6 100644
> --- a/src/marker.c
> +++ b/src/marker.c
> @@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
>    if (best_above == best_above_byte)
>      return bytepos;
>  
> +  /* Check bytepos is not in the middle of a character. */
> +  eassert (bytepos >= BUF_Z_BYTE (b)
> +           || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));
> +
>    best_below = BEG;
>    best_below_byte = BEG_BYTE;
>  
> 
> I now no longer see the failed easserts in make check.
> 
> So I'll commit this sometime (real life is a bit urgent right now).

I was forced to disable this assertion for now: I bootstrapped today a
clean checkout, and several jobs that run during the bootstrap
triggered the assertion.  It turns out there's one legitimate use case
when bytepos _can_ be in the middle of a multibyte sequence: when we
convert a buffer from unibyte to multibyte.  There are comments to
that effect in set_intervals_multibyte_1.

I see 2 possible ways to handle this: (1) remove the assertion for
good, or (2) change buf_bytepos_to_charpos to accept one more
argument, telling it whether to make this check, and then modify all
the callers except those in set_intervals_multibyte_1 to pass 'true'
as that argument.

Thoughts?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-04 17:25                               ` Eli Zaretskii
@ 2019-03-05 10:51                                 ` Alan Mackenzie
  2019-03-05 16:26                                   ` Eli Zaretskii
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Mackenzie @ 2019-03-05 10:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Hello, Eli.

On Mon, Mar 04, 2019 at 19:25:57 +0200, Eli Zaretskii wrote:
> > Date: Sat, 2 Mar 2019 13:18:01 +0000
> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> > From: Alan Mackenzie <acm@muc.de>

> > Instead I moved the eassert to after the bit where it checks for unibyte
> > buffers, giving this:



> > diff --git a/src/marker.c b/src/marker.c
> > index b58051a8c2..0b2e1bf5c6 100644
> > --- a/src/marker.c
> > +++ b/src/marker.c
> > @@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
> >    if (best_above == best_above_byte)
> >      return bytepos;
> > 
> > +  /* Check bytepos is not in the middle of a character. */
> > +  eassert (bytepos >= BUF_Z_BYTE (b)
> > +           || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)));
> > +
> >    best_below = BEG;
> >    best_below_byte = BEG_BYTE;

[ .... ]

> I was forced to disable this assertion for now: I bootstrapped today a
> clean checkout, and several jobs that run during the bootstrap
> triggered the assertion.  It turns out there's one legitimate use case
> when bytepos _can_ be in the middle of a multibyte sequence: when we
> convert a buffer from unibyte to multibyte.  There are comments to
> that effect in set_intervals_multibyte_1.

> I see 2 possible ways to handle this: (1) remove the assertion for
> good, or (2) change buf_bytepos_to_charpos to accept one more
> argument, telling it whether to make this check, and then modify all
> the callers except those in set_intervals_multibyte_1 to pass 'true'
> as that argument.

> Thoughts?

First of all, sorry I wasn't here yesterday to deal with it.

I don't think I like alternative (2) - it's ugly, and how much do we
really need this eassert anyway?  It's turned out not to be such a good
idea after all.  I would favour alternative (1), just removing the thing
altogether.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-05 10:51                                 ` Alan Mackenzie
@ 2019-03-05 16:26                                   ` Eli Zaretskii
  0 siblings, 0 replies; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-05 16:26 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Tue, 5 Mar 2019 10:51:50 +0000
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
> From: Alan Mackenzie <acm@muc.de>
> 
> > I see 2 possible ways to handle this: (1) remove the assertion for
> > good, or (2) change buf_bytepos_to_charpos to accept one more
> > argument, telling it whether to make this check, and then modify all
> > the callers except those in set_intervals_multibyte_1 to pass 'true'
> > as that argument.
> 
> > Thoughts?
> 
> First of all, sorry I wasn't here yesterday to deal with it.

No sweat: since this popped up in my own build, it was very easy to
understand the reasons.

> I don't think I like alternative (2) - it's ugly, and how much do we
> really need this eassert anyway?  It's turned out not to be such a good
> idea after all.  I would favour alternative (1), just removing the thing
> altogether.

Agreed; done.

Thanks.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
  2019-03-02 11:16                         ` Alan Mackenzie
  2019-03-02 12:18                           ` Eli Zaretskii
@ 2019-03-02 12:21                           ` Eli Zaretskii
  1 sibling, 0 replies; 22+ messages in thread
From: Eli Zaretskii @ 2019-03-02 12:21 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: monnier, emacs-devel

> Date: Sat, 2 Mar 2019 11:16:40 +0000
> From: Alan Mackenzie <acm@muc.de>
> Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org

Forgot to answer this part:

> In fact, is buf_bytepos_to_charpos the Right Thing to use here?

Yes, because decode-coding-region must also work in a multibyte
buffer.  The code in buf_bytepos_to_charpos takes care of the unibyte
use case right away, returning a trivial 1:1 mapping:

  /* If this buffer has as many characters as bytes,
     each character must be one byte.
     This takes care of the case where enable-multibyte-characters is nil.  */
  if (best_above == best_above_byte)
    return bytepos;



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-03-05 16:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-22 16:45 Questionable code in handling of wordend in the regexp engine in regex-emacs.c Alan Mackenzie
2019-02-23 23:15 ` Stefan Monnier
2019-02-25 18:56   ` Alan Mackenzie
2019-02-25 19:18     ` Stefan Monnier
2019-03-01 11:10       ` Alan Mackenzie
2019-03-01 13:41         ` Stefan Monnier
2019-03-01 13:46         ` Eli Zaretskii
2019-03-01 14:14           ` Alan Mackenzie
2019-03-01 14:43             ` Eli Zaretskii
2019-03-01 14:58               ` Alan Mackenzie
2019-03-01 16:22                 ` Eli Zaretskii
2019-03-01 16:38                   ` Alan Mackenzie
2019-03-01 19:16                     ` Alan Mackenzie
2019-03-01 19:31                       ` Eli Zaretskii
2019-03-02 11:16                         ` Alan Mackenzie
2019-03-02 12:18                           ` Eli Zaretskii
2019-03-02 13:18                             ` Alan Mackenzie
2019-03-02 13:37                               ` Eli Zaretskii
2019-03-04 17:25                               ` Eli Zaretskii
2019-03-05 10:51                                 ` Alan Mackenzie
2019-03-05 16:26                                   ` Eli Zaretskii
2019-03-02 12:21                           ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).