* Questionable code in handling of wordend in the regexp engine in regex-emacs.c @ 2019-02-22 16:45 Alan Mackenzie 2019-02-23 23:15 ` Stefan Monnier 0 siblings, 1 reply; 22+ messages in thread From: Alan Mackenzie @ 2019-02-22 16:45 UTC (permalink / raw) To: emacs-devel Hello, Emacs. In function re_match_2_internal (in regex-emacs.c) in the handling of case wordend: inside the large switch statement (approximately line 4800 in the file), there is some code, the look of which I don't like. Primarily, there is an UPDATE_SYNTAX_TABLE (charpos); before determining the syntax of the previous character, which seems OK. Later on, before determining the syntax of the next character, we have: UPDATE_SYNTAX_TABLE_FORWARD (charpos); . Between these two calls, charpos hasn't been changed. Surely the argument to the second occurrence should be (charpos + 1)? Also, probably less importantly, there is GET_CHAR_AFTER (c2, d, dummy); , whereas at the same place in the handler for case symend: we have instead c2 = RE_STRING_CHAR (d, target_multibyte); . Is the effect of these macros identical, or is one of them up to date, and the other one really needs updating as well, for correct functionality? I came across these whilst investigating bug #34525. Making the indicated changes to regex-emacs.c sadly doesn't help solve the symptoms of that bug. :-( -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-02-22 16:45 Questionable code in handling of wordend in the regexp engine in regex-emacs.c Alan Mackenzie @ 2019-02-23 23:15 ` Stefan Monnier 2019-02-25 18:56 ` Alan Mackenzie 0 siblings, 1 reply; 22+ messages in thread From: Stefan Monnier @ 2019-02-23 23:15 UTC (permalink / raw) To: emacs-devel > Primarily, there is an > > UPDATE_SYNTAX_TABLE (charpos); > > before determining the syntax of the previous character, which seems OK. > Later on, before determining the syntax of the next character, we have: > > UPDATE_SYNTAX_TABLE_FORWARD (charpos); > > . Between these two calls, charpos hasn't been changed. Good spotting. > Surely the argument to the second occurrence should be (charpos + 1)? I believe it's instead the other one that needs to use "charpos - 1" because the UPDATE_SYNTAX_TABLE is called just before reading the char *before* charpos (see patch below). > Also, probably less importantly, there is > > GET_CHAR_AFTER (c2, d, dummy); > > , whereas at the same place in the handler for case symend: we have > instead > > c2 = RE_STRING_CHAR (d, target_multibyte); > > . Is the effect of these macros identical, or is one of them up to > date, and the other one really needs updating as well, for correct > functionality? According to my reading of the code, they're identical in multibyte buffers not in unibyte buffers where RE_STRING_CHAR just returns a value between 0 and 255 (i.e. ASCII or Latin-1 more or less), whereas GET_CHAR_AFTER will return either an ASCII char (0..127) or a raw-byte char (4194176..4194303). I think it's more correct to return a raw-byte char (4194176..4194303), so I'd tend to think that GET_CHAR_AFTER is the better choice, but please don't quote me on this. > I came across these whilst investigating bug #34525. Making the > indicated changes to regex-emacs.c sadly doesn't help solve the symptoms > of that bug. :-( Does the patch below help? Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index b667a43a37..72fb5ec561 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -4813,7 +4813,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int dummy; ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); - UPDATE_SYNTAX_TABLE (charpos); + UPDATE_SYNTAX_TABLE (charpos - 1); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-02-23 23:15 ` Stefan Monnier @ 2019-02-25 18:56 ` Alan Mackenzie 2019-02-25 19:18 ` Stefan Monnier 0 siblings, 1 reply; 22+ messages in thread From: Alan Mackenzie @ 2019-02-25 18:56 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel Hello, Stefan. Sorry about the delay in replying. On Sat, Feb 23, 2019 at 18:15:55 -0500, Stefan Monnier wrote: > > Primarily, there is an > > UPDATE_SYNTAX_TABLE (charpos); > > before determining the syntax of the previous character, which seems OK. > > Later on, before determining the syntax of the next character, we have: > > UPDATE_SYNTAX_TABLE_FORWARD (charpos); > > . Between these two calls, charpos hasn't been changed. > Good spotting. Thanks! > > Surely the argument to the second occurrence should be (charpos + 1)? > I believe it's instead the other one that needs to use "charpos - 1" > because the UPDATE_SYNTAX_TABLE is called just before reading the char > *before* charpos (see patch below). I don't think this is right. offset is calculated from d, and then decremented, before calculating charpos. > > Also, probably less importantly, there is > > GET_CHAR_AFTER (c2, d, dummy); > > , whereas at the same place in the handler for case symend: we have > > instead > > c2 = RE_STRING_CHAR (d, target_multibyte); > > . Is the effect of these macros identical, or is one of them up to > > date, and the other one really needs updating as well, for correct > > functionality? > According to my reading of the code, they're identical in multibyte > buffers not in unibyte buffers where RE_STRING_CHAR just returns a value > between 0 and 255 (i.e. ASCII or Latin-1 more or less), whereas > GET_CHAR_AFTER will return either an ASCII char (0..127) or a raw-byte > char (4194176..4194303). OK. > I think it's more correct to return a raw-byte char (4194176..4194303), > so I'd tend to think that GET_CHAR_AFTER is the better choice, but > please don't quote me on this. I won't say a word! > > I came across these whilst investigating bug #34525. Making the > > indicated changes to regex-emacs.c sadly doesn't help solve the symptoms > > of that bug. :-( > Does the patch below help? Unfortunately not, not for bug #34525. I did try it out, though. In the mean time, I've advanced somewhat in the debugging. > Stefan > diff --git a/src/regex-emacs.c b/src/regex-emacs.c > index b667a43a37..72fb5ec561 100644 > --- a/src/regex-emacs.c > +++ b/src/regex-emacs.c > @@ -4813,7 +4813,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, > int dummy; ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > - UPDATE_SYNTAX_TABLE (charpos); > + UPDATE_SYNTAX_TABLE (charpos - 1); > GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); > s1 = SYNTAX (c1); -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-02-25 18:56 ` Alan Mackenzie @ 2019-02-25 19:18 ` Stefan Monnier 2019-03-01 11:10 ` Alan Mackenzie 0 siblings, 1 reply; 22+ messages in thread From: Stefan Monnier @ 2019-02-25 19:18 UTC (permalink / raw) To: emacs-devel; +Cc: Alan Mackenzie >> > Surely the argument to the second occurrence should be (charpos + 1)? >> I believe it's instead the other one that needs to use "charpos - 1" >> because the UPDATE_SYNTAX_TABLE is called just before reading the char >> *before* charpos (see patch below). > I don't think this is right. offset is calculated from d, and then > decremented, before calculating charpos. Hmm... I think you're right (and the symend code does like you suggest). This said, I find it odd that the code does: ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); UPDATE_SYNTAX_TABLE (charpos); Supposedly `d` is a char* pointing to the beginning of a potentially multibyte char, In that case `d - 1` will point "somewhere before the end of the previous multibyte char" but not necessarily at its beginning. Maybe the patch below would be preferable to avoid this situation? Worse, in notwordbound we do: ptrdiff_t offset = PTR_TO_OFFSET (d - 1); ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); UPDATE_SYNTAX_TABLE (charpos); which seems even more broken because `d` might point to the first byte after the gap, so `d - 1` will point in the middle of the gap, so it's simply an invalid argument to PTR_TO_OFFSET. According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, the result may be the same as if we did the decrement after the fact, but it still looks fishy. WDYT? Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index b667a43a37..b21cba0e46 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int c1, c2; int s1, s2; int dummy; - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; + ptrdiff_t offset = PTR_TO_OFFSET (d); ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); - UPDATE_SYNTAX_TABLE (charpos); + UPDATE_SYNTAX_TABLE (charpos - 1); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-02-25 19:18 ` Stefan Monnier @ 2019-03-01 11:10 ` Alan Mackenzie 2019-03-01 13:41 ` Stefan Monnier 2019-03-01 13:46 ` Eli Zaretskii 0 siblings, 2 replies; 22+ messages in thread From: Alan Mackenzie @ 2019-03-01 11:10 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel Hello, Stefan. On Mon, Feb 25, 2019 at 14:18:10 -0500, Stefan Monnier wrote: > >> > Surely the argument to the second occurrence should be (charpos + 1)? > >> I believe it's instead the other one that needs to use "charpos - 1" > >> because the UPDATE_SYNTAX_TABLE is called just before reading the char > >> *before* charpos (see patch below). > > I don't think this is right. offset is calculated from d, and then > > decremented, before calculating charpos. > Hmm... I think you're right (and the symend code does like you suggest). > This said, I find it odd that the code does: > ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > UPDATE_SYNTAX_TABLE (charpos); > Supposedly `d` is a char* pointing to the beginning of a potentially > multibyte char, In that case `d - 1` will point "somewhere before the > end of the previous multibyte char" but not necessarily at > its beginning. Maybe the patch below would be preferable to avoid > this situation? SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in marker.c). This latter function doesn't handle well the case of `d' being in the middle of a multibyte character; sometimes it "rounds it down", other times it "rounds it up" to a character position. I think it should be defined as rounding it down. It would be a relatively simple correction (at least, technically ;-). I think your patch "always do the arithmetic on a charpos, not a bytepos" is a very good idea indeed. But I'm still a little worried about buf_bytepos_to_charpos. Perhaps it should state that the result is undefined when the bytepos is "invalid". How many other places in Emacs call it with "invalid" byte positions? For that matter, how many charpos <-> bytepos functions are there in Emacs? Just this one? > Worse, in notwordbound we do: > ptrdiff_t offset = PTR_TO_OFFSET (d - 1); > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > UPDATE_SYNTAX_TABLE (charpos); > which seems even more broken because `d` might point to the first byte > after the gap, so `d - 1` will point in the middle of the gap, so it's > simply an invalid argument to PTR_TO_OFFSET. I don't think this is right. Both `d' and `offset' are byte measurements, not character measurements, so it shouldn't matter whether the "- 1" is inside or outside the parens. However, it would be less confusing if they were both (?all) the same. > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, > the result may be the same as if we did the decrement after the fact, > but it still looks fishy. WDYT? I think it is suboptimal to have both PTR_TO_OFFSET and POINTER_TO_OFFSET meaning different things in the same source file. ;-) > Stefan > diff --git a/src/regex-emacs.c b/src/regex-emacs.c > index b667a43a37..b21cba0e46 100644 > --- a/src/regex-emacs.c > +++ b/src/regex-emacs.c > @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, > int c1, c2; > int s1, s2; > int dummy; > - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > + ptrdiff_t offset = PTR_TO_OFFSET (d); > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > - UPDATE_SYNTAX_TABLE (charpos); > + UPDATE_SYNTAX_TABLE (charpos - 1); > GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); > s1 = SYNTAX (c1); There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in regex-emacs.c. I think I will check them all, amending them as in your patch. What do you say? -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 11:10 ` Alan Mackenzie @ 2019-03-01 13:41 ` Stefan Monnier 2019-03-01 13:46 ` Eli Zaretskii 1 sibling, 0 replies; 22+ messages in thread From: Stefan Monnier @ 2019-03-01 13:41 UTC (permalink / raw) To: Alan Mackenzie; +Cc: emacs-devel > down", other times it "rounds it up" to a character position. I think > it should be defined as rounding it down. It would be a relatively > simple correction (at least, technically ;-). When moving forward, rounding it up is more natural ;-) > But I'm still a little worried about buf_bytepos_to_charpos. Perhaps it > should state that the result is undefined when the bytepos is "invalid". Yes, I think it's the intention. Even better would be to signal an error (when built with --enable-checking). > For that matter, how many charpos <-> bytepos functions are there in > Emacs? Just this one? I think so, yes. >> Worse, in notwordbound we do: > >> ptrdiff_t offset = PTR_TO_OFFSET (d - 1); >> ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); >> UPDATE_SYNTAX_TABLE (charpos); > >> which seems even more broken because `d` might point to the first byte >> after the gap, so `d - 1` will point in the middle of the gap, so it's >> simply an invalid argument to PTR_TO_OFFSET. > > I don't think this is right. Both `d' and `offset' are byte > measurements, not character measurements, so it shouldn't matter whether > the "- 1" is inside or outside the parens. However, it would be less > confusing if they were both (?all) the same. The difference between `d` and `offset` is just an offset, indeed, but it can be 2 different offsets depending on whether `d` is before or after the gap, so what happens when `d` is within the gap depends on how the test for "before/after the gap" is implemented. More specifically, when `d` is N bytes before the end of the gap, the code could consider it as being N bytes before the beginning of the second part, or being "gap-size - N" bytes after the end of the first part. >> According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, >> the result may be the same as if we did the decrement after the fact, >> but it still looks fishy. WDYT? > > I think it is suboptimal to have both PTR_TO_OFFSET and > POINTER_TO_OFFSET meaning different things in the same source file. ;-) I'm so glad you're volunteering to clean this up. Thank you, really. > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in > regex-emacs.c. I think I will check them all, amending them as in your > patch. > What do you say? Thanks, Stefan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 11:10 ` Alan Mackenzie 2019-03-01 13:41 ` Stefan Monnier @ 2019-03-01 13:46 ` Eli Zaretskii 2019-03-01 14:14 ` Alan Mackenzie 1 sibling, 1 reply; 22+ messages in thread From: Eli Zaretskii @ 2019-03-01 13:46 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Fri, 1 Mar 2019 11:10:18 +0000 > From: Alan Mackenzie <acm@muc.de> > Cc: emacs-devel@gnu.org > > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in > marker.c). This latter function doesn't handle well the case of `d' > being in the middle of a multibyte character; sometimes it "rounds it > down", other times it "rounds it up" to a character position. I think > it should be defined as rounding it down. It would be a relatively > simple correction (at least, technically ;-). buf_bytepos_to_charpos is not supposed to be called when the byte position is in the middle of a multibyte sequence. We have the CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that. > For that matter, how many charpos <-> bytepos functions are there in > Emacs? Only one pair of such function exists for buffer text, and another pair for strings. > > ptrdiff_t offset = PTR_TO_OFFSET (d - 1); > > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > > UPDATE_SYNTAX_TABLE (charpos); > > > which seems even more broken because `d` might point to the first byte > > after the gap, so `d - 1` will point in the middle of the gap, so it's > > simply an invalid argument to PTR_TO_OFFSET. > > I don't think this is right. Both `d' and `offset' are byte > measurements, not character measurements, so it shouldn't matter whether > the "- 1" is inside or outside the parens. However, it would be less > confusing if they were both (?all) the same. That's orthogonal. Stefan is right in that you cannot in general do pointer arithmetics on pointers into buffer text without considering the gap. You need to convert 'd' into a byte position (which is actually an offset from the beginning of buffer text), then decrement it, then convert back into a 'char *' pointer to the previous byte. The macros used for these conversions take care of skipping the gap. However, since the caller already took care to split the text into two parts, one before the gap and the other after the gap, it sounds like we don't need to bother about the gap in this case, unless 'd - 1' happens to point before the beginning of string2 argument to re_match_2_internal. > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, > > the result may be the same as if we did the decrement after the fact, > > but it still looks fishy. WDYT? > > I think it is suboptimal to have both PTR_TO_OFFSET and > POINTER_TO_OFFSET meaning different things in the same source file. ;-) Those macros hide the fact that the argument could be a Lisp string or a buffer, so I don't think I agree with you here. > > diff --git a/src/regex-emacs.c b/src/regex-emacs.c > > index b667a43a37..b21cba0e46 100644 > > --- a/src/regex-emacs.c > > +++ b/src/regex-emacs.c > > @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, > > int c1, c2; > > int s1, s2; > > int dummy; > > - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; > > + ptrdiff_t offset = PTR_TO_OFFSET (d); > > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > > - UPDATE_SYNTAX_TABLE (charpos); > > + UPDATE_SYNTAX_TABLE (charpos - 1); > > GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); > > s1 = SYNTAX (c1); > > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in > regex-emacs.c. I think I will check them all, amending them as in your > patch. > > What do you say? I'm not Stefan, but what I say is that we should only make sure 'd' never points to the very first byte of 'string2'. If it does, then decrementing it will produce invalid results. If we cannot decide whether that situation could happen, we should add an assertion there to that effect. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 13:46 ` Eli Zaretskii @ 2019-03-01 14:14 ` Alan Mackenzie 2019-03-01 14:43 ` Eli Zaretskii 0 siblings, 1 reply; 22+ messages in thread From: Alan Mackenzie @ 2019-03-01 14:14 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Hello, Eli. On Fri, Mar 01, 2019 at 15:46:14 +0200, Eli Zaretskii wrote: > > Date: Fri, 1 Mar 2019 11:10:18 +0000 > > From: Alan Mackenzie <acm@muc.de> > > Cc: emacs-devel@gnu.org > > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in > > marker.c). This latter function doesn't handle well the case of `d' > > being in the middle of a multibyte character; sometimes it "rounds it > > down", other times it "rounds it up" to a character position. I think > > it should be defined as rounding it down. It would be a relatively > > simple correction (at least, technically ;-). > buf_bytepos_to_charpos is not supposed to be called when the byte > position is in the middle of a multibyte sequence. We have the > CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that. Thanks, I didn't know that. Maybe we should put an assert into the code, like Stefan suggested. > > For that matter, how many charpos <-> bytepos functions are there in > > Emacs? > Only one pair of such function exists for buffer text, and another > pair for strings. That's good. > > > ptrdiff_t offset = PTR_TO_OFFSET (d - 1); > > > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); > > > UPDATE_SYNTAX_TABLE (charpos); > > > which seems even more broken because `d` might point to the first byte > > > after the gap, so `d - 1` will point in the middle of the gap, so it's > > > simply an invalid argument to PTR_TO_OFFSET. > > I don't think this is right. Both `d' and `offset' are byte > > measurements, not character measurements, so it shouldn't matter whether > > the "- 1" is inside or outside the parens. However, it would be less > > confusing if they were both (?all) the same. > That's orthogonal. Stefan is right in that you cannot in general do > pointer arithmetics on pointers into buffer text without considering > the gap. You need to convert 'd' into a byte position (which is > actually an offset from the beginning of buffer text), then decrement > it, then convert back into a 'char *' pointer to the previous byte. > The macros used for these conversions take care of skipping the gap. > However, since the caller already took care to split the text into two > parts, one before the gap and the other after the gap, it sounds like > we don't need to bother about the gap in this case, unless 'd - 1' > happens to point before the beginning of string2 argument to > re_match_2_internal. I've got rid of all the questionable "d - 1"s. All these code pieces now first do PTR_TO_OFFSET (d), then do SYNTAX_TABLE_BYTE_TO_CHAR on the result, and then any arithmetic on the result of that. (See patch below). > > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, > > > the result may be the same as if we did the decrement after the fact, > > > but it still looks fishy. WDYT? > > I think it is suboptimal to have both PTR_TO_OFFSET and > > POINTER_TO_OFFSET meaning different things in the same source file. ;-) > Those macros hide the fact that the argument could be a Lisp string or > a buffer, so I don't think I agree with you here. I just meant that having the two names so similar might be confusing. [ .... ] > > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in > > regex-emacs.c. I think I will check them all, amending them as in your > > patch. > > What do you say? > I'm not Stefan, but what I say is that we should only make sure 'd' > never points to the very first byte of 'string2'. If it does, then > decrementing it will produce invalid results. If we cannot decide > whether that situation could happen, we should add an assertion there > to that effect. I'm fairly sure it's safe, through always first doing PTR_TO_OFFSET (d), which takes care of the gap. Here's the patch (already "tested") which gets rid of the unwanted "d - 1"s: diff --git a/src/regex-emacs.c b/src/regex-emacs.c index b667a43a37..45b4f8107c 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -4732,8 +4732,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int c1, c2; int s1, s2; int dummy; - ptrdiff_t offset = PTR_TO_OFFSET (d - 1); - ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); + ptrdiff_t offset = PTR_TO_OFFSET (d); + ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1; UPDATE_SYNTAX_TABLE (charpos); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); @@ -4811,8 +4811,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int c1, c2; int s1, s2; int dummy; - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; - ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); + ptrdiff_t offset = PTR_TO_OFFSET (d); + ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1; UPDATE_SYNTAX_TABLE (charpos); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); @@ -4826,7 +4826,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, { PREFETCH_NOLIMIT (); GET_CHAR_AFTER (c2, d, dummy); - UPDATE_SYNTAX_TABLE_FORWARD (charpos); + UPDATE_SYNTAX_TABLE_FORWARD (charpos + 1); s2 = SYNTAX (c2); /* ... and S2 is Sword, and WORD_BOUNDARY_P (C1, C2) @@ -4890,8 +4890,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, is the character at D, and S2 is the syntax of C2. */ int c1, c2; int s1, s2; - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; - ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); + ptrdiff_t offset = PTR_TO_OFFSET (d); + ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1; UPDATE_SYNTAX_TABLE (charpos); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1); -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 14:14 ` Alan Mackenzie @ 2019-03-01 14:43 ` Eli Zaretskii 2019-03-01 14:58 ` Alan Mackenzie 0 siblings, 1 reply; 22+ messages in thread From: Eli Zaretskii @ 2019-03-01 14:43 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Fri, 1 Mar 2019 14:14:48 +0000 > From: Alan Mackenzie <acm@muc.de> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > buf_bytepos_to_charpos is not supposed to be called when the byte > > position is in the middle of a multibyte sequence. We have the > > CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that. > > Thanks, I didn't know that. Maybe we should put an assert into the code, > like Stefan suggested. We could try. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 14:43 ` Eli Zaretskii @ 2019-03-01 14:58 ` Alan Mackenzie 2019-03-01 16:22 ` Eli Zaretskii 0 siblings, 1 reply; 22+ messages in thread From: Alan Mackenzie @ 2019-03-01 14:58 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Hello, Eli. On Fri, Mar 01, 2019 at 16:43:38 +0200, Eli Zaretskii wrote: > > Date: Fri, 1 Mar 2019 14:14:48 +0000 > > From: Alan Mackenzie <acm@muc.de> > > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > buf_bytepos_to_charpos is not supposed to be called when the byte > > > position is in the middle of a multibyte sequence. We have the > > > CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that. > > Thanks, I didn't know that. Maybe we should put an assert into the code, > > like Stefan suggested. > We could try. How about this, as a first approximation? diff --git a/src/marker.c b/src/marker.c index 36d6b10c74..9faeca49f4 100644 --- a/src/marker.c +++ b/src/marker.c @@ -311,7 +311,8 @@ buf_charpos_to_bytepos (struct buffer *b, ptrdiff_t charpos) } \ } -/* Return the character position corresponding to BYTEPOS in B. */ +/* Return the character position corresponding to BYTEPOS in B. + BYTEPOS must be at a character boundary. */ ptrdiff_t buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) @@ -370,6 +371,8 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) best_below++; BUF_INC_POS (b, best_below_byte); } + /* Check BYTEPOS was at a character boundary. */ + eassert (best_below_byte == bytepos); /* If this position is quite far from the nearest known position, cache the correspondence by creating a marker here. @@ -397,6 +400,8 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) best_above--; BUF_DEC_POS (b, best_above_byte); } + /* Check BYTEPOS was at a character boundary. */ + eassert (best_above_byte == bytepos); /* If this position is quite far from the nearest known position, cache the correspondence by creating a marker here. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 14:58 ` Alan Mackenzie @ 2019-03-01 16:22 ` Eli Zaretskii 2019-03-01 16:38 ` Alan Mackenzie 0 siblings, 1 reply; 22+ messages in thread From: Eli Zaretskii @ 2019-03-01 16:22 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Fri, 1 Mar 2019 14:58:56 +0000 > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Alan Mackenzie <acm@muc.de> > > > > Thanks, I didn't know that. Maybe we should put an assert into the code, > > > like Stefan suggested. > > > We could try. > > How about this, as a first approximation? > [...] > + /* Check BYTEPOS was at a character boundary. */ > + eassert (best_below_byte == bytepos); Actually, what I had in mind was a simple eassert (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); right at the beginning of buf_bytepos_to_charpos. But maybe if you explain why you wanted a different assertion, I will change my mind. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 16:22 ` Eli Zaretskii @ 2019-03-01 16:38 ` Alan Mackenzie 2019-03-01 19:16 ` Alan Mackenzie 0 siblings, 1 reply; 22+ messages in thread From: Alan Mackenzie @ 2019-03-01 16:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Hello, Eli. On Fri, Mar 01, 2019 at 18:22:39 +0200, Eli Zaretskii wrote: > > Date: Fri, 1 Mar 2019 14:58:56 +0000 > > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > From: Alan Mackenzie <acm@muc.de> > > > > Thanks, I didn't know that. Maybe we should put an assert into the code, > > > > like Stefan suggested. > > > We could try. > > How about this, as a first approximation? > > [...] > > + /* Check BYTEPOS was at a character boundary. */ > > + eassert (best_below_byte == bytepos); > Actually, what I had in mind was a simple > eassert (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > right at the beginning of buf_bytepos_to_charpos. But maybe if you > explain why you wanted a different assertion, I will change my mind. I had forgotten that it was possible to determine that a UTF8 byte was at the start of a character. I think your version is simpler and better. Lets use it! But first, I'll commit the fix to bug #34525, including the problems with "d - 1". That will be one fewer spot where that new assertion will get triggered. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 16:38 ` Alan Mackenzie @ 2019-03-01 19:16 ` Alan Mackenzie 2019-03-01 19:31 ` Eli Zaretskii 0 siblings, 1 reply; 22+ messages in thread From: Alan Mackenzie @ 2019-03-01 19:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Hello again, Eli. On Fri, Mar 01, 2019 at 16:38:25 +0000, Alan Mackenzie wrote: > On Fri, Mar 01, 2019 at 18:22:39 +0200, Eli Zaretskii wrote: > > > Date: Fri, 1 Mar 2019 14:58:56 +0000 > > > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > > From: Alan Mackenzie <acm@muc.de> [ .... ] > > Actually, what I had in mind was a simple > > eassert (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > > right at the beginning of buf_bytepos_to_charpos. [ .... ] I'm experimenting with: diff --git a/src/marker.c b/src/marker.c index b58051a8c2..89b6ca57f4 100644 --- a/src/marker.c +++ b/src/marker.c @@ -322,6 +322,9 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) ptrdiff_t distance = BYTECHAR_DISTANCE_INITIAL; eassert (BUF_BEG_BYTE (b) <= bytepos && bytepos <= BUF_Z_BYTE (b)); + /* Check bytepos is not in the middle of a character. */ + eassert (bytepos >= BUF_Z_BYTE (b) + || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); best_above = BUF_Z (b); best_above_byte = BUF_Z_BYTE (b); After configuring with --enable-checking and building, I tried make check. The tests errored out with this bytepos check three times. In: src/coding-tests.log lisp/epg-tests.log lisp/emacs-lisp/package-tests.log . Quite possibly there is just one bug here, but there might be two or three. I think it would be best to track it/them down before committing the change to marker.c. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 19:16 ` Alan Mackenzie @ 2019-03-01 19:31 ` Eli Zaretskii 2019-03-02 11:16 ` Alan Mackenzie 0 siblings, 1 reply; 22+ messages in thread From: Eli Zaretskii @ 2019-03-01 19:31 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Fri, 1 Mar 2019 19:16:07 +0000 > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Alan Mackenzie <acm@muc.de> > > + /* Check bytepos is not in the middle of a character. */ > + eassert (bytepos >= BUF_Z_BYTE (b) > + || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); LGTM. > After configuring with --enable-checking and building, I tried make > check. The tests errored out with this bytepos check three times. In: > > src/coding-tests.log > lisp/epg-tests.log > lisp/emacs-lisp/package-tests.log > > . Quite possibly there is just one bug here, but there might be two or > three. I think it would be best to track it/them down before committing > the change to marker.c. I agree. Let me know if you need help with that. Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-01 19:31 ` Eli Zaretskii @ 2019-03-02 11:16 ` Alan Mackenzie 2019-03-02 12:18 ` Eli Zaretskii 2019-03-02 12:21 ` Eli Zaretskii 0 siblings, 2 replies; 22+ messages in thread From: Alan Mackenzie @ 2019-03-02 11:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Hello, Eli. On Fri, Mar 01, 2019 at 21:31:37 +0200, Eli Zaretskii wrote: > > Date: Fri, 1 Mar 2019 19:16:07 +0000 > > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > From: Alan Mackenzie <acm@muc.de> > > + /* Check bytepos is not in the middle of a character. */ > > + eassert (bytepos >= BUF_Z_BYTE (b) > > + || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > LGTM. > > After configuring with --enable-checking and building, I tried make > > check. The tests errored out with this bytepos check three times. In: > > src/coding-tests.log > > lisp/epg-tests.log > > lisp/emacs-lisp/package-tests.log > > . Quite possibly there is just one bug here, but there might be two or > > three. I think it would be best to track it/them down before committing > > the change to marker.c. > I agree. Let me know if you need help with that. On the coding tests, we fail in one of them when bytepos == 1, and the first byte in the buffer is 0xa4. The Lisp call stack at the time is Lisp Backtrace: "decode-coding-region" (0xc35b5cb0) "progn" (0xc35b5d68) "unwind-protect" (0xc35b5ea8) "save-current-buffer" (0xc35b6018) "let" (0xc35b6208) 0xd985d350 Lisp type 3 "ert--run-test-internal" (0xc35b69b0) "ert-run-test" (0xc35b6ed8) "ert-run-or-rerun-test" (0xc35b7410) "ert-run-tests" (0xc35b7938) "ert-run-tests-interactively" (0xc35b7f30) "funcall-interactively" (0xc35b7f28) "call-interactively" (0xc35b8310) "command-execute" (0xc35b8838) "execute-extended-command" (0xc35b8f20) "funcall-interactively" (0xc35b8f18) "call-interactively" (0xc35b9300) "command-execute" (0xc35b9808) , and the first few lines of the C backtrace are: (gdb) backtrace #0 terminate_due_to_signal (sig=6, backtrace_limit=2147483647) at emacs.c:370 #1 0x000055a4d8603595 in die (msg=0x55a4d87637a8 "bytepos >= BUF_Z_BYTE (b) || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))", file=0x55a4d8763728 "marker.c", line=327) at alloc.c:7442 #2 0x000055a4d85b42cf in buf_bytepos_to_charpos (b=0x55a4da051710, bytepos=1) at marker.c:326 #3 0x000055a4d85ab93f in move_gap_both (charpos=1, bytepos=1) at insdel.c:92 #4 0x000055a4d84ca5d9 in decode_coding_object (coding=0x7ffcc35b5950, src_object=XIL(0x55a4da051715), from=1, from_byte=1, to=5, to_byte=5, dst_object=XIL(0x55a4da051715)) at coding.c:8072 #5 0x000055a4d84cff9f in code_convert_region (start=make_number(1), end=make_number(5), coding_system=XIL(0x2a1906036aa0), dst_object=XIL(0x55a4da051715), encodep=false, norecord=false) at coding.c:9371 #6 0x000055a4d84d005a in Fdecode_coding_region (start=make_number(1), end=make_number(5), coding_system=XIL(0x2a1906036aa0), destination=XIL(0)) at coding.c:9401 #7 0x000055a4d86370e1 in eval_sub (form=XIL(0x55a4d981ded3)) at eval.c:2325 . More precisely, the first few bytes of the buffer are: a4 a2 0d 0a 00 00 00 00 , and I suspect the buffer is just 4 bytes long (but I'm too lazy to check it properly). I have a suspicion that the CHAR_HEAD_P test isn't valid here. I'm guessing that we're converting an external coding, something like an MS-DOS 8-bit coding, to internal UTF-8-like coding, so we can't use CHAR_HEAD_P. In fact, is buf_bytepos_to_charpos the Right Thing to use here? -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-02 11:16 ` Alan Mackenzie @ 2019-03-02 12:18 ` Eli Zaretskii 2019-03-02 13:18 ` Alan Mackenzie 2019-03-02 12:21 ` Eli Zaretskii 1 sibling, 1 reply; 22+ messages in thread From: Eli Zaretskii @ 2019-03-02 12:18 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Sat, 2 Mar 2019 11:16:40 +0000 > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Alan Mackenzie <acm@muc.de> > > On the coding tests, we fail in one of them when bytepos == 1, and the > first byte in the buffer is 0xa4. The Lisp call stack at the time is > > Lisp Backtrace: > "decode-coding-region" (0xc35b5cb0) Ah, yes. We should do this to prevent such false alarms: eassert (NILP (BVAR (b, enable_multibyte_characters)) || bytepos >= BUF_Z_BYTE (b) || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); IOW, this test is irrelevant in unibyte buffers. Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-02 12:18 ` Eli Zaretskii @ 2019-03-02 13:18 ` Alan Mackenzie 2019-03-02 13:37 ` Eli Zaretskii 2019-03-04 17:25 ` Eli Zaretskii 0 siblings, 2 replies; 22+ messages in thread From: Alan Mackenzie @ 2019-03-02 13:18 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Hello, Eli. On Sat, Mar 02, 2019 at 14:18:00 +0200, Eli Zaretskii wrote: > > Date: Sat, 2 Mar 2019 11:16:40 +0000 > > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > From: Alan Mackenzie <acm@muc.de> > > On the coding tests, we fail in one of them when bytepos == 1, and the > > first byte in the buffer is 0xa4. The Lisp call stack at the time is > > Lisp Backtrace: > > "decode-coding-region" (0xc35b5cb0) > Ah, yes. We should do this to prevent such false alarms: > eassert (NILP (BVAR (b, enable_multibyte_characters)) > || bytepos >= BUF_Z_BYTE (b) > || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > IOW, this test is irrelevant in unibyte buffers. Instead I moved the eassert to after the bit where it checks for unibyte buffers, giving this: diff --git a/src/marker.c b/src/marker.c index b58051a8c2..0b2e1bf5c6 100644 --- a/src/marker.c +++ b/src/marker.c @@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) if (best_above == best_above_byte) return bytepos; + /* Check bytepos is not in the middle of a character. */ + eassert (bytepos >= BUF_Z_BYTE (b) + || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); + best_below = BEG; best_below_byte = BEG_BYTE; I now no longer see the failed easserts in make check. So I'll commit this sometime (real life is a bit urgent right now). > Thanks. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-02 13:18 ` Alan Mackenzie @ 2019-03-02 13:37 ` Eli Zaretskii 2019-03-04 17:25 ` Eli Zaretskii 1 sibling, 0 replies; 22+ messages in thread From: Eli Zaretskii @ 2019-03-02 13:37 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Sat, 2 Mar 2019 13:18:01 +0000 > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Alan Mackenzie <acm@muc.de> > > Instead I moved the eassert to after the bit where it checks for unibyte > buffers, giving this: > > diff --git a/src/marker.c b/src/marker.c > index b58051a8c2..0b2e1bf5c6 100644 > --- a/src/marker.c > +++ b/src/marker.c > @@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) > if (best_above == best_above_byte) > return bytepos; > > + /* Check bytepos is not in the middle of a character. */ > + eassert (bytepos >= BUF_Z_BYTE (b) > + || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > + > best_below = BEG; > best_below_byte = BEG_BYTE; > > > I now no longer see the failed easserts in make check. > > So I'll commit this sometime (real life is a bit urgent right now). Great, thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-02 13:18 ` Alan Mackenzie 2019-03-02 13:37 ` Eli Zaretskii @ 2019-03-04 17:25 ` Eli Zaretskii 2019-03-05 10:51 ` Alan Mackenzie 1 sibling, 1 reply; 22+ messages in thread From: Eli Zaretskii @ 2019-03-04 17:25 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Sat, 2 Mar 2019 13:18:01 +0000 > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Alan Mackenzie <acm@muc.de> > > > eassert (NILP (BVAR (b, enable_multibyte_characters)) > > || bytepos >= BUF_Z_BYTE (b) > > || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > > > IOW, this test is irrelevant in unibyte buffers. > > Instead I moved the eassert to after the bit where it checks for unibyte > buffers, giving this: > > > > diff --git a/src/marker.c b/src/marker.c > index b58051a8c2..0b2e1bf5c6 100644 > --- a/src/marker.c > +++ b/src/marker.c > @@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) > if (best_above == best_above_byte) > return bytepos; > > + /* Check bytepos is not in the middle of a character. */ > + eassert (bytepos >= BUF_Z_BYTE (b) > + || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > + > best_below = BEG; > best_below_byte = BEG_BYTE; > > > I now no longer see the failed easserts in make check. > > So I'll commit this sometime (real life is a bit urgent right now). I was forced to disable this assertion for now: I bootstrapped today a clean checkout, and several jobs that run during the bootstrap triggered the assertion. It turns out there's one legitimate use case when bytepos _can_ be in the middle of a multibyte sequence: when we convert a buffer from unibyte to multibyte. There are comments to that effect in set_intervals_multibyte_1. I see 2 possible ways to handle this: (1) remove the assertion for good, or (2) change buf_bytepos_to_charpos to accept one more argument, telling it whether to make this check, and then modify all the callers except those in set_intervals_multibyte_1 to pass 'true' as that argument. Thoughts? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-04 17:25 ` Eli Zaretskii @ 2019-03-05 10:51 ` Alan Mackenzie 2019-03-05 16:26 ` Eli Zaretskii 0 siblings, 1 reply; 22+ messages in thread From: Alan Mackenzie @ 2019-03-05 10:51 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Hello, Eli. On Mon, Mar 04, 2019 at 19:25:57 +0200, Eli Zaretskii wrote: > > Date: Sat, 2 Mar 2019 13:18:01 +0000 > > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > > From: Alan Mackenzie <acm@muc.de> > > Instead I moved the eassert to after the bit where it checks for unibyte > > buffers, giving this: > > diff --git a/src/marker.c b/src/marker.c > > index b58051a8c2..0b2e1bf5c6 100644 > > --- a/src/marker.c > > +++ b/src/marker.c > > @@ -332,6 +332,10 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos) > > if (best_above == best_above_byte) > > return bytepos; > > > > + /* Check bytepos is not in the middle of a character. */ > > + eassert (bytepos >= BUF_Z_BYTE (b) > > + || CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos))); > > + > > best_below = BEG; > > best_below_byte = BEG_BYTE; [ .... ] > I was forced to disable this assertion for now: I bootstrapped today a > clean checkout, and several jobs that run during the bootstrap > triggered the assertion. It turns out there's one legitimate use case > when bytepos _can_ be in the middle of a multibyte sequence: when we > convert a buffer from unibyte to multibyte. There are comments to > that effect in set_intervals_multibyte_1. > I see 2 possible ways to handle this: (1) remove the assertion for > good, or (2) change buf_bytepos_to_charpos to accept one more > argument, telling it whether to make this check, and then modify all > the callers except those in set_intervals_multibyte_1 to pass 'true' > as that argument. > Thoughts? First of all, sorry I wasn't here yesterday to deal with it. I don't think I like alternative (2) - it's ugly, and how much do we really need this eassert anyway? It's turned out not to be such a good idea after all. I would favour alternative (1), just removing the thing altogether. -- Alan Mackenzie (Nuremberg, Germany). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-05 10:51 ` Alan Mackenzie @ 2019-03-05 16:26 ` Eli Zaretskii 0 siblings, 0 replies; 22+ messages in thread From: Eli Zaretskii @ 2019-03-05 16:26 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Tue, 5 Mar 2019 10:51:50 +0000 > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org > From: Alan Mackenzie <acm@muc.de> > > > I see 2 possible ways to handle this: (1) remove the assertion for > > good, or (2) change buf_bytepos_to_charpos to accept one more > > argument, telling it whether to make this check, and then modify all > > the callers except those in set_intervals_multibyte_1 to pass 'true' > > as that argument. > > > Thoughts? > > First of all, sorry I wasn't here yesterday to deal with it. No sweat: since this popped up in my own build, it was very easy to understand the reasons. > I don't think I like alternative (2) - it's ugly, and how much do we > really need this eassert anyway? It's turned out not to be such a good > idea after all. I would favour alternative (1), just removing the thing > altogether. Agreed; done. Thanks. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c 2019-03-02 11:16 ` Alan Mackenzie 2019-03-02 12:18 ` Eli Zaretskii @ 2019-03-02 12:21 ` Eli Zaretskii 1 sibling, 0 replies; 22+ messages in thread From: Eli Zaretskii @ 2019-03-02 12:21 UTC (permalink / raw) To: Alan Mackenzie; +Cc: monnier, emacs-devel > Date: Sat, 2 Mar 2019 11:16:40 +0000 > From: Alan Mackenzie <acm@muc.de> > Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org Forgot to answer this part: > In fact, is buf_bytepos_to_charpos the Right Thing to use here? Yes, because decode-coding-region must also work in a multibyte buffer. The code in buf_bytepos_to_charpos takes care of the unibyte use case right away, returning a trivial 1:1 mapping: /* If this buffer has as many characters as bytes, each character must be one byte. This takes care of the case where enable-multibyte-characters is nil. */ if (best_above == best_above_byte) return bytepos; ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2019-03-05 16:26 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-22 16:45 Questionable code in handling of wordend in the regexp engine in regex-emacs.c Alan Mackenzie 2019-02-23 23:15 ` Stefan Monnier 2019-02-25 18:56 ` Alan Mackenzie 2019-02-25 19:18 ` Stefan Monnier 2019-03-01 11:10 ` Alan Mackenzie 2019-03-01 13:41 ` Stefan Monnier 2019-03-01 13:46 ` Eli Zaretskii 2019-03-01 14:14 ` Alan Mackenzie 2019-03-01 14:43 ` Eli Zaretskii 2019-03-01 14:58 ` Alan Mackenzie 2019-03-01 16:22 ` Eli Zaretskii 2019-03-01 16:38 ` Alan Mackenzie 2019-03-01 19:16 ` Alan Mackenzie 2019-03-01 19:31 ` Eli Zaretskii 2019-03-02 11:16 ` Alan Mackenzie 2019-03-02 12:18 ` Eli Zaretskii 2019-03-02 13:18 ` Alan Mackenzie 2019-03-02 13:37 ` Eli Zaretskii 2019-03-04 17:25 ` Eli Zaretskii 2019-03-05 10:51 ` Alan Mackenzie 2019-03-05 16:26 ` Eli Zaretskii 2019-03-02 12:21 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).