From: Alan Mackenzie <acm@muc.de>
To: Eli Zaretskii <eliz@gnu.org>
Cc: monnier@iro.umontreal.ca, emacs-devel@gnu.org
Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Date: Fri, 1 Mar 2019 14:14:48 +0000 [thread overview]
Message-ID: <20190301141448.GC5674@ACM> (raw)
In-Reply-To: <83bm2uiu6x.fsf@gnu.org>
Hello, Eli.
On Fri, Mar 01, 2019 at 15:46:14 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 11:10:18 +0000
> > From: Alan Mackenzie <acm@muc.de>
> > Cc: emacs-devel@gnu.org
> > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
> > marker.c). This latter function doesn't handle well the case of `d'
> > being in the middle of a multibyte character; sometimes it "rounds it
> > down", other times it "rounds it up" to a character position. I think
> > it should be defined as rounding it down. It would be a relatively
> > simple correction (at least, technically ;-).
> buf_bytepos_to_charpos is not supposed to be called when the byte
> position is in the middle of a multibyte sequence. We have the
> CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.
Thanks, I didn't know that. Maybe we should put an assert into the code,
like Stefan suggested.
> > For that matter, how many charpos <-> bytepos functions are there in
> > Emacs?
> Only one pair of such function exists for buffer text, and another
> pair for strings.
That's good.
> > > ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> > > ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > > UPDATE_SYNTAX_TABLE (charpos);
> > > which seems even more broken because `d` might point to the first byte
> > > after the gap, so `d - 1` will point in the middle of the gap, so it's
> > > simply an invalid argument to PTR_TO_OFFSET.
> > I don't think this is right. Both `d' and `offset' are byte
> > measurements, not character measurements, so it shouldn't matter whether
> > the "- 1" is inside or outside the parens. However, it would be less
> > confusing if they were both (?all) the same.
> That's orthogonal. Stefan is right in that you cannot in general do
> pointer arithmetics on pointers into buffer text without considering
> the gap. You need to convert 'd' into a byte position (which is
> actually an offset from the beginning of buffer text), then decrement
> it, then convert back into a 'char *' pointer to the previous byte.
> The macros used for these conversions take care of skipping the gap.
> However, since the caller already took care to split the text into two
> parts, one before the gap and the other after the gap, it sounds like
> we don't need to bother about the gap in this case, unless 'd - 1'
> happens to point before the beginning of string2 argument to
> re_match_2_internal.
I've got rid of all the questionable "d - 1"s. All these code pieces now
first do PTR_TO_OFFSET (d), then do SYNTAX_TABLE_BYTE_TO_CHAR on the
result, and then any arithmetic on the result of that. (See patch
below).
> > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> > > the result may be the same as if we did the decrement after the fact,
> > > but it still looks fishy. WDYT?
> > I think it is suboptimal to have both PTR_TO_OFFSET and
> > POINTER_TO_OFFSET meaning different things in the same source file. ;-)
> Those macros hide the fact that the argument could be a Lisp string or
> a buffer, so I don't think I agree with you here.
I just meant that having the two names so similar might be confusing.
[ .... ]
> > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> > regex-emacs.c. I think I will check them all, amending them as in your
> > patch.
> > What do you say?
> I'm not Stefan, but what I say is that we should only make sure 'd'
> never points to the very first byte of 'string2'. If it does, then
> decrementing it will produce invalid results. If we cannot decide
> whether that situation could happen, we should add an assertion there
> to that effect.
I'm fairly sure it's safe, through always first doing PTR_TO_OFFSET (d),
which takes care of the gap.
Here's the patch (already "tested") which gets rid of the unwanted "d -
1"s:
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index b667a43a37..45b4f8107c 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -4732,8 +4732,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
int c1, c2;
int s1, s2;
int dummy;
- ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
- ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+ ptrdiff_t offset = PTR_TO_OFFSET (d);
+ ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
UPDATE_SYNTAX_TABLE (charpos);
GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
s1 = SYNTAX (c1);
@@ -4811,8 +4811,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
int c1, c2;
int s1, s2;
int dummy;
- ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
- ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+ ptrdiff_t offset = PTR_TO_OFFSET (d);
+ ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
UPDATE_SYNTAX_TABLE (charpos);
GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
s1 = SYNTAX (c1);
@@ -4826,7 +4826,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
{
PREFETCH_NOLIMIT ();
GET_CHAR_AFTER (c2, d, dummy);
- UPDATE_SYNTAX_TABLE_FORWARD (charpos);
+ UPDATE_SYNTAX_TABLE_FORWARD (charpos + 1);
s2 = SYNTAX (c2);
/* ... and S2 is Sword, and WORD_BOUNDARY_P (C1, C2)
@@ -4890,8 +4890,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
is the character at D, and S2 is the syntax of C2. */
int c1, c2;
int s1, s2;
- ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
- ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+ ptrdiff_t offset = PTR_TO_OFFSET (d);
+ ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
UPDATE_SYNTAX_TABLE (charpos);
GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
s1 = SYNTAX (c1);
--
Alan Mackenzie (Nuremberg, Germany).
next prev parent reply other threads:[~2019-03-01 14:14 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-22 16:45 Questionable code in handling of wordend in the regexp engine in regex-emacs.c Alan Mackenzie
2019-02-23 23:15 ` Stefan Monnier
2019-02-25 18:56 ` Alan Mackenzie
2019-02-25 19:18 ` Stefan Monnier
2019-03-01 11:10 ` Alan Mackenzie
2019-03-01 13:41 ` Stefan Monnier
2019-03-01 13:46 ` Eli Zaretskii
2019-03-01 14:14 ` Alan Mackenzie [this message]
2019-03-01 14:43 ` Eli Zaretskii
2019-03-01 14:58 ` Alan Mackenzie
2019-03-01 16:22 ` Eli Zaretskii
2019-03-01 16:38 ` Alan Mackenzie
2019-03-01 19:16 ` Alan Mackenzie
2019-03-01 19:31 ` Eli Zaretskii
2019-03-02 11:16 ` Alan Mackenzie
2019-03-02 12:18 ` Eli Zaretskii
2019-03-02 13:18 ` Alan Mackenzie
2019-03-02 13:37 ` Eli Zaretskii
2019-03-04 17:25 ` Eli Zaretskii
2019-03-05 10:51 ` Alan Mackenzie
2019-03-05 16:26 ` Eli Zaretskii
2019-03-02 12:21 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190301141448.GC5674@ACM \
--to=acm@muc.de \
--cc=eliz@gnu.org \
--cc=emacs-devel@gnu.org \
--cc=monnier@iro.umontreal.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).