unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Alan Mackenzie <acm@muc.de>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: emacs-devel@gnu.org
Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Date: Fri, 1 Mar 2019 11:10:18 +0000	[thread overview]
Message-ID: <20190301111018.GA5674@ACM> (raw)
In-Reply-To: <jwvtvgr1yc7.fsf-monnier+emacs@gnu.org>

Hello, Stefan.

On Mon, Feb 25, 2019 at 14:18:10 -0500, Stefan Monnier wrote:
> >> > Surely the argument to the second occurrence should be (charpos + 1)?
> >> I believe it's instead the other one that needs to use "charpos - 1"
> >> because the UPDATE_SYNTAX_TABLE is called just before reading the char
> >> *before* charpos (see patch below).
> > I don't think this is right.  offset is calculated from d, and then
> > decremented, before calculating charpos.

> Hmm... I think you're right (and the symend code does like you suggest).
> This said, I find it odd that the code does:

> 	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
> 	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> 	      UPDATE_SYNTAX_TABLE (charpos);

> Supposedly `d` is a char* pointing to the beginning of a potentially
> multibyte char, In that case `d - 1` will point "somewhere before the
> end of the previous multibyte char" but not necessarily at
> its beginning.  Maybe the patch below would be preferable to avoid
> this situation?

SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
marker.c).  This latter function doesn't handle well the case of `d'
being in the middle of a multibyte character; sometimes it "rounds it
down", other times it "rounds it up" to a character position.  I think
it should be defined as rounding it down.  It would be a relatively
simple correction (at least, technically ;-).

I think your patch "always do the arithmetic on a charpos, not a
bytepos" is a very good idea indeed.

But I'm still a little worried about buf_bytepos_to_charpos.  Perhaps it
should state that the result is undefined when the bytepos is "invalid".
How many other places in Emacs call it with "invalid" byte positions?
For that matter, how many charpos <-> bytepos functions are there in
Emacs?  Just this one?

> Worse, in notwordbound we do:

> 		ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> 		ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> 		UPDATE_SYNTAX_TABLE (charpos);

> which seems even more broken because `d` might point to the first byte
> after the gap, so `d - 1` will point in the middle of the gap, so it's
> simply an invalid argument to PTR_TO_OFFSET.

I don't think this is right.  Both `d' and `offset' are byte
measurements, not character measurements, so it shouldn't matter whether
the "- 1" is inside or outside the parens.  However, it would be less
confusing if they were both (?all) the same.

> According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> the result may be the same as if we did the decrement after the fact,
> but it still looks fishy.  WDYT?

I think it is suboptimal to have both PTR_TO_OFFSET and
POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

>         Stefan


> diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> index b667a43a37..b21cba0e46 100644
> --- a/src/regex-emacs.c
> +++ b/src/regex-emacs.c
> @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1,
>  	      int c1, c2;
>  	      int s1, s2;
>  	      int dummy;
> -	      ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
> +	      ptrdiff_t offset = PTR_TO_OFFSET (d);
>  	      ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> -	      UPDATE_SYNTAX_TABLE (charpos);
> +	      UPDATE_SYNTAX_TABLE (charpos - 1);
>  	      GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
>  	      s1 = SYNTAX (c1);
 
There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
regex-emacs.c.  I think I will check them all, amending them as in your
patch.

What do you say?

-- 
Alan Mackenzie (Nuremberg, Germany).



  reply	other threads:[~2019-03-01 11:10 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-22 16:45 Questionable code in handling of wordend in the regexp engine in regex-emacs.c Alan Mackenzie
2019-02-23 23:15 ` Stefan Monnier
2019-02-25 18:56   ` Alan Mackenzie
2019-02-25 19:18     ` Stefan Monnier
2019-03-01 11:10       ` Alan Mackenzie [this message]
2019-03-01 13:41         ` Stefan Monnier
2019-03-01 13:46         ` Eli Zaretskii
2019-03-01 14:14           ` Alan Mackenzie
2019-03-01 14:43             ` Eli Zaretskii
2019-03-01 14:58               ` Alan Mackenzie
2019-03-01 16:22                 ` Eli Zaretskii
2019-03-01 16:38                   ` Alan Mackenzie
2019-03-01 19:16                     ` Alan Mackenzie
2019-03-01 19:31                       ` Eli Zaretskii
2019-03-02 11:16                         ` Alan Mackenzie
2019-03-02 12:18                           ` Eli Zaretskii
2019-03-02 13:18                             ` Alan Mackenzie
2019-03-02 13:37                               ` Eli Zaretskii
2019-03-04 17:25                               ` Eli Zaretskii
2019-03-05 10:51                                 ` Alan Mackenzie
2019-03-05 16:26                                   ` Eli Zaretskii
2019-03-02 12:21                           ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190301111018.GA5674@ACM \
    --to=acm@muc.de \
    --cc=emacs-devel@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).