From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c Date: Mon, 25 Feb 2019 14:18:10 -0500 Message-ID: References: <20190222164522.GB5411@ACM> <20190225185656.GA3605@ACM> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="35139"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) Cc: Alan Mackenzie To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Feb 25 20:18:28 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gyLlk-000932-Ac for ged-emacs-devel@m.gmane.org; Mon, 25 Feb 2019 20:18:28 +0100 Original-Received: from localhost ([127.0.0.1]:43136 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyLlj-00068c-BW for ged-emacs-devel@m.gmane.org; Mon, 25 Feb 2019 14:18:27 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:40385) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyLld-00068G-Ap for emacs-devel@gnu.org; Mon, 25 Feb 2019 14:18:22 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyLlc-0007Xi-Ig for emacs-devel@gnu.org; Mon, 25 Feb 2019 14:18:21 -0500 Original-Received: from [195.159.176.226] (port=39500 helo=blaine.gmane.org) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gyLlc-0007WV-A4 for emacs-devel@gnu.org; Mon, 25 Feb 2019 14:18:20 -0500 Original-Received: from list by blaine.gmane.org with local (Exim 4.89) (envelope-from ) id 1gyLla-0008tx-4X for emacs-devel@gnu.org; Mon, 25 Feb 2019 20:18:18 +0100 X-Injected-Via-Gmane: http://gmane.org/ Cancel-Lock: sha1:vTGHcB1c0bVKYSdZ6KfdP3Y4c34= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 195.159.176.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233611 Archived-At: >> > Surely the argument to the second occurrence should be (charpos + 1)? >> I believe it's instead the other one that needs to use "charpos - 1" >> because the UPDATE_SYNTAX_TABLE is called just before reading the char >> *before* charpos (see patch below). > I don't think this is right. offset is calculated from d, and then > decremented, before calculating charpos. Hmm... I think you're right (and the symend code does like you suggest). This said, I find it odd that the code does: ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); UPDATE_SYNTAX_TABLE (charpos); Supposedly `d` is a char* pointing to the beginning of a potentially multibyte char, In that case `d - 1` will point "somewhere before the end of the previous multibyte char" but not necessarily at its beginning. Maybe the patch below would be preferable to avoid this situation? Worse, in notwordbound we do: ptrdiff_t offset = PTR_TO_OFFSET (d - 1); ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); UPDATE_SYNTAX_TABLE (charpos); which seems even more broken because `d` might point to the first byte after the gap, so `d - 1` will point in the middle of the gap, so it's simply an invalid argument to PTR_TO_OFFSET. According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, the result may be the same as if we did the decrement after the fact, but it still looks fishy. WDYT? Stefan diff --git a/src/regex-emacs.c b/src/regex-emacs.c index b667a43a37..b21cba0e46 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, re_char *string1, int c1, c2; int s1, s2; int dummy; - ptrdiff_t offset = PTR_TO_OFFSET (d) - 1; + ptrdiff_t offset = PTR_TO_OFFSET (d); ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); - UPDATE_SYNTAX_TABLE (charpos); + UPDATE_SYNTAX_TABLE (charpos - 1); GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2); s1 = SYNTAX (c1);