From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c Date: Fri, 01 Mar 2019 08:41:31 -0500 Message-ID: References: <20190222164522.GB5411@ACM> <20190225185656.GA3605@ACM> <20190301111018.GA5674@ACM> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="154089"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) Cc: emacs-devel@gnu.org To: Alan Mackenzie Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Mar 01 14:41:58 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gziQG-000dw4-QQ for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 14:41:56 +0100 Original-Received: from localhost ([127.0.0.1]:38019 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gziQF-0001Jg-J2 for ged-emacs-devel@m.gmane.org; Fri, 01 Mar 2019 08:41:55 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:58804) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gziQ5-0001F2-Tu for emacs-devel@gnu.org; Fri, 01 Mar 2019 08:41:46 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gziQ3-0006Ld-MU for emacs-devel@gnu.org; Fri, 01 Mar 2019 08:41:45 -0500 Original-Received: from chene.dit.umontreal.ca ([132.204.246.20]:40752) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gziQ1-00066f-W3 for emacs-devel@gnu.org; Fri, 01 Mar 2019 08:41:42 -0500 Original-Received: from pastel.home (lechon.iro.umontreal.ca [132.204.27.242]) by chene.dit.umontreal.ca (8.14.7/8.14.1) with ESMTP id x21DfWwG031318; Fri, 1 Mar 2019 08:41:32 -0500 Original-Received: by pastel.home (Postfix, from userid 20848) id C99B769F62; Fri, 1 Mar 2019 08:41:31 -0500 (EST) In-Reply-To: <20190301111018.GA5674@ACM> (Alan Mackenzie's message of "Fri, 1 Mar 2019 11:10:18 +0000") X-NAI-Spam-Flag: NO X-NAI-Spam-Threshold: 5 X-NAI-Spam-Score: 0 X-NAI-Spam-Rules: 2 Rules triggered EDT_SA_DN_PASS=0, RV6494=0 X-NAI-Spam-Version: 2.3.0.9418 : core <6494> : inlines <7025> : streams <1814445> : uri <2804333> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 132.204.246.20 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233726 Archived-At: > down", other times it "rounds it up" to a character position. I think > it should be defined as rounding it down. It would be a relatively > simple correction (at least, technically ;-). When moving forward, rounding it up is more natural ;-) > But I'm still a little worried about buf_bytepos_to_charpos. Perhaps it > should state that the result is undefined when the bytepos is "invalid". Yes, I think it's the intention. Even better would be to signal an error (when built with --enable-checking). > For that matter, how many charpos <-> bytepos functions are there in > Emacs? Just this one? I think so, yes. >> Worse, in notwordbound we do: > >> ptrdiff_t offset = PTR_TO_OFFSET (d - 1); >> ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset); >> UPDATE_SYNTAX_TABLE (charpos); > >> which seems even more broken because `d` might point to the first byte >> after the gap, so `d - 1` will point in the middle of the gap, so it's >> simply an invalid argument to PTR_TO_OFFSET. > > I don't think this is right. Both `d' and `offset' are byte > measurements, not character measurements, so it shouldn't matter whether > the "- 1" is inside or outside the parens. However, it would be less > confusing if they were both (?all) the same. The difference between `d` and `offset` is just an offset, indeed, but it can be 2 different offsets depending on whether `d` is before or after the gap, so what happens when `d` is within the gap depends on how the test for "before/after the gap" is implemented. More specifically, when `d` is N bytes before the end of the gap, the code could consider it as being N bytes before the beginning of the second part, or being "gap-size - N" bytes after the end of the first part. >> According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET, >> the result may be the same as if we did the decrement after the fact, >> but it still looks fishy. WDYT? > > I think it is suboptimal to have both PTR_TO_OFFSET and > POINTER_TO_OFFSET meaning different things in the same source file. ;-) I'm so glad you're volunteering to clean this up. Thank you, really. > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in > regex-emacs.c. I think I will check them all, amending them as in your > patch. > What do you say? Thanks, Stefan