unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#15107: 24.3; replace-regexp-in-string wrong on \`
@ 2013-08-15 22:15 Kevin Ryde
  2016-03-06  4:18 ` bug#15107: reproduced on emacs-25 branch Michael Wright
  2016-08-30 23:57 ` bug#15107: [PATCH] Add replace-regexp-in-string regression test Erik Anderson
  0 siblings, 2 replies; 10+ messages in thread
From: Kevin Ryde @ 2013-08-15 22:15 UTC (permalink / raw)
  To: 15107

replace-regexp-in-string behaves incorrectly if a regexp has \` among
its matches.

    (replace-regexp-in-string "\\`\\|X" "Z" "--XX--" t t)
    =>
    "Z--ZXZX--"

where I expected

    "Z--ZZ--"

This seems to be due to the optimization in replace-regexp-in-string
which re-matches on the matched substring.  \' can match the substring
where it did not match in the middle of the full string.  In the example
above "X" is the match in the full string, but on taking that "X" as a
substring it can match "\\`".

Probably similar mismatches on the substring occur for things like \' ^
$ \b \< etc.  Maybe the comment in the code about munging the match data
would be a better way.





In GNU Emacs 24.3.1 (i486-pc-linux-gnu, X toolkit, Xaw3d scroll bars)
 of 2013-05-29 on blah.blah, modified by Debian
System Description:	Debian GNU/Linux testing/unstable

Configured using:
 `configure '--build' 'i486-linux-gnu' '--build' 'i486-linux-gnu'
 '--prefix=/usr' '--sharedstatedir=/var/lib' '--libexecdir=/usr/lib'
 '--localstatedir=/var/lib' '--infodir=/usr/share/info'
 '--mandir=/usr/share/man' '--with-pop=yes'
 '--enable-locallisppath=/etc/emacs24:/etc/emacs:/usr/local/share/emacs/24.3/site-lisp:/usr/local/share/emacs/site-lisp:/usr/share/emacs/24.3/site-lisp:/usr/share/emacs/site-lisp'
 '--with-crt-dir=/usr/lib/i386-linux-gnu' '--with-x=yes'
 '--with-x-toolkit=lucid' '--with-toolkit-scroll-bars' '--without-gconf'
 'build_alias=i486-linux-gnu' 'CFLAGS=-g -O2 -fstack-protector
 --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wall'
 'LDFLAGS=-Wl,-z,relro -Wl,-znocombreloc'
 'CPPFLAGS=-D_FORTIFY_SOURCE=2''

Important settings:
  value of $LANG: en_AU
  locale-coding-system: iso-latin-1-unix
  default enable-multibyte-characters: t





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: reproduced on emacs-25 branch
  2013-08-15 22:15 bug#15107: 24.3; replace-regexp-in-string wrong on \` Kevin Ryde
@ 2016-03-06  4:18 ` Michael Wright
  2016-08-30 23:57 ` bug#15107: [PATCH] Add replace-regexp-in-string regression test Erik Anderson
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Wright @ 2016-03-06  4:18 UTC (permalink / raw)
  To: 15107

[-- Attachment #1: Type: text/plain, Size: 761 bytes --]

Kevin Ryde <user42@zip.com.au> writes:

> replace-regexp-in-string behaves incorrectly if a regexp has \` among
> its matches.
>
>     (replace-regexp-in-string "\\`\\|X" "Z" "--XX--" t t)
>     =>
>     "Z--ZXZX--"
>
> where I expected
>
>     "Z--ZZ--"
>
> This seems to be due to the optimization in replace-regexp-in-string
> which re-matches on the matched substring.  \' can match the substring
> where it did not match in the middle of the full string.  In the example
> above "X" is the match in the full string, but on taking that "X" as a
> substring it can match "\\`".

I built the emacs-25 git branch I recreated the above bug today.

GNU Emacs 25.0.92.1 (x86_64-apple-darwin13.4.0, NS appkit-1265.21 Version
10.9.5 (Build 13F1507))
 of 2016-03-05

[-- Attachment #2: Type: text/html, Size: 1234 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2013-08-15 22:15 bug#15107: 24.3; replace-regexp-in-string wrong on \` Kevin Ryde
  2016-03-06  4:18 ` bug#15107: reproduced on emacs-25 branch Michael Wright
@ 2016-08-30 23:57 ` Erik Anderson
  2016-08-31 14:24   ` Eli Zaretskii
  2016-09-01 15:46   ` Eli Zaretskii
  1 sibling, 2 replies; 10+ messages in thread
From: Erik Anderson @ 2016-08-30 23:57 UTC (permalink / raw)
  To: 15107

[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]

I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for Kevin's
example as well as:

(replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> "Foo bar"  (should be "Foo Bar")

Some close variants which behave correctly:

(replace-regexp-in-string "^F\\| ." #'upcase "foo bar")
> "Foo Bar"
(replace-regexp-in-string "^.K\\| ." #'upcase "ok corral")
> "OK Corral"
(replace-regexp-in-string "^..\\| ." #'upcase "ok corral")
> "OK Corral"

This was discussed here:
http://emacs.stackexchange.com/questions/26590/replace-regexp-in-string-stops-replacement-with

Here is a regression test for when someone has a chance to tackle this:

---
 test/lisp/subr-tests.el | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index ce21290..969d5c2 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -224,5 +224,8 @@
               (error-message-string (should-error (version-to-list
"beta22_8alpha3")))
               "Invalid version syntax: `beta22_8alpha3' (must start with a
number)"))))

+(ert-deftest replace-regexp-in-string-test ()
+  (should (equal (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
"Foo Bar")))
+
 (provide 'subr-tests)
 ;;; subr-tests.el ends here
-- 

Regards,
Erik Anderson.

[-- Attachment #2: Type: text/html, Size: 2034 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2016-08-30 23:57 ` bug#15107: [PATCH] Add replace-regexp-in-string regression test Erik Anderson
@ 2016-08-31 14:24   ` Eli Zaretskii
  2016-08-31 14:36     ` Erik Anderson
  2016-09-01 15:46   ` Eli Zaretskii
  1 sibling, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2016-08-31 14:24 UTC (permalink / raw)
  To: Erik Anderson; +Cc: 15107

> From: Erik Anderson <erikbpanderson@gmail.com>
> Date: Tue, 30 Aug 2016 23:57:35 +0000
> 
> I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for Kevin's example as well as:
> 
> (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > "Foo bar"  (should be "Foo Bar")

Maybe I'm missing something, but I don't see why this is a bug.  The
input string "foo bar" matches the "^." alternative in its entirety,
so there's no reason to expect Emacs to apply 'upcase' twice.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2016-08-31 14:24   ` Eli Zaretskii
@ 2016-08-31 14:36     ` Erik Anderson
  2016-08-31 15:01       ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Erik Anderson @ 2016-08-31 14:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15107

[-- Attachment #1: Type: text/plain, Size: 794 bytes --]

Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP
with REP in STRING."

My email was a comment to an existing open bug from 2013-08-15:
http://debbugs.gnu.org/cgi/bugreport.cgi?bug=15107

On Wed, Aug 31, 2016 at 9:25 AM Eli Zaretskii <eliz@gnu.org> wrote:

> > From: Erik Anderson <erikbpanderson@gmail.com>
> > Date: Tue, 30 Aug 2016 23:57:35 +0000
> >
> > I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for
> Kevin's example as well as:
> >
> > (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > > "Foo bar"  (should be "Foo Bar")
>
> Maybe I'm missing something, but I don't see why this is a bug.  The
> input string "foo bar" matches the "^." alternative in its entirety,
> so there's no reason to expect Emacs to apply 'upcase' twice.
>

[-- Attachment #2: Type: text/html, Size: 1343 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2016-08-31 14:36     ` Erik Anderson
@ 2016-08-31 15:01       ` Eli Zaretskii
  2016-08-31 15:13         ` Noam Postavsky
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2016-08-31 15:01 UTC (permalink / raw)
  To: Erik Anderson; +Cc: 15107

> From: Erik Anderson <erikbpanderson@gmail.com>
> Date: Wed, 31 Aug 2016 14:36:06 +0000
> Cc: 15107@debbugs.gnu.org
> 
> Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP with REP in STRING."

Yes, and there is a single match in this case, so a single
replacement.  The _entire_ input string matches the regexp, so after
that match there's nothing else left to match.

What am I missing?






^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2016-08-31 15:01       ` Eli Zaretskii
@ 2016-08-31 15:13         ` Noam Postavsky
  2016-08-31 15:32           ` Erik Anderson
  2016-08-31 16:04           ` Eli Zaretskii
  0 siblings, 2 replies; 10+ messages in thread
From: Noam Postavsky @ 2016-08-31 15:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Erik Anderson, 15107

On Wed, Aug 31, 2016 at 11:01 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: Erik Anderson <erikbpanderson@gmail.com>
>> Date: Wed, 31 Aug 2016 14:36:06 +0000
>> Cc: 15107@debbugs.gnu.org
>>
>> Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP with REP in STRING."
>
> Yes, and there is a single match in this case, so a single
> replacement.  The _entire_ input string matches the regexp, so after
> that match there's nothing else left to match.
>
> What am I missing?

"^." matches only the first character of "foo bar", but maybe you have
a different idea of "matches" than I do. I would consider "^..*" to
match the whole string.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2016-08-31 15:13         ` Noam Postavsky
@ 2016-08-31 15:32           ` Erik Anderson
  2016-08-31 16:04           ` Eli Zaretskii
  1 sibling, 0 replies; 10+ messages in thread
From: Erik Anderson @ 2016-08-31 15:32 UTC (permalink / raw)
  To: Noam Postavsky, Eli Zaretskii; +Cc: 15107

[-- Attachment #1: Type: text/plain, Size: 1615 bytes --]

I suspect that since ".*" is such a commonly used term in regexps, Eli
might be misreading the regexp.

From the Emacs manual on regular expression special characters:
"‘.’ (Period)

is a special character that matches any single character except a newline.
Using concatenation, we can make regular expressions like ‘a.b’, which
matches any three-character string that begins with ‘a’ and ends with ‘b’."
You can verify the behavior of "."

(string-match "^." "No greedy modifiers here")
(match-data)
> (0 1)

(string-match "^.*" "This has a greedy modifier")
(match-data)
> (0 26)

This is a helpful document:
https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Special.html#Regexp-Special

Further discussion should be moved off this list.

-Erik.

On Wed, Aug 31, 2016 at 10:13 AM Noam Postavsky <
npostavs@users.sourceforge.net> wrote:

> On Wed, Aug 31, 2016 at 11:01 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> >> From: Erik Anderson <erikbpanderson@gmail.com>
> >> Date: Wed, 31 Aug 2016 14:36:06 +0000
> >> Cc: 15107@debbugs.gnu.org
> >>
> >> Per the replace-regexp-in-string docstring: "Replace all matches for
> REGEXP with REP in STRING."
> >
> > Yes, and there is a single match in this case, so a single
> > replacement.  The _entire_ input string matches the regexp, so after
> > that match there's nothing else left to match.
> >
> > What am I missing?
>
> "^." matches only the first character of "foo bar", but maybe you have
> a different idea of "matches" than I do. I would consider "^..*" to
> match the whole string.
>

[-- Attachment #2: Type: text/html, Size: 2610 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2016-08-31 15:13         ` Noam Postavsky
  2016-08-31 15:32           ` Erik Anderson
@ 2016-08-31 16:04           ` Eli Zaretskii
  1 sibling, 0 replies; 10+ messages in thread
From: Eli Zaretskii @ 2016-08-31 16:04 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: erikbpanderson, 15107

> From: Noam Postavsky <npostavs@users.sourceforge.net>
> Date: Wed, 31 Aug 2016 11:13:05 -0400
> Cc: Erik Anderson <erikbpanderson@gmail.com>, 15107@debbugs.gnu.org
> 
> "^." matches only the first character of "foo bar", but maybe you have
> a different idea of "matches" than I do. I would consider "^..*" to
> match the whole string.

OMG, I was sure the * was there!

Sorry.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#15107: [PATCH] Add replace-regexp-in-string regression test
  2016-08-30 23:57 ` bug#15107: [PATCH] Add replace-regexp-in-string regression test Erik Anderson
  2016-08-31 14:24   ` Eli Zaretskii
@ 2016-09-01 15:46   ` Eli Zaretskii
  1 sibling, 0 replies; 10+ messages in thread
From: Eli Zaretskii @ 2016-09-01 15:46 UTC (permalink / raw)
  To: Erik Anderson; +Cc: 15107

> From: Erik Anderson <erikbpanderson@gmail.com>
> Date: Tue, 30 Aug 2016 23:57:35 +0000
> 
> (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > "Foo bar"  (should be "Foo Bar")

It looks like an algorithmic design flaw.  Here's the relevant part of
replace-regexp-in-string:

      (while (and (< start l) (string-match regexp string start))
	(setq mb (match-beginning 0)
	      me (match-end 0))
	;; If we matched the empty string, make sure we advance by one char
	(when (= me mb) (setq me (min l (1+ mb))))
	;; Generate a replacement for the matched substring.
	;; Operate only on the substring to minimize string consing.
	;; Set up match data for the substring for replacement;
	;; presumably this is likely to be faster than munging the
	;; match data directly in Lisp.
	(string-match regexp (setq str (substring string mb me)))
	(setq matches
	      (cons (replace-match (if (stringp rep)
				       rep
				     (funcall rep (match-string 0 str)))
				   fixedcase literal str subexp)

As you see, it first matches the (rest of the) string against REGEXP,
then takes the substring that matched, and matches that substring
again.  But the evident assumption that the match in the substring
will yield the same result is false.  In this case, the substring of
"oo bar" that matches "^.\\| ." is " b", but matching it again against
the same regexp yields just " ", because the first alternative
matches.  So 'upcase' is applied to the blank, and the rest is
history.





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-09-01 15:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-15 22:15 bug#15107: 24.3; replace-regexp-in-string wrong on \` Kevin Ryde
2016-03-06  4:18 ` bug#15107: reproduced on emacs-25 branch Michael Wright
2016-08-30 23:57 ` bug#15107: [PATCH] Add replace-regexp-in-string regression test Erik Anderson
2016-08-31 14:24   ` Eli Zaretskii
2016-08-31 14:36     ` Erik Anderson
2016-08-31 15:01       ` Eli Zaretskii
2016-08-31 15:13         ` Noam Postavsky
2016-08-31 15:32           ` Erik Anderson
2016-08-31 16:04           ` Eli Zaretskii
2016-09-01 15:46   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).