unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Lars Ingebrigtsen <larsi@gnus.org>
Cc: 43598@debbugs.gnu.org
Subject: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 12:42:06 +0200	[thread overview]
Message-ID: <2CFAAACA-2FD3-44C2-B12E-E49DAA968115@acm.org> (raw)
In-Reply-To: <87blhuu3w9.fsf@gnus.org>

[-- Attachment #1: Type: text/plain, Size: 1343 bytes --]

25 sep. 2020 kl. 01.54 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> I went ahead and checked in a new C-level function string-search, which
> should be an efficient way to search for strings in strings (using
> memmem, which Emacs has via Gnulib?), and this fixed these corner cases.

Thank you! Here are some proposed tweaks (diff attached):

1. Check the range of the START-POS argument so that we don't crash.
The permitted range is [0..N] where N is (length HAYSTACK), thus we permit a start right after the last character but no further.
We could also return nil in these cases but I think an error is more useful.

2. Make the docs more precise about various things.

3. Slight simplification of the implementation logic to avoid testing the same conditions multiple times.

4. More tests, especially for edge cases. Can't have too many!
One test still fails:

 (string-search "ø" "\303\270")

which should return nil but currently matches.
I think it's wrong to convert the needle to unibyte (using Fstring_as_unibyte) in this case, but I haven't decided what the best solution would be.

We should also consider the optimisations:
- If SCHARS(needle)>SCHARS(haystack) then no match is possible.
- If either needle or haystack is all-ASCII (all bytes in 0..127), then we can use memmem without conversion.


[-- Attachment #2: string-search.diff --]
[-- Type: application/octet-stream, Size: 5058 bytes --]

diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi
index 6eb3d6f310..0f157c39d6 100644
--- a/doc/lispref/strings.texi
+++ b/doc/lispref/strings.texi
@@ -660,8 +660,10 @@ Text Comparison
 Return the position of the first instance of @var{needle} in
 @var{haystack}, both of which are strings.  If @var{start-pos} is
 non-@code{nil}, start searching from that position in @var{needle}.
+Return @code{nil} if no match was found.
 This function only considers the characters in the strings when doing
-the comparison; text properties are ignored.
+the comparison; text properties are ignored.  Matching is always
+case-sensitive.
 @end defun
 
 @defun compare-strings string1 start1 end1 string2 start2 end2 &optional ignore-case
diff --git a/src/fns.c b/src/fns.c
index 3927e4306e..2f64d95576 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -5456,15 +5456,18 @@ DEFUN ("buffer-hash", Fbuffer_hash, Sbuffer_hash, 0, 1, 0,
 
 DEFUN ("string-search", Fstring_search, Sstring_search, 2, 3, 0,
        doc: /* Search for the string NEEDLE in the string HAYSTACK.
-The return value is the position of the first instance of NEEDLE in
-HAYSTACK.
+The return value is the position of the first occurrence of NEEDLE in
+HAYSTACK, or nil if no match was found.
 
 The optional START-POS argument says where to start searching in
-HAYSTACK.  If not given, start at the beginning. */)
+HAYSTACK and defaults to zero (start at the beginning).
+It must be between zero and the length of HAYSTACK, inclusive.
+
+Case is always significant and text properties are ignored. */)
   (register Lisp_Object needle, Lisp_Object haystack, Lisp_Object start_pos)
 {
   ptrdiff_t start_byte = 0, haybytes;
-  char *res = NULL, *haystart;
+  char *res, *haystart;
 
   CHECK_STRING (needle);
   CHECK_STRING (haystack);
@@ -5472,7 +5475,10 @@ DEFUN ("string-search", Fstring_search, Sstring_search, 2, 3, 0,
   if (!NILP (start_pos))
     {
       CHECK_FIXNUM (start_pos);
-      start_byte = string_char_to_byte (haystack, XFIXNUM (start_pos));
+      EMACS_INT start = XFIXNUM (start_pos);
+      if (start < 0 || start > SCHARS (haystack))
+        xsignal1 (Qargs_out_of_range, start_pos);
+      start_byte = string_char_to_byte (haystack, start);
     }
 
   haystart = SSDATA (haystack) + start_byte;
@@ -5481,13 +5487,13 @@ DEFUN ("string-search", Fstring_search, Sstring_search, 2, 3, 0,
   if (STRING_MULTIBYTE (haystack) == STRING_MULTIBYTE (needle))
     res = memmem (haystart, haybytes,
 		  SSDATA (needle), SBYTES (needle));
-  else if (STRING_MULTIBYTE (haystack) && !STRING_MULTIBYTE (needle))
+  else if (STRING_MULTIBYTE (haystack))  /* unibyte needle */
     {
       Lisp_Object multi_needle = string_to_multibyte (needle);
       res = memmem (haystart, haybytes,
 		    SSDATA (multi_needle), SBYTES (multi_needle));
     }
-  else if (!STRING_MULTIBYTE (haystack) && STRING_MULTIBYTE (needle))
+  else                        /* unibyte haystack, multibyte needle */
     {
       Lisp_Object uni_needle = Fstring_as_unibyte (needle);
       res = memmem (haystart, haybytes,
diff --git a/test/src/fns-tests.el b/test/src/fns-tests.el
index 8c2b1300dc..323743d842 100644
--- a/test/src/fns-tests.el
+++ b/test/src/fns-tests.el
@@ -907,6 +907,12 @@ string-search
   (should (equal (string-search "foo" "foobarzot") 0))
   (should (not (string-search "fooz" "foobarzot")))
   (should (not (string-search "zot" "foobarzo")))
+  (should (equal (string-search "ab" "ab") 0))
+  (should (equal (string-search "ab\0" "ab") nil))
+  (should (equal (string-search "ab" "abababab" 3) 4))
+  (should (equal (string-search "ab" "ababac" 3) nil))
+  (let ((case-fold-search t))
+    (should (equal (string-search "ab" "AB") nil)))
 
   (should (equal
            (string-search (make-string 2 130)
@@ -923,4 +929,26 @@ string-search
   (should (not (string-search (make-string 1 255) "a\377ø")))
   (should (not (string-search (make-string 1 255) "a\377a")))
 
-  (should (equal (string-search "fóo" "zotfóo") 3)))
+  (should (equal (string-search "fóo" "zotfóo") 3))
+
+  (should (equal (string-search (string-to-multibyte "\377") "ab\377c") 2))
+  (should (equal (string-search "\303" "aøb") nil))
+  (should (equal (string-search "\270" "aøb") nil))
+  ;; This test currently fails, but it shouldn't!
+  ;;(should (equal (string-search "ø" "\303\270") nil))
+
+  (should-error (string-search "a" "abc" -1))
+  (should-error (string-search "a" "abc" 4))
+  (should-error (string-search "a" "abc" 100000000000))
+
+  (should (equal (string-search "a" "aaa" 3) nil))
+  (should (equal (string-search "\0" "") nil))
+
+  (should (equal (string-search "" "") 0))
+  (should-error (string-search "" "" 1))
+  (should (equal (string-search "" "abc") 0))
+  (should (equal (string-search "" "abc" 2) 2))
+  (should (equal (string-search "" "abc" 3) 3))
+  (should-error (string-search "" "abc" 4))
+  (should-error (string-search "" "abc" -1))
+  )

  reply	other threads:[~2020-09-25 10:42 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-24 20:52 bug#43598: replace-in-string: finishing touches Mattias Engdegård
2020-09-24 21:12 ` Lars Ingebrigtsen
2020-09-24 21:19 ` Lars Ingebrigtsen
2020-09-24 23:18   ` Lars Ingebrigtsen
2020-09-25  9:21     ` Eli Zaretskii
2020-09-25 10:09       ` Lars Ingebrigtsen
2020-09-24 23:54 ` Lars Ingebrigtsen
2020-09-25 10:42   ` Mattias Engdegård [this message]
2020-09-25 11:11     ` Lars Ingebrigtsen
2020-09-25 11:22       ` Mattias Engdegård
2020-09-25 11:32         ` Lars Ingebrigtsen
2020-09-27  0:03         ` Lars Ingebrigtsen
2020-09-27  0:34           ` Lars Ingebrigtsen
2020-09-27  8:45             ` Mattias Engdegård
2020-09-28  3:41             ` Richard Stallman
2020-09-28  9:40               ` Mattias Engdegård
2020-09-29  3:29                 ` Richard Stallman
2020-09-29  4:12                   ` Eli Zaretskii
2020-09-27 11:12           ` Mattias Engdegård
2020-09-27 11:48             ` Lars Ingebrigtsen
2020-09-27 11:57               ` Mattias Engdegård
2020-09-27 12:02                 ` Lars Ingebrigtsen
2020-09-27 16:14                   ` Mattias Engdegård
2020-09-27 16:19                     ` Eli Zaretskii
2020-09-27 16:41                       ` Lars Ingebrigtsen
2020-09-27 16:48                         ` Eli Zaretskii
2020-09-26 22:44     ` Lars Ingebrigtsen
2020-09-26 22:25 ` Lars Ingebrigtsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2CFAAACA-2FD3-44C2-B12E-E49DAA968115@acm.org \
    --to=mattiase@acm.org \
    --cc=43598@debbugs.gnu.org \
    --cc=larsi@gnus.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).