[PATCH] add 'string-distance' to calculate Levenshtein distance

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* [PATCH] add 'string-distance' to calculate Levenshtein distance
@ 2018-04-14  2:35 Chen Bin
  2018-04-14  7:05 ` Eli Zaretskii
  2018-04-14 17:18 ` Nathan Moreau
  0 siblings, 2 replies; 24+ messages in thread
From: Chen Bin @ 2018-04-14  2:35 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 590 bytes --]

Hello,
Attached patch implemented 'string-distance' is much faster then
existing Lisp implementation like 'org-babel-edit-distance'.

I benchmarked by below snippet:

  (setq s1 "projs1/emacs/etc/imeees/icon/emacs-document.svg")
  (setq s2 "projs2/test/etc/images/icons/emacs-document23.svg")
  (let* ((gc-cons-threshold most-positive-fixnum))
    (message "%S vs %S"
             (benchmark-run-compiled 100
               (org-babel-edit-distance s1 s2))
             (benchmark-run-compiled 100
               (string-distance s1 s2))))

Got:
(0.506494223 0 0.0) vs (0.001317414 0 0.0)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-add-api-string-distance.patch --]
[-- Type: text/x-diff, Size: 2238 bytes --]

From bb32afb18bd06e6843921205987b23641145f51a Mon Sep 17 00:00:00 2001
From: Chen Bin <chenbin.sh@gmail.com>
Date: Sat, 14 Apr 2018 10:33:44 +1000
Subject: [PATCH] add api string-distance

---
 src/fns.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/src/fns.c b/src/fns.c
index 94b9d98..b63d4f4 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -41,6 +41,8 @@ along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.  */
 # define gnutls_rnd w32_gnutls_rnd
 #endif
 
+#define min3(a, b, c) ((a) < (b) ? ((a) < (c) ? (a) : (c)) : ((b) < (c) ? (b) : (c)))
+
 static void sort_vector_copy (Lisp_Object, ptrdiff_t,
 			      Lisp_Object *restrict, Lisp_Object *restrict);
 enum equal_kind { EQUAL_NO_QUIT, EQUAL_PLAIN, EQUAL_INCLUDING_PROPERTIES };
@@ -153,6 +155,33 @@ If STRING is multibyte, this may be greater than the length of STRING.  */)
   return make_number (SBYTES (string));
 }
 
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
+       doc: /* Return Levenshtein distance of two strings.
+Case is significant, but text properties are ignored. */)
+  (register Lisp_Object string1, Lisp_Object string2)
+{
+  CHECK_STRING (string1);
+  CHECK_STRING (string2);
+
+  char *s1 = SSDATA (string1);
+  char *s2 = SSDATA (string2);
+  unsigned int s1len, s2len, x, y, lastdiag, olddiag;
+  s1len = strlen(s1);
+  s2len = strlen(s2);
+  unsigned int column[s1len+1];
+  for (y = 1; y <= s1len; y++)
+    column[y] = y;
+  for (x = 1; x <= s2len; x++) {
+    column[0] = x;
+    for (y = 1, lastdiag = x-1; y <= s1len; y++) {
+      olddiag = column[y];
+      column[y] = min3(column[y] + 1, column[y-1] + 1, lastdiag + (s1[y-1] == s2[x-1] ? 0 : 1));
+      lastdiag = olddiag;
+    }
+  }
+  return make_number(column[s1len]);
+}
+
 DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0,
        doc: /* Return t if two strings have identical contents.
 Case is significant, but text properties are ignored.
@@ -5226,6 +5255,7 @@ this variable.  */);
   defsubr (&Slength);
   defsubr (&Ssafe_length);
   defsubr (&Sstring_bytes);
+  defsubr (&Sstring_distance);
   defsubr (&Sstring_equal);
   defsubr (&Scompare_strings);
   defsubr (&Sstring_lessp);
-- 
2.16.3


[-- Attachment #3: Type: text/plain, Size: 50 bytes --]


-- 
Best Regards,
Chen Bin

--
Help me, help you

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-14  2:35 [PATCH] add 'string-distance' to calculate Levenshtein distance Chen Bin
@ 2018-04-14  7:05 ` Eli Zaretskii
       [not found]   ` <87lgdq831h.fsf@gmail.com>
  2018-04-14 17:18 ` Nathan Moreau
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-14  7:05 UTC (permalink / raw)
  To: Chen Bin; +Cc: emacs-devel

> From: Chen Bin <chenbin.sh@gmail.com>
> Date: Sat, 14 Apr 2018 12:35:08 +1000
> 
> Attached patch implemented 'string-distance' is much faster then
> existing Lisp implementation like 'org-babel-edit-distance'.

Thanks, please see a few comments below.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
> +       doc: /* Return Levenshtein distance of two strings.

Please mention the arguments in the first line of the doc string.

> +Case is significant, but text properties are ignored. */)
> +  (register Lisp_Object string1, Lisp_Object string2)

We don't like to use 'register' in new code, we prefer leaving this to
the compiler.

> +  CHECK_STRING (string1);
> +  CHECK_STRING (string2);
> +
> +  char *s1 = SSDATA (string1);
> +  char *s2 = SSDATA (string2);
> +  unsigned int s1len, s2len, x, y, lastdiag, olddiag;
> +  s1len = strlen(s1);
> +  s2len = strlen(s2);

I presume this function is supposed to work only on strings in their
internal representation (a.k.a. "multibyte strings"), so the function
should check that and signal an error if that's not so.
Alternatively, check that either both strings are unibyte or both are
multibyte, and signal an error if not.

> +  unsigned int column[s1len+1];

I'm not sure this is portable enough, but even if it is, it's not a
good idea to unconditionally make automatic variables in this case,
because s1len and s2len could be very large, in which case you will
get stack overflow.  Please use SAFE_ALLOCA instead.

> +  for (y = 1; y <= s1len; y++)
> +    column[y] = y;
> +  for (x = 1; x <= s2len; x++) {
> +    column[0] = x;
> +    for (y = 1, lastdiag = x-1; y <= s1len; y++) {
> +      olddiag = column[y];
> +      column[y] = min3(column[y] + 1, column[y-1] + 1, lastdiag + (s1[y-1] == s2[x-1] ? 0 : 1));
> +      lastdiag = olddiag;

Is this algorithm really better than allocating just the diag matrix
(or part thereof), and accessing the string data directly, without
copying them inti local arrays?

> +  return make_number(column[s1len]);

Style: we leave a single blank between the function's name and the
following opening parenthesis (you have a few of these scattered
through the code).

Finally, this new primitives needs documentation (NEWS and the ELisp
manual), and also a couple of tests in the test suite.

Thanks again for working on this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <87lgdq831h.fsf@gmail.com>]

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
       [not found]   ` <87lgdq831h.fsf@gmail.com>
@ 2018-04-14 13:24     ` Eli Zaretskii
  2018-04-14 16:40       ` Chen Bin
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-14 13:24 UTC (permalink / raw)
  To: Chen Bin; +Cc: emacs-devel

[Please CC the mailing list when you respond, so others could see your
messages.]

> From: Chen Bin <chenbin.sh@gmail.com>
> Date: Sat, 14 Apr 2018 21:01:46 +1000
> 
> Hi, Eli,
> Thanks for the review.
> 
> I fixed most issues except two things.
> 
> 1. In Asia, it's possible to cacluate distance between one unibyte and
> one multibyte string. As a Chinese, I might create a document containing
> Hanzi characters whose file name is obviously multibyte string. I may 
> need get the distance of this document to a file named "README.txt". 

If you mean unibyte pure-ASCII strings, then I agree.  But that
doesn't mean we should avoid the text altogether, because we might
compute non-zero distance between a string and its encoded unibyte
variant, which will confuse users.  At the very least the doc string
should say something about this.

> 2. Algorithm is based on https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C
> It's optimized to use O(min(m,n)) space instead of O(mn).
> Say you compare two string whose string length is 512 bytes.
> You only need allocate 512 bytes instead of 262K (512*512)
> in memory.
> 
> Please check attached patch for latest code.
> 
> --- a/etc/NEWS
> +++ b/etc/NEWS
> @@ -463,6 +463,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
>  +++
>  ** New function assoc-delete-all.
>  
> +** New function string-distance.

This should mention Levenshtein distance.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
> +       doc: /* Return Levenshtein distance of STRING1 and STRING2.
                                              ^^^^^^^^^^^^^^^^^^^^^^
"between STRING1 and STRING2"

> +  unsigned int s1len, s2len, x, y, lastdiag, olddiag;

These variables should be declared EMACS_INT, not unsigned int,
because the size of Emacs strings can be larger than UINT_MAX,
especially on 64-bit systems.

> +  unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof (unsigned int));

Likewise here.

> +  char *s1 = SSDATA (string1);
> +  char *s2 = SSDATA (string2);
> +
> +  unsigned int s1len, s2len, x, y, lastdiag, olddiag;
> +  s1len = strlen(s1);
> +  s2len = strlen(s2);

You could optimize the code by using SCHARS and SBYTES, instead of
calling strlen.

> +(ert-deftest subr-tests--string-distance ()
> +  "Test `string-distance' behavior."
> +  (should (equal 1 (string-distance "heelo" "hello")))
> +  (should (equal 2 (string-distance "aeelo" "hello")))
> +  (should (equal 0 (string-distance "ab" "ab")))
> +  (should (equal 1 (string-distance "ab" "abc"))))

Could you please add a test or two with non-ASCII characters?

Thanks.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-14 13:24     ` Eli Zaretskii
@ 2018-04-14 16:40       ` Chen Bin
  2018-04-14 17:08         ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Chen Bin @ 2018-04-14 16:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 507 bytes --]

Correct me if I'm wrong.

I read cod eand found definion of Lisp_String:
  struct GCALIGNED Lisp_String
  {
    ptrdiff_t size;
    ptrdiff_t size_byte;
    INTERVAL intervals;		/* Text properties in this string.  */
    unsigned char *data;
  };

I understand string text is encoded in UTF8 format and is stored in
'Lisp_String::data'. There is actually no difference between unibyte
and multibyte text since UTF8 is compatible with ASCII and we only deal
with 'data' field.

I attached the latest patch.


[-- Attachment #2: 0001-add-api-string-distance.patch --]
[-- Type: text/x-diff, Size: 3588 bytes --]

From 53e6358e1346afd64bad261823f92ae9619b93d2 Mon Sep 17 00:00:00 2001
From: Chen Bin <chenbin.sh@gmail.com>
Date: Sun, 15 Apr 2018 02:20:29 +1000
Subject: [PATCH] add api string-distance

---
 etc/NEWS                |  2 ++
 src/fns.c               | 35 +++++++++++++++++++++++++++++++++++
 test/lisp/subr-tests.el | 10 ++++++++++
 3 files changed, 47 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index 12b72eb..75cc92d 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -463,6 +463,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
 +++
 ** New function assoc-delete-all.
 
+** New function string-distance to calculate Levenshtein distance between two strings.
+
 ** 'print-quoted' now defaults to t, so if you want to see
 (quote x) instead of 'x you will have to bind it to nil where applicable.
 
diff --git a/src/fns.c b/src/fns.c
index 94b9d98..f149001 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -153,6 +153,40 @@ If STRING is multibyte, this may be greater than the length of STRING.  */)
   return make_number (SBYTES (string));
 }
 
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
+       doc: /* Return Levenshtein distance between STRING1 and STRING2.
+Case is significant, but text properties are ignored.
+Strings are treated as byte arrays when calculating distance. */)
+  (Lisp_Object string1, Lisp_Object string2)
+{
+  CHECK_STRING (string1);
+  CHECK_STRING (string2);
+
+  char *s1 = SSDATA (string1);
+  char *s2 = SSDATA (string2);
+
+  ptrdiff_t s1len, s2len, x, y, lastdiag, olddiag;
+  s1len = SBYTES (string1);
+  s2len = SBYTES (string2);
+
+  USE_SAFE_ALLOCA;
+  ptrdiff_t *column = SAFE_ALLOCA ((s1len + 1) * sizeof (ptrdiff_t));
+  for (y = 1; y <= s1len; y++)
+    column[y] = y;
+  for (x = 1; x <= s2len; x++)
+    {
+      column[0] = x;
+      for (y = 1, lastdiag = x - 1; y <= s1len; y++)
+        {
+          olddiag = column[y];
+          column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (s1[y-1] == s2[x-1] ? 0 : 1));
+          lastdiag = olddiag;
+        }
+    }
+  SAFE_FREE ();
+  return make_number (column[s1len]);
+}
+
 DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0,
        doc: /* Return t if two strings have identical contents.
 Case is significant, but text properties are ignored.
@@ -5226,6 +5260,7 @@ this variable.  */);
   defsubr (&Slength);
   defsubr (&Ssafe_length);
   defsubr (&Sstring_bytes);
+  defsubr (&Sstring_distance);
   defsubr (&Sstring_equal);
   defsubr (&Scompare_strings);
   defsubr (&Sstring_lessp);
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 52b61d9..40a7727 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -281,6 +281,16 @@ subr-test--frames-1
   (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
   (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
 
+(ert-deftest subr-tests--string-distance ()
+  "Test `string-distance' behavior."
+  (should (equal 1 (string-distance "heelo" "hello")))
+  (should (equal 2 (string-distance "aeelo" "hello")))
+  (should (equal 0 (string-distance "ab" "ab")))
+  (should (equal 1 (string-distance "ab" "abc")))
+  ;; string containing unicode character (Hanzi)
+  (should (equal 6 (string-distance "ab" "ab我她")))
+  (should (equal 3 (string-distance "我" "她"))))
+
 (ert-deftest subr-tests--dolist--wrong-number-of-args ()
   "Test that `dolist' doesn't accept wrong types or length of SPEC,
 cf. Bug#25477."
-- 
2.16.3


[-- Attachment #3: Type: text/plain, Size: 3094 bytes --]


>>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:

    Eli> [Please CC the mailing list when you respond, so others could
    Eli> see your messages.]

    >> From: Chen Bin <chenbin.sh@gmail.com> Date: Sat, 14 Apr 2018
    >> 21:01:46 +1000
    >> 
    >> Hi, Eli, Thanks for the review.
    >> 
    >> I fixed most issues except two things.
    >> 
    >> 1. In Asia, it's possible to cacluate distance between one
    >> unibyte and one multibyte string. As a Chinese, I might create a
    >> document containing Hanzi characters whose file name is obviously
    >> multibyte string. I may need get the distance of this document to
    >> a file named "README.txt".

    Eli> If you mean unibyte pure-ASCII strings, then I agree.  But that
    Eli> doesn't mean we should avoid the text altogether, because we
    Eli> might compute non-zero distance between a string and its
    Eli> encoded unibyte variant, which will confuse users.  At the very
    Eli> least the doc string should say something about this.

    >> 2. Algorithm is based on
    >> https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C
    >> It's optimized to use O(min(m,n)) space instead of O(mn).  Say
    >> you compare two string whose string length is 512 bytes.  You
    >> only need allocate 512 bytes instead of 262K (512*512) in memory.
    >> 
    >> Please check attached patch for latest code.
    >> 
    >> --- a/etc/NEWS +++ b/etc/NEWS @@ -463,6 +463,8 @@
    >> x-lost-selection-hooks, x-sent-selection-hooks +++ ** New
    >> function assoc-delete-all.
    >> 
    >> +** New function string-distance.

    Eli> This should mention Levenshtein distance.

    >> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2,
    >> 2, 0, + doc: /* Return Levenshtein distance of STRING1 and
    >> STRING2.
    Eli>                                               ^^^^^^^^^^^^^^^^^^^^^^
    Eli> "between STRING1 and STRING2"

    >> + unsigned int s1len, s2len, x, y, lastdiag, olddiag;

    Eli> These variables should be declared EMACS_INT, not unsigned int,
    Eli> because the size of Emacs strings can be larger than UINT_MAX,
    Eli> especially on 64-bit systems.

    >> + unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof
    >> (unsigned int));

    Eli> Likewise here.

    >> + char *s1 = SSDATA (string1); + char *s2 = SSDATA (string2); + +
    >> unsigned int s1len, s2len, x, y, lastdiag, olddiag; + s1len =
    >> strlen(s1); + s2len = strlen(s2);

    Eli> You could optimize the code by using SCHARS and SBYTES, instead
    Eli> of calling strlen.

    >> +(ert-deftest subr-tests--string-distance () + "Test
    >> `string-distance' behavior."  + (should (equal 1 (string-distance
    >> "heelo" "hello"))) + (should (equal 2 (string-distance "aeelo"
    >> "hello"))) + (should (equal 0 (string-distance "ab" "ab"))) +
    >> (should (equal 1 (string-distance "ab" "abc"))))

    Eli> Could you please add a test or two with non-ASCII characters?

    Eli> Thanks.

-- 
Best Regards,
Chen Bin

--
Help me, help you

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-14 16:40       ` Chen Bin
@ 2018-04-14 17:08         ` Eli Zaretskii
  2018-04-15  7:15           ` Chen Bin
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-14 17:08 UTC (permalink / raw)
  To: Chen Bin; +Cc: emacs-devel

> From: Chen Bin <chenbin.sh@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Sun, 15 Apr 2018 02:40:18 +1000
> 
> Correct me if I'm wrong.
> 
> I read cod eand found definion of Lisp_String:
>   struct GCALIGNED Lisp_String
>   {
>     ptrdiff_t size;
>     ptrdiff_t size_byte;
>     INTERVAL intervals;		/* Text properties in this string.  */
>     unsigned char *data;
>   };
> 
> I understand string text is encoded in UTF8 format and is stored in
> 'Lisp_String::data'. There is actually no difference between unibyte
> and multibyte text since UTF8 is compatible with ASCII and we only deal
> with 'data' field.

No, that's incorrect.  The difference does exist, it just all but
disappear for unibyte strings encoded in UTF-8.  But if you encode a
string in some other encoding, like Latin-1, you will see a very
different stream of bytes.

> I attached the latest patch.

Thanks.

> +  ;; string containing unicode character (Hanzi)
> +  (should (equal 6 (string-distance "ab" "ab我她")))
> +  (should (equal 3 (string-distance "我" "她"))))

Should the distance be measured in bytes or in characters?  I think
it's the latter, in which case the implementation should work in
characters, not bytes.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-14 17:08         ` Eli Zaretskii
@ 2018-04-15  7:15           ` Chen Bin
  2018-04-15 14:47             ` Eli Zaretskii
  2018-04-15 18:53             ` Paul Eggert
  0 siblings, 2 replies; 24+ messages in thread
From: Chen Bin @ 2018-04-15  7:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1038 bytes --]

I attached patch for latest code.

As required, now by default we compare the string by character. So
it can handle multibyte character correctly. 

'string-distance' has third optional parameter 'bytecompare' which
use byte comparing instead of default character comparing.

I did read this thread carefully and I did know different algorithms
exist to calculate string distance.

But I feel only my code is the right solution to our original problem.

Our original problem is to provide a faster C solution to calculate
*Levenshtein distance*.

It will replace 3rd party Lisp solution like 'org-babel-edit-distance'.

As 'org-babel-edit-distance' documented, it will "Return the edit
(levenshtein) distance between strings S1 S2". So the problem here is to
calculate *Levenshtein distance*.

Other algorithms have their own definition of "string distance" so they
are not for *Levenshtein distance*.

Here is link of "Comparison of String Distance Algorithms":

https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/



[-- Attachment #2: 0001-add-api-string-distance.patch --]
[-- Type: text/x-diff, Size: 4711 bytes --]

From 657ee7f92e5bfd6f5138e9d0b030bd80ff963b16 Mon Sep 17 00:00:00 2001
From: Chen Bin <chenbin.sh@gmail.com>
Date: Sun, 15 Apr 2018 02:20:29 +1000
Subject: [PATCH] add api string-distance

---
 etc/NEWS                |  2 ++
 src/fns.c               | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 test/lisp/subr-tests.el | 18 ++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index 0c4daee9ac..7c9f3feaa7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -484,6 +484,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
 +++
 ** New function assoc-delete-all.
 
+** New function string-distance to calculate Levenshtein distance between two strings.
+
 ** 'print-quoted' now defaults to t, so if you want to see
 (quote x) instead of 'x you will have to bind it to nil where applicable.
 
diff --git a/src/fns.c b/src/fns.c
index 94b9d984f0..ce1e5be1ed 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -153,6 +153,54 @@ If STRING is multibyte, this may be greater than the length of STRING.  */)
   return make_number (SBYTES (string));
 }
 
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+       doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, we compare character of strings.
+If BYTECOMPARE is t, we compare byte of strings.
+Case is significant, but text properties are ignored. */)
+  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)
+{
+  CHECK_STRING (string1);
+  CHECK_STRING (string2);
+
+  bool use_bytecompare = !NILP(bytecompare);
+  bool same;
+  char *s1 = SSDATA (string1);
+  char *s2 = SSDATA (string2);
+
+  ptrdiff_t len1 = use_bytecompare? SBYTES (string1) : SCHARS (string1);
+  ptrdiff_t len2 = use_bytecompare? SBYTES (string2) : SCHARS (string2);
+  ptrdiff_t x, y, lastdiag, olddiag;
+  unsigned short *ws1 = 0; /* 16 bit unicode character */
+  unsigned short *ws2 = 0; /* 16 bit unicode character */
+  if(!use_bytecompare)
+    {
+      /* convert utf-8 byte stream to 16 bit unicode array */
+      string1 = code_convert_string_norecord (string1, Qutf_16le, 1);
+      ws1 = (unsigned short *) SDATA (string1);
+      string2 = code_convert_string_norecord (string2, Qutf_16le, 1);
+      ws2 = (unsigned short *) SDATA (string2);
+    }
+
+  USE_SAFE_ALLOCA;
+  ptrdiff_t *column = SAFE_ALLOCA ((len1 + 1) * sizeof (ptrdiff_t));
+  for (y = 1; y <= len1; y++)
+    column[y] = y;
+  for (x = 1; x <= len2; x++)
+    {
+      column[0] = x;
+      for (y = 1, lastdiag = x - 1; y <= len1; y++)
+        {
+          olddiag = column[y];
+          same = use_bytecompare? (s1[y-1] == s2[x-1]) : (ws1[y-1] == ws2[x-1]);
+          column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (same? 0 : 1));
+          lastdiag = olddiag;
+        }
+    }
+  SAFE_FREE ();
+  return make_number (column[len1]);
+}
+
 DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0,
        doc: /* Return t if two strings have identical contents.
 Case is significant, but text properties are ignored.
@@ -5226,6 +5274,7 @@ this variable.  */);
   defsubr (&Slength);
   defsubr (&Ssafe_length);
   defsubr (&Sstring_bytes);
+  defsubr (&Sstring_distance);
   defsubr (&Sstring_equal);
   defsubr (&Scompare_strings);
   defsubr (&Sstring_lessp);
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 52b61d9fb9..1310f5b2a5 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -281,6 +281,24 @@ subr-test--frames-1
   (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
   (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
 
+(ert-deftest subr-tests--string-distance ()
+  "Test `string-distance' behavior."
+  ;; ASCII characters are always fine
+  (should (equal 1 (string-distance "heelo" "hello")))
+  (should (equal 2 (string-distance "aeelo" "hello")))
+  (should (equal 0 (string-distance "ab" "ab" t)))
+  (should (equal 1 (string-distance "ab" "abc" t)))
+
+  ;; string containing hanzi character, compare by byte
+  (should (equal 6 (string-distance "ab" "ab我她" t)))
+  (should (equal 3 (string-distance "ab" "a我b" t)))
+  (should (equal 3 (string-distance "我" "她" t))
+
+  ;; string containing hanzi character, compare by character
+  (should (equal 2 (string-distance "ab" "ab我她")))
+  (should (equal 1 (string-distance "ab" "a我b")))
+  (should (equal 1 (string-distance "我" "她"))))
+
 (ert-deftest subr-tests--dolist--wrong-number-of-args ()
   "Test that `dolist' doesn't accept wrong types or length of SPEC,
 cf. Bug#25477."
-- 
2.16.3


[-- Attachment #3: Type: text/plain, Size: 1476 bytes --]

>>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:

    >> From: Chen Bin <chenbin.sh@gmail.com> Cc: emacs-devel@gnu.org
    >> Date: Sun, 15 Apr 2018 02:40:18 +1000
    >> 
    >> Correct me if I'm wrong.
    >> 
    >> I read cod eand found definion of Lisp_String: struct GCALIGNED
    >> Lisp_String { ptrdiff_t size; ptrdiff_t size_byte; INTERVAL
    >> intervals; /* Text properties in this string.  */ unsigned char
    >> *data; };
    >> 
    >> I understand string text is encoded in UTF8 format and is stored
    >> in 'Lisp_String::data'. There is actually no difference between
    >> unibyte and multibyte text since UTF8 is compatible with ASCII
    >> and we only deal with 'data' field.

    Eli> No, that's incorrect.  The difference does exist, it just all
    Eli> but disappear for unibyte strings encoded in UTF-8.  But if you
    Eli> encode a string in some other encoding, like Latin-1, you will
    Eli> see a very different stream of bytes.

    >> I attached the latest patch.

    Eli> Thanks.

    >> + ;; string containing unicode character (Hanzi) + (should (equal
    >> 6 (string-distance "ab" "ab我她"))) + (should (equal 3
    >> (string-distance "我" "她"))))

    Eli> Should the distance be measured in bytes or in characters?  I
    Eli> think it's the latter, in which case the implementation should
    Eli> work in characters, not bytes.

-- 
Best Regards,
Chen Bin

--
Help me, help you

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-15  7:15           ` Chen Bin
@ 2018-04-15 14:47             ` Eli Zaretskii
       [not found]               ` <CAAE-R+-RDWvyrv+uqHszzh6VMH6An3disOw=PyPWaTnUTHDOCw@mail.gmail.com>
  2018-04-15 18:53             ` Paul Eggert
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-15 14:47 UTC (permalink / raw)
  To: Chen Bin; +Cc: emacs-devel

> From: Chen Bin <chenbin.sh@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Sun, 15 Apr 2018 17:15:49 +1000
> 
> I attached patch for latest code.

Thanks, we are getting close.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
> +If BYTECOMPARE is nil, we compare character of strings.
> +If BYTECOMPARE is t, we compare byte of strings.
> +Case is significant, but text properties are ignored. */)
> +  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)

I question the need for the BYTECOMPARE flag.  Emacs's editing
operations work in characters, not in bytes.  There's insert-byte, but
no delete-byte or replace-byte (although applications which should for
some reason need that could implement that, albeit not conveniently).
The byte-level operations are not exposed to Lisp for a good reason:
Emacs is primarily a text-processing environment, and text is
processed in character units.

So I think you should remove that option, unless you can explain why
you think it's needed and in what situations.

> +  unsigned short *ws1 = 0; /* 16 bit unicode character */
> +  unsigned short *ws2 = 0; /* 16 bit unicode character */
> +  if(!use_bytecompare)
> +    {
> +      /* convert utf-8 byte stream to 16 bit unicode array */
> +      string1 = code_convert_string_norecord (string1, Qutf_16le, 1);
> +      ws1 = (unsigned short *) SDATA (string1);
> +      string2 = code_convert_string_norecord (string2, Qutf_16le, 1);
> +      ws2 = (unsigned short *) SDATA (string2);
> +    }

Conversion to UTF-16 burns cycles, and the function will do the wrong
thing for characters beyond the BMP as result, because you compare two
16-bit words instead of full characters.

Instead, please use the macros defined on character.h, either
CHAR_STRING_ADVANCE of FETCH_STRING_CHAR_ADVANCE (or their non-ADVANCE
counterparts), whichever better suits your coding style and needs.
These macros produce the full Unicode codepoint of the string
characters, and you can then compare them without bumping into the
problem with UTF-8 or UTF-16 encoding.  The *-ADVANCE variants also
advance the pointer or index to the next character as side effect,
which is handy when examining successive characters in a loop.

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <CAAE-R+-RDWvyrv+uqHszzh6VMH6An3disOw=PyPWaTnUTHDOCw@mail.gmail.com>]

[parent not found: <83k1t72b2o.fsf@gnu.org>]

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
       [not found]                 ` <83k1t72b2o.fsf@gnu.org>
@ 2018-04-17  2:43                   ` chen bin
  2018-04-17 15:44                     ` Eli Zaretskii
       [not found]                   ` <CAAE-R+8s++_LRcQCLX60Z=TQeQHdtbM5X1k525bfNnnPSLDvRw@mail.gmail.com>
  1 sibling, 1 reply; 24+ messages in thread
From: chen bin @ 2018-04-17  2:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

'bytecompare' only applies the original Levenshtein distance algorihtm
to the ascii strings. It's not enabled by default.

As we agreed, it has some pleasant side effect for East Asian. It's not right
and wrong thing. Just side effect of UTF-8 encoding.

Weighted distance is NOT doable for only one reason:
You can't increase character weight of certain race or religion in Emacs.
It's politically wrong.

So I suggest keep both unicode and ascii algorithm.

I will study the macros you mentioned tonight and figure out a new solution
at the end of day.


On Tue, Apr 17, 2018 at 5:41 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: chen bin <chenbin.sh@gmail.com>
>> Date: Tue, 17 Apr 2018 00:29:39 +1000
>>
>> To be honest I'm 100% confident about my current solution. I'm
>> actually an expert in dealing unicode related C programming.
>
> I appreciate your expertise, that's not the issue here.  The main
> issue is how to support all the range of characters that Emacs wants
> to support.
>
>> It's standard practice to convert UTF-8 character to UTF-16
>> character before any further string operation.
>>
>> To anwser your questions, I have to introduce East Asian culture at first.
>>
>> China/Japan/Korea use same Hanzi (Chinese characters). China's
>> national standard GBK defines
>> 21886 characters used Eas Asians.
>>
>> UTF-16 is compatilbe with GBK.
>>
>> Even most educated Chinese knows only 6000~8000 characters.
>>
>> UTF-32 exists because there are another 50000 ancient Hanzi which no
>> Chinese uses.
>>
>> Among 99089 characters of Unicode v5.0, 71226 characters are Hanzi.
>>
>> In short, you got about 22000 non-hanzi characters and 21000 Hanzi
>> characters. UTF-16 is totally fine
>> to deal with it.
>
> UTF-16 is fine for characters within the BMP, yes.  But Emacs supports
> much more, see below.
>
>> Say my name is 陈斌, my colleague's name is 张斌. Only one Hanzi difference, right?
>>
>> While reading file named "陈斌's doucment0", I want to list other
>> documents with similar file names.
>> Using my byte compare algorithm, "陈斌's document1", "陈斌's document2" is
>> on the top of the sorted
>> document list. Make sense.
>>
>> Without byte comparing, my family name "陈" is not more important than
>> an latin character
>> So files like "张斌's document1", "张斌's document2" is on the top of the list now.
>
> You are saying that a difference of 1 character has more "weight" for
> some characters than for others.  That might be so, but using the
> number of bytes in the UTF-8 representation as that weight makes very
> little sense to me, because the length of the UTF-8 sequence for a
> given character is determined by reasons that are largely historical:
> it depends on when was a particular character introduced into Unicode.
> It has nothing to do with "importance" of a character.
>
> Thus, your example might produce a sensible result when comparing
> ASCII characters vs non-ASCII characters, but will make much less
> sense when comparing one non-ASCII character with another non-ASCII
> character.  For example, emoji characters, which need 4 bytes in UTF-8
> sequences, will be considered twice as "important" as the above Hanzi
> characters of your example (and 4 times as important as ASCII).  The
> results will be totally unexpected and unpredictable (unless the user
> knows the entire Unicode database by heart).
>
> So I think features such as what you want to achieve with the above
> example will have to use some other additional feature, like "weighted
> difference", and using the byte length as a kind of surrogate "weight"
> for this purpose is IMO not a good idea, even though it could look
> like that at first glance.
>
>> 2. As mentioned, China has national standard GBK which is compatible
>> with UTF-16.
>> Every character is a 16 bit integer. It's common practice to convert
>> UTF-8 string (or other encoded string)
>> to UTF-16 format before any further string operation. Or else, it's
>> impossbile to display Japanese, Korean, Chinese
>> in one document (considering tyical user operations, text
>> searching/replacing ...)
>>
>> I've been doing this since day one of my career because I'm Chinese.
>
> As I said, this is okay for characters in the BMP.  But Emacs must
> support much more.
>
> First, we support the latest Unicode 10.0, which added a lot of
> characters outside the BMP.  You have the emoji there, you have many
> new symbols there (the Alchemichal Symbols, Geometric Shapes, and
> Supplemental Arrows-C blocks), Domino Tiles, Mathematical
> Alphanumerics, Musical symbols, and many scripts that there's no
> reason to tell people we don't support in Emacs.
>
> Moreover, Emacs's internal representation of raw 8-bit bytes uses the
> area #x3FFF00..#x3FFFFF, which need 5 bytes in its UTF-8 sequence.  We
> cannot exclude the possibility that strings we are comparing have
> embedded raw bytes in them, this happens in practice and must work as
> expected.
>
>> The integer is called wide character (wchar_t in standard C, or WCHAR
>> in Windows GUI SDK),
>
> Windows indeed uses a 16-bit wchar_t type, which is why its support
> for characters beyond the BMP is problematic and in some cases simply
> non-existent, because standard C library functions that accept a
> single wide character argument are unable to cope with UTF-16 encoding
> that requires 2 16-bit words to represent a character.  So, for
> example, you don't have a way of looking for a character outside the
> BMP in a wide character string by calling the standard function
> wcschr, because its 2nd argument is a single wchar_t wide character,
> which can only express characters inside the BMP.
>
> Emacs cannot tolerate such limitations, not in the year 2018.
>
> So using implementations that only support the BMP is really a
> non-starter for Emacs.  We cannot accept such code.
>
>> Now allow me explain why I can't use CHAR_STRING_ADVANCE. The first
>> parameter 'c' is ALREADY a wide character.
>
> I apologize for my stupid typo.  I meant STRING_CHAR_ADVANCE, not
> CHAR_STRING_ADVANCE.  At least I didn't make such a mistake in the 2nd
> macro I mentioned, FETCH_STRING_CHAR_ADVANCE...
>
>> It's efficient to extract characters into one unsigned short array in
>> one shot.
>
> It is indeed quite efficient, but it requires one more pass of the
> entire string, which is unnecessary.  And anyway, the main reason not
> to convert to UTF-16 is what I wrote above about supporting characters
> beyond the BMP, not the extra loop.
>
>> Or else, I need convert character index to byte offset first. So I
>> need get byte offset by calling 'string_char_to_byte'
>> twice to get offset of two elements for comparing, The element coould
>> could be 1, 2, 3 bytes. So I still need convert bytes
>>  into a 4 byte integer for comparing.
>
> The macros I mentioned already do that for you, they manage both
> character offsets and byte offsets efficiently (and they are much more
> efficient than string_char_to_byte because they advance sequentially,
> whereas string_char_to_byte is for random access to a string).
>
> Thanks.
>
> P.S.  Please, let's return this discussion to the list, where it
> belongs.  I'm disappointed that this was again a private email, and
> the list isn't CC'ed.  This discussion should be public, not private.



-- 
help me, help you.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-17  2:43                   ` chen bin
@ 2018-04-17 15:44                     ` Eli Zaretskii
  2018-04-18  7:11                       ` chen bin
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-17 15:44 UTC (permalink / raw)
  To: chen bin; +Cc: emacs-devel

> From: chen bin <chenbin.sh@gmail.com>
> Date: Tue, 17 Apr 2018 12:43:33 +1000
> Cc: emacs-devel@gnu.org
> 
> Weighted distance is NOT doable for only one reason:
> You can't increase character weight of certain race or religion in Emacs.
> It's politically wrong.

If the weight comes from the user/Lisp program which uses this
primitive, then whatever politics is involved is up to the caller.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-17 15:44                     ` Eli Zaretskii
@ 2018-04-18  7:11                       ` chen bin
  0 siblings, 0 replies; 24+ messages in thread
From: chen bin @ 2018-04-18  7:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Sure.

What's your opinion on my latest patch?

On Wed, Apr 18, 2018 at 1:44 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: chen bin <chenbin.sh@gmail.com>
>> Date: Tue, 17 Apr 2018 12:43:33 +1000
>> Cc: emacs-devel@gnu.org
>>
>> Weighted distance is NOT doable for only one reason:
>> You can't increase character weight of certain race or religion in Emacs.
>> It's politically wrong.
>
> If the weight comes from the user/Lisp program which uses this
> primitive, then whatever politics is involved is up to the caller.



-- 
help me, help you.



^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <CAAE-R+8s++_LRcQCLX60Z=TQeQHdtbM5X1k525bfNnnPSLDvRw@mail.gmail.com>]

[parent not found: <83bmei36dw.fsf@gnu.org>]

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
       [not found]                     ` <83bmei36dw.fsf@gnu.org>
@ 2018-04-17 12:31                       ` chen bin
  2018-04-19  8:05                         ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: chen bin @ 2018-04-17 12:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1220 bytes --]

Hi, Eli,
As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.

I did some peformance test.

My original wide character array solution is two time fast as current solution.

But I understand now we avoid repeated memory allocation for wide
character array. So the generic performance looks OK to me.

I also implemented the byte comparing version. It's 4 times as fast. And I do
need use it to compare file path in my package 'counsel-etags'.

That's the major reason why I started this task

The fille path couldn't contain any funny characters (emoji). so
it'sperfectly fine
to use byte comparing version.

Regards,
Chen

On Tue, Apr 17, 2018 at 12:37 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: chen bin <chenbin.sh@gmail.com>
>> Date: Tue, 17 Apr 2018 11:32:23 +1000
>>
>> I  intentionally avoid cc `emacs-devel` in my previous mail  because I
>> feel `culture/race` is sensitive thing.
>
> Please don't.  There's no sensitivity about this stuff on emacs-devel,
> in my long experience with Emacs development.
>
>> I will study your mail and figure out a better solution.
>
> Thanks.  Feel free to ask more questions, and sorry again for my
> stupid typo mistake.
>



-- 
help me, help you.

[-- Attachment #2: 0001-add-api-string-distance.patch --]
[-- Type: text/x-patch, Size: 4751 bytes --]

From e785b766cf412c5d6b0c2d8a3f6e575675390c83 Mon Sep 17 00:00:00 2001
From: Chen Bin <chenbin.sh@gmail.com>
Date: Sun, 15 Apr 2018 02:20:29 +1000
Subject: [PATCH] add api string-distance

---
 etc/NEWS                |  2 ++
 src/fns.c               | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 test/lisp/subr-tests.el | 18 +++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index 5aa92e2991..3cce2c48c7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
 +++
 ** New function assoc-delete-all.
 
+** New function string-distance to calculate Levenshtein distance between two strings.
+
 ** 'print-quoted' now defaults to t, so if you want to see
 (quote x) instead of 'x you will have to bind it to nil where applicable.
 
diff --git a/src/fns.c b/src/fns.c
index 94b9d984f0..ce779b2a79 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -153,6 +153,66 @@ If STRING is multibyte, this may be greater than the length of STRING.  */)
   return make_number (SBYTES (string));
 }
 
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+       doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, we compare character of strings.
+If BYTECOMPARE is t, we compare byte of strings.
+Comparing by byte is faster and non-ascii characters has weighted distance.
+Case is significant, but text properties are ignored. */)
+  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)
+{
+  CHECK_STRING (string1);
+  CHECK_STRING (string2);
+
+  bool use_bytecompare = !NILP(bytecompare);
+  ptrdiff_t len1 = SCHARS (string1);
+  ptrdiff_t len2 = SCHARS (string2);
+  ptrdiff_t x, y, lastdiag, olddiag;
+
+  USE_SAFE_ALLOCA;
+  ptrdiff_t *column = SAFE_ALLOCA ((len1 + 1) * sizeof (ptrdiff_t));
+  for (y = 1; y <= len1; y++)
+    column[y] = y;
+
+  if (use_bytecompare)
+    {
+      char *s1 = SSDATA (string1);
+      char *s2 = SSDATA (string2);
+
+      for (x = 1; x <= len2; x++)
+        {
+          column[0] = x;
+          for (y = 1, lastdiag = x - 1; y <= len1; y++)
+            {
+              olddiag = column[y];
+              column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (s1[y-1] == s2[x-1]? 0 : 1));
+              lastdiag = olddiag;
+            }
+        }
+    }
+  else
+    {
+      int c1, c2;
+      ptrdiff_t i1, i1_byte, i2, i2_byte;
+      i2 = i2_byte = 0;
+      for (x = 1; x <= len2; x++)
+        {
+          column[0] = x;
+          FETCH_STRING_CHAR_ADVANCE (c2, string2, i2, i2_byte);
+          i1 = i1_byte = 0;
+          for (y = 1, lastdiag = x - 1; y <= len1; y++)
+            {
+              olddiag = column[y];
+              FETCH_STRING_CHAR_ADVANCE (c1, string1, i1, i1_byte);
+              column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (c1 == c2? 0 : 1));
+              lastdiag = olddiag;
+            }
+        }
+    }
+  SAFE_FREE ();
+  return make_number (column[len1]);
+}
+
 DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0,
        doc: /* Return t if two strings have identical contents.
 Case is significant, but text properties are ignored.
@@ -5226,6 +5286,7 @@ this variable.  */);
   defsubr (&Slength);
   defsubr (&Ssafe_length);
   defsubr (&Sstring_bytes);
+  defsubr (&Sstring_distance);
   defsubr (&Sstring_equal);
   defsubr (&Scompare_strings);
   defsubr (&Sstring_lessp);
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 52b61d9fb9..1310f5b2a5 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -281,6 +281,24 @@ subr-test--frames-1
   (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
   (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
 
+(ert-deftest subr-tests--string-distance ()
+  "Test `string-distance' behavior."
+  ;; ASCII characters are always fine
+  (should (equal 1 (string-distance "heelo" "hello")))
+  (should (equal 2 (string-distance "aeelo" "hello")))
+  (should (equal 0 (string-distance "ab" "ab" t)))
+  (should (equal 1 (string-distance "ab" "abc" t)))
+
+  ;; string containing hanzi character, compare by byte
+  (should (equal 6 (string-distance "ab" "ab我她" t)))
+  (should (equal 3 (string-distance "ab" "a我b" t)))
+  (should (equal 3 (string-distance "我" "她" t))
+
+  ;; string containing hanzi character, compare by character
+  (should (equal 2 (string-distance "ab" "ab我她")))
+  (should (equal 1 (string-distance "ab" "a我b")))
+  (should (equal 1 (string-distance "我" "她"))))
+
 (ert-deftest subr-tests--dolist--wrong-number-of-args ()
   "Test that `dolist' doesn't accept wrong types or length of SPEC,
 cf. Bug#25477."
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-17 12:31                       ` chen bin
@ 2018-04-19  8:05                         ` Eli Zaretskii
  2018-04-19 14:55                           ` chen bin
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-19  8:05 UTC (permalink / raw)
  To: chen bin; +Cc: emacs-devel

> From: chen bin <chenbin.sh@gmail.com>
> Date: Tue, 17 Apr 2018 22:31:20 +1000
> Cc: emacs-devel@gnu.org
> 
> As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.

Thanks.

> I also implemented the byte comparing version. It's 4 times as fast. And I do
> need use it to compare file path in my package 'counsel-etags'.

File names are just strings for this purpose, and they can potentially
include any non-zero characters.  So I don't see why they are special.

> The fille path couldn't contain any funny characters (emoji). so
> it'sperfectly fine
> to use byte comparing version.

File names can very well include emoji and other "funny" characters,
Emacs supports that on all modern systems (including even MS-Windows).

> diff --git a/etc/NEWS b/etc/NEWS
> index 5aa92e2991..3cce2c48c7 100644
> --- a/etc/NEWS
> +++ b/etc/NEWS
> @@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
>  +++
>  ** New function assoc-delete-all.
>  
> +** New function string-distance to calculate Levenshtein distance between two strings.

This long line should be filled using the fill-column setting we use
in NEWS.  Even better, make the header a short summary, like

  ** New function 'string-distance'

and then describe its functionality in a separate sentence that starts
immediately below that header.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
> +If BYTECOMPARE is nil, we compare character of strings.
> +If BYTECOMPARE is t, we compare byte of strings.

Please lose the "we" part, it's inappropriate in documentation,
because it describes what Emacs does.

> +Comparing by byte is faster and non-ascii characters has weighted distance.

I would delete this sentence, it is IMO confusing more than anything
else.  (And I still think the bytewise comparison is not needed.)

> +  bool use_bytecompare = !NILP(bytecompare);
                                ^^
Space between these 2 characters.

> +  else
> +    {
> +      int c1, c2;
> +      ptrdiff_t i1, i1_byte, i2, i2_byte;
> +      i2 = i2_byte = 0;
> +      for (x = 1; x <= len2; x++)

Please move the initialization of i2 and i2_byte into the for-loop
initializer (suing the comma operator).

> +          i1 = i1_byte = 0;
> +          for (y = 1, lastdiag = x - 1; y <= len1; y++)

Likewise here with i1 and i1_byte.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-19  8:05                         ` Eli Zaretskii
@ 2018-04-19 14:55                           ` chen bin
  2018-04-20  4:37                             ` chen bin
                                               ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: chen bin @ 2018-04-19 14:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2821 bytes --]

Hi, Eli,
I attached the new patch.

A software project usually contains nomal directory and file names.
Linux kernel source, for example.

My package `counsel-etags` will use byte compare version.

Regards,
Chen

On Thu, Apr 19, 2018 at 6:05 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: chen bin <chenbin.sh@gmail.com>
>> Date: Tue, 17 Apr 2018 22:31:20 +1000
>> Cc: emacs-devel@gnu.org
>>
>> As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.
>
> Thanks.
>
>> I also implemented the byte comparing version. It's 4 times as fast. And I do
>> need use it to compare file path in my package 'counsel-etags'.
>
> File names are just strings for this purpose, and they can potentially
> include any non-zero characters.  So I don't see why they are special.
>
>> The fille path couldn't contain any funny characters (emoji). so
>> it'sperfectly fine
>> to use byte comparing version.
>
> File names can very well include emoji and other "funny" characters,
> Emacs supports that on all modern systems (including even MS-Windows).
>
>> diff --git a/etc/NEWS b/etc/NEWS
>> index 5aa92e2991..3cce2c48c7 100644
>> --- a/etc/NEWS
>> +++ b/etc/NEWS
>> @@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
>>  +++
>>  ** New function assoc-delete-all.
>>
>> +** New function string-distance to calculate Levenshtein distance between two strings.
>
> This long line should be filled using the fill-column setting we use
> in NEWS.  Even better, make the header a short summary, like
>
>   ** New function 'string-distance'
>
> and then describe its functionality in a separate sentence that starts
> immediately below that header.
>
>> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
>> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
>> +If BYTECOMPARE is nil, we compare character of strings.
>> +If BYTECOMPARE is t, we compare byte of strings.
>
> Please lose the "we" part, it's inappropriate in documentation,
> because it describes what Emacs does.
>
>> +Comparing by byte is faster and non-ascii characters has weighted distance.
>
> I would delete this sentence, it is IMO confusing more than anything
> else.  (And I still think the bytewise comparison is not needed.)
>
>> +  bool use_bytecompare = !NILP(bytecompare);
>                                 ^^
> Space between these 2 characters.
>
>> +  else
>> +    {
>> +      int c1, c2;
>> +      ptrdiff_t i1, i1_byte, i2, i2_byte;
>> +      i2 = i2_byte = 0;
>> +      for (x = 1; x <= len2; x++)
>
> Please move the initialization of i2 and i2_byte into the for-loop
> initializer (suing the comma operator).
>
>> +          i1 = i1_byte = 0;
>> +          for (y = 1, lastdiag = x - 1; y <= len1; y++)
>
> Likewise here with i1 and i1_byte.
>
> Thanks.



-- 
help me, help you.

[-- Attachment #2: 0001-add-api-string-distance.patch --]
[-- Type: text/x-patch, Size: 4734 bytes --]

From d42338ce5ecb592d12ac4108119282cf6a0a552b Mon Sep 17 00:00:00 2001
From: Chen Bin <chenbin.sh@gmail.com>
Date: Fri, 20 Apr 2018 00:38:29 +1000
Subject: [PATCH] add api string-distance

---
 etc/NEWS                |  3 +++
 src/fns.c               | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 test/lisp/subr-tests.el | 18 +++++++++++++++
 3 files changed, 82 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index 5aa92e2991..4efb7e983e 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -490,6 +490,9 @@ x-lost-selection-hooks, x-sent-selection-hooks
 +++
 ** New function assoc-delete-all.
 
+** New function string-distance to calculate Levenshtein distance
+between two strings.
+
 ** 'print-quoted' now defaults to t, so if you want to see
 (quote x) instead of 'x you will have to bind it to nil where applicable.
 
diff --git a/src/fns.c b/src/fns.c
index 94b9d984f0..753dc88ed6 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -153,6 +153,66 @@ If STRING is multibyte, this may be greater than the length of STRING.  */)
   return make_number (SBYTES (string));
 }
 
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+       doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, compare character of strings.
+If BYTECOMPARE is t, compare byte of strings.
+Case is significant, but text properties are ignored. */)
+  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)
+
+{
+  CHECK_STRING (string1);
+  CHECK_STRING (string2);
+
+  bool use_byte_compare = !NILP (bytecompare);
+  ptrdiff_t len1 = use_byte_compare? SBYTES (string1) : SCHARS (string1);
+  ptrdiff_t len2 = use_byte_compare? SBYTES (string2) : SCHARS (string2);
+  ptrdiff_t x, y, lastdiag, olddiag;
+
+  USE_SAFE_ALLOCA;
+  ptrdiff_t *column = SAFE_ALLOCA ((len1 + 1) * sizeof (ptrdiff_t));
+  for (y = 1; y <= len1; y++)
+    column[y] = y;
+
+  if (use_byte_compare)
+    {
+      char *s1 = SSDATA (string1);
+      char *s2 = SSDATA (string2);
+
+      for (x = 1; x <= len2; x++)
+        {
+          column[0] = x;
+          for (y = 1, lastdiag = x - 1; y <= len1; y++)
+            {
+              olddiag = column[y];
+              column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (s1[y-1] == s2[x-1]? 0 : 1));
+              lastdiag = olddiag;
+            }
+        }
+    }
+  else
+    {
+      int c1, c2;
+      ptrdiff_t i1, i1_byte, i2 = 0, i2_byte = 0;
+      for (x = 1; x <= len2; x++)
+        {
+          column[0] = x;
+          FETCH_STRING_CHAR_ADVANCE (c2, string2, i2, i2_byte);
+          i1 = i1_byte = 0;
+          for (y = 1, lastdiag = x - 1; y <= len1; y++)
+            {
+              olddiag = column[y];
+              FETCH_STRING_CHAR_ADVANCE (c1, string1, i1, i1_byte);
+              column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (c1 == c2? 0 : 1));
+              lastdiag = olddiag;
+            }
+        }
+    }
+
+  SAFE_FREE ();
+  return make_number (column[len1]);
+}
+
 DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0,
        doc: /* Return t if two strings have identical contents.
 Case is significant, but text properties are ignored.
@@ -5226,6 +5286,7 @@ this variable.  */);
   defsubr (&Slength);
   defsubr (&Ssafe_length);
   defsubr (&Sstring_bytes);
+  defsubr (&Sstring_distance);
   defsubr (&Sstring_equal);
   defsubr (&Scompare_strings);
   defsubr (&Sstring_lessp);
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 52b61d9fb9..1310f5b2a5 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -281,6 +281,24 @@ subr-test--frames-1
   (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
   (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
 
+(ert-deftest subr-tests--string-distance ()
+  "Test `string-distance' behavior."
+  ;; ASCII characters are always fine
+  (should (equal 1 (string-distance "heelo" "hello")))
+  (should (equal 2 (string-distance "aeelo" "hello")))
+  (should (equal 0 (string-distance "ab" "ab" t)))
+  (should (equal 1 (string-distance "ab" "abc" t)))
+
+  ;; string containing hanzi character, compare by byte
+  (should (equal 6 (string-distance "ab" "ab我她" t)))
+  (should (equal 3 (string-distance "ab" "a我b" t)))
+  (should (equal 3 (string-distance "我" "她" t))
+
+  ;; string containing hanzi character, compare by character
+  (should (equal 2 (string-distance "ab" "ab我她")))
+  (should (equal 1 (string-distance "ab" "a我b")))
+  (should (equal 1 (string-distance "我" "她"))))
+
 (ert-deftest subr-tests--dolist--wrong-number-of-args ()
   "Test that `dolist' doesn't accept wrong types or length of SPEC,
 cf. Bug#25477."
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-19 14:55                           ` chen bin
@ 2018-04-20  4:37                             ` chen bin
  2018-04-20  6:01                             ` Thien-Thi Nguyen
  2018-04-20 10:47                             ` chen bin
  2 siblings, 0 replies; 24+ messages in thread
From: chen bin @ 2018-04-20  4:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Please hold on, I will send you a new patch tonight.

On Fri, Apr 20, 2018 at 12:55 AM, chen bin <chenbin.sh@gmail.com> wrote:
> Hi, Eli,
> I attached the new patch.
>
> A software project usually contains nomal directory and file names.
> Linux kernel source, for example.
>
> My package `counsel-etags` will use byte compare version.
>
> Regards,
> Chen
>
> On Thu, Apr 19, 2018 at 6:05 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>>> From: chen bin <chenbin.sh@gmail.com>
>>> Date: Tue, 17 Apr 2018 22:31:20 +1000
>>> Cc: emacs-devel@gnu.org
>>>
>>> As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.
>>
>> Thanks.
>>
>>> I also implemented the byte comparing version. It's 4 times as fast. And I do
>>> need use it to compare file path in my package 'counsel-etags'.
>>
>> File names are just strings for this purpose, and they can potentially
>> include any non-zero characters.  So I don't see why they are special.
>>
>>> The fille path couldn't contain any funny characters (emoji). so
>>> it'sperfectly fine
>>> to use byte comparing version.
>>
>> File names can very well include emoji and other "funny" characters,
>> Emacs supports that on all modern systems (including even MS-Windows).
>>
>>> diff --git a/etc/NEWS b/etc/NEWS
>>> index 5aa92e2991..3cce2c48c7 100644
>>> --- a/etc/NEWS
>>> +++ b/etc/NEWS
>>> @@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
>>>  +++
>>>  ** New function assoc-delete-all.
>>>
>>> +** New function string-distance to calculate Levenshtein distance between two strings.
>>
>> This long line should be filled using the fill-column setting we use
>> in NEWS.  Even better, make the header a short summary, like
>>
>>   ** New function 'string-distance'
>>
>> and then describe its functionality in a separate sentence that starts
>> immediately below that header.
>>
>>> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
>>> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
>>> +If BYTECOMPARE is nil, we compare character of strings.
>>> +If BYTECOMPARE is t, we compare byte of strings.
>>
>> Please lose the "we" part, it's inappropriate in documentation,
>> because it describes what Emacs does.
>>
>>> +Comparing by byte is faster and non-ascii characters has weighted distance.
>>
>> I would delete this sentence, it is IMO confusing more than anything
>> else.  (And I still think the bytewise comparison is not needed.)
>>
>>> +  bool use_bytecompare = !NILP(bytecompare);
>>                                 ^^
>> Space between these 2 characters.
>>
>>> +  else
>>> +    {
>>> +      int c1, c2;
>>> +      ptrdiff_t i1, i1_byte, i2, i2_byte;
>>> +      i2 = i2_byte = 0;
>>> +      for (x = 1; x <= len2; x++)
>>
>> Please move the initialization of i2 and i2_byte into the for-loop
>> initializer (suing the comma operator).
>>
>>> +          i1 = i1_byte = 0;
>>> +          for (y = 1, lastdiag = x - 1; y <= len1; y++)
>>
>> Likewise here with i1 and i1_byte.
>>
>> Thanks.
>
>
>
> --
> help me, help you.



-- 
help me, help you.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-19 14:55                           ` chen bin
  2018-04-20  4:37                             ` chen bin
@ 2018-04-20  6:01                             ` Thien-Thi Nguyen
  2018-04-20 10:47                             ` chen bin
  2 siblings, 0 replies; 24+ messages in thread
From: Thien-Thi Nguyen @ 2018-04-20  6:01 UTC (permalink / raw)
  To: chen bin; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]

() chen bin <chenbin.sh@gmail.com>
() Fri, 20 Apr 2018 00:55:15 +1000

   A software project usually contains nomal directory and file
   names.

What you say is perfectly true.

However, i think that truth tends to lead programmers astray, so
that they grow "inwards" (to cater for the "norm", which over
time becomes ever more refined (restricted) in its definition),
and not "upwards" (to stretch the mind (and thus code that is
after all merely mindstuff writ stylishly) to consider what is
similar and what is different, in many (ideally, all) cases).

Of course, "perfect is the enemy of good" so they say, and one
line of code in the repo is worth two in *scratch*, so as always
it's a question of balance.

Anyway, glad to see this functionality added to Emacs.

 friends see eye to eye.
 always?  no.  but heart to heart:
 character distance.

-- 
Thien-Thi Nguyen -----------------------------------------------
 (defun responsep (query)
   (pcase (context query)
     (`(technical ,ml) (correctp ml))
     ...))                              748E A0E8 1CB8 A748 9BFA
--------------------------------------- 6CE4 6703 2224 4C80 7502

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-19 14:55                           ` chen bin
  2018-04-20  4:37                             ` chen bin
  2018-04-20  6:01                             ` Thien-Thi Nguyen
@ 2018-04-20 10:47                             ` chen bin
  2018-04-21  7:22                               ` Eli Zaretskii
  2 siblings, 1 reply; 24+ messages in thread
From: chen bin @ 2018-04-20 10:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]

Hi, Eli,

Attached is latest patch. It will use byte compare aglorithm if both
strings are not multi-byte strings.

Regards,
Chen

On Fri, Apr 20, 2018 at 12:55 AM, chen bin <chenbin.sh@gmail.com> wrote:
> Hi, Eli,
> I attached the new patch.
>
> A software project usually contains nomal directory and file names.
> Linux kernel source, for example.
>
> My package `counsel-etags` will use byte compare version.
>
> Regards,
> Chen
>
> On Thu, Apr 19, 2018 at 6:05 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>>> From: chen bin <chenbin.sh@gmail.com>
>>> Date: Tue, 17 Apr 2018 22:31:20 +1000
>>> Cc: emacs-devel@gnu.org
>>>
>>> As you suggested, I re-write the code using 'FETCH_STRING_CHAR_ADVANCE'.
>>
>> Thanks.
>>
>>> I also implemented the byte comparing version. It's 4 times as fast. And I do
>>> need use it to compare file path in my package 'counsel-etags'.
>>
>> File names are just strings for this purpose, and they can potentially
>> include any non-zero characters.  So I don't see why they are special.
>>
>>> The fille path couldn't contain any funny characters (emoji). so
>>> it'sperfectly fine
>>> to use byte comparing version.
>>
>> File names can very well include emoji and other "funny" characters,
>> Emacs supports that on all modern systems (including even MS-Windows).
>>
>>> diff --git a/etc/NEWS b/etc/NEWS
>>> index 5aa92e2991..3cce2c48c7 100644
>>> --- a/etc/NEWS
>>> +++ b/etc/NEWS
>>> @@ -490,6 +490,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
>>>  +++
>>>  ** New function assoc-delete-all.
>>>
>>> +** New function string-distance to calculate Levenshtein distance between two strings.
>>
>> This long line should be filled using the fill-column setting we use
>> in NEWS.  Even better, make the header a short summary, like
>>
>>   ** New function 'string-distance'
>>
>> and then describe its functionality in a separate sentence that starts
>> immediately below that header.
>>
>>> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
>>> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
>>> +If BYTECOMPARE is nil, we compare character of strings.
>>> +If BYTECOMPARE is t, we compare byte of strings.
>>
>> Please lose the "we" part, it's inappropriate in documentation,
>> because it describes what Emacs does.
>>
>>> +Comparing by byte is faster and non-ascii characters has weighted distance.
>>
>> I would delete this sentence, it is IMO confusing more than anything
>> else.  (And I still think the bytewise comparison is not needed.)
>>
>>> +  bool use_bytecompare = !NILP(bytecompare);
>>                                 ^^
>> Space between these 2 characters.
>>
>>> +  else
>>> +    {
>>> +      int c1, c2;
>>> +      ptrdiff_t i1, i1_byte, i2, i2_byte;
>>> +      i2 = i2_byte = 0;
>>> +      for (x = 1; x <= len2; x++)
>>
>> Please move the initialization of i2 and i2_byte into the for-loop
>> initializer (suing the comma operator).
>>
>>> +          i1 = i1_byte = 0;
>>> +          for (y = 1, lastdiag = x - 1; y <= len1; y++)
>>
>> Likewise here with i1 and i1_byte.
>>
>> Thanks.
>
>
>
> --
> help me, help you.



-- 
help me, help you.

[-- Attachment #2: 0001-add-api-string-distance.patch --]
[-- Type: text/x-patch, Size: 4802 bytes --]

From 77e34cb3e336c764cd79b67b25464f44f76da6f9 Mon Sep 17 00:00:00 2001
From: Chen Bin <chenbin.sh@gmail.com>
Date: Fri, 20 Apr 2018 00:38:29 +1000
Subject: [PATCH] add api string-distance

---
 etc/NEWS                |  3 +++
 src/fns.c               | 62 +++++++++++++++++++++++++++++++++++++++++++++++++
 test/lisp/subr-tests.el | 18 ++++++++++++++
 3 files changed, 83 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index 5aa92e2991..4efb7e983e 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -490,6 +490,9 @@ x-lost-selection-hooks, x-sent-selection-hooks
 +++
 ** New function assoc-delete-all.
 
+** New function string-distance to calculate Levenshtein distance
+between two strings.
+
 ** 'print-quoted' now defaults to t, so if you want to see
 (quote x) instead of 'x you will have to bind it to nil where applicable.
 
diff --git a/src/fns.c b/src/fns.c
index 94b9d984f0..6e851c8555 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -153,6 +153,67 @@ If STRING is multibyte, this may be greater than the length of STRING.  */)
   return make_number (SBYTES (string));
 }
 
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
+       doc: /* Return Levenshtein distance between STRING1 and STRING2.
+If BYTECOMPARE is nil, compare character of strings.
+If BYTECOMPARE is t, compare byte of strings.
+Case is significant, but text properties are ignored. */)
+  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)
+
+{
+  CHECK_STRING (string1);
+  CHECK_STRING (string2);
+
+  bool use_byte_compare = !NILP (bytecompare)
+    || (!STRING_MULTIBYTE (string1) && !STRING_MULTIBYTE (string2));
+  ptrdiff_t len1 = use_byte_compare? SBYTES (string1) : SCHARS (string1);
+  ptrdiff_t len2 = use_byte_compare? SBYTES (string2) : SCHARS (string2);
+  ptrdiff_t x, y, lastdiag, olddiag;
+
+  USE_SAFE_ALLOCA;
+  ptrdiff_t *column = SAFE_ALLOCA ((len1 + 1) * sizeof (ptrdiff_t));
+  for (y = 1; y <= len1; y++)
+    column[y] = y;
+
+  if (use_byte_compare)
+    {
+      char *s1 = SSDATA (string1);
+      char *s2 = SSDATA (string2);
+
+      for (x = 1; x <= len2; x++)
+        {
+          column[0] = x;
+          for (y = 1, lastdiag = x - 1; y <= len1; y++)
+            {
+              olddiag = column[y];
+              column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (s1[y-1] == s2[x-1]? 0 : 1));
+              lastdiag = olddiag;
+            }
+        }
+    }
+  else
+    {
+      int c1, c2;
+      ptrdiff_t i1, i1_byte, i2 = 0, i2_byte = 0;
+      for (x = 1; x <= len2; x++)
+        {
+          column[0] = x;
+          FETCH_STRING_CHAR_ADVANCE (c2, string2, i2, i2_byte);
+          i1 = i1_byte = 0;
+          for (y = 1, lastdiag = x - 1; y <= len1; y++)
+            {
+              olddiag = column[y];
+              FETCH_STRING_CHAR_ADVANCE (c1, string1, i1, i1_byte);
+              column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + (c1 == c2? 0 : 1));
+              lastdiag = olddiag;
+            }
+        }
+    }
+
+  SAFE_FREE ();
+  return make_number (column[len1]);
+}
+
 DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0,
        doc: /* Return t if two strings have identical contents.
 Case is significant, but text properties are ignored.
@@ -5226,6 +5287,7 @@ this variable.  */);
   defsubr (&Slength);
   defsubr (&Ssafe_length);
   defsubr (&Sstring_bytes);
+  defsubr (&Sstring_distance);
   defsubr (&Sstring_equal);
   defsubr (&Scompare_strings);
   defsubr (&Sstring_lessp);
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 52b61d9fb9..1310f5b2a5 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -281,6 +281,24 @@ subr-test--frames-1
   (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
   (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
 
+(ert-deftest subr-tests--string-distance ()
+  "Test `string-distance' behavior."
+  ;; ASCII characters are always fine
+  (should (equal 1 (string-distance "heelo" "hello")))
+  (should (equal 2 (string-distance "aeelo" "hello")))
+  (should (equal 0 (string-distance "ab" "ab" t)))
+  (should (equal 1 (string-distance "ab" "abc" t)))
+
+  ;; string containing hanzi character, compare by byte
+  (should (equal 6 (string-distance "ab" "ab我她" t)))
+  (should (equal 3 (string-distance "ab" "a我b" t)))
+  (should (equal 3 (string-distance "我" "她" t))
+
+  ;; string containing hanzi character, compare by character
+  (should (equal 2 (string-distance "ab" "ab我她")))
+  (should (equal 1 (string-distance "ab" "a我b")))
+  (should (equal 1 (string-distance "我" "她"))))
+
 (ert-deftest subr-tests--dolist--wrong-number-of-args ()
   "Test that `dolist' doesn't accept wrong types or length of SPEC,
 cf. Bug#25477."
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-20 10:47                             ` chen bin
@ 2018-04-21  7:22                               ` Eli Zaretskii
  2018-04-21 20:47                                 ` Juri Linkov
  2018-04-28  7:36                                 ` Eli Zaretskii
  0 siblings, 2 replies; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-21  7:22 UTC (permalink / raw)
  To: chen bin; +Cc: emacs-devel

> From: chen bin <chenbin.sh@gmail.com>
> Date: Fri, 20 Apr 2018 20:47:23 +1000
> Cc: emacs-devel@gnu.org
> 
> Attached is latest patch. It will use byte compare aglorithm if both
> strings are not multi-byte strings.

Thanks, let's give people a few days to comment.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-21  7:22                               ` Eli Zaretskii
@ 2018-04-21 20:47                                 ` Juri Linkov
  2018-04-28  7:36                                 ` Eli Zaretskii
  1 sibling, 0 replies; 24+ messages in thread
From: Juri Linkov @ 2018-04-21 20:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: chen bin, emacs-devel

>> Attached is latest patch. It will use byte compare aglorithm if both
>> strings are not multi-byte strings.
>
> Thanks, let's give people a few days to comment.

I have no comments on this implementation, but I wonder if it's possible
to adapt this algorithm to implement fuzzy search with the distance?
This supposes there is a customizable option that defines a preferred
default distance, and then the search functions search text in the buffer
to find the strings similar to the search string.  I don't know if it
the same how is implemented for “Similarity search” in LibreOffice Writer.
But it seems this could find the same text as char-fold search matching
equivalent characters and ignoring diacritics or glyphless characters
like “word joiner”.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-21  7:22                               ` Eli Zaretskii
  2018-04-21 20:47                                 ` Juri Linkov
@ 2018-04-28  7:36                                 ` Eli Zaretskii
  2018-05-06  9:53                                   ` chen bin
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-04-28  7:36 UTC (permalink / raw)
  To: chenbin.sh; +Cc: emacs-devel

> Date: Sat, 21 Apr 2018 10:22:54 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org
> 
> > From: chen bin <chenbin.sh@gmail.com>
> > Date: Fri, 20 Apr 2018 20:47:23 +1000
> > Cc: emacs-devel@gnu.org
> > 
> > Attached is latest patch. It will use byte compare aglorithm if both
> > strings are not multi-byte strings.
> 
> Thanks, let's give people a few days to comment.

No further comments, so I pushed your changes.

A few comments for your future contributions:

  . Please include with the changes a commit log message, formatted as
    described in CONTRIBUTE.

  . We encourage to include changes for the relevant manuals with the
    code changes; please try doing that in the future.  If you do
    include the relevant manual changes, mark the NEWS entry with
    "+++" to indicate that.

  . The tests should be in the file named after the file where the
    relevant feature(s) is/are defined.  If the code is in src/FOO.c,
    the tests should be in test/src /FOO-tests.el; if the code is in
    lisp/foo/BAR.el, the tests should be in test/lisp/foo/BAR-tests.el.

Thanks.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-28  7:36                                 ` Eli Zaretskii
@ 2018-05-06  9:53                                   ` chen bin
  0 siblings, 0 replies; 24+ messages in thread
From: chen bin @ 2018-05-06  9:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Noted. Thanks

On Sat, Apr 28, 2018 at 5:36 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> Date: Sat, 21 Apr 2018 10:22:54 +0300
>> From: Eli Zaretskii <eliz@gnu.org>
>> Cc: emacs-devel@gnu.org
>>
>> > From: chen bin <chenbin.sh@gmail.com>
>> > Date: Fri, 20 Apr 2018 20:47:23 +1000
>> > Cc: emacs-devel@gnu.org
>> >
>> > Attached is latest patch. It will use byte compare aglorithm if both
>> > strings are not multi-byte strings.
>>
>> Thanks, let's give people a few days to comment.
>
> No further comments, so I pushed your changes.
>
> A few comments for your future contributions:
>
>   . Please include with the changes a commit log message, formatted as
>     described in CONTRIBUTE.
>
>   . We encourage to include changes for the relevant manuals with the
>     code changes; please try doing that in the future.  If you do
>     include the relevant manual changes, mark the NEWS entry with
>     "+++" to indicate that.
>
>   . The tests should be in the file named after the file where the
>     relevant feature(s) is/are defined.  If the code is in src/FOO.c,
>     the tests should be in test/src /FOO-tests.el; if the code is in
>     lisp/foo/BAR.el, the tests should be in test/lisp/foo/BAR-tests.el.
>
> Thanks.



-- 
help me, help you.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-15  7:15           ` Chen Bin
  2018-04-15 14:47             ` Eli Zaretskii
@ 2018-04-15 18:53             ` Paul Eggert
  1 sibling, 0 replies; 24+ messages in thread
From: Paul Eggert @ 2018-04-15 18:53 UTC (permalink / raw)
  To: Chen Bin, Eli Zaretskii; +Cc: emacs-devel

On 04/15/2018 12:15 AM, Chen Bin wrote:
> As 'org-babel-edit-distance' documented, it will "Return the edit
> (levenshtein) distance between strings S1 S2". So the problem here is to
> calculate*Levenshtein distance*.

First, I doubt whether the callers care whether the code computes 
Levenshtein distance, LCS distance, or some other reasonable 
string-distance measure. Second, the Myers-Ukkonen algorithm does 
compute Levenshtein distance; see, for example:

Papamichail D, Papamichail G. Improved algorithms for approximate string 
matching. BMC Bioinformatics. 2009; 10(Suppl 1): S10. 
https://dx.doi.org/10.1186/1471-2105-10-S1-S10 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648743/

I don't offhand know whether diffseq.h uses the original Myers-Ukkonen 
algorithm or one of Myers's variations with a different distance 
measure, but if it's the latter and if users really care then we should 
be able to change the algorithm to match the requirements.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-14  2:35 [PATCH] add 'string-distance' to calculate Levenshtein distance Chen Bin
  2018-04-14  7:05 ` Eli Zaretskii
@ 2018-04-14 17:18 ` Nathan Moreau
  2018-04-14 17:36   ` Paul Eggert
  1 sibling, 1 reply; 24+ messages in thread
From: Nathan Moreau @ 2018-04-14 17:18 UTC (permalink / raw)
  To: Chen Bin; +Cc: emacs-devel

Hi,

What is the difference with the code present in lib/diffseq.h?



On 14 April 2018 at 04:35, Chen Bin <chenbin.sh@gmail.com> wrote:
> Hello,
> Attached patch implemented 'string-distance' is much faster then
> existing Lisp implementation like 'org-babel-edit-distance'.
>
> I benchmarked by below snippet:
>
>   (setq s1 "projs1/emacs/etc/imeees/icon/emacs-document.svg")
>   (setq s2 "projs2/test/etc/images/icons/emacs-document23.svg")
>   (let* ((gc-cons-threshold most-positive-fixnum))
>     (message "%S vs %S"
>              (benchmark-run-compiled 100
>                (org-babel-edit-distance s1 s2))
>              (benchmark-run-compiled 100
>                (string-distance s1 s2))))
>
> Got:
> (0.506494223 0 0.0) vs (0.001317414 0 0.0)
>
>
>
> --
> Best Regards,
> Chen Bin
>
> --
> Help me, help you
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-14 17:18 ` Nathan Moreau
@ 2018-04-14 17:36   ` Paul Eggert
  2018-04-15 18:17     ` Andreas Politz
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Eggert @ 2018-04-14 17:36 UTC (permalink / raw)
  To: Nathan Moreau, Chen Bin; +Cc: emacs-devel

Nathan Moreau wrote:
> What is the difference with the code present in lib/diffseq.h?

lib/diffseq.h uses the Myers-Ukkonen algorithm that scales better for the common 
case where strings are closely related. If the two strings are length N and 
their Levenshtein distance is D (where D is much less than N), then 
lib/diffseq.h is O(N*D) whereas the proposed algorithm is O(N**2).

So yes, it'd be better if the code used lib/diffseq.h rather than rolled its own 
algorithm.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
  2018-04-14 17:36   ` Paul Eggert
@ 2018-04-15 18:17     ` Andreas Politz
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Politz @ 2018-04-15 18:17 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Nathan Moreau, emacs-devel, Chen Bin

Paul Eggert <eggert@cs.ucla.edu> writes:

> lib/diffseq.h uses the Myers-Ukkonen algorithm that scales better for
> the common case where strings are closely related. If the two strings
> are length N and their Levenshtein distance is D (where D is much less
> than N), then lib/diffseq.h is O(N*D) whereas the proposed algorithm
> is O(N**2).

The Ukkonen algorithm also allows for D to be an input parameter in form
of a maximal distance it should search for.  This makes this observation
even more important, since most callers, I presume, are only interested
in string-pairs with distances below some threshold.

Incidentally the (only) application of the mentioned Org function
(org-babel-edit-distance) uses D=2 (in Emacs 25.3.1).

Andreas

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-05-06  9:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-04-14  2:35 [PATCH] add 'string-distance' to calculate Levenshtein distance Chen Bin
2018-04-14  7:05 ` Eli Zaretskii
     [not found]   ` <87lgdq831h.fsf@gmail.com>
2018-04-14 13:24     ` Eli Zaretskii
2018-04-14 16:40       ` Chen Bin
2018-04-14 17:08         ` Eli Zaretskii
2018-04-15  7:15           ` Chen Bin
2018-04-15 14:47             ` Eli Zaretskii
     [not found]               ` <CAAE-R+-RDWvyrv+uqHszzh6VMH6An3disOw=PyPWaTnUTHDOCw@mail.gmail.com>
     [not found]                 ` <83k1t72b2o.fsf@gnu.org>
2018-04-17  2:43                   ` chen bin
2018-04-17 15:44                     ` Eli Zaretskii
2018-04-18  7:11                       ` chen bin
     [not found]                   ` <CAAE-R+8s++_LRcQCLX60Z=TQeQHdtbM5X1k525bfNnnPSLDvRw@mail.gmail.com>
     [not found]                     ` <83bmei36dw.fsf@gnu.org>
2018-04-17 12:31                       ` chen bin
2018-04-19  8:05                         ` Eli Zaretskii
2018-04-19 14:55                           ` chen bin
2018-04-20  4:37                             ` chen bin
2018-04-20  6:01                             ` Thien-Thi Nguyen
2018-04-20 10:47                             ` chen bin
2018-04-21  7:22                               ` Eli Zaretskii
2018-04-21 20:47                                 ` Juri Linkov
2018-04-28  7:36                                 ` Eli Zaretskii
2018-05-06  9:53                                   ` chen bin
2018-04-15 18:53             ` Paul Eggert
2018-04-14 17:18 ` Nathan Moreau
2018-04-14 17:36   ` Paul Eggert
2018-04-15 18:17     ` Andreas Politz

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).