From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Chen Bin Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance Date: Sun, 15 Apr 2018 17:15:49 +1000 Message-ID: <87o9il0wka.fsf@gmail.com> References: <87vacuecrn.fsf@gmail.com> <83po3246ah.fsf@gnu.org> <87lgdq831h.fsf@gmail.com> <83muy553ae.fsf@gnu.org> <87o9ilhhcd.fsf@gmail.com> <83d0z14sws.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: blaine.gmane.org 1523776443 6913 195.159.176.226 (15 Apr 2018 07:14:03 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 15 Apr 2018 07:14:03 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 15 09:13:59 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f7brK-0001en-KU for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2018 09:13:58 +0200 Original-Received: from localhost ([::1]:47698 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7btR-00006R-26 for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2018 03:16:09 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44494) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7btJ-00006D-58 for emacs-devel@gnu.org; Sun, 15 Apr 2018 03:16:02 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f7btF-00034g-V4 for emacs-devel@gnu.org; Sun, 15 Apr 2018 03:16:01 -0400 Original-Received: from mail-pf0-x243.google.com ([2607:f8b0:400e:c00::243]:40675) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1f7btF-00033d-Jv; Sun, 15 Apr 2018 03:15:57 -0400 Original-Received: by mail-pf0-x243.google.com with SMTP id y66so9115391pfi.7; Sun, 15 Apr 2018 00:15:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=uWXTvcBBx/N6eYQ2S5BpTh/YF21NLMxbxt5LBIObM5Q=; b=EO1fJ3eD/x2ziOLYzTXo/s56yVmMXhQ83o5te0nibcBouJKMWCi7+ECYiWN0eH+fMQ SCsbqFKpoA6JFdhkZT6weLvF0B7NKiCv3SzzZWABeL9ZdJjC+VABe4Wl8JogosISs7mJ jhS6PoAK/rZrpWWq/bAo8rsP1n23rY91zwp5RuUC6h+zklyE+gULpfhlsjPK+Frt28hQ dXXKLsVQosKF3lnu5k4CCDQYBPx5/rp3gy40K1pY7YxP6Kfs1eyc297Q43j+lGs+Rggf mtcEZyWpBNWVUfeFd40zvVaivti6pXNyozUH22p0BgON2RmbV7THYEUpoLr6yF/yef4B AZLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=uWXTvcBBx/N6eYQ2S5BpTh/YF21NLMxbxt5LBIObM5Q=; b=s8Oh9M7eG3iWG6Q7EoY+JjRnPg1FKBh844AYM46sPGVXEas3Sc26dThSKWl9LlfQTg 0Ku5CAkV5nJ5e1temxnyhsQGfcxiWHvxETVRRkm82rqWhWFXxTSbmlioDYvyegrr5RIM rOJ/Hl3QKUo1TLR47cNf0eSKgcmONd3bGKwrF7agGuiRvzBFjOxZw4/pWVJ6pqy50EMj 1GOaLlKy+i5wlJAjqYiAy4b1pTVWW9fyjFTIucqO7dFaTKz0OQqR62pDTrqBe69mh5B6 wBCHPNw5nsR4MFQErO+wGZ5ZTifcJRNRBLmF0cRjJb1Iq9PDZboQGGjzXPUHHJ2Al7nR 8S7w== X-Gm-Message-State: ALQs6tAYIYXU6XA28QkIqivXeVg8dTQsTT4JjdVXC60tFknh35GWxGPK UMbQ89yVUUvaxxCb6L9UtGlOXw== X-Google-Smtp-Source: AIpwx49WDOWxgptm0JHIT4GU0R2H2UZ8dBa14HJTln4g10gHCoFS7yUXNN3U9+xk1+5PXBDuf/xB0w== X-Received: by 10.99.113.20 with SMTP id m20mr1171692pgc.144.1523776555747; Sun, 15 Apr 2018 00:15:55 -0700 (PDT) Original-Received: from sydneypc.homepc ([115.187.209.174]) by smtp.gmail.com with ESMTPSA id x25sm4827146pfn.124.2018.04.15.00.15.52 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sun, 15 Apr 2018 00:15:53 -0700 (PDT) In-Reply-To: <83d0z14sws.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 14 Apr 2018 20:08:51 +0300") X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:400e:c00::243 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:224615 Archived-At: --=-=-= Content-Type: text/plain I attached patch for latest code. As required, now by default we compare the string by character. So it can handle multibyte character correctly. 'string-distance' has third optional parameter 'bytecompare' which use byte comparing instead of default character comparing. I did read this thread carefully and I did know different algorithms exist to calculate string distance. But I feel only my code is the right solution to our original problem. Our original problem is to provide a faster C solution to calculate *Levenshtein distance*. It will replace 3rd party Lisp solution like 'org-babel-edit-distance'. As 'org-babel-edit-distance' documented, it will "Return the edit (levenshtein) distance between strings S1 S2". So the problem here is to calculate *Levenshtein distance*. Other algorithms have their own definition of "string distance" so they are not for *Levenshtein distance*. Here is link of "Comparison of String Distance Algorithms": https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/ --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: inline; filename=0001-add-api-string-distance.patch Content-Transfer-Encoding: quoted-printable >From 657ee7f92e5bfd6f5138e9d0b030bd80ff963b16 Mon Sep 17 00:00:00 2001 From: Chen Bin Date: Sun, 15 Apr 2018 02:20:29 +1000 Subject: [PATCH] add api string-distance --- etc/NEWS | 2 ++ src/fns.c | 49 +++++++++++++++++++++++++++++++++++++++++++++= ++++ test/lisp/subr-tests.el | 18 ++++++++++++++++++ 3 files changed, 69 insertions(+) diff --git a/etc/NEWS b/etc/NEWS index 0c4daee9ac..7c9f3feaa7 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -484,6 +484,8 @@ x-lost-selection-hooks, x-sent-selection-hooks +++ ** New function assoc-delete-all. =20 +** New function string-distance to calculate Levenshtein distance between = two strings. + ** 'print-quoted' now defaults to t, so if you want to see (quote x) instead of 'x you will have to bind it to nil where applicable. =20 diff --git a/src/fns.c b/src/fns.c index 94b9d984f0..ce1e5be1ed 100644 --- a/src/fns.c +++ b/src/fns.c @@ -153,6 +153,54 @@ If STRING is multibyte, this may be greater than the l= ength of STRING. */) return make_number (SBYTES (string)); } =20 +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0, + doc: /* Return Levenshtein distance between STRING1 and STRING2. +If BYTECOMPARE is nil, we compare character of strings. +If BYTECOMPARE is t, we compare byte of strings. +Case is significant, but text properties are ignored. */) + (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare) +{ + CHECK_STRING (string1); + CHECK_STRING (string2); + + bool use_bytecompare =3D !NILP(bytecompare); + bool same; + char *s1 =3D SSDATA (string1); + char *s2 =3D SSDATA (string2); + + ptrdiff_t len1 =3D use_bytecompare? SBYTES (string1) : SCHARS (string1); + ptrdiff_t len2 =3D use_bytecompare? SBYTES (string2) : SCHARS (string2); + ptrdiff_t x, y, lastdiag, olddiag; + unsigned short *ws1 =3D 0; /* 16 bit unicode character */ + unsigned short *ws2 =3D 0; /* 16 bit unicode character */ + if(!use_bytecompare) + { + /* convert utf-8 byte stream to 16 bit unicode array */ + string1 =3D code_convert_string_norecord (string1, Qutf_16le, 1); + ws1 =3D (unsigned short *) SDATA (string1); + string2 =3D code_convert_string_norecord (string2, Qutf_16le, 1); + ws2 =3D (unsigned short *) SDATA (string2); + } + + USE_SAFE_ALLOCA; + ptrdiff_t *column =3D SAFE_ALLOCA ((len1 + 1) * sizeof (ptrdiff_t)); + for (y =3D 1; y <=3D len1; y++) + column[y] =3D y; + for (x =3D 1; x <=3D len2; x++) + { + column[0] =3D x; + for (y =3D 1, lastdiag =3D x - 1; y <=3D len1; y++) + { + olddiag =3D column[y]; + same =3D use_bytecompare? (s1[y-1] =3D=3D s2[x-1]) : (ws1[y-1] = =3D=3D ws2[x-1]); + column[y] =3D min (min (column[y] + 1, column[y-1] + 1), lastdia= g + (same? 0 : 1)); + lastdiag =3D olddiag; + } + } + SAFE_FREE (); + return make_number (column[len1]); +} + DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0, doc: /* Return t if two strings have identical contents. Case is significant, but text properties are ignored. @@ -5226,6 +5274,7 @@ this variable. */); defsubr (&Slength); defsubr (&Ssafe_length); defsubr (&Sstring_bytes); + defsubr (&Sstring_distance); defsubr (&Sstring_equal); defsubr (&Scompare_strings); defsubr (&Sstring_lessp); diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el index 52b61d9fb9..1310f5b2a5 100644 --- a/test/lisp/subr-tests.el +++ b/test/lisp/subr-tests.el @@ -281,6 +281,24 @@ subr-test--frames-1 (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0)) (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}"))) =20 +(ert-deftest subr-tests--string-distance () + "Test `string-distance' behavior." + ;; ASCII characters are always fine + (should (equal 1 (string-distance "heelo" "hello"))) + (should (equal 2 (string-distance "aeelo" "hello"))) + (should (equal 0 (string-distance "ab" "ab" t))) + (should (equal 1 (string-distance "ab" "abc" t))) + + ;; string containing hanzi character, compare by byte + (should (equal 6 (string-distance "ab" "ab=E6=88=91=E5=A5=B9" t))) + (should (equal 3 (string-distance "ab" "a=E6=88=91b" t))) + (should (equal 3 (string-distance "=E6=88=91" "=E5=A5=B9" t)) + + ;; string containing hanzi character, compare by character + (should (equal 2 (string-distance "ab" "ab=E6=88=91=E5=A5=B9"))) + (should (equal 1 (string-distance "ab" "a=E6=88=91b"))) + (should (equal 1 (string-distance "=E6=88=91" "=E5=A5=B9")))) + (ert-deftest subr-tests--dolist--wrong-number-of-args () "Test that `dolist' doesn't accept wrong types or length of SPEC, cf. Bug#25477." --=20 2.16.3 --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable >>>>> "Eli" =3D=3D Eli Zaretskii writes: >> From: Chen Bin Cc: emacs-devel@gnu.org >> Date: Sun, 15 Apr 2018 02:40:18 +1000 >>=20 >> Correct me if I'm wrong. >>=20 >> I read cod eand found definion of Lisp_String: struct GCALIGNED >> Lisp_String { ptrdiff_t size; ptrdiff_t size_byte; INTERVAL >> intervals; /* Text properties in this string. */ unsigned char >> *data; }; >>=20 >> I understand string text is encoded in UTF8 format and is stored >> in 'Lisp_String::data'. There is actually no difference between >> unibyte and multibyte text since UTF8 is compatible with ASCII >> and we only deal with 'data' field. Eli> No, that's incorrect. The difference does exist, it just all Eli> but disappear for unibyte strings encoded in UTF-8. But if you Eli> encode a string in some other encoding, like Latin-1, you will Eli> see a very different stream of bytes. >> I attached the latest patch. Eli> Thanks. >> + ;; string containing unicode character (Hanzi) + (should (equal >> 6 (string-distance "ab" "ab=E6=88=91=E5=A5=B9"))) + (should (equal 3 >> (string-distance "=E6=88=91" "=E5=A5=B9")))) Eli> Should the distance be measured in bytes or in characters? I Eli> think it's the latter, in which case the implementation should Eli> work in characters, not bytes. --=20 Best Regards, Chen Bin -- Help me, help you --=-=-=--