From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Chen Bin Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance Date: Sun, 15 Apr 2018 02:40:18 +1000 Message-ID: <87o9ilhhcd.fsf@gmail.com> References: <87vacuecrn.fsf@gmail.com> <83po3246ah.fsf@gnu.org> <87lgdq831h.fsf@gmail.com> <83muy553ae.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: blaine.gmane.org 1523723913 31516 195.159.176.226 (14 Apr 2018 16:38:33 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 14 Apr 2018 16:38:33 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Apr 14 18:38:29 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f7OC4-00088d-No for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 18:38:28 +0200 Original-Received: from localhost ([::1]:48062 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7OEB-0002Hu-EV for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 12:40:39 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56119) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7OE1-0002Gr-64 for emacs-devel@gnu.org; Sat, 14 Apr 2018 12:40:30 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f7ODy-0000uq-0i for emacs-devel@gnu.org; Sat, 14 Apr 2018 12:40:29 -0400 Original-Received: from mail-pf0-x243.google.com ([2607:f8b0:400e:c00::243]:32855) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1f7ODx-0000ty-N6; Sat, 14 Apr 2018 12:40:25 -0400 Original-Received: by mail-pf0-x243.google.com with SMTP id f15so8451102pfn.0; Sat, 14 Apr 2018 09:40:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=AcDye9eDBcp7O23rz2R7R4sUqDBf0Jz47tNRTlXQT44=; b=CperxNhSyRBW/kfUMrlNawBs55sJTqZ51kI3wJuR875h8QSHoVpJQnoEYIP4CPJVh3 ASEJVB4NocDYI8yt1CnjdVeFarbAR/RF3VeeI//JXKbuKYAsnWMxVUq/fDkHValNBqsa i0JE+SyaY3l39zpYDoxAcar7QsImpa1we8dgcONPFo4M5F74jNrlIte3ogmn+eR+aTK+ THkqhg0HdMmnbYSFZGMDPDoOkhuSMN7rSlD8JwVidIyLDWAhpTMQBYThf7/WMVvAxzXT 3ylxsS4BczZ9AxwKF0B7xnS4YmhJiHPej935tb+d7OEgwXZFI/36NBolDHrlVxZ46GAW vqbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=AcDye9eDBcp7O23rz2R7R4sUqDBf0Jz47tNRTlXQT44=; b=KubkJu6UE9GIRbSgSm8jEjiCPKtRGG+7QpdLUl+8/U0ypXA6yzJIJpMvbZT9imo4M9 oEHPuwOEL42S2DOznXt9Eu2LR36h4RgkxmNv8cj3Q5Fg2t+v0tOru7wJiJ1TWxLTo0Uy Y8WyoSB36Q6jQPrQvWVFYnZbl303fYMp5n43bKiUQRB7VvDapjNZBFtOgy3UJzHahfuU 3rQ7gus0HSuA2PPncymtef2qnlJ9rhWXaHKput2uYN9sqdzKM+26g64xjuhSDLh2J6LI BB9QrQGrDbhu+/adHVjDDLtPzG17KRaKA1nZMy/984ZiJonc45d8Wx6IFff7iRQ4At5V Gj2g== X-Gm-Message-State: ALQs6tDgtD6QkkDAJ2wEIYQMaoIyDFYVSzckHl2wbheabAO69k1sE+yn h/y4JIHG2P4Q2iVSu4F2Xwr4cw== X-Google-Smtp-Source: AIpwx4++TgOSO/Ax6HTHRPJHY2sby6g+5V9WcnP38EYYVwf3e722ZKNIfeVIuPP/+edXbpwq03ITRw== X-Received: by 10.98.21.209 with SMTP id 200mr15514808pfv.232.1523724024098; Sat, 14 Apr 2018 09:40:24 -0700 (PDT) Original-Received: from sydneypc.homepc ([115.187.209.174]) by smtp.gmail.com with ESMTPSA id l80sm18368078pfk.73.2018.04.14.09.40.20 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sat, 14 Apr 2018 09:40:22 -0700 (PDT) In-Reply-To: <83muy553ae.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 14 Apr 2018 16:24:41 +0300") X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:400e:c00::243 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:224598 Archived-At: --=-=-= Content-Type: text/plain Correct me if I'm wrong. I read cod eand found definion of Lisp_String: struct GCALIGNED Lisp_String { ptrdiff_t size; ptrdiff_t size_byte; INTERVAL intervals; /* Text properties in this string. */ unsigned char *data; }; I understand string text is encoded in UTF8 format and is stored in 'Lisp_String::data'. There is actually no difference between unibyte and multibyte text since UTF8 is compatible with ASCII and we only deal with 'data' field. I attached the latest patch. --=-=-= Content-Type: text/x-diff; charset=utf-8 Content-Disposition: inline; filename=0001-add-api-string-distance.patch Content-Transfer-Encoding: quoted-printable >From 53e6358e1346afd64bad261823f92ae9619b93d2 Mon Sep 17 00:00:00 2001 From: Chen Bin Date: Sun, 15 Apr 2018 02:20:29 +1000 Subject: [PATCH] add api string-distance --- etc/NEWS | 2 ++ src/fns.c | 35 +++++++++++++++++++++++++++++++++++ test/lisp/subr-tests.el | 10 ++++++++++ 3 files changed, 47 insertions(+) diff --git a/etc/NEWS b/etc/NEWS index 12b72eb..75cc92d 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -463,6 +463,8 @@ x-lost-selection-hooks, x-sent-selection-hooks +++ ** New function assoc-delete-all. =20 +** New function string-distance to calculate Levenshtein distance between = two strings. + ** 'print-quoted' now defaults to t, so if you want to see (quote x) instead of 'x you will have to bind it to nil where applicable. =20 diff --git a/src/fns.c b/src/fns.c index 94b9d98..f149001 100644 --- a/src/fns.c +++ b/src/fns.c @@ -153,6 +153,40 @@ If STRING is multibyte, this may be greater than the l= ength of STRING. */) return make_number (SBYTES (string)); } =20 +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0, + doc: /* Return Levenshtein distance between STRING1 and STRING2. +Case is significant, but text properties are ignored. +Strings are treated as byte arrays when calculating distance. */) + (Lisp_Object string1, Lisp_Object string2) +{ + CHECK_STRING (string1); + CHECK_STRING (string2); + + char *s1 =3D SSDATA (string1); + char *s2 =3D SSDATA (string2); + + ptrdiff_t s1len, s2len, x, y, lastdiag, olddiag; + s1len =3D SBYTES (string1); + s2len =3D SBYTES (string2); + + USE_SAFE_ALLOCA; + ptrdiff_t *column =3D SAFE_ALLOCA ((s1len + 1) * sizeof (ptrdiff_t)); + for (y =3D 1; y <=3D s1len; y++) + column[y] =3D y; + for (x =3D 1; x <=3D s2len; x++) + { + column[0] =3D x; + for (y =3D 1, lastdiag =3D x - 1; y <=3D s1len; y++) + { + olddiag =3D column[y]; + column[y] =3D min (min (column[y] + 1, column[y-1] + 1), lastdia= g + (s1[y-1] =3D=3D s2[x-1] ? 0 : 1)); + lastdiag =3D olddiag; + } + } + SAFE_FREE (); + return make_number (column[s1len]); +} + DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0, doc: /* Return t if two strings have identical contents. Case is significant, but text properties are ignored. @@ -5226,6 +5260,7 @@ this variable. */); defsubr (&Slength); defsubr (&Ssafe_length); defsubr (&Sstring_bytes); + defsubr (&Sstring_distance); defsubr (&Sstring_equal); defsubr (&Scompare_strings); defsubr (&Sstring_lessp); diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el index 52b61d9..40a7727 100644 --- a/test/lisp/subr-tests.el +++ b/test/lisp/subr-tests.el @@ -281,6 +281,16 @@ subr-test--frames-1 (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0)) (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}"))) =20 +(ert-deftest subr-tests--string-distance () + "Test `string-distance' behavior." + (should (equal 1 (string-distance "heelo" "hello"))) + (should (equal 2 (string-distance "aeelo" "hello"))) + (should (equal 0 (string-distance "ab" "ab"))) + (should (equal 1 (string-distance "ab" "abc"))) + ;; string containing unicode character (Hanzi) + (should (equal 6 (string-distance "ab" "ab=E6=88=91=E5=A5=B9"))) + (should (equal 3 (string-distance "=E6=88=91" "=E5=A5=B9")))) + (ert-deftest subr-tests--dolist--wrong-number-of-args () "Test that `dolist' doesn't accept wrong types or length of SPEC, cf. Bug#25477." --=20 2.16.3 --=-=-= Content-Type: text/plain >>>>> "Eli" == Eli Zaretskii writes: Eli> [Please CC the mailing list when you respond, so others could Eli> see your messages.] >> From: Chen Bin Date: Sat, 14 Apr 2018 >> 21:01:46 +1000 >> >> Hi, Eli, Thanks for the review. >> >> I fixed most issues except two things. >> >> 1. In Asia, it's possible to cacluate distance between one >> unibyte and one multibyte string. As a Chinese, I might create a >> document containing Hanzi characters whose file name is obviously >> multibyte string. I may need get the distance of this document to >> a file named "README.txt". Eli> If you mean unibyte pure-ASCII strings, then I agree. But that Eli> doesn't mean we should avoid the text altogether, because we Eli> might compute non-zero distance between a string and its Eli> encoded unibyte variant, which will confuse users. At the very Eli> least the doc string should say something about this. >> 2. Algorithm is based on >> https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C >> It's optimized to use O(min(m,n)) space instead of O(mn). Say >> you compare two string whose string length is 512 bytes. You >> only need allocate 512 bytes instead of 262K (512*512) in memory. >> >> Please check attached patch for latest code. >> >> --- a/etc/NEWS +++ b/etc/NEWS @@ -463,6 +463,8 @@ >> x-lost-selection-hooks, x-sent-selection-hooks +++ ** New >> function assoc-delete-all. >> >> +** New function string-distance. Eli> This should mention Levenshtein distance. >> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, >> 2, 0, + doc: /* Return Levenshtein distance of STRING1 and >> STRING2. Eli> ^^^^^^^^^^^^^^^^^^^^^^ Eli> "between STRING1 and STRING2" >> + unsigned int s1len, s2len, x, y, lastdiag, olddiag; Eli> These variables should be declared EMACS_INT, not unsigned int, Eli> because the size of Emacs strings can be larger than UINT_MAX, Eli> especially on 64-bit systems. >> + unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof >> (unsigned int)); Eli> Likewise here. >> + char *s1 = SSDATA (string1); + char *s2 = SSDATA (string2); + + >> unsigned int s1len, s2len, x, y, lastdiag, olddiag; + s1len = >> strlen(s1); + s2len = strlen(s2); Eli> You could optimize the code by using SCHARS and SBYTES, instead Eli> of calling strlen. >> +(ert-deftest subr-tests--string-distance () + "Test >> `string-distance' behavior." + (should (equal 1 (string-distance >> "heelo" "hello"))) + (should (equal 2 (string-distance "aeelo" >> "hello"))) + (should (equal 0 (string-distance "ab" "ab"))) + >> (should (equal 1 (string-distance "ab" "abc")))) Eli> Could you please add a test or two with non-ASCII characters? Eli> Thanks. -- Best Regards, Chen Bin -- Help me, help you --=-=-=--