From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance Date: Sat, 14 Apr 2018 16:24:41 +0300 Message-ID: <83muy553ae.fsf@gnu.org> References: <87vacuecrn.fsf@gmail.com> <83po3246ah.fsf@gnu.org> <87lgdq831h.fsf@gmail.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1523712169 29767 195.159.176.226 (14 Apr 2018 13:22:49 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 14 Apr 2018 13:22:49 +0000 (UTC) Cc: emacs-devel@gnu.org To: Chen Bin Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Apr 14 15:22:45 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f7L8f-0007dI-4g for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 15:22:45 +0200 Original-Received: from localhost ([::1]:34920 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7LAl-0005FH-MN for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 09:24:55 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42125) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7LAb-0005EI-8w for emacs-devel@gnu.org; Sat, 14 Apr 2018 09:24:46 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f7LAY-0000Xn-4l for emacs-devel@gnu.org; Sat, 14 Apr 2018 09:24:45 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:49660) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7LAY-0000Xf-0R; Sat, 14 Apr 2018 09:24:42 -0400 Original-Received: from [176.228.60.248] (port=1465 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1f7LAX-0005Iz-Bw; Sat, 14 Apr 2018 09:24:41 -0400 In-reply-to: <87lgdq831h.fsf@gmail.com> (message from Chen Bin on Sat, 14 Apr 2018 21:01:46 +1000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:224588 Archived-At: [Please CC the mailing list when you respond, so others could see your messages.] > From: Chen Bin > Date: Sat, 14 Apr 2018 21:01:46 +1000 > > Hi, Eli, > Thanks for the review. > > I fixed most issues except two things. > > 1. In Asia, it's possible to cacluate distance between one unibyte and > one multibyte string. As a Chinese, I might create a document containing > Hanzi characters whose file name is obviously multibyte string. I may > need get the distance of this document to a file named "README.txt". If you mean unibyte pure-ASCII strings, then I agree. But that doesn't mean we should avoid the text altogether, because we might compute non-zero distance between a string and its encoded unibyte variant, which will confuse users. At the very least the doc string should say something about this. > 2. Algorithm is based on https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C > It's optimized to use O(min(m,n)) space instead of O(mn). > Say you compare two string whose string length is 512 bytes. > You only need allocate 512 bytes instead of 262K (512*512) > in memory. > > Please check attached patch for latest code. > > --- a/etc/NEWS > +++ b/etc/NEWS > @@ -463,6 +463,8 @@ x-lost-selection-hooks, x-sent-selection-hooks > +++ > ** New function assoc-delete-all. > > +** New function string-distance. This should mention Levenshtein distance. > +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0, > + doc: /* Return Levenshtein distance of STRING1 and STRING2. ^^^^^^^^^^^^^^^^^^^^^^ "between STRING1 and STRING2" > + unsigned int s1len, s2len, x, y, lastdiag, olddiag; These variables should be declared EMACS_INT, not unsigned int, because the size of Emacs strings can be larger than UINT_MAX, especially on 64-bit systems. > + unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof (unsigned int)); Likewise here. > + char *s1 = SSDATA (string1); > + char *s2 = SSDATA (string2); > + > + unsigned int s1len, s2len, x, y, lastdiag, olddiag; > + s1len = strlen(s1); > + s2len = strlen(s2); You could optimize the code by using SCHARS and SBYTES, instead of calling strlen. > +(ert-deftest subr-tests--string-distance () > + "Test `string-distance' behavior." > + (should (equal 1 (string-distance "heelo" "hello"))) > + (should (equal 2 (string-distance "aeelo" "hello"))) > + (should (equal 0 (string-distance "ab" "ab"))) > + (should (equal 1 (string-distance "ab" "abc")))) Could you please add a test or two with non-ASCII characters? Thanks.