From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance Date: Sat, 14 Apr 2018 10:05:10 +0300 Message-ID: <83po3246ah.fsf@gnu.org> References: <87vacuecrn.fsf@gmail.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1523689432 30066 195.159.176.226 (14 Apr 2018 07:03:52 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 14 Apr 2018 07:03:52 +0000 (UTC) Cc: emacs-devel@gnu.org To: Chen Bin Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Apr 14 09:03:48 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f7FDv-0007ip-EH for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 09:03:47 +0200 Original-Received: from localhost ([::1]:38458 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7FG2-0001wt-5i for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 03:05:58 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39302) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7FFO-0001uB-0T for emacs-devel@gnu.org; Sat, 14 Apr 2018 03:05:19 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f7FFK-0007qF-Qm for emacs-devel@gnu.org; Sat, 14 Apr 2018 03:05:17 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:43799) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7FFK-0007pt-CQ; Sat, 14 Apr 2018 03:05:14 -0400 Original-Received: from [176.228.60.248] (port=3714 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1f7FFG-0004c0-MQ; Sat, 14 Apr 2018 03:05:11 -0400 In-reply-to: <87vacuecrn.fsf@gmail.com> (message from Chen Bin on Sat, 14 Apr 2018 12:35:08 +1000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:224583 Archived-At: > From: Chen Bin > Date: Sat, 14 Apr 2018 12:35:08 +1000 > > Attached patch implemented 'string-distance' is much faster then > existing Lisp implementation like 'org-babel-edit-distance'. Thanks, please see a few comments below. > +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0, > + doc: /* Return Levenshtein distance of two strings. Please mention the arguments in the first line of the doc string. > +Case is significant, but text properties are ignored. */) > + (register Lisp_Object string1, Lisp_Object string2) We don't like to use 'register' in new code, we prefer leaving this to the compiler. > + CHECK_STRING (string1); > + CHECK_STRING (string2); > + > + char *s1 = SSDATA (string1); > + char *s2 = SSDATA (string2); > + unsigned int s1len, s2len, x, y, lastdiag, olddiag; > + s1len = strlen(s1); > + s2len = strlen(s2); I presume this function is supposed to work only on strings in their internal representation (a.k.a. "multibyte strings"), so the function should check that and signal an error if that's not so. Alternatively, check that either both strings are unibyte or both are multibyte, and signal an error if not. > + unsigned int column[s1len+1]; I'm not sure this is portable enough, but even if it is, it's not a good idea to unconditionally make automatic variables in this case, because s1len and s2len could be very large, in which case you will get stack overflow. Please use SAFE_ALLOCA instead. > + for (y = 1; y <= s1len; y++) > + column[y] = y; > + for (x = 1; x <= s2len; x++) { > + column[0] = x; > + for (y = 1, lastdiag = x-1; y <= s1len; y++) { > + olddiag = column[y]; > + column[y] = min3(column[y] + 1, column[y-1] + 1, lastdiag + (s1[y-1] == s2[x-1] ? 0 : 1)); > + lastdiag = olddiag; Is this algorithm really better than allocating just the diag matrix (or part thereof), and accessing the string data directly, without copying them inti local arrays? > + return make_number(column[s1len]); Style: we leave a single blank between the function's name and the following opening parenthesis (you have a few of these scattered through the code). Finally, this new primitives needs documentation (NEWS and the ELisp manual), and also a couple of tests in the test suite. Thanks again for working on this.