From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance Date: Sun, 15 Apr 2018 17:47:00 +0300 Message-ID: <83bmek4jdn.fsf@gnu.org> References: <87vacuecrn.fsf@gmail.com> <83po3246ah.fsf@gnu.org> <87lgdq831h.fsf@gmail.com> <83muy553ae.fsf@gnu.org> <87o9ilhhcd.fsf@gmail.com> <83d0z14sws.fsf@gnu.org> <87o9il0wka.fsf@gmail.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1523803503 9059 195.159.176.226 (15 Apr 2018 14:45:03 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 15 Apr 2018 14:45:03 +0000 (UTC) Cc: emacs-devel@gnu.org To: Chen Bin Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 15 16:44:59 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f7itm-0002Ds-8d for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2018 16:44:58 +0200 Original-Received: from localhost ([::1]:47733 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7ivs-0004CK-Sa for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2018 10:47:08 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41578) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7ivl-0004Bs-IZ for emacs-devel@gnu.org; Sun, 15 Apr 2018 10:47:02 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f7ivi-0005PA-Ga for emacs-devel@gnu.org; Sun, 15 Apr 2018 10:47:01 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:45537) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7ivi-0005P2-Ck; Sun, 15 Apr 2018 10:46:58 -0400 Original-Received: from [176.228.60.248] (port=2999 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1f7ivh-0007rX-OK; Sun, 15 Apr 2018 10:46:58 -0400 In-reply-to: <87o9il0wka.fsf@gmail.com> (message from Chen Bin on Sun, 15 Apr 2018 17:15:49 +1000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:224618 Archived-At: > From: Chen Bin > Cc: emacs-devel@gnu.org > Date: Sun, 15 Apr 2018 17:15:49 +1000 > > I attached patch for latest code. Thanks, we are getting close. > +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0, > + doc: /* Return Levenshtein distance between STRING1 and STRING2. > +If BYTECOMPARE is nil, we compare character of strings. > +If BYTECOMPARE is t, we compare byte of strings. > +Case is significant, but text properties are ignored. */) > + (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare) I question the need for the BYTECOMPARE flag. Emacs's editing operations work in characters, not in bytes. There's insert-byte, but no delete-byte or replace-byte (although applications which should for some reason need that could implement that, albeit not conveniently). The byte-level operations are not exposed to Lisp for a good reason: Emacs is primarily a text-processing environment, and text is processed in character units. So I think you should remove that option, unless you can explain why you think it's needed and in what situations. > + unsigned short *ws1 = 0; /* 16 bit unicode character */ > + unsigned short *ws2 = 0; /* 16 bit unicode character */ > + if(!use_bytecompare) > + { > + /* convert utf-8 byte stream to 16 bit unicode array */ > + string1 = code_convert_string_norecord (string1, Qutf_16le, 1); > + ws1 = (unsigned short *) SDATA (string1); > + string2 = code_convert_string_norecord (string2, Qutf_16le, 1); > + ws2 = (unsigned short *) SDATA (string2); > + } Conversion to UTF-16 burns cycles, and the function will do the wrong thing for characters beyond the BMP as result, because you compare two 16-bit words instead of full characters. Instead, please use the macros defined on character.h, either CHAR_STRING_ADVANCE of FETCH_STRING_CHAR_ADVANCE (or their non-ADVANCE counterparts), whichever better suits your coding style and needs. These macros produce the full Unicode codepoint of the string characters, and you can then compare them without bumping into the problem with UTF-8 or UTF-16 encoding. The *-ADVANCE variants also advance the pointer or index to the next character as side effect, which is handy when examining successive characters in a loop.