From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
Date: Sun, 15 Apr 2018 17:47:00 +0300
Message-ID: <83bmek4jdn.fsf@gnu.org>
References: <87vacuecrn.fsf@gmail.com> <83po3246ah.fsf@gnu.org>
	<87lgdq831h.fsf@gmail.com> <83muy553ae.fsf@gnu.org>
	<87o9ilhhcd.fsf@gmail.com> <83d0z14sws.fsf@gnu.org>
	<87o9il0wka.fsf@gmail.com>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
X-Trace: blaine.gmane.org 1523803503 9059 195.159.176.226 (15 Apr 2018 14:45:03 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 15 Apr 2018 14:45:03 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Chen Bin <chenbin.sh@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 15 16:44:59 2018
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1f7itm-0002Ds-8d
	for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2018 16:44:58 +0200
Original-Received: from localhost ([::1]:47733 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1f7ivs-0004CK-Sa
	for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2018 10:47:08 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41578)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1f7ivl-0004Bs-IZ
	for emacs-devel@gnu.org; Sun, 15 Apr 2018 10:47:02 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1f7ivi-0005PA-Ga
	for emacs-devel@gnu.org; Sun, 15 Apr 2018 10:47:01 -0400
Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:45537)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1f7ivi-0005P2-Ck; Sun, 15 Apr 2018 10:46:58 -0400
Original-Received: from [176.228.60.248] (port=2999 helo=home-c4e4a596f7)
	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
	(Exim 4.82) (envelope-from <eliz@gnu.org>)
	id 1f7ivh-0007rX-OK; Sun, 15 Apr 2018 10:46:58 -0400
In-reply-to: <87o9il0wka.fsf@gmail.com> (message from Chen Bin on Sun, 15 Apr
	2018 17:15:49 +1000)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:224618
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/224618>

> From: Chen Bin <chenbin.sh@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Sun, 15 Apr 2018 17:15:49 +1000
> 
> I attached patch for latest code.

Thanks, we are getting close.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
> +If BYTECOMPARE is nil, we compare character of strings.
> +If BYTECOMPARE is t, we compare byte of strings.
> +Case is significant, but text properties are ignored. */)
> +  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)

I question the need for the BYTECOMPARE flag.  Emacs's editing
operations work in characters, not in bytes.  There's insert-byte, but
no delete-byte or replace-byte (although applications which should for
some reason need that could implement that, albeit not conveniently).
The byte-level operations are not exposed to Lisp for a good reason:
Emacs is primarily a text-processing environment, and text is
processed in character units.

So I think you should remove that option, unless you can explain why
you think it's needed and in what situations.

> +  unsigned short *ws1 = 0; /* 16 bit unicode character */
> +  unsigned short *ws2 = 0; /* 16 bit unicode character */
> +  if(!use_bytecompare)
> +    {
> +      /* convert utf-8 byte stream to 16 bit unicode array */
> +      string1 = code_convert_string_norecord (string1, Qutf_16le, 1);
> +      ws1 = (unsigned short *) SDATA (string1);
> +      string2 = code_convert_string_norecord (string2, Qutf_16le, 1);
> +      ws2 = (unsigned short *) SDATA (string2);
> +    }

Conversion to UTF-16 burns cycles, and the function will do the wrong
thing for characters beyond the BMP as result, because you compare two
16-bit words instead of full characters.

Instead, please use the macros defined on character.h, either
CHAR_STRING_ADVANCE of FETCH_STRING_CHAR_ADVANCE (or their non-ADVANCE
counterparts), whichever better suits your coding style and needs.
These macros produce the full Unicode codepoint of the string
characters, and you can then compare them without bumping into the
problem with UTF-8 or UTF-16 encoding.  The *-ADVANCE variants also
advance the pointer or index to the next character as side effect,
which is handy when examining successive characters in a loop.