From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance Date: Sat, 14 Apr 2018 20:08:51 +0300 Message-ID: <83d0z14sws.fsf@gnu.org> References: <87vacuecrn.fsf@gmail.com> <83po3246ah.fsf@gnu.org> <87lgdq831h.fsf@gmail.com> <83muy553ae.fsf@gnu.org> <87o9ilhhcd.fsf@gmail.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1523725654 3732 195.159.176.226 (14 Apr 2018 17:07:34 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 14 Apr 2018 17:07:34 +0000 (UTC) Cc: emacs-devel@gnu.org To: Chen Bin Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Apr 14 19:07:30 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f7Oe9-0000pi-ID for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 19:07:29 +0200 Original-Received: from localhost ([::1]:49858 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7OgE-0003yh-B4 for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2018 13:09:38 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46064) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7OfX-0003xv-AH for emacs-devel@gnu.org; Sat, 14 Apr 2018 13:08:56 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f7OfT-0005zL-8E for emacs-devel@gnu.org; Sat, 14 Apr 2018 13:08:55 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:53629) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f7OfT-0005z6-4q; Sat, 14 Apr 2018 13:08:51 -0400 Original-Received: from [176.228.60.248] (port=1864 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1f7OfS-00040r-JZ; Sat, 14 Apr 2018 13:08:51 -0400 In-reply-to: <87o9ilhhcd.fsf@gmail.com> (message from Chen Bin on Sun, 15 Apr 2018 02:40:18 +1000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:224600 Archived-At: > From: Chen Bin > Cc: emacs-devel@gnu.org > Date: Sun, 15 Apr 2018 02:40:18 +1000 > > Correct me if I'm wrong. > > I read cod eand found definion of Lisp_String: > struct GCALIGNED Lisp_String > { > ptrdiff_t size; > ptrdiff_t size_byte; > INTERVAL intervals; /* Text properties in this string. */ > unsigned char *data; > }; > > I understand string text is encoded in UTF8 format and is stored in > 'Lisp_String::data'. There is actually no difference between unibyte > and multibyte text since UTF8 is compatible with ASCII and we only deal > with 'data' field. No, that's incorrect. The difference does exist, it just all but disappear for unibyte strings encoded in UTF-8. But if you encode a string in some other encoding, like Latin-1, you will see a very different stream of bytes. > I attached the latest patch. Thanks. > + ;; string containing unicode character (Hanzi) > + (should (equal 6 (string-distance "ab" "ab我她"))) > + (should (equal 3 (string-distance "我" "她")))) Should the distance be measured in bytes or in characters? I think it's the latter, in which case the implementation should work in characters, not bytes.