unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Chen Bin <chenbin.sh@gmail.com>
Cc: emacs-devel@gnu.org
Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
Date: Sun, 15 Apr 2018 17:47:00 +0300	[thread overview]
Message-ID: <83bmek4jdn.fsf@gnu.org> (raw)
In-Reply-To: <87o9il0wka.fsf@gmail.com> (message from Chen Bin on Sun, 15 Apr 2018 17:15:49 +1000)

> From: Chen Bin <chenbin.sh@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Sun, 15 Apr 2018 17:15:49 +1000
> 
> I attached patch for latest code.

Thanks, we are getting close.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
> +If BYTECOMPARE is nil, we compare character of strings.
> +If BYTECOMPARE is t, we compare byte of strings.
> +Case is significant, but text properties are ignored. */)
> +  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)

I question the need for the BYTECOMPARE flag.  Emacs's editing
operations work in characters, not in bytes.  There's insert-byte, but
no delete-byte or replace-byte (although applications which should for
some reason need that could implement that, albeit not conveniently).
The byte-level operations are not exposed to Lisp for a good reason:
Emacs is primarily a text-processing environment, and text is
processed in character units.

So I think you should remove that option, unless you can explain why
you think it's needed and in what situations.

> +  unsigned short *ws1 = 0; /* 16 bit unicode character */
> +  unsigned short *ws2 = 0; /* 16 bit unicode character */
> +  if(!use_bytecompare)
> +    {
> +      /* convert utf-8 byte stream to 16 bit unicode array */
> +      string1 = code_convert_string_norecord (string1, Qutf_16le, 1);
> +      ws1 = (unsigned short *) SDATA (string1);
> +      string2 = code_convert_string_norecord (string2, Qutf_16le, 1);
> +      ws2 = (unsigned short *) SDATA (string2);
> +    }

Conversion to UTF-16 burns cycles, and the function will do the wrong
thing for characters beyond the BMP as result, because you compare two
16-bit words instead of full characters.

Instead, please use the macros defined on character.h, either
CHAR_STRING_ADVANCE of FETCH_STRING_CHAR_ADVANCE (or their non-ADVANCE
counterparts), whichever better suits your coding style and needs.
These macros produce the full Unicode codepoint of the string
characters, and you can then compare them without bumping into the
problem with UTF-8 or UTF-16 encoding.  The *-ADVANCE variants also
advance the pointer or index to the next character as side effect,
which is handy when examining successive characters in a loop.



  reply	other threads:[~2018-04-15 14:47 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-14  2:35 [PATCH] add 'string-distance' to calculate Levenshtein distance Chen Bin
2018-04-14  7:05 ` Eli Zaretskii
     [not found]   ` <87lgdq831h.fsf@gmail.com>
2018-04-14 13:24     ` Eli Zaretskii
2018-04-14 16:40       ` Chen Bin
2018-04-14 17:08         ` Eli Zaretskii
2018-04-15  7:15           ` Chen Bin
2018-04-15 14:47             ` Eli Zaretskii [this message]
     [not found]               ` <CAAE-R+-RDWvyrv+uqHszzh6VMH6An3disOw=PyPWaTnUTHDOCw@mail.gmail.com>
     [not found]                 ` <83k1t72b2o.fsf@gnu.org>
2018-04-17  2:43                   ` chen bin
2018-04-17 15:44                     ` Eli Zaretskii
2018-04-18  7:11                       ` chen bin
     [not found]                   ` <CAAE-R+8s++_LRcQCLX60Z=TQeQHdtbM5X1k525bfNnnPSLDvRw@mail.gmail.com>
     [not found]                     ` <83bmei36dw.fsf@gnu.org>
2018-04-17 12:31                       ` chen bin
2018-04-19  8:05                         ` Eli Zaretskii
2018-04-19 14:55                           ` chen bin
2018-04-20  4:37                             ` chen bin
2018-04-20  6:01                             ` Thien-Thi Nguyen
2018-04-20 10:47                             ` chen bin
2018-04-21  7:22                               ` Eli Zaretskii
2018-04-21 20:47                                 ` Juri Linkov
2018-04-28  7:36                                 ` Eli Zaretskii
2018-05-06  9:53                                   ` chen bin
2018-04-15 18:53             ` Paul Eggert
2018-04-14 17:18 ` Nathan Moreau
2018-04-14 17:36   ` Paul Eggert
2018-04-15 18:17     ` Andreas Politz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83bmek4jdn.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=chenbin.sh@gmail.com \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).