>>>>> "Eli" == Eli Zaretskii writes: Eli> [Please CC the mailing list when you respond, so others could Eli> see your messages.] >> From: Chen Bin Date: Sat, 14 Apr 2018 >> 21:01:46 +1000 >> >> Hi, Eli, Thanks for the review. >> >> I fixed most issues except two things. >> >> 1. In Asia, it's possible to cacluate distance between one >> unibyte and one multibyte string. As a Chinese, I might create a >> document containing Hanzi characters whose file name is obviously >> multibyte string. I may need get the distance of this document to >> a file named "README.txt". Eli> If you mean unibyte pure-ASCII strings, then I agree. But that Eli> doesn't mean we should avoid the text altogether, because we Eli> might compute non-zero distance between a string and its Eli> encoded unibyte variant, which will confuse users. At the very Eli> least the doc string should say something about this. >> 2. Algorithm is based on >> https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C >> It's optimized to use O(min(m,n)) space instead of O(mn). Say >> you compare two string whose string length is 512 bytes. You >> only need allocate 512 bytes instead of 262K (512*512) in memory. >> >> Please check attached patch for latest code. >> >> --- a/etc/NEWS +++ b/etc/NEWS @@ -463,6 +463,8 @@ >> x-lost-selection-hooks, x-sent-selection-hooks +++ ** New >> function assoc-delete-all. >> >> +** New function string-distance. Eli> This should mention Levenshtein distance. >> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, >> 2, 0, + doc: /* Return Levenshtein distance of STRING1 and >> STRING2. Eli> ^^^^^^^^^^^^^^^^^^^^^^ Eli> "between STRING1 and STRING2" >> + unsigned int s1len, s2len, x, y, lastdiag, olddiag; Eli> These variables should be declared EMACS_INT, not unsigned int, Eli> because the size of Emacs strings can be larger than UINT_MAX, Eli> especially on 64-bit systems. >> + unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof >> (unsigned int)); Eli> Likewise here. >> + char *s1 = SSDATA (string1); + char *s2 = SSDATA (string2); + + >> unsigned int s1len, s2len, x, y, lastdiag, olddiag; + s1len = >> strlen(s1); + s2len = strlen(s2); Eli> You could optimize the code by using SCHARS and SBYTES, instead Eli> of calling strlen. >> +(ert-deftest subr-tests--string-distance () + "Test >> `string-distance' behavior." + (should (equal 1 (string-distance >> "heelo" "hello"))) + (should (equal 2 (string-distance "aeelo" >> "hello"))) + (should (equal 0 (string-distance "ab" "ab"))) + >> (should (equal 1 (string-distance "ab" "abc")))) Eli> Could you please add a test or two with non-ASCII characters? Eli> Thanks. -- Best Regards, Chen Bin -- Help me, help you