From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: chen bin Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance Date: Tue, 17 Apr 2018 12:43:33 +1000 Message-ID: References: <87vacuecrn.fsf@gmail.com> <83po3246ah.fsf@gnu.org> <87lgdq831h.fsf@gmail.com> <83muy553ae.fsf@gnu.org> <87o9ilhhcd.fsf@gmail.com> <83d0z14sws.fsf@gnu.org> <87o9il0wka.fsf@gmail.com> <83bmek4jdn.fsf@gnu.org> <83k1t72b2o.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1523932901 7440 195.159.176.226 (17 Apr 2018 02:41:41 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 17 Apr 2018 02:41:41 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Apr 17 04:41:37 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f8GYq-0001qQ-OB for ged-emacs-devel@m.gmane.org; Tue, 17 Apr 2018 04:41:36 +0200 Original-Received: from localhost ([::1]:34036 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f8Gax-0001Aw-6I for ged-emacs-devel@m.gmane.org; Mon, 16 Apr 2018 22:43:47 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58434) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f8Gaq-0001AF-CI for emacs-devel@gnu.org; Mon, 16 Apr 2018 22:43:41 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f8Gao-0006Z9-VP for emacs-devel@gnu.org; Mon, 16 Apr 2018 22:43:40 -0400 Original-Received: from mail-lf0-x233.google.com ([2a00:1450:4010:c07::233]:43968) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1f8Gal-0006KS-Qy; Mon, 16 Apr 2018 22:43:36 -0400 Original-Received: by mail-lf0-x233.google.com with SMTP id v207-v6so25025423lfa.10; Mon, 16 Apr 2018 19:43:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=61UGZXa0jWpKPwauOrDcuYEep9HjbNR9rHtpSaDEo1k=; b=qwYyh+DyLRK24zxcIJcDmSAUo47aEzU9MfokcQ4zoyx0q+BHRkwnashbI4GBMYGfvb srhPQvFAmzYyC7li6dN5wgXOCASjtyAXMiX08+iG6st2h92vHW4qLkNBHgb7I2z/jVFm kuOoRfwnY/eeuyYYwfoybCZRJcWlNA2W+V3kQ24L2fTfMT+tc2ia9L9G6l5g1Q0+2kbF Wxi07AWPUJ8GOukmRk5T/qNaCYatujUBeGU6wkEDCT2eI43GmQmBavGVJRNyt8Qyo/FI 9fi7i/g4l3fsrlnXgP9IpKv53IEWXzI6lTAXgDlyQyaFYDX6aPOsFjyPX0i8LPnPVcDM Aidw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=61UGZXa0jWpKPwauOrDcuYEep9HjbNR9rHtpSaDEo1k=; b=FCpcigN+eUumZel8llBxF+MfylGwFXvDhWQsjRCQuiuqvNuAzpbRLKZlPoyKgJ77Zx sNCrMYrHrtWs4S1OpWZU3uZNWtg5HqTaDo0eW+s6Mb0/GFr7AvbUT3ITlTtdtkXWhL38 iVrJJlq3tn+VPdoBCFBJtlpJWAzFXXy8mJ/pgInF62dxzLY4BqxE3JXp3t/ta5uY/pzZ ZVk4ivxdWXip+HfKSK65uFjzAL3yattW80hkXe385U4ku/w8ySqoNfEUBwqv41Ka8Ey3 Ou/6Z87fKkRIVlOQvjanHxugRDyymWxi2KJVcIM6s8MUxP8NSWjeRPS/JTXICQIZ4dtw /I2Q== X-Gm-Message-State: ALQs6tDUZdKw7upfewVIGMDnXSBID1fwSamKH2GZTsZOnE6WWsVOKV0E JcvPRWDGs9BBFvUT5LGvn37q1aK2brB7k6syuGHjLQ== X-Google-Smtp-Source: AIpwx49mXkFxNEJy/Y418zzYJaC6aKV+qzvzH+xQtuQ9WtrwqyUTzbdtA/WBh4H/pnq2esu5MyArRJFaR8GUytUzt+0= X-Received: by 10.46.154.205 with SMTP id p13mr145874ljj.60.1523933014017; Mon, 16 Apr 2018 19:43:34 -0700 (PDT) Original-Received: by 10.46.20.93 with HTTP; Mon, 16 Apr 2018 19:43:33 -0700 (PDT) In-Reply-To: <83k1t72b2o.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4010:c07::233 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:224696 Archived-At: 'bytecompare' only applies the original Levenshtein distance algorihtm to the ascii strings. It's not enabled by default. As we agreed, it has some pleasant side effect for East Asian. It's not rig= ht and wrong thing. Just side effect of UTF-8 encoding. Weighted distance is NOT doable for only one reason: You can't increase character weight of certain race or religion in Emacs. It's politically wrong. So I suggest keep both unicode and ascii algorithm. I will study the macros you mentioned tonight and figure out a new solution at the end of day. On Tue, Apr 17, 2018 at 5:41 AM, Eli Zaretskii wrote: >> From: chen bin >> Date: Tue, 17 Apr 2018 00:29:39 +1000 >> >> To be honest I'm 100% confident about my current solution. I'm >> actually an expert in dealing unicode related C programming. > > I appreciate your expertise, that's not the issue here. The main > issue is how to support all the range of characters that Emacs wants > to support. > >> It's standard practice to convert UTF-8 character to UTF-16 >> character before any further string operation. >> >> To anwser your questions, I have to introduce East Asian culture at firs= t. >> >> China/Japan/Korea use same Hanzi (Chinese characters). China's >> national standard GBK defines >> 21886 characters used Eas Asians. >> >> UTF-16 is compatilbe with GBK. >> >> Even most educated Chinese knows only 6000~8000 characters. >> >> UTF-32 exists because there are another 50000 ancient Hanzi which no >> Chinese uses. >> >> Among 99089 characters of Unicode v5.0, 71226 characters are Hanzi. >> >> In short, you got about 22000 non-hanzi characters and 21000 Hanzi >> characters. UTF-16 is totally fine >> to deal with it. > > UTF-16 is fine for characters within the BMP, yes. But Emacs supports > much more, see below. > >> Say my name is =E9=99=88=E6=96=8C, my colleague's name is =E5=BC=A0=E6= =96=8C. Only one Hanzi difference, right? >> >> While reading file named "=E9=99=88=E6=96=8C's doucment0", I want to lis= t other >> documents with similar file names. >> Using my byte compare algorithm, "=E9=99=88=E6=96=8C's document1", "=E9= =99=88=E6=96=8C's document2" is >> on the top of the sorted >> document list. Make sense. >> >> Without byte comparing, my family name "=E9=99=88" is not more important= than >> an latin character >> So files like "=E5=BC=A0=E6=96=8C's document1", "=E5=BC=A0=E6=96=8C's do= cument2" is on the top of the list now. > > You are saying that a difference of 1 character has more "weight" for > some characters than for others. That might be so, but using the > number of bytes in the UTF-8 representation as that weight makes very > little sense to me, because the length of the UTF-8 sequence for a > given character is determined by reasons that are largely historical: > it depends on when was a particular character introduced into Unicode. > It has nothing to do with "importance" of a character. > > Thus, your example might produce a sensible result when comparing > ASCII characters vs non-ASCII characters, but will make much less > sense when comparing one non-ASCII character with another non-ASCII > character. For example, emoji characters, which need 4 bytes in UTF-8 > sequences, will be considered twice as "important" as the above Hanzi > characters of your example (and 4 times as important as ASCII). The > results will be totally unexpected and unpredictable (unless the user > knows the entire Unicode database by heart). > > So I think features such as what you want to achieve with the above > example will have to use some other additional feature, like "weighted > difference", and using the byte length as a kind of surrogate "weight" > for this purpose is IMO not a good idea, even though it could look > like that at first glance. > >> 2. As mentioned, China has national standard GBK which is compatible >> with UTF-16. >> Every character is a 16 bit integer. It's common practice to convert >> UTF-8 string (or other encoded string) >> to UTF-16 format before any further string operation. Or else, it's >> impossbile to display Japanese, Korean, Chinese >> in one document (considering tyical user operations, text >> searching/replacing ...) >> >> I've been doing this since day one of my career because I'm Chinese. > > As I said, this is okay for characters in the BMP. But Emacs must > support much more. > > First, we support the latest Unicode 10.0, which added a lot of > characters outside the BMP. You have the emoji there, you have many > new symbols there (the Alchemichal Symbols, Geometric Shapes, and > Supplemental Arrows-C blocks), Domino Tiles, Mathematical > Alphanumerics, Musical symbols, and many scripts that there's no > reason to tell people we don't support in Emacs. > > Moreover, Emacs's internal representation of raw 8-bit bytes uses the > area #x3FFF00..#x3FFFFF, which need 5 bytes in its UTF-8 sequence. We > cannot exclude the possibility that strings we are comparing have > embedded raw bytes in them, this happens in practice and must work as > expected. > >> The integer is called wide character (wchar_t in standard C, or WCHAR >> in Windows GUI SDK), > > Windows indeed uses a 16-bit wchar_t type, which is why its support > for characters beyond the BMP is problematic and in some cases simply > non-existent, because standard C library functions that accept a > single wide character argument are unable to cope with UTF-16 encoding > that requires 2 16-bit words to represent a character. So, for > example, you don't have a way of looking for a character outside the > BMP in a wide character string by calling the standard function > wcschr, because its 2nd argument is a single wchar_t wide character, > which can only express characters inside the BMP. > > Emacs cannot tolerate such limitations, not in the year 2018. > > So using implementations that only support the BMP is really a > non-starter for Emacs. We cannot accept such code. > >> Now allow me explain why I can't use CHAR_STRING_ADVANCE. The first >> parameter 'c' is ALREADY a wide character. > > I apologize for my stupid typo. I meant STRING_CHAR_ADVANCE, not > CHAR_STRING_ADVANCE. At least I didn't make such a mistake in the 2nd > macro I mentioned, FETCH_STRING_CHAR_ADVANCE... > >> It's efficient to extract characters into one unsigned short array in >> one shot. > > It is indeed quite efficient, but it requires one more pass of the > entire string, which is unnecessary. And anyway, the main reason not > to convert to UTF-16 is what I wrote above about supporting characters > beyond the BMP, not the extra loop. > >> Or else, I need convert character index to byte offset first. So I >> need get byte offset by calling 'string_char_to_byte' >> twice to get offset of two elements for comparing, The element coould >> could be 1, 2, 3 bytes. So I still need convert bytes >> into a 4 byte integer for comparing. > > The macros I mentioned already do that for you, they manage both > character offsets and byte offsets efficiently (and they are much more > efficient than string_char_to_byte because they advance sequentially, > whereas string_char_to_byte is for random access to a string). > > Thanks. > > P.S. Please, let's return this discussion to the list, where it > belongs. I'm disappointed that this was again a private email, and > the list isn't CC'ed. This discussion should be public, not private. --=20 help me, help you.