From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: idn.el and confusables.txt Date: Sun, 15 May 2011 01:56:02 -0400 Message-ID: References: <87r58vbj7o.fsf@lifelogs.com> <87fwpba03q.fsf@lifelogs.com> <874o5rqr5z.fsf@lifelogs.com> <87mxjjpal4.fsf@lifelogs.com> <87vcy6nzan.fsf@lifelogs.com> <87tydl4sjj.fsf_-_@lifelogs.com> <87r58pghh7.fsf_-_@lifelogs.com> <83iptdg0yr.fsf@gnu.org> <87y629ien3.fsf@lifelogs.com> <83aaepfiuk.fsf@gnu.org> <87aaepi9k2.fsf@lifelogs.com> <834o4xfd34.fsf@gnu.org> <8739khi54z.fsf@lifelogs.com> <83y629dmmt.fsf@gnu.org> <8739kg7o63.fsf@lifelogs.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: lo.gmane.org X-Trace: dough.gmane.org 1305438972 27562 80.91.229.12 (15 May 2011 05:56:12 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sun, 15 May 2011 05:56:12 +0000 (UTC) Cc: emacs-devel@gnu.org To: Ted Zlatanov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun May 15 07:56:08 2011 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QLUJD-00036Z-3W for ged-emacs-devel@m.gmane.org; Sun, 15 May 2011 07:56:07 +0200 Original-Received: from localhost ([::1]:47382 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QLUJC-00063s-EA for ged-emacs-devel@m.gmane.org; Sun, 15 May 2011 01:56:06 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:52888) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QLUJ9-00063l-VR for emacs-devel@gnu.org; Sun, 15 May 2011 01:56:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QLUJ9-0002ls-0f for emacs-devel@gnu.org; Sun, 15 May 2011 01:56:03 -0400 Original-Received: from fencepost.gnu.org ([140.186.70.10]:54702) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QLUJ8-0002lo-Vd for emacs-devel@gnu.org; Sun, 15 May 2011 01:56:02 -0400 Original-Received: from eliz by fencepost.gnu.org with local (Exim 4.71) (envelope-from ) id 1QLUJ8-0003pE-HP; Sun, 15 May 2011 01:56:02 -0400 In-reply-to: <8739kg7o63.fsf@lifelogs.com> (message from Ted Zlatanov on Sat, 14 May 2011 20:22:44 -0500) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 140.186.70.10 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:139412 Archived-At: > From: Ted Zlatanov > Date: Sat, 14 May 2011 20:22:44 -0500 > > EZ> Should it be a list or a string? How would you use this mapping? > > It could be any type of sequence, I guess. Strings are more compact but > for small amounts of data (typically 1-3 characters) I'm not sure if > that matters. For 1 character in particular I'm pretty sure it's more > efficient to store the character directly than any sequence. > > markchars.el would use it as follows: look at all the characters of a > word. If any are of a different script S2 from the majority script S1, > highlight them (we do this now with `markchars-face-confusable'). > > New functionality: now if any of the S2 characters are multi-script > confusables that map to a character in the majority script S1, highlight > them specially with the new variable > `markchars-face-confusable-multi-script' and give them a tooltip to say > they are confusable with a particular character. > > New functionality: if any of the word characters, regardless of script, > are confusables of the single-script type, highlight them with > `markchars-face-confusable'. But see below about normalization. These all examine portions of a buffer ("words") for being a match to some string or regexp. So I think having strings in the char-table will be more convenient, because you could then use looking-at, string=, string-match, etc. > As a general rule I'd say that if the mapping is to a single character > with the SL/SA single-script property, chances are it's a true > confusable. Otherwise it could be legitimate and we'd need to convert > the string to a normalized form, which is probably slow (do you know?) What do you mean by "normalized form"? > Based on all this, I think it's best to make the confusables char-table > values atoms or sequences (strings or lists) but split them into two > char-tables for the single-script and multi-script mappings. If we were to implement the full IDNA protocol, would the above be enough? Or will we need additional information?