From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Ted Zlatanov Newsgroups: gmane.emacs.devel Subject: Re: idn.el and confusables.txt Date: Sun, 15 May 2011 07:14:47 -0500 Organization: =?utf-8?B?0KLQtdC+0LTQvtGAINCX0LvQsNGC0LDQvdC+0LI=?= @ Cienfuegos Message-ID: <87hb8w5few.fsf@lifelogs.com> References: <87fwpba03q.fsf@lifelogs.com> <874o5rqr5z.fsf@lifelogs.com> <87mxjjpal4.fsf@lifelogs.com> <87vcy6nzan.fsf@lifelogs.com> <87tydl4sjj.fsf_-_@lifelogs.com> <87r58pghh7.fsf_-_@lifelogs.com> <83iptdg0yr.fsf@gnu.org> <87y629ien3.fsf@lifelogs.com> <83aaepfiuk.fsf@gnu.org> <87aaepi9k2.fsf@lifelogs.com> <834o4xfd34.fsf@gnu.org> <8739khi54z.fsf@lifelogs.com> <83y629dmmt.fsf@gnu.org> <8739kg7o63.fsf@lifelogs.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: dough.gmane.org 1305461719 24863 80.91.229.12 (15 May 2011 12:15:19 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sun, 15 May 2011 12:15:19 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun May 15 14:15:07 2011 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QLaDz-0001Se-77 for ged-emacs-devel@m.gmane.org; Sun, 15 May 2011 14:15:07 +0200 Original-Received: from localhost ([::1]:48801 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QLaDy-0001ea-LY for ged-emacs-devel@m.gmane.org; Sun, 15 May 2011 08:15:06 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:59459) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QLaDv-0001dc-Pc for emacs-devel@gnu.org; Sun, 15 May 2011 08:15:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QLaDu-0004Y0-K5 for emacs-devel@gnu.org; Sun, 15 May 2011 08:15:03 -0400 Original-Received: from lo.gmane.org ([80.91.229.12]:50883) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QLaDu-0004XU-87 for emacs-devel@gnu.org; Sun, 15 May 2011 08:15:02 -0400 Original-Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1QLaDr-0001Pv-Ph for emacs-devel@gnu.org; Sun, 15 May 2011 14:14:59 +0200 Original-Received: from c-67-186-102-106.hsd1.il.comcast.net ([67.186.102.106]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 15 May 2011 14:14:59 +0200 Original-Received: from tzz by c-67-186-102-106.hsd1.il.comcast.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 15 May 2011 14:14:59 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 49 Original-X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: c-67-186-102-106.hsd1.il.comcast.net X-Face: bd.DQ~'29fIs`T_%O%C\g%6jW)yi[zuz6; d4V0`@y-~$#3P_Ng{@m+e4o<4P'#(_GJQ%TT= D}[Ep*b!\e,fBZ'j_+#"Ps?s2!4H2-Y"sx" User-Agent: Gnus/5.110018 (No Gnus v0.18) Emacs/24.0.50 (gnu/linux) Cancel-Lock: sha1:+ADplK3OllaiXnH0vKzWVu9A0M4= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 80.91.229.12 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:139414 Archived-At: On Sun, 15 May 2011 01:56:02 -0400 Eli Zaretskii wrote: EZ> These all examine portions of a buffer ("words") for being a match to EZ> some string or regexp. So I think having strings in the char-table EZ> will be more convenient, because you could then use looking-at, EZ> string=, string-match, etc. Oh, good point. OK, strings it is. I'll write the converter. >> As a general rule I'd say that if the mapping is to a single character >> with the SL/SA single-script property, chances are it's a true >> confusable. Otherwise it could be legitimate and we'd need to convert >> the string to a normalized form, which is probably slow (do you know?) EZ> What do you mean by "normalized form"? Unicode has a normalization algorithm to see if two strings are informationally the same regardless of the combining characters and other sequences within. But thinking about it, even if normalization says they're the same, it's still a potential problem for the user, so we can skip normalization and always mark those. >> Based on all this, I think it's best to make the confusables char-table >> values atoms or sequences (strings or lists) but split them into two >> char-tables for the single-script and multi-script mappings. EZ> If we were to implement the full IDNA protocol, would the above be EZ> enough? Or will we need additional information? Oh, all this has been for confusables (TR39) only. IDNA and uni-idn.el will have their own needs! IIUC, Lennart used IDNA only as a character set in markchars.el (I didn't write that functionality and he maintains idn.el), but there are more security issues with it we may need to handle. IDNA is better described in http://unicode.org/reports/tr46/ and the links at the end of that document (a whole bunch of RFCs). I'm not interested in implementing the IDNA code beyond supporting the current character set detection because I don't think IDNA is popular enough, but maybe Lennart and others want to do it. For further possible markchars.el functionality, take a look at http://www.unicode.org/reports/tr36/ (Unicode Security Considerations). It talks about the confusables issues, IDNA issues, and bidi issues among others. It's a really good explanation of what security-related functionality is needed from the confusables char-table and potentially other places in Emacs. Ted