From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Ted Zlatanov Newsgroups: gmane.emacs.devel Subject: Re: idn.el and confusables.txt Date: Wed, 18 May 2011 13:15:05 -0500 Organization: =?utf-8?B?0KLQtdC+0LTQvtGAINCX0LvQsNGC0LDQvdC+0LI=?= @ Cienfuegos Message-ID: <87aaej5106.fsf@lifelogs.com> References: <874o5rqr5z.fsf@lifelogs.com> <87mxjjpal4.fsf@lifelogs.com> <87vcy6nzan.fsf@lifelogs.com> <87tydl4sjj.fsf_-_@lifelogs.com> <87r58pghh7.fsf_-_@lifelogs.com> <83iptdg0yr.fsf@gnu.org> <87y629ien3.fsf@lifelogs.com> <83aaepfiuk.fsf@gnu.org> <87aaepi9k2.fsf@lifelogs.com> <834o4xfd34.fsf@gnu.org> <8739khi54z.fsf@lifelogs.com> <83y629dmmt.fsf@gnu.org> <8739kg7o63.fsf@lifelogs.com> <87hb8w5few.fsf@lifelogs.com> <87r57xgx70.fsf@lifelogs.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: dough.gmane.org 1305742572 5757 80.91.229.12 (18 May 2011 18:16:12 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 18 May 2011 18:16:12 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed May 18 20:16:08 2011 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QMlHL-0002qa-Ic for ged-emacs-devel@m.gmane.org; Wed, 18 May 2011 20:16:08 +0200 Original-Received: from localhost ([::1]:60114 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QMlHK-0001SH-QJ for ged-emacs-devel@m.gmane.org; Wed, 18 May 2011 14:15:26 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:52865) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QMlHI-0001S8-7M for emacs-devel@gnu.org; Wed, 18 May 2011 14:15:25 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QMlHD-0000SD-F6 for emacs-devel@gnu.org; Wed, 18 May 2011 14:15:24 -0400 Original-Received: from lo.gmane.org ([80.91.229.12]:39088) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QMlHC-0000Ru-Tl for emacs-devel@gnu.org; Wed, 18 May 2011 14:15:19 -0400 Original-Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1QMlHC-0002in-0b for emacs-devel@gnu.org; Wed, 18 May 2011 20:15:18 +0200 Original-Received: from 38.98.147.130 ([38.98.147.130]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 18 May 2011 20:15:18 +0200 Original-Received: from tzz by 38.98.147.130 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 18 May 2011 20:15:18 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 151 Original-X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: 38.98.147.130 X-Face: bd.DQ~'29fIs`T_%O%C\g%6jW)yi[zuz6; d4V0`@y-~$#3P_Ng{@m+e4o<4P'#(_GJQ%TT= D}[Ep*b!\e,fBZ'j_+#"Ps?s2!4H2-Y"sx" User-Agent: Gnus/5.110018 (No Gnus v0.18) Emacs/24.0.50 (gnu/linux) Cancel-Lock: sha1:xSBBohFlmdXS202dkdr5zabKra8= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 80.91.229.12 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:139491 Archived-At: --=-=-= Content-Type: text/plain On Tue, 17 May 2011 10:32:03 -0500 Ted Zlatanov wrote: TZ> Here's the converter. It reads the confusables.txt file and generates a TZ> char-table with strings as values. I'll package the converter and the TZ> resulting uni-confusables.el library and put them on the GNU ELPA. TZ> Could you tell me the best way to write uni-confusables.el? In what TZ> format should I provide the char-tables in the ELisp code? The shortest format turned out to be a range enumeration, because the native char-table dump was much bigger (700K vs. 100K). So I wrote `gen-confusables-write' to create the "uni-confusables.el" file that defines the two char-tables and then populates them. As a bonus, two ERT tests (one per single/multiple type) are also generated dynamically based on the data found in the confusables.txt file. gen-confusables.el is a pretty unholy mix of Lisp and string manipulations, but since I am the only real user I don't mind. You can test it with http://www.unicode.org/Public/security/revision-04/confusables.txt (I'm not including the resulting uni-confusables.el here because it's over 100K). Ted --=-=-= Content-Type: application/emacs-lisp; charset=utf-8 Content-Disposition: attachment; filename=gen-confusables.el Content-Transfer-Encoding: quoted-printable ;;; gen-confusables.el --- generate uni-confusables.el from confusables.txt ;; Copyright (C) 2011 Teodor Zlatanov ;; Author: Teodor Zlatanov ;; This program is free software; you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; This program is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with this program. If not, see . ;;; Commentary: ;;; Code: (require 'cl) (defvar gen-confusables-char-table-single) (defvar gen-confusables-char-table-multiple) (defun gen-confusables-read (file) (interactive "fConfusables filename: \n") (flet ((reader (h) (string-to-number h 16))) (let ((stable (make-char-table 'confusables-single-script)) (mtable (make-char-table 'confusables-multiple-script)) (count 0) (confusable-line-regexp (concat "^\\([[:xdigit:]]+\\)" ; \x+ " ;\t" ;; \x+ separated by spaces "\\([[:space:][:xdigit:]]+\\)" " ;\t" "\\([SM]\\)[LA]"))) ; SL, SA, ML,= MA (setq gen-confusables-char-table-single stable) (setq gen-confusables-char-table-multiple mtable) (with-temp-buffer (insert-file-contents file) (goto-char (point-min)) (while (re-search-forward confusable-line-regexp nil t) (incf count) (when (and (called-interactively-p) (zerop (mod count 100))) (message "processed %d lines" count)) (let* ((from (match-string 1)) (to (match-string 2)) (class (match-string 3)) (table (if (string-equal "S" class) stable mtable))) (set-char-table-range table (reader from) (concat (mapcar 'reader (split-string to)))))))))) (defun gen-confusables-write (file) (interactive "fDumped filename: \n") (let ((coding-system-for-write 'utf-8-emacs)) (with-temp-file file (insert ";; Copyright (C) 1991-2009, 2010 Unicode, Inc. ;; This file was generated from the Unicode confusables list at ;; http://www.unicode.org/Public/security/revision-04/confusables.txt. ;; See lisp/international/README in the Emacs trunk ;; for the copyright and permission notice.\n\n") (dolist (type '(single multiple)) (let* ((tablesym (intern (format "uni-confusables-char-table-%s" ty= pe))) (oursym (intern (format "gen-confusables-char-table-%s" type= ))) (ourtable (symbol-value oursym)) (ourtablename (symbol-name oursym)) (tablename (symbol-name tablesym)) (prop (format "confusables-%s-script" type)) props) (insert (format "(defvar %s (make-char-table '%s))\n\n" tablename prop)) (map-char-table (lambda (k v) (setq props (cons k (cons v props)))) ourtable) (insert (format "(let ((k nil) (v nil) (ranges '%S))\n" props)) (insert (format " (while ranges (setq k (pop ranges) v (pop ranges)) (set-char-table-range %s k v)))\n\n" tablename)) (insert (format "(ert-deftest uni-confusables-test-%s ()\n" type)) (dolist (offset '(100 200 800 3000 3500)) (insert (format " (should (string-equal (char-table-range %s %d) %S))\n" tablename (nth (* 2 offset) props) (nth (1+ (* 2 offset)) props)))) (insert ")\n\n"))) (insert " ;; Local Variables: ;; coding: utf-8 ;; no-byte-compile: t ;; End: ;; uni-confusables.el ends here")))) (provide 'gen-confusables) ;;; gen-confusables.el ends here --=-=-=--