From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: "James J. Ramsey" Newsgroups: gmane.emacs.bugs Subject: Possible spurious "range striding over charsets" errors Date: Fri, 31 Dec 2004 12:09:58 -0800 (PST) Message-ID: <20041231200958.16349.qmail@web50902.mail.yahoo.com> Reply-To: jjramsey@pobox.com NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1104523951 29654 80.91.229.6 (31 Dec 2004 20:12:31 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 31 Dec 2004 20:12:31 +0000 (UTC) Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Dec 31 21:12:25 2004 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1CkT8K-0004US-00 for ; Fri, 31 Dec 2004 21:12:24 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1CkTJL-0004TK-Gy for geb-bug-gnu-emacs@m.gmane.org; Fri, 31 Dec 2004 15:23:47 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33) id 1CkTJI-0004SF-No for bug-gnu-emacs@gnu.org; Fri, 31 Dec 2004 15:23:45 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33) id 1CkTJH-0004Rd-N3 for bug-gnu-emacs@gnu.org; Fri, 31 Dec 2004 15:23:44 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1CkTJH-0004Ra-Jd for bug-gnu-emacs@gnu.org; Fri, 31 Dec 2004 15:23:43 -0500 Original-Received: from [206.190.38.122] (helo=web50902.mail.yahoo.com) by monty-python.gnu.org with smtp (Exim 4.34) id 1CkT5z-0000fI-48 for bug-gnu-emacs@gnu.org; Fri, 31 Dec 2004 15:09:59 -0500 Original-Received: (qmail 16351 invoked by uid 60001); 31 Dec 2004 20:09:58 -0000 Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; b=SeFgf++vQfSazIHLUue+ftIjW9aH+00ICNvFeua4XjShK1fqDwu3r/AyHHNm38NolHgWFFUbJMFwTLXpHMNyqlT+tzzanuqyEo6/g+tg0V/xPRVvYpAuMgCdjoxLMNgyJwZZlC6Hp6YHpuy1WMrbdzNeyGkKkW6HDqZV5LWqUgE= ; Original-Received: from [66.219.135.59] by web50902.mail.yahoo.com via HTTP; Fri, 31 Dec 2004 12:09:58 PST Original-To: bug-gnu-emacs@gnu.org X-BeenThere: bug-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: main.gmane.org gmane.emacs.bugs:10266 X-Report-Spam: http://spam.gmane.org/gmane.emacs.bugs:10266 I've been working on a patch for ispell.el so it works better on UTF-8 files when used with Aspell. The patch is here: http://sourceforge.net/tracker/index.php?func=detail&aid=945391&group_id=245&atid=300245 There is a variable called ispell-utf-8-casechars that contains a string that, when passed through the function ispell-decode-string (basically a wrapper for decode-coding-string), becomes a regular expression for anything that is supposed to be a "letter" of a word. Here is the original value of the variable, which worked in Emacs 21.3: (I apologize in advance if it causes horizontal scroll problems. Sorry.) (setq ispell-utf-8-casechars "[A-Za-z\303\200-\303\226\303\230-\303\266\303\270-\303\277\304\200-\304 \261\304\264-\304\276\305\201-\305\210\305\212-\305\276\306\200-\307\203\307\215 -\307\260\307\264\307\265\307\272-\310\227\311\220-\312\250\312\273-\313\201\316 \206\316\210-\316\212\316\214\316\216-\316\241\316\243-\317\216\317\220-\317\226 \317\232\317\234\317\236\317\240\317\242-\317\263\320\201-\320\214\320\216-\321\ 217\321\221-\321\234\321\236-\322\201\322\220-\323\204\323\207\323\210\323\213\3 23\214\323\220-\323\253\323\256-\323\265\323\270\323\271\324\261-\325\226\325\23 1\325\241-\326\206\327\220-\327\252\327\260-\327\262\330\241-\330\272\331\201-\3 31\212\331\261-\332\267\332\272-\332\276\333\200-\333\216\333\220-\333\223\333\2 25\333\245\333\246\340\244\205-\340\244\271\340\244\275\340\245\230-\340\245\241 \340\246\205-\340\246\214\340\246\217\340\246\220\340\246\223-\340\246\250\340\2 46\252-\340\246\260\340\246\262\340\246\266-\340\246\271\340\247\234\340\247\235 \340\247\237-\340\247\241\340\247\260\340\247\261\340\250\205-\340\250\212\340\2 50\217\340\250\220\340\250\223-\340\250\250\340\250\252-\340\250\260\340\250\262 \340\250\263\340\250\265\340\250\266\340\250\270\340\250\271\340\251\231-\340\25 1\234\340\251\236\340\251\262-\340\251\264\340\252\205-\340\252\213\340\252\215\ 340\252\217-\340\252\221\340\252\223-\340\252\250\340\252\252-\340\252\260\340\2 52\262\340\252\263\340\252\265-\340\252\271\340\252\275\340\253\240\340\254\205- \340\254\214\340\254\217\340\254\220\340\254\223-\340\254\250\340\254\252-\340\2 54\260\340\254\262\340\254\263\340\254\266-\340\254\271\340\254\275\340\255\234\ 340\255\235\340\255\237-\340\255\241\340\256\205-\340\256\212\340\256\216-\340\2 56\220\340\256\222-\340\256\225\340\256\231\340\256\232\340\256\234\340\256\236\ 340\256\237\340\256\243\340\256\244\340\256\250-\340\256\252\340\256\256-\340\25 6\265\340\256\267-\340\256\271\340\260\205-\340\260\214\340\260\216-\340\260\220 \340\260\222-\340\260\250\340\260\252-\340\260\263\340\260\265-\340\260\271\340\ 261\240\340\261\241\340\262\205-\340\262\214\340\262\216-\340\262\220\340\262\22 2-\340\262\250\340\262\252-\340\262\263\340\262\265-\340\262\271\340\263\236\340 \263\240\340\263\241\340\264\205-\340\264\214\340\264\216-\340\264\220\340\264\2 22-\340\264\250\340\264\252-\340\264\271\340\265\240\340\265\241\340\270\201-\34 0\270\256\340\270\260\340\270\262\340\270\263\340\271\200-\340\271\205\340\272\2 01\340\272\202\340\272\204\340\272\207\340\272\210\340\272\212\340\272\215\340\2 72\224-\340\272\227\340\272\231-\340\272\237\340\272\241-\340\272\243\340\272\24 5\340\272\247\340\272\252\340\272\253\340\272\255\340\272\256\340\272\260\340\27 2\262\340\272\263\340\272\275\340\273\200-\340\273\204\340\275\200-\340\275\207\ 340\275\211-\340\275\251\341\202\240-\341\203\205\341\203\220-\341\203\266\341\2 04\200\341\204\202\341\204\203\341\204\205-\341\204\207\341\204\211\341\204\213\ 341\204\214\341\204\216-\341\204\222\341\204\274\341\204\276\341\205\200\341\205 \214\341\205\216\341\205\220\341\205\224\341\205\225\341\205\231\341\205\237-\34 1\205\241\341\205\243\341\205\245\341\205\247\341\205\251\341\205\255\341\205\25 6\341\205\262\341\205\263\341\205\265\341\206\236\341\206\250\341\206\253\341\20 6\256\341\206\257\341\206\267\341\206\270\341\206\272\341\206\274-\341\207\202\3 41\207\253\341\207\260\341\207\271\341\270\200-\341\272\233\341\272\240-\341\273 \271\341\274\200-\341\274\225\341\274\230-\341\274\235\341\274\240-\341\275\205\ 341\275\210-\341\275\215\341\275\220-\341\275\227\341\275\231\341\275\233\341\27 5\235\341\275\237-\341\275\275\341\276\200-\341\276\264\341\276\266-\341\276\274 \341\276\276\341\277\202-\341\277\204\341\277\206-\341\277\214\341\277\220-\341\ 277\223\341\277\226-\341\277\233\341\277\240-\341\277\254\341\277\262-\341\277\2 64\341\277\266-\341\277\274\342\204\246\342\204\252\342\204\253\342\204\256\342\ 206\200-\342\206\202\343\200\207\343\200\241-\343\200\251\343\201\201-\343\202\2 24\343\202\241-\343\203\272\343\204\205-\343\204\254]") In the patched version of ispell.el, ispell-decode-string translates the sequences of octets into the appropriate UTF-8 characters. It's the last bunch of characters that is of interest: \343\200\207\343\200\241-\343\200\251\343\201\201-\343\202\2 24\343\202\241-\343\203\272\343\204\205-\343\204\254 These translate into UTF-8 characters in the mule-unicode-2500-33ff charset, corresponding to the Unicode code points U+3007 (IDEOGRAPHIC NUMBER ZERO), U+3021 to U+3029 (HANGZHOU NUMERALS), U+3041 to U+3094 (HIRAGANA LETTERS), U+30A1 to U+30FA (KATAKANA LETTERS), and U+3105 to U+312C (BOPOMOFO LETTERS) Emacs 21.3.50, however, complains that the above ranges stride over charsets, and to mollify Emacs, I have to change it to \343\200\207\343\200\241-\343\200\251\343\201\201- \343\202\223\343\202\224\343\202\241-\343\203\266 \343\203\267-\343\203\272\343\204\205-\343\204\251 \343\204\252-\343\204\254 which corresponds to the same codepoints as mentioned above, but distributed as follows: U+3007 (IDEOGRAPHIC NUMBER ZERO), U+3021 to U+3029 (HANGZHOU NUMERALS), U+3041 to U+3093, U+3094 (HIRAGANA LETTERS), U+30A1 to U+30F6, U+30F7 to U+30FA (KATAKANA LETTERS), and U+3105 to U+3129, U+312A to U+312C (BOPOMOFO LETTERS) Interestingly enough, when I run (decode-coding-string "\343\200\207\343\200\241-\343\200\251\343\201\201-\343\202\223\343\202\224\343\202\241-\343\203\266\343\203\267-\343\203\272\343\204\205-\343\204\251\343\204\252-\343\204\254" 'utf-8) the result suggests that U+3021 to U+3029 (HANGZHOU NUMERALS), U+3041 to U+3093 (all but one of the HIRAGANA LETTERS), U+30A1 to U+30F6 (most of the KATAKANA LETTERS), and U+3105 to U+3129 (most of the BOPOMOFO LETTERS) are represented by double-wide characters, while the rest are represented by single-wide characters. Running the X11 version of Emacs on OS X. __________________________________ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail