From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Ispell and unibyte characters Date: Sat, 17 Mar 2012 20:46:54 +0200 Message-ID: <83aa3f2hgh.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: dough.gmane.org 1332010026 26600 80.91.229.3 (17 Mar 2012 18:47:06 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sat, 17 Mar 2012 18:47:06 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Mar 17 19:47:05 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1S8yeb-0002ua-VI for ged-emacs-devel@m.gmane.org; Sat, 17 Mar 2012 19:47:02 +0100 Original-Received: from localhost ([::1]:41466 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S8yea-00020C-RK for ged-emacs-devel@m.gmane.org; Sat, 17 Mar 2012 14:47:00 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:34267) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S8yeY-000207-4i for emacs-devel@gnu.org; Sat, 17 Mar 2012 14:46:59 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1S8yeW-0004a1-IE for emacs-devel@gnu.org; Sat, 17 Mar 2012 14:46:57 -0400 Original-Received: from mtaout23.012.net.il ([80.179.55.175]:58844) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1S8yeW-0004ZH-AR for emacs-devel@gnu.org; Sat, 17 Mar 2012 14:46:56 -0400 Original-Received: from conversion-daemon.a-mtaout23.012.net.il by a-mtaout23.012.net.il (HyperSendmail v2007.08) id <0M1100300L9UTX00@a-mtaout23.012.net.il> for emacs-devel@gnu.org; Sat, 17 Mar 2012 20:46:54 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([84.229.34.45]) by a-mtaout23.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0M11003JHLI5T030@a-mtaout23.012.net.il> for emacs-devel@gnu.org; Sat, 17 Mar 2012 20:46:54 +0200 (IST) X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-Received-From: 80.179.55.175 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:149112 Archived-At: The doc string of ispell-dictionary-alist says, inter alia: Each element of this list is also a list: (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET) ... CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings containing bytes of CHARACTER-SET. In addition, if they contain a non-ASCII byte, the regular expression must be a single `character set' construct that doesn't specify a character range for non-ASCII bytes. Why the restriction to unibyte character sets? This is quite a serious limitation, given that the modern spellers (aspell and hunspell) use UTF-8 as their default encoding. The only reason for this limitation I could find is in ispell-process-line, which assumes that the byte offsets returned by the speller can be used to compute character position of the misspelled word in the buffer. Are there any other places in ispell.el that assume unibyte characters? If ispell-process-line is the only place, then it should be easy to extend it so it handles correctly UTF-8 in addition to unibyte character sets. In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and OTHERCHARS as ugly unibyte escapes, since their usage is entirely consistent with multibyte characters: they are used to construct regular expressions and match buffer text against those regexps. Did I miss something important? Any comments and pointers to my blunders are welcome.