From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.devel Subject: Re: Ispell and unibyte characters Date: Mon, 26 Mar 2012 19:39:12 +0200 Message-ID: <20120326173912.GA22306@agmartin.aq.upm.es> References: <83aa3f2hgh.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1332790832 12539 80.91.229.3 (26 Mar 2012 19:40:32 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 26 Mar 2012 19:40:32 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Mar 26 21:40:31 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SCFmJ-0000W4-09 for ged-emacs-devel@m.gmane.org; Mon, 26 Mar 2012 21:40:31 +0200 Original-Received: from localhost ([::1]:50065 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SCFmI-0000RD-Bo for ged-emacs-devel@m.gmane.org; Mon, 26 Mar 2012 15:40:30 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:33581) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SCDt8-00042M-Nm for emacs-devel@gnu.org; Mon, 26 Mar 2012 13:39:28 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SCDt3-0005Xw-Qj for emacs-devel@gnu.org; Mon, 26 Mar 2012 13:39:26 -0400 Original-Received: from edison.ccupm.upm.es ([138.100.198.71]:53228 helo=smtp.upm.es) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SCDt3-0005WA-GK for emacs-devel@gnu.org; Mon, 26 Mar 2012 13:39:21 -0400 Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131]) by smtp.upm.es (8.14.3/8.14.3/edison-001) with ESMTP id q2QHdCv1010095; Mon, 26 Mar 2012 19:39:12 +0200 Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000) id C386B448; Mon, 26 Mar 2012 19:39:12 +0200 (CEST) Mail-Followup-To: emacs-devel@gnu.org Content-Disposition: inline In-Reply-To: <83aa3f2hgh.fsf@gnu.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 138.100.198.71 X-Mailman-Approved-At: Mon, 26 Mar 2012 15:40:28 -0400 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:149223 Archived-At: On Sat, Mar 17, 2012 at 08:46:54PM +0200, Eli Zaretskii wrote: > The doc string of ispell-dictionary-alist says, inter alia: > > Each element of this list is also a list: > > (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P > ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET) > ... > CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings > containing bytes of CHARACTER-SET. In addition, if they contain > a non-ASCII byte, the regular expression must be a single > `character set' construct that doesn't specify a character range > for non-ASCII bytes. > > Why the restriction to unibyte character sets? This is quite a > serious limitation, given that the modern spellers (aspell and > hunspell) use UTF-8 as their default encoding. Hi Eli, At least for aspell ispell.el already uses utf8 as default communication encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). OTHERCHARS is guessed from aspell .dat file for given dictionary. Since currently it is not possible to ask hunspell for installed dictionaries (hunspell -D does not return control to the console) no one tried something similar for hunspell. > The only reason for this limitation I could find is in > ispell-process-line, which assumes that the byte offsets returned by > the speller can be used to compute character position of the > misspelled word in the buffer. Are there any other places in > ispell.el that assume unibyte characters? Not sure if using utf8 and [:alpha:] has caused some problem for aspell, I do not remember reports about this. > If ispell-process-line is the only place, then it should be easy to > extend it so it handles correctly UTF-8 in addition to unibyte > character sets. > > In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and > OTHERCHARS as ugly unibyte escapes, since their usage is entirely > consistent with multibyte characters: they are used to construct > regular expressions and match buffer text against those regexps. IIRC, the reason to use octal escapes is mostly that they are encoding independent. Otherwise a .emacs file may have mixed unibyte/multibyte encodings. Current limitation in docstring may be only something left from old times. I will try to look with recent ispell american dict, which can be called in utf8. Will let you know. Regards, -- Agustin