From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Agustin Martin <agustin.martin@hispalinux.es>
Newsgroups: gmane.emacs.devel
Subject: Re: Ispell and unibyte characters
Date: Mon, 26 Mar 2012 19:39:12 +0200
Message-ID: <20120326173912.GA22306@agmartin.aq.upm.es>
References: <83aa3f2hgh.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: dough.gmane.org 1332790832 12539 80.91.229.3 (26 Mar 2012 19:40:32 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Mon, 26 Mar 2012 19:40:32 +0000 (UTC)
To: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Mar 26 21:40:31 2012
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SCFmJ-0000W4-09
	for ged-emacs-devel@m.gmane.org; Mon, 26 Mar 2012 21:40:31 +0200
Original-Received: from localhost ([::1]:50065 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SCFmI-0000RD-Bo
	for ged-emacs-devel@m.gmane.org; Mon, 26 Mar 2012 15:40:30 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:33581)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SCDt8-00042M-Nm
	for emacs-devel@gnu.org; Mon, 26 Mar 2012 13:39:28 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SCDt3-0005Xw-Qj
	for emacs-devel@gnu.org; Mon, 26 Mar 2012 13:39:26 -0400
Original-Received: from edison.ccupm.upm.es ([138.100.198.71]:53228 helo=smtp.upm.es)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SCDt3-0005WA-GK
	for emacs-devel@gnu.org; Mon, 26 Mar 2012 13:39:21 -0400
Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131])
	by smtp.upm.es (8.14.3/8.14.3/edison-001) with ESMTP id q2QHdCv1010095; 
	Mon, 26 Mar 2012 19:39:12 +0200
Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000)
	id C386B448; Mon, 26 Mar 2012 19:39:12 +0200 (CEST)
Mail-Followup-To: emacs-devel@gnu.org
Content-Disposition: inline
In-Reply-To: <83aa3f2hgh.fsf@gnu.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 138.100.198.71
X-Mailman-Approved-At: Mon, 26 Mar 2012 15:40:28 -0400
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:149223
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/149223>

On Sat, Mar 17, 2012 at 08:46:54PM +0200, Eli Zaretskii wrote:
> The doc string of ispell-dictionary-alist says, inter alia:
> 
>   Each element of this list is also a list:
> 
>   (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P
> 	  ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET)
>   ...
>   CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings
>   containing bytes of CHARACTER-SET.  In addition, if they contain
>   a non-ASCII byte, the regular expression must be a single
>   `character set' construct that doesn't specify a character range
>   for non-ASCII bytes.
> 
> Why the restriction to unibyte character sets?  This is quite a
> serious limitation, given that the modern spellers (aspell and
> hunspell) use UTF-8 as their default encoding.

Hi Eli,

At least for aspell ispell.el already uses utf8 as default communication
encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
OTHERCHARS is guessed from aspell .dat file for given dictionary.

Since currently it is not possible to ask hunspell for installed
dictionaries (hunspell -D does not return control to the console)
no one tried something similar for hunspell.
 
> The only reason for this limitation I could find is in
> ispell-process-line, which assumes that the byte offsets returned by
> the speller can be used to compute character position of the
> misspelled word in the buffer.  Are there any other places in
> ispell.el that assume unibyte characters?

Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
I do not remember reports about this. 

> If ispell-process-line is the only place, then it should be easy to
> extend it so it handles correctly UTF-8 in addition to unibyte
> character sets.
> 
> In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and
> OTHERCHARS as ugly unibyte escapes, since their usage is entirely
> consistent with multibyte characters: they are used to construct
> regular expressions and match buffer text against those regexps.  

IIRC, the reason to use octal escapes is mostly that they are encoding
independent. Otherwise a .emacs file may have mixed unibyte/multibyte
encodings.

Current limitation in docstring may be only something left from old times. I
will try to look with recent ispell american dict, which can be called in
utf8. Will let you know.

Regards,

-- 
Agustin