Re: Ispell and unibyte characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Agustin Martin <agustin.martin@hispalinux.es>
To: emacs-devel@gnu.org
Subject: Re: Ispell and unibyte characters
Date: Mon, 26 Mar 2012 19:39:12 +0200	[thread overview]
Message-ID: <20120326173912.GA22306@agmartin.aq.upm.es> (raw)
In-Reply-To: <83aa3f2hgh.fsf@gnu.org>

On Sat, Mar 17, 2012 at 08:46:54PM +0200, Eli Zaretskii wrote:
> The doc string of ispell-dictionary-alist says, inter alia:
> 
>   Each element of this list is also a list:
> 
>   (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P
> 	  ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET)
>   ...
>   CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings
>   containing bytes of CHARACTER-SET.  In addition, if they contain
>   a non-ASCII byte, the regular expression must be a single
>   `character set' construct that doesn't specify a character range
>   for non-ASCII bytes.
> 
> Why the restriction to unibyte character sets?  This is quite a
> serious limitation, given that the modern spellers (aspell and
> hunspell) use UTF-8 as their default encoding.

Hi Eli,

At least for aspell ispell.el already uses utf8 as default communication
encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
OTHERCHARS is guessed from aspell .dat file for given dictionary.

Since currently it is not possible to ask hunspell for installed
dictionaries (hunspell -D does not return control to the console)
no one tried something similar for hunspell.

> The only reason for this limitation I could find is in
> ispell-process-line, which assumes that the byte offsets returned by
> the speller can be used to compute character position of the
> misspelled word in the buffer.  Are there any other places in
> ispell.el that assume unibyte characters?

Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
I do not remember reports about this. 

> If ispell-process-line is the only place, then it should be easy to
> extend it so it handles correctly UTF-8 in addition to unibyte
> character sets.
> 
> In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and
> OTHERCHARS as ugly unibyte escapes, since their usage is entirely
> consistent with multibyte characters: they are used to construct
> regular expressions and match buffer text against those regexps.  

IIRC, the reason to use octal escapes is mostly that they are encoding
independent. Otherwise a .emacs file may have mixed unibyte/multibyte
encodings.

Current limitation in docstring may be only something left from old times. I
will try to look with recent ispell american dict, which can be called in
utf8. Will let you know.

Regards,

-- 
Agustin

next prev parent reply	other threads:[~2012-03-26 17:39 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii
2012-03-26 17:39 ` Agustin Martin [this message]
2012-03-26 20:08   ` Eli Zaretskii
2012-03-26 22:07     ` Lennart Borgman
2012-03-28 19:18     ` Agustin Martin
2012-03-29 18:06       ` Eli Zaretskii
2012-03-29 21:13         ` Andreas Schwab
2012-03-30  6:28           ` Eli Zaretskii
2012-04-26  9:54         ` Eli Zaretskii
2012-04-10 19:08       ` Agustin Martin
2012-04-10 19:11         ` Eli Zaretskii
2012-04-12 14:36           ` Agustin Martin
2012-04-12 19:01             ` Eli Zaretskii
2012-04-13 15:25               ` Agustin Martin
2012-04-13 15:53                 ` Eli Zaretskii
2012-04-13 16:38                   ` Agustin Martin
2012-04-13 17:51                 ` Stefan Monnier
2012-04-13 18:44                   ` Agustin Martin
2012-04-14  1:57                     ` Stefan Monnier
2012-04-15  0:02                       ` Agustin Martin
2012-04-16  2:40                         ` Stefan Monnier
2012-04-20 15:25                           ` Agustin Martin
2012-04-20 15:36                             ` Eli Zaretskii
2012-04-20 16:17                               ` Agustin Martin
2012-04-21  2:17                                 ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120326173912.GA22306@agmartin.aq.upm.es \
    --to=agustin.martin@hispalinux.es \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).