unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Agustin Martin <agustin.martin@hispalinux.es>
To: emacs-devel@gnu.org
Subject: Re: Ispell and unibyte characters
Date: Wed, 28 Mar 2012 21:18:21 +0200	[thread overview]
Message-ID: <20120328191821.GA6266@agmartin.aq.upm.es> (raw)
In-Reply-To: <E1SCGD0-0001Dm-Tu@fencepost.gnu.org>

On Mon, Mar 26, 2012 at 04:08:06PM -0400, Eli Zaretskii wrote:
> > Date: Mon, 26 Mar 2012 19:39:12 +0200
> > From: Agustin Martin <agustin.martin@hispalinux.es>
> > 
> > Hi Eli,
> 
> Thanks for responding, I was beginning to think that no one is
> interested.  In general, I find that ispell.el is in sore need of
> modernization; at least that's my conclusion so far from playing with
> hunspell (with which I want to replace my aging collection of Ispell
> and its dictionaries that I use for many years).
> 
> > At least for aspell ispell.el already uses utf8 as default communication
> > encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
> > OTHERCHARS is guessed from aspell .dat file for given dictionary.
> 
> The question is, why isn't this done for any modern speller.  The only
> one I know of that cannot handle UTF-8 is Ispell.

I think the only real remaining reason is for XEmacs compatibility. AFAIK 
XEmacs does not support [:alpha:].

I thought about filtering ispell-dictionary-base-alist when used from FSF
Emacs, so it uses [:alpha:] and still keeps compatibility. I am currently a
bit busy, but at some time I may try this for Debian and see what happens.

For XEMACS in Debian GNU/* even changing to [:alpha:] should have a reduced
impact, strings provided by dictionary maintainers take precedence, but
better if I can easily do the above anyway, so [:alpha:] is used if
available.

Once release happens, I'd like to commit some other changes to decrease
XEmacs incompatibilities in ispell.el and flyspell.el, so my changes for
Debian GNU/* become smaller.

> OTHERCHARS are not very important anyway, at least for languages I'm
> interested in.
> 
> > Since currently it is not possible to ask hunspell for installed
> > dictionaries (hunspell -D does not return control to the console)
> > no one tried something similar for hunspell.
> 
> In what version do you have problems with -D?

Hunspell 1.3.2. Does not return control until I press ^C. This may be useful
if someone wants to know about installed hunspell dictionaries and prepare
something to play with that info, in a way similar to what is currently done
for aspell in ispell.el.

> In any case, hunspell supports multiple dictionaries in the same
> session.  One can invoke it with, e.g., "-d en_US,de_DE,ru_RU,he_IL"
> and have it spell-check mixed text that uses all these languages in
> the same buffer (at least in theory; I didn't yet try that in my
> experiments).  Clearly, this can only be done with UTF-8 or some such
> as the encoding.

Right.

> So I think we should deprecate usage of the unibyte characters in the
> ispell.el defaults, and simply use [:alpha:] for all languages.  As a
> bonus, we can then get rid of the ridiculously long and hard to
> maintain customization of each new dictionary you add to your
> repertory.  Just one entry will serve almost any language, or at least
> supply an excellent default.
> 
> > > The only reason for this limitation I could find is in
> > > ispell-process-line, which assumes that the byte offsets returned by
> > > the speller can be used to compute character position of the
> > > misspelled word in the buffer.  Are there any other places in
> > > ispell.el that assume unibyte characters?
> > 
> > Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
> > I do not remember reports about this. 
> 
> Since I wrote that, I found that the problem was due to a bug in
> hunspell (which I fixed in my copy): it reported byte offsets of the
> misspelled words, rather than character offsets.  After fixing that
> bug, there's no issue here anymore and nothing to fix in ispell.el.
> There's a bug report with a patch about that in the hunspell bug
> tracker, so there's reason to believe this bug will be fixed in a
> future release.

You mean

http://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395

I filed that bug one year ago and received no reply from hunspell
maintainers. This year I received a followup with a proposed change, but
there is still no reply to it.

There is other problem that mostly hits re-using ispell default entries
under hunspell

http://sourceforge.net/tracker/?func=detail&aid=2617130&group_id=143754&atid=756395

[~ prefixed strings are treated as words in pipe mode]

that now stands for three years. I have waited in the hope this is fixed,
but I think I will soon commit to Emacs the same change I use for Debian, 
making sure extended-character-mode is nil for hunspell. I do not think
extended-character-mode pseudo-charsets will ever be implemented in
hunspell.

-- 
Agustin



  parent reply	other threads:[~2012-03-28 19:18 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii
2012-03-26 17:39 ` Agustin Martin
2012-03-26 20:08   ` Eli Zaretskii
2012-03-26 22:07     ` Lennart Borgman
2012-03-28 19:18     ` Agustin Martin [this message]
2012-03-29 18:06       ` Eli Zaretskii
2012-03-29 21:13         ` Andreas Schwab
2012-03-30  6:28           ` Eli Zaretskii
2012-04-26  9:54         ` Eli Zaretskii
2012-04-10 19:08       ` Agustin Martin
2012-04-10 19:11         ` Eli Zaretskii
2012-04-12 14:36           ` Agustin Martin
2012-04-12 19:01             ` Eli Zaretskii
2012-04-13 15:25               ` Agustin Martin
2012-04-13 15:53                 ` Eli Zaretskii
2012-04-13 16:38                   ` Agustin Martin
2012-04-13 17:51                 ` Stefan Monnier
2012-04-13 18:44                   ` Agustin Martin
2012-04-14  1:57                     ` Stefan Monnier
2012-04-15  0:02                       ` Agustin Martin
2012-04-16  2:40                         ` Stefan Monnier
2012-04-20 15:25                           ` Agustin Martin
2012-04-20 15:36                             ` Eli Zaretskii
2012-04-20 16:17                               ` Agustin Martin
2012-04-21  2:17                                 ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120328191821.GA6266@agmartin.aq.upm.es \
    --to=agustin.martin@hispalinux.es \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).