On 19 December 2016 at 16:01, Eli Zaretskii <eliz@gnu.org> wrote:

> > From: Reuben Thomas <rrt@sc3d.org>
> > Date: Sun, 18 Dec 2016 23:39:54 +0000
> > Cc: 17742@debbugs.gnu.org
> >
> > I have not had any response to my enquiries yet, but I did some
> research, and neither GNU Aspell nor hunspell offer any way to get this
> information (about character classes of dictionaries) via their APIs.
>
> They provide this information in the dictionaries, and we glean it
> from there.  See ispell-parse-hunspell-affix-file and
> ispell-aspell-find-dictionary.
>

​The dictionaries are not part of the API (even where the format is
documented, the location may not be fixed), so it's not a good idea to rely
on them.

​Having discovered that Aspell does not provide this information (I checked
again, and ispell-aspell-find-dictionary does not find this information in
the dictionaries, except for limited information about otherchars; for
casechars and not-casechars it defaults to [:alpha:]), I shall investigate
with the hunspell maintainers.​


> Maybe there's a misunderstanding: I'm talking about the CASECHARS,
> NOT-CASECHARS, and OTHERCHARS parts of the dictionary data in
> ispell-dictionary-alist.


​There's no misunderstanding here, that's what I'm talking about.​

Each dictionary can (and many do) use some of the punctuation
> characters in the words it can handle.  A notable example is the
> apostrophe ' in English, used for the various suffixes that spellers
> support; similar features exist in other languages, but with possibly
> different punctuation characters.  Ispell.el must match that by using
> the speller's notion of a word, which must be independent of the
> current major mode's idea of what a word is.  This is where these
> character sets come into play, and I really cannot see how can
> ispell.el work well without using them as it does now.
>

​Currently, using casechars = [[:graph:]], if I put point over part of the
string " (XP) ", and run M-x ispell-word, it says "(XP) is correct". That's
good enough for me!

Note that merely using the characters declared in the dictionary may not be
enough: I have words like SC³D (I spell my company that way) in my personal
word lists. Other users might be more imaginative, and for example have
sequences of emoji. The list of characters in the dictionary is only a
minimum.​


> So we do need this information.  If Enchant doesn't provide it, we
> could still use the same technique as with Aspell and Hunspell,
> provided that we can figure out which back end(s) is/are used by
> Enchant.  Is that doable?
>

​Yes, that can be done, but it's fragile; that's why I'm trying to avoid
it.​

Ispell.el also supports spell-checking by words, in which case the
> above is not useful, because we need to figure out what is a word.
>

​See above. It's not clear to me that we need a very precise idea of what
constitutes a word.​

Moreover, even when we send entire lines to the speller, we want to
> skip lines that include only non-word characters.


​Why?​

Just look at the
> ​ ​
> callers of the above-mentioned accessor functions, and you will see
> how we use them.
>

​I have read this code. I see how we use them; it's just not clear to me
that it's necessary to use them thus.​

Hunspell is the most modern and sophisticated speller, we certainly
>
don't want to degrade it.


​No chance of that, this patch is only about Enchant.​

  Also, Aspell uses the dictionaries at least
> for some of this info, see the function I pointed to above.
>

​Only for otherchars, not casechars/not-casechars.​

Bottom line, this information cannot be thrown away or ignored.  It is
> important for correctly interfacing with a dictionary and for doing
> TRT as the users expect.  Any modern speller program would benefit
> from it, and therefore we should strive to provide such information to
> ispell.el whenever we possibly can.
>

​It is not a question of throwing away or ignoring information: the
information is simply not available through documented channels (at least
for Enchant). Yes, one can find the underlying engine and then use that
information to (try to) find the dictionaries, but one is then making a
number of brittle assumptions. And it's not clear that the information is
actually necessary to have.

It would be helpful if you could show a situation in which using [:graph:]
for enchant dictionaries. actually misbehaves in some way.

In fact, reading enchant's source code, it uses a fixed set of Unicode
classes for its own internal equivalent of casechars. Using that would make
sense (for Enchant! again, I'm not suggesting changing how we use hunspell).

One other data point: a senior LyX maintainer, Jean-Marc Lasgouttes, agrees
with you:

https://github.com/AbiWord/enchant/issues/17#issuecomment-267924304

He says that LyX has a "bug open somewhere" that suggests using this
information (but he didn't know it was available!).

-- 
http://rrt.sc3d.org