all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Reuben Thomas <rrt@sc3d.org>
Cc: 17742@debbugs.gnu.org
Subject: bug#17742: Acknowledgement (Support for enchant?)
Date: Tue, 20 Dec 2016 17:40:12 +0200	[thread overview]
Message-ID: <834m1y4nj7.fsf@gnu.org> (raw)
In-Reply-To: <CAOnWdoinKZc9wK-KZpPVBjsGcG1Bphvt7PEu49S55mGDnoJcmg@mail.gmail.com> (message from Reuben Thomas on Mon, 19 Dec 2016 21:47:42 +0000)

> From: Reuben Thomas <rrt@sc3d.org>
> Date: Mon, 19 Dec 2016 21:47:42 +0000
> Cc: 17742@debbugs.gnu.org
> 
>     neither GNU Aspell nor hunspell offer any way to get this information (about character classes of dictionaries) via their APIs.
> 
>     They provide this information in the dictionaries, and we glean it
>     from there.  See ispell-parse-hunspell-affix-file and
>     ispell-aspell-find-dictionary.
> 
> ​The dictionaries are not part of the API (even where the format is documented, the location may not be fixed), so it's not a good idea to rely on them.

If there's no better way, then I see no problem in relying on the
dictionaries, and de-facto the results are satisfactory.

> ​Having discovered that Aspell does not provide this information (I checked again, and ispell-aspell-find-dictionary does not find this information in the dictionaries, except for limited information about otherchars; for casechars and not-casechars it defaults to [:alpha:]), I shall investigate with the hunspell maintainers.​

Aspell provides some of that, and there's no reason to ignore what it
does provide.

> ​Currently, using casechars = [[:graph:]], if I put point over part of the string " (XP) ", and run M-x ispell-word, it says "(XP) is correct". That's good enough for me!

Whether it's good enough depends on the dictionary and on what "(XP)"
means.  It could be that "(XP)", including the parentheses, is a word
the dictionary recognizes, something akin to "(C)", i.e. copyright
sign.  And it could be that the correct word is "XP", with the
parentheses acting as punctuation.  And there could be additional
alternatives.  Only the dictionary "knows" what is the right
alternative, and ispell.el should abide by the dictionary's rules, or
else it will not do what the user wants.  E.g., "XP" could not be in
the dictionary (as in fact I get when I try that with Hunspell), but
"(XP)" is.  So CASECHARS should be set up according to what the
dictionary expects, or you will have false positives and false
negatives.

> Note that merely using the characters declared in the dictionary may not be enough: I have words like SC³D (I spell my company that way) in my personal word lists. Other users might be more imaginative, and for example have sequences of emoji. The list of characters in the dictionary is only a minimum.​
 
That's why personal word list go together with dictionaries: they both
must use the same affix rules, so if you change to another dictionary
for the same language, your personal word list should also change, or
else you will get false negatives.

>     So we do need this information.  If Enchant doesn't provide it, we
>     could still use the same technique as with Aspell and Hunspell,
>     provided that we can figure out which back end(s) is/are used by
>     Enchant.  Is that doable?
> 
> ​Yes, that can be done, but it's fragile; that's why I'm trying to avoid it.​

I don't see why it would be fragile with Enchant when it isn't with
its back-ends.  And avoiding even fragile methods is worse than using
them, when there's no better way of gleaning the same information, and
the information is important (as it is in this case).

>     Ispell.el also supports spell-checking by words, in which case the
>     above is not useful, because we need to figure out what is a word.
>
> ​See above. It's not clear to me that we need a very precise idea of what constitutes a word.​

I think you are drawing too radical conclusions from trying that with
a single word and a single dictionary.  Which string was sent to the
speller in this case, and is that the string you expected to be sent?

>     Moreover, even when we send entire lines to the speller, we want to
>     skip lines that include only non-word characters.
>
> ​Why?​

To avoid false positives and false negatives, as explained above.

>     Hunspell is the most modern and sophisticated speller, we certainly
>     don't want to degrade it.
> 
> ​No chance of that, this patch is only about Enchant.​

First, Enchant could be using Hunspell as its engine, right?

And second, AFAIU this discussion started by you proposing to get rid
of CASECHARS etc., for all spellers, not just for Enchant, something
that will definitely cause degradation.

>       Also, Aspell uses the dictionaries at least
>     for some of this info, see the function I pointed to above.
>
> ​Only for otherchars, not casechars/not-casechars.​

Partial information is better than no information, IMO.

>     Bottom line, this information cannot be thrown away or ignored.  It is
>     important for correctly interfacing with a dictionary and for doing
>     TRT as the users expect.  Any modern speller program would benefit
>     from it, and therefore we should strive to provide such information to
>     ispell.el whenever we possibly can.
>
> ​It is not a question of throwing away or ignoring information: the information is simply not available through documented channels (at least for Enchant). Yes, one can find the underlying engine and then use that information to (try to) find the dictionaries, but one is then making a number of brittle assumptions. And it's not clear that the information is actually necessary to have.

It sounds like the important part of our disagreement is in the last
sentence.  If so, I hope I've succeeded to change your mind.  Failing
that, all I can suggest is to study the spelling rules of modern
speller, such as Hunspell, and see how this information is used there.

> It would be helpful if you could show a situation in which using [:graph:] for enchant dictionaries. actually misbehaves in some way.

I tried to explain that above: you will get falses and/or irrelevant
or missing corrections from the speller.  For example, if you send
"foo.bar", and the speller doesn't support '.' as a word-constituent
character, you will get separate suggestions for "foo" and "bar", and
won't get "foobar".

I also don't understand why you want to remove this information, that
is already there, is not harder to get with Enchant than it is without
it, and the code which supports it is already there?





  parent reply	other threads:[~2016-12-20 15:40 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-10  0:01 bug#17742: Support for enchant? Reuben Thomas
2014-09-15 11:06 ` bug#17742: Limitations of enchant Reuben Thomas
2016-12-02  0:15   ` Reuben Thomas
     [not found] ` <handler.17742.B.140235850213377.ack@debbugs.gnu.org>
2016-12-05 21:41   ` bug#17742: Acknowledgement (Support for enchant?) Reuben Thomas
2016-12-06 15:55     ` Eli Zaretskii
2016-12-06 15:56       ` Reuben Thomas
2016-12-13  0:53         ` Reuben Thomas
2016-12-13 16:37           ` Eli Zaretskii
2016-12-13 18:26             ` Reuben Thomas
2016-12-13 18:54               ` Eli Zaretskii
2016-12-13 21:17                 ` Reuben Thomas
2016-12-13 21:30                   ` Reuben Thomas
2016-12-14 15:42                   ` Eli Zaretskii
2016-12-15 12:36                     ` Reuben Thomas
2016-12-18 23:39                 ` Reuben Thomas
2016-12-19  1:02                   ` Reuben Thomas
2016-12-19 12:41                     ` Reuben Thomas
2016-12-19 16:01                   ` Eli Zaretskii
2016-12-19 17:37                     ` Agustin Martin
2016-12-19 18:09                       ` Eli Zaretskii
2016-12-19 21:21                         ` Reuben Thomas
2016-12-19 21:27                       ` Reuben Thomas
2016-12-20 15:38                         ` Eli Zaretskii
2016-12-19 21:47                     ` Reuben Thomas
2016-12-19 22:04                       ` Reuben Thomas
2016-12-20 15:40                         ` Eli Zaretskii
2016-12-20 15:40                       ` Eli Zaretskii [this message]
2016-12-20 21:43                         ` Reuben Thomas
2016-12-21 17:13                           ` Eli Zaretskii
2016-12-21 17:32                             ` Reuben Thomas
2017-08-09 11:35                               ` Reuben Thomas
2017-08-18  8:54                                 ` Eli Zaretskii
2017-08-20 13:02                                   ` Reuben Thomas
2017-08-20 14:42                                     ` Eli Zaretskii
2017-08-20 14:50                                       ` Reuben Thomas
2017-08-20 19:34                                         ` Eli Zaretskii
2017-08-20 20:36                                           ` Reuben Thomas
2017-08-20 14:50 ` bug#17742: Reuben Thomas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=834m1y4nj7.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=17742@debbugs.gnu.org \
    --cc=rrt@sc3d.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.