On 19 December 2016 at 16:01, Eli Zaretskii wrote: > > From: Reuben Thomas > > Date: Sun, 18 Dec 2016 23:39:54 +0000 > > Cc: 17742@debbugs.gnu.org > > > > I have not had any response to my enquiries yet, but I did some > research, and neither GNU Aspell nor hunspell offer any way to get this > information (about character classes of dictionaries) via their APIs. > > They provide this information in the dictionaries, and we glean it > from there. See ispell-parse-hunspell-affix-file and > ispell-aspell-find-dictionary. > ​The dictionaries are not part of the API (even where the format is documented, the location may not be fixed), so it's not a good idea to rely on them. ​Having discovered that Aspell does not provide this information (I checked again, and ispell-aspell-find-dictionary does not find this information in the dictionaries, except for limited information about otherchars; for casechars and not-casechars it defaults to [:alpha:]), I shall investigate with the hunspell maintainers.​ > Maybe there's a misunderstanding: I'm talking about the CASECHARS, > NOT-CASECHARS, and OTHERCHARS parts of the dictionary data in > ispell-dictionary-alist. ​There's no misunderstanding here, that's what I'm talking about.​ Each dictionary can (and many do) use some of the punctuation > characters in the words it can handle. A notable example is the > apostrophe ' in English, used for the various suffixes that spellers > support; similar features exist in other languages, but with possibly > different punctuation characters. Ispell.el must match that by using > the speller's notion of a word, which must be independent of the > current major mode's idea of what a word is. This is where these > character sets come into play, and I really cannot see how can > ispell.el work well without using them as it does now. > ​Currently, using casechars = [[:graph:]], if I put point over part of the string " (XP) ", and run M-x ispell-word, it says "(XP) is correct". That's good enough for me! Note that merely using the characters declared in the dictionary may not be enough: I have words like SC³D (I spell my company that way) in my personal word lists. Other users might be more imaginative, and for example have sequences of emoji. The list of characters in the dictionary is only a minimum.​ > So we do need this information. If Enchant doesn't provide it, we > could still use the same technique as with Aspell and Hunspell, > provided that we can figure out which back end(s) is/are used by > Enchant. Is that doable? > ​Yes, that can be done, but it's fragile; that's why I'm trying to avoid it.​ Ispell.el also supports spell-checking by words, in which case the > above is not useful, because we need to figure out what is a word. > ​See above. It's not clear to me that we need a very precise idea of what constitutes a word.​ Moreover, even when we send entire lines to the speller, we want to > skip lines that include only non-word characters. ​Why?​ Just look at the > ​ ​ > callers of the above-mentioned accessor functions, and you will see > how we use them. > ​I have read this code. I see how we use them; it's just not clear to me that it's necessary to use them thus.​ Hunspell is the most modern and sophisticated speller, we certainly > don't want to degrade it. ​No chance of that, this patch is only about Enchant.​ Also, Aspell uses the dictionaries at least > for some of this info, see the function I pointed to above. > ​Only for otherchars, not casechars/not-casechars.​ Bottom line, this information cannot be thrown away or ignored. It is > important for correctly interfacing with a dictionary and for doing > TRT as the users expect. Any modern speller program would benefit > from it, and therefore we should strive to provide such information to > ispell.el whenever we possibly can. > ​It is not a question of throwing away or ignoring information: the information is simply not available through documented channels (at least for Enchant). Yes, one can find the underlying engine and then use that information to (try to) find the dictionaries, but one is then making a number of brittle assumptions. And it's not clear that the information is actually necessary to have. It would be helpful if you could show a situation in which using [:graph:] for enchant dictionaries. actually misbehaves in some way. In fact, reading enchant's source code, it uses a fixed set of Unicode classes for its own internal equivalent of casechars. Using that would make sense (for Enchant! again, I'm not suggesting changing how we use hunspell). One other data point: a senior LyX maintainer, Jean-Marc Lasgouttes, agrees with you: https://github.com/AbiWord/enchant/issues/17#issuecomment-267924304 He says that LyX has a "bug open somewhere" that suggests using this information (but he didn't know it was available!). -- http://rrt.sc3d.org