From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#17742: Acknowledgement (Support for enchant?) Date: Mon, 19 Dec 2016 18:01:27 +0200 Message-ID: <838trb6h7s.fsf@gnu.org> References: <834m2hjbmr.fsf@gnu.org> <83bmwfbxaf.fsf@gnu.org> <837f73bqwv.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1482163403 15714 195.159.176.226 (19 Dec 2016 16:03:23 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 19 Dec 2016 16:03:23 +0000 (UTC) Cc: 17742@debbugs.gnu.org To: Reuben Thomas Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Dec 19 17:03:18 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cJ0PC-0002fa-By for geb-bug-gnu-emacs@m.gmane.org; Mon, 19 Dec 2016 17:03:14 +0100 Original-Received: from localhost ([::1]:46343 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cJ0PF-0002XD-88 for geb-bug-gnu-emacs@m.gmane.org; Mon, 19 Dec 2016 11:03:17 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:49317) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cJ0P4-0002Wr-P5 for bug-gnu-emacs@gnu.org; Mon, 19 Dec 2016 11:03:11 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cJ0P0-0006VH-Fi for bug-gnu-emacs@gnu.org; Mon, 19 Dec 2016 11:03:06 -0500 Original-Received: from debbugs.gnu.org ([208.118.235.43]:60573) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cJ0P0-0006Uu-C3 for bug-gnu-emacs@gnu.org; Mon, 19 Dec 2016 11:03:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1cJ0P0-0004Sf-2H for bug-gnu-emacs@gnu.org; Mon, 19 Dec 2016 11:03:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 19 Dec 2016 16:03:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 17742 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 17742-submit@debbugs.gnu.org id=B17742.148216333117075 (code B ref 17742); Mon, 19 Dec 2016 16:03:02 +0000 Original-Received: (at 17742) by debbugs.gnu.org; 19 Dec 2016 16:02:11 +0000 Original-Received: from localhost ([127.0.0.1]:47739 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cJ0OA-0004RL-W7 for submit@debbugs.gnu.org; Mon, 19 Dec 2016 11:02:11 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:58228) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cJ0O9-0004R6-OA for 17742@debbugs.gnu.org; Mon, 19 Dec 2016 11:02:10 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cJ0O0-0005OW-AO for 17742@debbugs.gnu.org; Mon, 19 Dec 2016 11:02:04 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:46233) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cJ0O0-0005OM-6y; Mon, 19 Dec 2016 11:02:00 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:1703 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1cJ0Nz-0002To-FD; Mon, 19 Dec 2016 11:01:59 -0500 In-reply-to: (message from Reuben Thomas on Sun, 18 Dec 2016 23:39:54 +0000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:127181 Archived-At: > From: Reuben Thomas > Date: Sun, 18 Dec 2016 23:39:54 +0000 > Cc: 17742@debbugs.gnu.org > > I have not had any response to my enquiries yet, but I did some research, and neither GNU Aspell nor hunspell offer any way to get this information (about character classes of dictionaries) via their APIs. They provide this information in the dictionaries, and we glean it from there. See ispell-parse-hunspell-affix-file and ispell-aspell-find-dictionary. > This suggests that they do not see a need for it. Perhaps it is worth confirming whether Emacs really needs this information? > > As far as I can see, it is used only in flyspell-word, for per-word spell-checking. (The only caller outside flyspell.el is erc, which has a FIXME saying not to call flyspell-word.) As far as I can see, the code assumes that words are a convenient unit to check and cache, though there's no definite requirement for that: in particular, the spelling checkers will say what words are incorrectly spelled and where they are without having to be given precisely the word. I guess that other editors and word processors work this way. Maybe there's a misunderstanding: I'm talking about the CASECHARS, NOT-CASECHARS, and OTHERCHARS parts of the dictionary data in ispell-dictionary-alist. These are definitely used in ispell.el, via the corresponding accessor functions ispell-get-casechars, ispell-get-not-casechars, and ispell-get-otherchars, which see. Each dictionary can (and many do) use some of the punctuation characters in the words it can handle. A notable example is the apostrophe ' in English, used for the various suffixes that spellers support; similar features exist in other languages, but with possibly different punctuation characters. Ispell.el must match that by using the speller's notion of a word, which must be independent of the current major mode's idea of what a word is. This is where these character sets come into play, and I really cannot see how can ispell.el work well without using them as it does now. So we do need this information. If Enchant doesn't provide it, we could still use the same technique as with Aspell and Hunspell, provided that we can figure out which back end(s) is/are used by Enchant. Is that doable? > For example, aspell.h contains the following notice about aspell_document_checker_process: > > * The string passed in should only be split on > * white space characters. Ispell.el also supports spell-checking by words, in which case the above is not useful, because we need to figure out what is a word. Moreover, even when we send entire lines to the speller, we want to skip lines that include only non-word characters. Just look at the callers of the above-mentioned accessor functions, and you will see how we use them. > Basic tests using [[:alpha:]] for casechars and [^[:alpha:]] for not-casechars seem to work OK. For which language and dictionary? This will definitely do the wrong thing for Hunspell he_IL dictionary I have here, which says: WORDCHARS אבגדהוזחטיכלמנסעפצקרשתםןךףץ'" That is, it wants ' and " to be treated as word-constituent characters. As another example, I can envision a dictionary of acronyms and abbreviations, which might want to treat the period as a word-constituent character, to support the likes of "a.k.a.". Etc. etc. -- this is up to the dictionary to decide, and Emacs must follow suit. Also, please note that [:alpha:] in Emacs 25 means a much larger set of characters than in previous versions, see NEWS. It will in general catch strings of characters that cannot possibly be TRT for a single-language dictionary. E.g., (string-match "[[:alpha:]]+" "aβגд") => 0 > ​I meant [[:graph:]] and [^[:graph:]].​ This will match an even larger set in Emacs 25, I don't think we will ever want that for spell-checking. > ​Also, as I realised while preparing the patch for bug#25230, it is only hunspell that has special information > about character classes. All the others just use [:alpha:]. So if it's good enough for ispell and aspell, can't it be > good enough for enchant? (It just means that for now "direct Hunspell" is arguably better than "Hunspell via > Enchant".) Hunspell is the most modern and sophisticated speller, we certainly don't want to degrade it. Also, Aspell uses the dictionaries at least for some of this info, see the function I pointed to above. Once again, if Enchant uses a back-end for which we know how to find this information, we should do so. Bottom line, this information cannot be thrown away or ignored. It is important for correctly interfacing with a dictionary and for doing TRT as the users expect. Any modern speller program would benefit from it, and therefore we should strive to provide such information to ispell.el whenever we possibly can. Thanks.