* ispell.el and multilanguage lines. @ 2015-12-15 23:14 Nikolay Kudryavtsev 2015-12-16 3:37 ` Eli Zaretskii 0 siblings, 1 reply; 7+ messages in thread From: Nikolay Kudryavtsev @ 2015-12-15 23:14 UTC (permalink / raw) To: help-gnu-emacs@gnu.org Here's the problem I've found with ispell.el. As a bilingual person and an avid org-mode user I often have lines in multiple languages. The fact that Russian language has a different alphabet is great for the purpose of spell checking, since I can check English first, then switch dictionary and check Russian. And spell checker is supposed to ignore words not in the dictionary alphabet. ispell-dictionary-alist has all the facilities needed for this and it should work in theory. Here's where it fails in practice - ispell.el uses a function called ispell-region and what it does is that it sends buffer line by line and whether the line gets sent is decided by ispell-get-line where the criteria is (re-search-forward ispell-casechars end t) - that is if there's a match with the language for the current dictionary the whole line gets sent. And of course words in that line that are in a different language get considered misspelled by the spell checker. I tried checking in bug reports, but this does not seem to have been reported yet. I'll submit it as a bug later. There's a possible workaround if you're using hunspell, by using multiple dictionaries at the same time, like this: ("ru_RU,en_US" "[[:alpha:]]" "[^[:alpha:]]" "['0-9]" t ("-d" "ru_RU,en_US") nil koi8-r) But this creates a mess in user dictionaries. -- Best Regards, Nikolay Kudryavtsev ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines. 2015-12-15 23:14 ispell.el and multilanguage lines Nikolay Kudryavtsev @ 2015-12-16 3:37 ` Eli Zaretskii 2016-03-07 1:18 ` Nikolay Kudryavtsev 0 siblings, 1 reply; 7+ messages in thread From: Eli Zaretskii @ 2015-12-16 3:37 UTC (permalink / raw) To: help-gnu-emacs > From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com> > Date: Wed, 16 Dec 2015 02:14:58 +0300 > > Here's the problem I've found with ispell.el. As a bilingual person and > an avid org-mode user I often have lines in multiple languages. The fact > that Russian language has a different alphabet is great for the purpose > of spell checking, since I can check English first, then switch > dictionary and check Russian. And spell checker is supposed to ignore > words not in the dictionary alphabet. ispell-dictionary-alist has all > the facilities needed for this and it should work in theory. > Here's where it fails in practice - ispell.el uses a function called > ispell-region and what it does is that it sends buffer line by line and > whether the line gets sent is decided by ispell-get-line where the > criteria is (re-search-forward ispell-casechars end t) - that is if > there's a match with the language for the current dictionary the whole > line gets sent. And of course words in that line that are in a different > language get considered misspelled by the spell checker. Emacs doesn't yet have a concept of the language of the text, certainly not when several languages are mixed in a buffer. > There's a possible workaround if you're using hunspell, by using > multiple dictionaries at the same time, like this: > ("ru_RU,en_US" "[[:alpha:]]" "[^[:alpha:]]" "['0-9]" t ("-d" > "ru_RU,en_US") nil koi8-r) Yes, and the latest code base (of what will become Emacs 25.1) supports this feature of Hunspell. > But this creates a mess in user dictionaries. I believe that "mess" is fixed in the current code. So I suggest you try the latest emacs-25 branch. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines. 2015-12-16 3:37 ` Eli Zaretskii @ 2016-03-07 1:18 ` Nikolay Kudryavtsev 2016-03-07 16:38 ` Eli Zaretskii 0 siblings, 1 reply; 7+ messages in thread From: Nikolay Kudryavtsev @ 2016-03-07 1:18 UTC (permalink / raw) To: help-gnu-emacs Sorry, for the very late reply, was somewhat busy, plus this stuff required somewhat extensive extra checking. > Emacs doesn't yet have a concept of the language of the text, > certainly not when several languages are mixed in a buffer. Ispell.el has CASECHARS and NOT-CASECHARS regular expressions in ispell-dictionary-alist. This should be enough for the purpose of differentiating Russian from English, if only it worked. But in practice, those regexps are not used correctly, since now ispell-get-line sends the whole line when re-search-forward finds CASECHARS within it. While I'd agree that the number of use cases, where CASECHARS and NOT-CASECHARS are useful is rather limited, fixing them seems easier than removing. > Yes, and the latest code base (of what will become Emacs 25.1) > supports this feature of Hunspell. I've tried that version, sure it fixes this bug <https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20495>, but I actually never had a problem with it in the first place, since you can avoid it in 24.5 by setting ispell-dictionary-alist manually after ispell-set-spellchecker-params had ran. > I believe that "mess" is fixed in the current code. So I suggest you > try the latest emacs-25 branch. There's a really confusing bug with the way hunspell behaves with Russian codepages on windows, that is not really emacs related. FYI: I'm using current hunspell from ezwinports. Let's start hunspell in windows cmd: chcp 1251 hunspell -a "" -d ru_RU -i cp1251 тестовоеслово testword Both of the above would result in spellchecking failures. Let's add them to the personal dictionary: *тестовоеслово *testword # Now, if you try exiting and starting hunspell again, both words would be considered correct, since they are in your personal dictionary. Here's where the bug comes into play. If you start some kind of bash, be it cygwin bash, or msys and try using hunspell there: hunspell -a "" -d ru_RU -i utf-8 testword would check ok, but тестовоеслово would fail. The same would happen within emacs. My best guess is that this happens because of some locale-connected environmental variable, that bash(and emacs) sets. So, in the end, I'm stuck with aspell and running one spellcheck per dictionary. -- Best Regards, Nikolay Kudryavtsev ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines. 2016-03-07 1:18 ` Nikolay Kudryavtsev @ 2016-03-07 16:38 ` Eli Zaretskii 2016-03-07 21:10 ` Nikolay Kudryavtsev 0 siblings, 1 reply; 7+ messages in thread From: Eli Zaretskii @ 2016-03-07 16:38 UTC (permalink / raw) To: help-gnu-emacs > From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com> > Date: Mon, 7 Mar 2016 04:18:20 +0300 > > There's a really confusing bug with the way hunspell behaves with > Russian codepages on windows, that is not really emacs related. FYI: > I'm using current hunspell from ezwinports. Let's start hunspell in > windows cmd: > chcp 1251 > hunspell -a "" -d ru_RU -i cp1251 > тестовоеслово > testword > Both of the above would result in spellchecking failures. Let's add them > to the personal dictionary: > *тестовоеслово > *testword > # > Now, if you try exiting and starting hunspell again, both words would be > considered correct, since they are in your personal dictionary. > > Here's where the bug comes into play. If you start some kind of bash, be > it cygwin bash, or msys and try using hunspell there: > hunspell -a "" -d ru_RU -i utf-8 > testword would check ok, but тестовоеслово would fail. The same would > happen within emacs. My best guess is that this happens because of some > locale-connected environmental variable, that bash(and emacs) sets. Why don't you use utf-8 instead of cp1251? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines. 2016-03-07 16:38 ` Eli Zaretskii @ 2016-03-07 21:10 ` Nikolay Kudryavtsev 2016-03-07 21:17 ` Eli Zaretskii 0 siblings, 1 reply; 7+ messages in thread From: Nikolay Kudryavtsev @ 2016-03-07 21:10 UTC (permalink / raw) To: help-gnu-emacs Utf-8 does not work for cyrillics in cmd.exe by default. Hunspell first signals conversion error, then exits the moment you enter any input in it: C:\>chcp 65001 Active code page: 65001 C:\>hunspell -a "" -d ru_RU -i utf-8 error - iconv_open: KOI8-R -> CP65001 error - iconv_open: KOI8-R -> CP65001 @(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2) тестовоеслово C:\>hunspell -a "" -d ru_RU -i utf-8 error - iconv_open: KOI8-R -> CP65001 error - iconv_open: KOI8-R -> CP65001 @(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2) testword * But the problem is not with utf-8 itself - it works in general from bash and emacs. It's only those russian words that were entered to personal dictionary fail to get read correctly. You can start hunspell with utf-8 in bash, insert a russian and an english word to a dictionary. Exit hunspell, start again in bash, english word would pass, russian would fail. Start 1251 cmd, both words would pass. -- Best Regards, Nikolay Kudryavtsev ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines. 2016-03-07 21:10 ` Nikolay Kudryavtsev @ 2016-03-07 21:17 ` Eli Zaretskii 2016-03-07 22:03 ` Nikolay Kudryavtsev 0 siblings, 1 reply; 7+ messages in thread From: Eli Zaretskii @ 2016-03-07 21:17 UTC (permalink / raw) To: help-gnu-emacs > From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com> > Date: Tue, 8 Mar 2016 00:10:38 +0300 > > Utf-8 does not work for cyrillics in cmd.exe by default. Who said anything about cmd? I meant to use utf-8 when invoking Hunspell from ispell.el. That doesn't require anything from cmd.exe. > But the problem is not with utf-8 itself - it works in general from bash > and emacs. It's only those russian words that were entered to personal > dictionary fail to get read correctly. I think you shouldn't enter words in different languages into the same personal dictionary. Having them encoded differently just adds insult to injury. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines. 2016-03-07 21:17 ` Eli Zaretskii @ 2016-03-07 22:03 ` Nikolay Kudryavtsev 0 siblings, 0 replies; 7+ messages in thread From: Nikolay Kudryavtsev @ 2016-03-07 22:03 UTC (permalink / raw) To: help-gnu-emacs > Who said anything about cmd? I meant to use utf-8 when invoking > Hunspell from ispell.el. That doesn't require anything from cmd.exe. Emacs uses utf-8 by default, I didn't change it. It works in general, just not for the Russian personal dictionary. > I think you shouldn't enter words in different languages into the same > personal dictionary. I agree, but that's exactly what hunspell does when ran with multiple -d. All words get inserted into the dictionary for the first language specified. -- Best Regards, Nikolay Kudryavtsev ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-03-07 22:03 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-15 23:14 ispell.el and multilanguage lines Nikolay Kudryavtsev 2015-12-16 3:37 ` Eli Zaretskii 2016-03-07 1:18 ` Nikolay Kudryavtsev 2016-03-07 16:38 ` Eli Zaretskii 2016-03-07 21:10 ` Nikolay Kudryavtsev 2016-03-07 21:17 ` Eli Zaretskii 2016-03-07 22:03 ` Nikolay Kudryavtsev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).