* ispell.el and multilanguage lines.
@ 2015-12-15 23:14 Nikolay Kudryavtsev
2015-12-16 3:37 ` Eli Zaretskii
0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2015-12-15 23:14 UTC (permalink / raw)
To: help-gnu-emacs@gnu.org
Here's the problem I've found with ispell.el. As a bilingual person and
an avid org-mode user I often have lines in multiple languages. The fact
that Russian language has a different alphabet is great for the purpose
of spell checking, since I can check English first, then switch
dictionary and check Russian. And spell checker is supposed to ignore
words not in the dictionary alphabet. ispell-dictionary-alist has all
the facilities needed for this and it should work in theory.
Here's where it fails in practice - ispell.el uses a function called
ispell-region and what it does is that it sends buffer line by line and
whether the line gets sent is decided by ispell-get-line where the
criteria is (re-search-forward ispell-casechars end t) - that is if
there's a match with the language for the current dictionary the whole
line gets sent. And of course words in that line that are in a different
language get considered misspelled by the spell checker.
I tried checking in bug reports, but this does not seem to have been
reported yet. I'll submit it as a bug later.
There's a possible workaround if you're using hunspell, by using
multiple dictionaries at the same time, like this:
("ru_RU,en_US" "[[:alpha:]]" "[^[:alpha:]]" "['0-9]" t ("-d"
"ru_RU,en_US") nil koi8-r)
But this creates a mess in user dictionaries.
--
Best Regards,
Nikolay Kudryavtsev
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines.
2015-12-15 23:14 ispell.el and multilanguage lines Nikolay Kudryavtsev
@ 2015-12-16 3:37 ` Eli Zaretskii
2016-03-07 1:18 ` Nikolay Kudryavtsev
0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2015-12-16 3:37 UTC (permalink / raw)
To: help-gnu-emacs
> From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com>
> Date: Wed, 16 Dec 2015 02:14:58 +0300
>
> Here's the problem I've found with ispell.el. As a bilingual person and
> an avid org-mode user I often have lines in multiple languages. The fact
> that Russian language has a different alphabet is great for the purpose
> of spell checking, since I can check English first, then switch
> dictionary and check Russian. And spell checker is supposed to ignore
> words not in the dictionary alphabet. ispell-dictionary-alist has all
> the facilities needed for this and it should work in theory.
> Here's where it fails in practice - ispell.el uses a function called
> ispell-region and what it does is that it sends buffer line by line and
> whether the line gets sent is decided by ispell-get-line where the
> criteria is (re-search-forward ispell-casechars end t) - that is if
> there's a match with the language for the current dictionary the whole
> line gets sent. And of course words in that line that are in a different
> language get considered misspelled by the spell checker.
Emacs doesn't yet have a concept of the language of the text,
certainly not when several languages are mixed in a buffer.
> There's a possible workaround if you're using hunspell, by using
> multiple dictionaries at the same time, like this:
> ("ru_RU,en_US" "[[:alpha:]]" "[^[:alpha:]]" "['0-9]" t ("-d"
> "ru_RU,en_US") nil koi8-r)
Yes, and the latest code base (of what will become Emacs 25.1)
supports this feature of Hunspell.
> But this creates a mess in user dictionaries.
I believe that "mess" is fixed in the current code. So I suggest you
try the latest emacs-25 branch.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines.
2015-12-16 3:37 ` Eli Zaretskii
@ 2016-03-07 1:18 ` Nikolay Kudryavtsev
2016-03-07 16:38 ` Eli Zaretskii
0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2016-03-07 1:18 UTC (permalink / raw)
To: help-gnu-emacs
Sorry, for the very late reply, was somewhat busy, plus this stuff
required somewhat extensive extra checking.
> Emacs doesn't yet have a concept of the language of the text,
> certainly not when several languages are mixed in a buffer.
Ispell.el has CASECHARS and NOT-CASECHARS regular expressions in
ispell-dictionary-alist. This should be enough for the purpose of
differentiating Russian from English, if only it worked. But in
practice, those regexps are not used correctly, since now
ispell-get-line sends the whole line when re-search-forward finds
CASECHARS within it. While I'd agree that the number of use cases, where
CASECHARS and NOT-CASECHARS are useful is rather limited, fixing them
seems easier than removing.
> Yes, and the latest code base (of what will become Emacs 25.1)
> supports this feature of Hunspell.
I've tried that version, sure it fixes this bug
<https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20495>, but I actually
never had a problem with it in the first place, since you can avoid it
in 24.5 by setting ispell-dictionary-alist manually after
ispell-set-spellchecker-params had ran.
> I believe that "mess" is fixed in the current code. So I suggest you
> try the latest emacs-25 branch.
There's a really confusing bug with the way hunspell behaves with
Russian codepages on windows, that is not really emacs related. FYI:
I'm using current hunspell from ezwinports. Let's start hunspell in
windows cmd:
chcp 1251
hunspell -a "" -d ru_RU -i cp1251
тестовоеслово
testword
Both of the above would result in spellchecking failures. Let's add them
to the personal dictionary:
*тестовоеслово
*testword
#
Now, if you try exiting and starting hunspell again, both words would be
considered correct, since they are in your personal dictionary.
Here's where the bug comes into play. If you start some kind of bash, be
it cygwin bash, or msys and try using hunspell there:
hunspell -a "" -d ru_RU -i utf-8
testword would check ok, but тестовоеслово would fail. The same would
happen within emacs. My best guess is that this happens because of some
locale-connected environmental variable, that bash(and emacs) sets.
So, in the end, I'm stuck with aspell and running one spellcheck per
dictionary.
--
Best Regards,
Nikolay Kudryavtsev
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines.
2016-03-07 1:18 ` Nikolay Kudryavtsev
@ 2016-03-07 16:38 ` Eli Zaretskii
2016-03-07 21:10 ` Nikolay Kudryavtsev
0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2016-03-07 16:38 UTC (permalink / raw)
To: help-gnu-emacs
> From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com>
> Date: Mon, 7 Mar 2016 04:18:20 +0300
>
> There's a really confusing bug with the way hunspell behaves with
> Russian codepages on windows, that is not really emacs related. FYI:
> I'm using current hunspell from ezwinports. Let's start hunspell in
> windows cmd:
> chcp 1251
> hunspell -a "" -d ru_RU -i cp1251
> тестовоеслово
> testword
> Both of the above would result in spellchecking failures. Let's add them
> to the personal dictionary:
> *тестовоеслово
> *testword
> #
> Now, if you try exiting and starting hunspell again, both words would be
> considered correct, since they are in your personal dictionary.
>
> Here's where the bug comes into play. If you start some kind of bash, be
> it cygwin bash, or msys and try using hunspell there:
> hunspell -a "" -d ru_RU -i utf-8
> testword would check ok, but тестовоеслово would fail. The same would
> happen within emacs. My best guess is that this happens because of some
> locale-connected environmental variable, that bash(and emacs) sets.
Why don't you use utf-8 instead of cp1251?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines.
2016-03-07 16:38 ` Eli Zaretskii
@ 2016-03-07 21:10 ` Nikolay Kudryavtsev
2016-03-07 21:17 ` Eli Zaretskii
0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2016-03-07 21:10 UTC (permalink / raw)
To: help-gnu-emacs
Utf-8 does not work for cyrillics in cmd.exe by default. Hunspell first
signals conversion error, then exits the moment you enter any input in it:
C:\>chcp 65001
Active code page: 65001
C:\>hunspell -a "" -d ru_RU -i utf-8
error - iconv_open: KOI8-R -> CP65001
error - iconv_open: KOI8-R -> CP65001
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2)
тестовоеслово
C:\>hunspell -a "" -d ru_RU -i utf-8
error - iconv_open: KOI8-R -> CP65001
error - iconv_open: KOI8-R -> CP65001
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2)
testword
*
But the problem is not with utf-8 itself - it works in general from bash
and emacs. It's only those russian words that were entered to personal
dictionary fail to get read correctly.
You can start hunspell with utf-8 in bash, insert a russian and an
english word to a dictionary. Exit hunspell, start again in bash,
english word would pass, russian would fail. Start 1251 cmd, both words
would pass.
--
Best Regards,
Nikolay Kudryavtsev
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines.
2016-03-07 21:10 ` Nikolay Kudryavtsev
@ 2016-03-07 21:17 ` Eli Zaretskii
2016-03-07 22:03 ` Nikolay Kudryavtsev
0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2016-03-07 21:17 UTC (permalink / raw)
To: help-gnu-emacs
> From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com>
> Date: Tue, 8 Mar 2016 00:10:38 +0300
>
> Utf-8 does not work for cyrillics in cmd.exe by default.
Who said anything about cmd? I meant to use utf-8 when invoking
Hunspell from ispell.el. That doesn't require anything from cmd.exe.
> But the problem is not with utf-8 itself - it works in general from bash
> and emacs. It's only those russian words that were entered to personal
> dictionary fail to get read correctly.
I think you shouldn't enter words in different languages into the same
personal dictionary. Having them encoded differently just adds insult
to injury.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ispell.el and multilanguage lines.
2016-03-07 21:17 ` Eli Zaretskii
@ 2016-03-07 22:03 ` Nikolay Kudryavtsev
0 siblings, 0 replies; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2016-03-07 22:03 UTC (permalink / raw)
To: help-gnu-emacs
> Who said anything about cmd? I meant to use utf-8 when invoking
> Hunspell from ispell.el. That doesn't require anything from cmd.exe.
Emacs uses utf-8 by default, I didn't change it. It works in general,
just not for the Russian personal dictionary.
> I think you shouldn't enter words in different languages into the same
> personal dictionary.
I agree, but that's exactly what hunspell does when ran with multiple
-d. All words get inserted into the dictionary for the first language
specified.
--
Best Regards,
Nikolay Kudryavtsev
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-03-07 22:03 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-15 23:14 ispell.el and multilanguage lines Nikolay Kudryavtsev
2015-12-16 3:37 ` Eli Zaretskii
2016-03-07 1:18 ` Nikolay Kudryavtsev
2016-03-07 16:38 ` Eli Zaretskii
2016-03-07 21:10 ` Nikolay Kudryavtsev
2016-03-07 21:17 ` Eli Zaretskii
2016-03-07 22:03 ` Nikolay Kudryavtsev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).