ispell.el and multilanguage lines.

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* ispell.el and multilanguage lines.
@ 2015-12-15 23:14 Nikolay Kudryavtsev
  2015-12-16  3:37 ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2015-12-15 23:14 UTC (permalink / raw)
  To: help-gnu-emacs@gnu.org

Here's the problem I've found with ispell.el. As a bilingual person and 
an avid org-mode user I often have lines in multiple languages. The fact 
that Russian language has a different alphabet is great for the purpose 
of spell checking, since I can check English first, then switch 
dictionary and check Russian. And spell checker is supposed to ignore 
words not in the dictionary alphabet. ispell-dictionary-alist has all 
the facilities needed for this and it should work in theory.
Here's where it fails in practice - ispell.el uses a function called 
ispell-region and what it does is that it sends buffer line by line and 
whether the line gets sent is decided by ispell-get-line where the 
criteria is (re-search-forward ispell-casechars end t) - that is if 
there's a match with the language for the current dictionary the whole 
line gets sent. And of course words in that line that are in a different 
language get considered misspelled by the spell checker.

I tried checking in bug reports, but this does not seem to have been 
reported yet. I'll submit it as a bug later.

There's a possible workaround if you're using hunspell, by using 
multiple dictionaries at the same time, like this:
("ru_RU,en_US" "[[:alpha:]]" "[^[:alpha:]]" "['0-9]" t ("-d" 
"ru_RU,en_US") nil koi8-r)
But this creates a mess in user dictionaries.

-- 
Best Regards,
Nikolay Kudryavtsev

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ispell.el and multilanguage lines.
  2015-12-15 23:14 ispell.el and multilanguage lines Nikolay Kudryavtsev
@ 2015-12-16  3:37 ` Eli Zaretskii
  2016-03-07  1:18   ` Nikolay Kudryavtsev
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2015-12-16  3:37 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com>
> Date: Wed, 16 Dec 2015 02:14:58 +0300
> 
> Here's the problem I've found with ispell.el. As a bilingual person and 
> an avid org-mode user I often have lines in multiple languages. The fact 
> that Russian language has a different alphabet is great for the purpose 
> of spell checking, since I can check English first, then switch 
> dictionary and check Russian. And spell checker is supposed to ignore 
> words not in the dictionary alphabet. ispell-dictionary-alist has all 
> the facilities needed for this and it should work in theory.
> Here's where it fails in practice - ispell.el uses a function called 
> ispell-region and what it does is that it sends buffer line by line and 
> whether the line gets sent is decided by ispell-get-line where the 
> criteria is (re-search-forward ispell-casechars end t) - that is if 
> there's a match with the language for the current dictionary the whole 
> line gets sent. And of course words in that line that are in a different 
> language get considered misspelled by the spell checker.

Emacs doesn't yet have a concept of the language of the text,
certainly not when several languages are mixed in a buffer.

> There's a possible workaround if you're using hunspell, by using 
> multiple dictionaries at the same time, like this:
> ("ru_RU,en_US" "[[:alpha:]]" "[^[:alpha:]]" "['0-9]" t ("-d" 
> "ru_RU,en_US") nil koi8-r)

Yes, and the latest code base (of what will become Emacs 25.1)
supports this feature of Hunspell.

> But this creates a mess in user dictionaries.

I believe that "mess" is fixed in the current code.  So I suggest you
try the latest emacs-25 branch.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ispell.el and multilanguage lines.
  2015-12-16  3:37 ` Eli Zaretskii
@ 2016-03-07  1:18   ` Nikolay Kudryavtsev
  2016-03-07 16:38     ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2016-03-07  1:18 UTC (permalink / raw)
  To: help-gnu-emacs

Sorry, for the very late reply,  was somewhat busy, plus this stuff 
required somewhat extensive extra checking.
> Emacs doesn't yet have a concept of the language of the text,
> certainly not when several languages are mixed in a buffer.
Ispell.el has CASECHARS and NOT-CASECHARS regular expressions in 
ispell-dictionary-alist. This should be enough for the purpose of 
differentiating Russian from English, if only it worked. But in 
practice, those regexps are not used correctly, since now 
ispell-get-line sends the whole line when re-search-forward finds 
CASECHARS within it. While I'd agree that the number of use cases, where 
CASECHARS and NOT-CASECHARS are useful is rather limited, fixing them 
seems easier than removing.

> Yes, and the latest code base (of what will become Emacs 25.1)
> supports this feature of Hunspell.
I've tried that version, sure it fixes this bug 
<https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20495>, but I actually 
never had a problem with it in the first place, since you can avoid it 
in 24.5 by setting ispell-dictionary-alist manually after 
ispell-set-spellchecker-params had ran.

> I believe that "mess" is fixed in the current code.  So I suggest you
> try the latest emacs-25 branch.
There's a really confusing bug with the way hunspell behaves with 
Russian codepages on windows, that is not really emacs related. FYI:
I'm using current hunspell from ezwinports. Let's start hunspell in 
windows cmd:
chcp 1251
hunspell -a "" -d ru_RU -i cp1251
тестовоеслово
testword
Both of the above would result in spellchecking failures. Let's add them 
to the personal dictionary:
*тестовоеслово
*testword
#
Now, if you try exiting and starting hunspell again, both words would be 
considered correct, since they are in your personal dictionary.

Here's where the bug comes into play. If you start some kind of bash, be 
it cygwin bash, or msys and try using hunspell there:
hunspell -a "" -d ru_RU -i utf-8
testword would check ok, but тестовоеслово would fail. The same would 
happen within emacs. My best guess is that this happens because of some 
locale-connected environmental variable, that bash(and emacs) sets.

So, in the end, I'm stuck with aspell and running one spellcheck per 
dictionary.

-- 
Best Regards,
Nikolay Kudryavtsev



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ispell.el and multilanguage lines.
  2016-03-07  1:18   ` Nikolay Kudryavtsev
@ 2016-03-07 16:38     ` Eli Zaretskii
  2016-03-07 21:10       ` Nikolay Kudryavtsev
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2016-03-07 16:38 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com>
> Date: Mon, 7 Mar 2016 04:18:20 +0300
> 
> There's a really confusing bug with the way hunspell behaves with 
> Russian codepages on windows, that is not really emacs related. FYI:
> I'm using current hunspell from ezwinports. Let's start hunspell in 
> windows cmd:
> chcp 1251
> hunspell -a "" -d ru_RU -i cp1251
> тестовоеслово
> testword
> Both of the above would result in spellchecking failures. Let's add them 
> to the personal dictionary:
> *тестовоеслово
> *testword
> #
> Now, if you try exiting and starting hunspell again, both words would be 
> considered correct, since they are in your personal dictionary.
> 
> Here's where the bug comes into play. If you start some kind of bash, be 
> it cygwin bash, or msys and try using hunspell there:
> hunspell -a "" -d ru_RU -i utf-8
> testword would check ok, but тестовоеслово would fail. The same would 
> happen within emacs. My best guess is that this happens because of some 
> locale-connected environmental variable, that bash(and emacs) sets.

Why don't you use utf-8 instead of cp1251?



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ispell.el and multilanguage lines.
  2016-03-07 16:38     ` Eli Zaretskii
@ 2016-03-07 21:10       ` Nikolay Kudryavtsev
  2016-03-07 21:17         ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2016-03-07 21:10 UTC (permalink / raw)
  To: help-gnu-emacs

Utf-8 does not work for cyrillics in cmd.exe by default. Hunspell first 
signals conversion error, then exits the moment you enter any input in it:
C:\>chcp 65001
Active code page: 65001

C:\>hunspell -a "" -d ru_RU -i utf-8
error - iconv_open: KOI8-R -> CP65001
error - iconv_open: KOI8-R -> CP65001
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2)
тестовоеслово

C:\>hunspell -a "" -d ru_RU -i utf-8
error - iconv_open: KOI8-R -> CP65001
error - iconv_open: KOI8-R -> CP65001
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2)
testword
*

But the problem is not with utf-8 itself - it works in general from bash 
and emacs. It's only those russian words that were entered to personal 
dictionary fail to get read correctly.

You can start hunspell with utf-8 in bash, insert a russian and an 
english word to a dictionary. Exit hunspell, start again in bash, 
english word would pass, russian would fail. Start 1251 cmd, both words 
would pass.

-- 
Best Regards,
Nikolay Kudryavtsev

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ispell.el and multilanguage lines.
  2016-03-07 21:10       ` Nikolay Kudryavtsev
@ 2016-03-07 21:17         ` Eli Zaretskii
  2016-03-07 22:03           ` Nikolay Kudryavtsev
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2016-03-07 21:17 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Nikolay Kudryavtsev <nikolay.kudryavtsev@gmail.com>
> Date: Tue, 8 Mar 2016 00:10:38 +0300
> 
> Utf-8 does not work for cyrillics in cmd.exe by default.

Who said anything about cmd?  I meant to use utf-8 when invoking
Hunspell from ispell.el.  That doesn't require anything from cmd.exe.

> But the problem is not with utf-8 itself - it works in general from bash 
> and emacs. It's only those russian words that were entered to personal 
> dictionary fail to get read correctly.

I think you shouldn't enter words in different languages into the same
personal dictionary.  Having them encoded differently just adds insult
to injury.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ispell.el and multilanguage lines.
  2016-03-07 21:17         ` Eli Zaretskii
@ 2016-03-07 22:03           ` Nikolay Kudryavtsev
  0 siblings, 0 replies; 7+ messages in thread
From: Nikolay Kudryavtsev @ 2016-03-07 22:03 UTC (permalink / raw)
  To: help-gnu-emacs

> Who said anything about cmd?  I meant to use utf-8 when invoking
> Hunspell from ispell.el.  That doesn't require anything from cmd.exe.
Emacs uses utf-8 by default, I didn't change it. It works in general, 
just not for the Russian personal dictionary.

> I think you shouldn't enter words in different languages into the same
> personal dictionary.
I agree, but that's exactly what hunspell does when ran with multiple 
-d. All words get inserted into the dictionary for the first language 
specified.

-- 
Best Regards,
Nikolay Kudryavtsev




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-03-07 22:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-15 23:14 ispell.el and multilanguage lines Nikolay Kudryavtsev
2015-12-16  3:37 ` Eli Zaretskii
2016-03-07  1:18   ` Nikolay Kudryavtsev
2016-03-07 16:38     ` Eli Zaretskii
2016-03-07 21:10       ` Nikolay Kudryavtsev
2016-03-07 21:17         ` Eli Zaretskii
2016-03-07 22:03           ` Nikolay Kudryavtsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).