Hunspell for Japanese

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Hunspell for Japanese
@ 2018-02-17 13:53 Tak Kunihiro
  2018-02-17 15:18 ` Eli Zaretskii
  2018-02-18  5:31 ` Tak Kunihiro
  0 siblings, 2 replies; 5+ messages in thread
From: Tak Kunihiro @ 2018-02-17 13:53 UTC (permalink / raw)
  To: help-gnu-emacs; +Cc: 国広卓也

I want to spellcheck English phrases that are mixed in Japanese
phrases by `hunspell'.  When I call M-x ispell-word, responses from `aspell' and
`hunspell' differ.  The difference results in how underlines are drawn in
flyspell-mode.  The `hunspell' gives many unnecessary underlines on Japanese phrases.
So I add following to my ~/.emacs.d/inits.el for now.

  (defun flyspell-ignore-non-ascii (beg end info)
    "Tell flyspell to ignore non ascii characters.
  Call this on `flyspell-incorrect-hook'."
    (string-match "[^!-~]" (buffer-substring beg end)))
  (add-hook 'flyspell-incorrect-hook 'flyspell-ignore-non-ascii)

Is is possible to make `hunspell' behave like `aspell'?

GNU Emacs 25.3.1 (x86_64-apple-darwin13.4.0, NS appkit-1265.21 Version 10.9.5 (Build 13F1911))
 of 2017-09-19

##
## Aspell
##

$ which aspell
/opt/local/bin/aspell
$ Emacs -Q
M-: (insert "Emacsは日本ではイーマックスと呼ばれる")
C-a
M-: (setq ispell-program-name "aspell")
M-x ispell-word
X-b *Messages*

> Starting new Ispell process /opt/local/bin/aspell with default dictionary...
> Checking spelling of EMACSは日本語ではイーマックスと呼ばれる...
> EMACSは日本語ではイーマックスと呼ばれる is correct
> You can run the command ‘ispell-word’ with M-$

##
## Hunspell
##

$ which hunspell
/opt/local/bin/hunspell
$ hunspell -D
...
/opt/local/share/hunspell/en_US
LOADED DICTIONARY:
/opt/local/share/hunspell/en_US.aff
/opt/local/share/hunspell/en_US.dic
Hunspell 1.6.2
$ Emacs -Q
M-: (insert "Emacsは日本ではイーマックスと呼ばれる")
C-a
M-: (setq ispell-program-name "hunspell")
M-x ispell-word
X-b *Messages*

> Starting new Ispell process hunspell with default dictionary...
> Checking spelling of EMACSは日本語ではイーマックスと呼ばれる...
> ispell-word: Ispell and its process have different character maps

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hunspell for Japanese
  2018-02-17 13:53 Hunspell for Japanese Tak Kunihiro
@ 2018-02-17 15:18 ` Eli Zaretskii
  2018-02-18  5:31 ` Tak Kunihiro
  1 sibling, 0 replies; 5+ messages in thread
From: Eli Zaretskii @ 2018-02-17 15:18 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Tak Kunihiro <tkk@misasa.okayama-u.ac.jp>
> Date: Sat, 17 Feb 2018 22:53:50 +0900
> Cc: 国広卓也 <tkk@misasa.okayama-u.ac.jp>
> 
> I want to spellcheck English phrases that are mixed in Japanese
> phrases by `hunspell'.  When I call M-x ispell-word, responses from `aspell' and
> `hunspell' differ.  The difference results in how underlines are drawn in
> flyspell-mode.  The `hunspell' gives many unnecessary underlines on Japanese phrases.

If your dictionary is for English, why do you expect flyspell-mode to
work correctly with words in another language?  It can't do anything
sensible with such foreign words.  The underlines flyspell-mode shows
in Japanese words when the dictionary is for English could be
anything; you should simply disregard any such underlines in
non-English words.

Can you tell why you pay attention to underlines in non-English words
in this situation?

> Is is possible to make `hunspell' behave like `aspell'?

They are very different programs, so they cannot behave the same.

> $ which hunspell
> /opt/local/bin/hunspell
> $ hunspell -D
> ...
> /opt/local/share/hunspell/en_US
> LOADED DICTIONARY:
> /opt/local/share/hunspell/en_US.aff
> /opt/local/share/hunspell/en_US.dic
> Hunspell 1.6.2
> $ Emacs -Q
> M-: (insert "Emacsは日本ではイーマックスと呼ばれる")
> C-a
> M-: (setq ispell-program-name "hunspell")
> M-x ispell-word
> X-b *Messages*
> 
> > Starting new Ispell process hunspell with default dictionary...
> > Checking spelling of EMACSは日本語ではイーマックスと呼ばれる...
> > ispell-word: Ispell and its process have different character maps

I see the same message.  It is caused by Hunspell somehow considering
the string "は日本語ではイーマックスと呼ばれる" as more than one word,
and it therefore returns 3 misspellings, which then trigger the above
cryptic error message.

But once again, you've set up flyspell-mode to work in English, so you
shouldn't pay attention to what it does with Japanese.  For starters,
I believe the encoding Emacs uses is incorrect in that case, because
the en_US.aff file probably states that it wants a Latin-1 encoding,
not UTF-8.  But even using UTF-8 will not help here, AFAIU.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hunspell for Japanese
  2018-02-17 13:53 Hunspell for Japanese Tak Kunihiro
  2018-02-17 15:18 ` Eli Zaretskii
@ 2018-02-18  5:31 ` Tak Kunihiro
  2018-02-18 15:59   ` Eli Zaretskii
  2018-02-24  1:41   ` Tak Kunihiro
  1 sibling, 2 replies; 5+ messages in thread
From: Tak Kunihiro @ 2018-02-18  5:31 UTC (permalink / raw)
  To: help-gnu-emacs, eliz; +Cc: tkk

Thank you for the reply.

I see.  It is true that I should not expect both Aspell and Hunspell
to handle Japanese correctly when their task is to check English.  It
was just a lucky case how flyspell-mode with Aspell ignores Japanese
words and show no underlines.

> Can you tell why you pay attention to underlines in non-English
> words in this situation?

When I write Japanese, very often English words such for `Emacs' are
mixed.  Thus I (I think most of Japanese) run flyspell-mode with
English dictionary all the time.  I expect flyspell-mode ignores all
Japanese words and only checks English words like how LibreOffice
does.

With flyspell-mode with Hunspell, lines are shown under many Japanese
phrases (not all Japanese phases) and I cannot tell which underline
corresponds to misspelled English words.  As inferred already, Aspell
only shows underline on wrong spelled English.

> But once again, you've set up flyspell-mode to work in English, so you
> shouldn't pay attention to what it does with Japanese.

I agree. I also see problem with M-x ispell-buffer, and noticed a
solution.

  (defvar ispell-regexp-non-ascii "[^\000-\377]+"
    "Regular expression to match a non-ascii word.")
  (add-to-list 'ispell-skip-region-alist (list ispell-regexp-non-ascii))

Once I accept this solution for M-x spell-buffer, I would accept a
solution for flyspell-mode as shown below.

  (defun flyspell-skip-non-ascii (beg end info)
    "Tell flyspell to skip a non-ascii word.
  Call this on `flyspell-incorrect-hook'."
    (string-match ispell-regexp-non-ascii (buffer-substring beg end)))
  (add-hook 'flyspell-incorrect-hook 'flyspell-skip-non-ascii)

It took me a while to figure this out.  I think that what M-x
ispell-buffer and flyspell-mode provide is fundamental functionalities
and it is good to be documented in somewhere in Emacs such for (info
"(emacs) Spelling").  Can you give suggestion?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hunspell for Japanese
  2018-02-18  5:31 ` Tak Kunihiro
@ 2018-02-18 15:59   ` Eli Zaretskii
  2018-02-24  1:41   ` Tak Kunihiro
  1 sibling, 0 replies; 5+ messages in thread
From: Eli Zaretskii @ 2018-02-18 15:59 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sun, 18 Feb 2018 14:31:56 +0900 (JST)
> Cc: tkk@misasa.okayama-u.ac.jp
> From: Tak Kunihiro <tkk@misasa.okayama-u.ac.jp>
> 
>   (defvar ispell-regexp-non-ascii "[^\000-\377]+"
>     "Regular expression to match a non-ascii word.")
>   (add-to-list 'ispell-skip-region-alist (list ispell-regexp-non-ascii))
> 
> Once I accept this solution for M-x spell-buffer, I would accept a
> solution for flyspell-mode as shown below.
> 
>   (defun flyspell-skip-non-ascii (beg end info)
>     "Tell flyspell to skip a non-ascii word.
>   Call this on `flyspell-incorrect-hook'."
>     (string-match ispell-regexp-non-ascii (buffer-substring beg end)))
>   (add-hook 'flyspell-incorrect-hook 'flyspell-skip-non-ascii)
> 
> It took me a while to figure this out.  I think that what M-x
> ispell-buffer and flyspell-mode provide is fundamental functionalities
> and it is good to be documented in somewhere in Emacs such for (info
> "(emacs) Spelling").  Can you give suggestion?

On the Wiki?

You see, the solution you propose has one significant disadvantage: it
will skip words used in English prose which are written using
non-ASCII characters.  It's true that there aren't many of those, but
they do exist.

You could try instead use 2 dictionaries at the same time, one for
English, the other for Japanese.  This will only work with Hunspell,
and only in Emacs 26 or later.  Caveat: I never tried it with these
two languages, so I don't know whether this combination has some
subtle problems with that feature.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hunspell for Japanese
  2018-02-18  5:31 ` Tak Kunihiro
  2018-02-18 15:59   ` Eli Zaretskii
@ 2018-02-24  1:41   ` Tak Kunihiro
  1 sibling, 0 replies; 5+ messages in thread
From: Tak Kunihiro @ 2018-02-24  1:41 UTC (permalink / raw)
  To: help-gnu-emacs, eliz; +Cc: tkk

> On the Wiki?

OK. I put the solution on EmacsWiki.

https://www.emacswiki.org/emacs/FlySpell#toc14



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-02-24  1:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-17 13:53 Hunspell for Japanese Tak Kunihiro
2018-02-17 15:18 ` Eli Zaretskii
2018-02-18  5:31 ` Tak Kunihiro
2018-02-18 15:59   ` Eli Zaretskii
2018-02-24  1:41   ` Tak Kunihiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).