bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist
@ 2020-07-29 16:12 Sebastian Urban
  2020-07-29 18:43 ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Sebastian Urban @ 2020-07-29 16:12 UTC (permalink / raw)
  To: 42602

Hello,

for words like:
    męski
    miód
    klątwa
    ślad
    łuk
    żaba
    źrebak
    grzać
    bańka
ispell.el sends to Aspell only part of the word, e.g. "lad" instead of
"ślad", or "kl"/"twa" (depending on the cursor position) instead of
"klątwa".

I think this is because wrong value of (NOT-)CASECHARS, which is ASCII
A-z letters and a few chars of which only ó/Ó is valid for Polish.

Although, for some reason, it doesn't recognize "ó" in word "miód",
sending "mi" or "d". It is on the list of CASECHARS under \363, so it
should work.  Moreover, if I type in regexp-builder "[\363\323]" it
won't recognize ó/Ó, but it doesn't have a problem with other Polish
chars, like "ł" ("[\502]") or "ż" ("[\574]").

If I put in my init.el:
--8<---------------cut here---------------start------------->8---
(setq ispell-program-name "C:/cygwin64/bin/aspell")
(add-hook 'ispell-initialize-spellchecker-hook
           (lambda ()
           (add-to-list 'ispell-local-dictionary-alist
                        '("pl"
                          ;; "[[:alpha:]]"
                          ;; "[^[:alpha:]]"
                          ;; ęóąśłżźćńĘÓĄŚŁŻŹĆŃ
"[A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
"[^A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
                          "[.]" nil nil nil iso-8859-2))))
(setq ispell-dictionary "pl")
--8<---------------cut here---------------start------------->8---

everything seems to work, even ó/Ó are recognised. "[[:alpha:]]" works
as well, so I leaved it as an alternative. Changing from iso-8859-2 to
utf-8 doesn't break anything.

Tested on:
- GNU Emacs 26.3 (build 1, x86_64-w64-mingw32) of 2019-08-29,
- GNU Emacs 28.0.50 (build 1, x86_64-w64-mingw32) of 2020-07-05,
with Aspell from Cygwin installation.


S. U.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist
  2020-07-29 16:12 bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist Sebastian Urban
@ 2020-07-29 18:43 ` Eli Zaretskii
  2020-07-30 11:39   ` Sebastian Urban
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2020-07-29 18:43 UTC (permalink / raw)
  To: Sebastian Urban; +Cc: 42602

> From: Sebastian Urban <mrsebastianurban@gmail.com>
> Date: Wed, 29 Jul 2020 18:12:02 +0200
> 
> for words like:
>     męski
>     miód
>     klątwa
>     ślad
>     łuk
>     żaba
>     źrebak
>     grzać
>     bańka
> ispell.el sends to Aspell only part of the word, e.g. "lad" instead of
> "ślad", or "kl"/"twa" (depending on the cursor position) instead of
> "klątwa".
> 
> I think this is because wrong value of (NOT-)CASECHARS, which is ASCII
> A-z letters and a few chars of which only ó/Ó is valid for Polish.
> 
> Although, for some reason, it doesn't recognize "ó" in word "miód",
> sending "mi" or "d". It is on the list of CASECHARS under \363, so it
> should work.  Moreover, if I type in regexp-builder "[\363\323]" it
> won't recognize ó/Ó, but it doesn't have a problem with other Polish
> chars, like "ł" ("[\502]") or "ż" ("[\574]").
> 
> If I put in my init.el:
> --8<---------------cut here---------------start------------->8---
> (setq ispell-program-name "C:/cygwin64/bin/aspell")
> (add-hook 'ispell-initialize-spellchecker-hook
>            (lambda ()
>            (add-to-list 'ispell-local-dictionary-alist
>                         '("pl"
>                           ;; "[[:alpha:]]"
>                           ;; "[^[:alpha:]]"
>                           ;; ęóąśłżźćńĘÓĄŚŁŻŹĆŃ
> "[A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
> "[^A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
>                           "[.]" nil nil nil iso-8859-2))))
> (setq ispell-dictionary "pl")
> --8<---------------cut here---------------start------------->8---
> 
> everything seems to work, even ó/Ó are recognised.

I don't understand this change.  Values above octal 377 cannot be
right in the above regexps, because they are supposed to be in Latin-2
encoding, which is a single-byte encoding, and so can only handle
values below octal 400.  How did you come up with those values?

Anyway, I'm quite sure some other factor is at work here.

> Tested on:
> - GNU Emacs 26.3 (build 1, x86_64-w64-mingw32) of 2019-08-29,
> - GNU Emacs 28.0.50 (build 1, x86_64-w64-mingw32) of 2020-07-05,
> with Aspell from Cygwin installation.

Your Emacs is a native MinGW build, whereas Aspell seems to be a
Cygwin build?  If so, you could have incompatibility in character
encoding.  What is your Windows locale?  And what does

  M-: (getenv "LANG") RET

yield inside Emacs?





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist
  2020-07-29 18:43 ` Eli Zaretskii
@ 2020-07-30 11:39   ` Sebastian Urban
  2020-07-30 13:26     ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Sebastian Urban @ 2020-07-30 11:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42602

> I don't understand this change.  Values above octal 377 cannot be
> right in the above regexps, because they are supposed to be in
> Latin-2 encoding, which is a single-byte encoding, and so can only
> handle values below octal 400.  How did you come up with those
> values?

Basically, C-x = on a char, which gave me octal values.  I though it
was recognising only A-z + ó/Ó and some other chars that I'm not
interested in, so I swapped those values for the ones corresponding to
the Polish chars.  That's the whole story.

> Anyway, I'm quite sure some other factor is at work here.

Well, I did some tests, e.g. switched back to the original value of
"polish" in my "pl" dictionary, and... it works.  And if I change from
iso-8859-2 to utf-8 in my "pl" (with original value from "polish") it
doesn't work.  So, as you later wrote - wrong character encoding,
I guess.

Looking for a cause (in default settings), I think I found it in
ispell-dictionary-base-alist and ispell-dictionary-alist.  During
"transfer" from *-base-* to ispell-dictionary-alist, the value of
CHARACTER-SET is changed in all cases from iso-* or cp1255 to utf-8,
then ispell uses these (from ispell-dictionary-alist) when it "talks"
with Aspell.

On the other hand, if I use Emacs 26.3 from Cygwin, everything works
out of the box, I don't even have to set "polish" as default
dictionary. But there, in Cygwin command line, "env | grep LANG" gives
"LANG=pl_PL.UTF-8".

> Your Emacs is a native MinGW build, whereas Aspell seems to be
> a Cygwin build?

Both Emacses are official Win builds, and Aspell is installed through
Cygwin.

> If so, you could have incompatibility in character encoding.  What
> is your Windows locale?

"Polish" everywhere in "Control Panel" -> "Regional and Language".

> And what does M-: (getenv "LANG") RET yield inside Emacs?

"PLK"

S. U.

P.S.
> Moreover, if I type in regexp-builder "[\363\323]" it won't
> recognize ó/Ó, but it doesn't have a problem with other Polish
> chars, like "ł" ("[\502]") or "ż" ("[\574]").

In the "Character List" buffer for unicode-bmp, regexp-builder
(numbers are octal values):
- 0-177 and 400-777 - highlights chars
- 240-377 - doesn't highlight chars (it highlights them if I use hex
   value, or insert them directly)
I didn't check "80h-9Fh" chars.  Chars like C-a were checked by
inserting them with quoted-insert in another buffer.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist
  2020-07-30 11:39   ` Sebastian Urban
@ 2020-07-30 13:26     ` Eli Zaretskii
  2020-07-31 10:52       ` Sebastian Urban
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2020-07-30 13:26 UTC (permalink / raw)
  To: Sebastian Urban; +Cc: 42602

> From: Sebastian Urban <mrsebastianurban@gmail.com>
> Cc: 42602@debbugs.gnu.org
> Date: Thu, 30 Jul 2020 13:39:55 +0200
> 
> > I don't understand this change.  Values above octal 377 cannot be
> > right in the above regexps, because they are supposed to be in
> > Latin-2 encoding, which is a single-byte encoding, and so can only
> > handle values below octal 400.  How did you come up with those
> > values?
> 
> Basically, C-x = on a char, which gave me octal values.

This gives you the Unicode codepoint, not its Latin-2 encoding.  They
are different.  The database in ispell.el uses Latin-2 encodings of
Polish characters.

> Well, I did some tests, e.g. switched back to the original value of
> "polish" in my "pl" dictionary, and... it works.  And if I change from
> iso-8859-2 to utf-8 in my "pl" (with original value from "polish") it
> doesn't work.  So, as you later wrote - wrong character encoding,
> I guess.
> 
> Looking for a cause (in default settings), I think I found it in
> ispell-dictionary-base-alist and ispell-dictionary-alist.  During
> "transfer" from *-base-* to ispell-dictionary-alist, the value of
> CHARACTER-SET is changed in all cases from iso-* or cp1255 to utf-8,
> then ispell uses these (from ispell-dictionary-alist) when it "talks"
> with Aspell.
> 
> On the other hand, if I use Emacs 26.3 from Cygwin, everything works
> out of the box, I don't even have to set "polish" as default
> dictionary. But there, in Cygwin command line, "env | grep LANG" gives
> "LANG=pl_PL.UTF-8".

Native MinGW builds cannot use the UTF-8 encoding.

So, do we have a problem to solve, or can this issue be closed?





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist
  2020-07-30 13:26     ` Eli Zaretskii
@ 2020-07-31 10:52       ` Sebastian Urban
  2020-08-13  0:07         ` Stefan Kangas
  0 siblings, 1 reply; 6+ messages in thread
From: Sebastian Urban @ 2020-07-31 10:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 42602

>>> I don't understand this change.  Values above octal 377 cannot be
>>> right in the above regexps, because they are supposed to be in
>>> Latin-2 encoding, which is a single-byte encoding, and so can only
>>> handle values below octal 400.  How did you come up with those
>>> values?
>>
>> Basically, C-x = on a char, which gave me octal values.
>
> This gives you the Unicode codepoint, not its Latin-2 encoding.
> They are different.

So, it would work even if I would add "\999999999", because Emacs
would not recognize and simply ignore it, which means the only reason
it worked was explicitly set encoding (iso-8859-2)?

> The database in ispell.el uses Latin-2 encodings of Polish
> characters.

As base, but before ispell.el sends the string to the Aspell it
translates it to uft-8, right?  Because that's the only difference
between my custom "pl" dictionary and value of "polish" in
ispell-dictionary-alist.

> Native MinGW builds cannot use the UTF-8 encoding.

So, with my setup (not saying that it's the best one, it's just
current one, if there is a better one I can change), for Polish lang,
I have to define local dictionary with iso-8859-2 coding?

> So, do we have a problem to solve, or can this issue be closed?

If it's a problem of MinGW, and my setup, then I guess it's not an
Emacs problem, so yes, it can be closed.

S. U.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist
  2020-07-31 10:52       ` Sebastian Urban
@ 2020-08-13  0:07         ` Stefan Kangas
  0 siblings, 0 replies; 6+ messages in thread
From: Stefan Kangas @ 2020-08-13  0:07 UTC (permalink / raw)
  To: Sebastian Urban; +Cc: 42602-done

Sebastian Urban <mrsebastianurban@gmail.com> writes:

>> So, do we have a problem to solve, or can this issue be closed?
>
> If it's a problem of MinGW, and my setup, then I guess it's not an
> Emacs problem, so yes, it can be closed.

I'm therefore closing this bug report.

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-08-13  0:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-29 16:12 bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist Sebastian Urban
2020-07-29 18:43 ` Eli Zaretskii
2020-07-30 11:39   ` Sebastian Urban
2020-07-30 13:26     ` Eli Zaretskii
2020-07-31 10:52       ` Sebastian Urban
2020-08-13  0:07         ` Stefan Kangas

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).