From: Eli Zaretskii <eliz@gnu.org>
To: "Николай Сущенко" <sckol@yandex.ru>
Cc: 7781@debbugs.gnu.org
Subject: bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file
Date: Sun, 14 Apr 2013 10:08:42 +0300 [thread overview]
Message-ID: <83ppxx7fnp.fsf@gnu.org> (raw)
In-Reply-To: <516A4DC3.90205@yandex.ru>
> Date: Sun, 14 Apr 2013 10:33:39 +0400
> From: Николай Сущенко
> <sckol@yandex.ru>
> CC: 7781@debbugs.gnu.org
>
> Please send me this patch, I'll ask the hunspell developers to include it.
Attached. This is a small part of a much larger patch, most of it for
Windows-specific problems. If you have problems compiling the patched
hunspell, let me know: it could be that I omitted some hunk that is
needed for this part.
> Could you also recall which concrete problems produces this workaround?
> For me it works fine, but I haven't tested it in different languages and
> encodings.
One problem is that you assume the encoding of the communications with
hunspell is UTF-8, and thus matches the internal representation of
text in Emacs buffers and strings (only then will byte-to-position
give correct results). But that assumption is false: hunspell
supports any encoding that it can convert to/from UTF-8 (it uses
libiconv internally). The "usual" choice of the encoding is the one
used by the dictionary. Not every dictionary out there is in UTF-8.
> If it is some problems, I could try to fix it
I don't think you can fix this on the Emacs side, because Emacs cannot
easily and/or quickly convert between bytes and characters in an
arbitrary multibyte encoding.
When I discovered this problem, I also tried fixing it on the Emacs
side first, but then I realized that this kind of solution has too
many problems, and instead fixed it in hunspell.
--- src/tools/hunspell.cxx~0 2011-01-21 19:01:29.000000000 +0200
+++ src/tools/hunspell.cxx 2013-02-07 10:11:54.443610900 +0200
@@ -710,13 +748,22 @@ if (pos >= 0) {
fflush(stdout);
} else {
char ** wlst = NULL;
- int ns = pMS[d]->suggest(&wlst, token);
+ int byte_offset = parser->get_tokenpos() + pos;
+ int char_offset = 0;
+ if (strcmp(io_enc, "UTF-8") == 0) {
+ for (int i = 0; i < byte_offset; i++) {
+ if ((buf[i] & 0xc0) != 0x80)
+ char_offset++;
+ }
+ } else {
+ char_offset = byte_offset;
+ }
+ int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d]));
if (ns == 0) {
- fprintf(stdout,"# %s %d", token,
- parser->get_tokenpos() + pos);
+ fprintf(stdout,"# %s %d", token, char_offset);
} else {
fprintf(stdout,"& %s %d %d: ", token, ns,
- parser->get_tokenpos() + pos);
+ char_offset);
fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], io_enc));
}
for (int j = 1; j < ns; j++) {
@@ -745,13 +792,23 @@ if (pos >= 0) {
if (root) free(root);
} else {
char ** wlst = NULL;
+ int byte_offset = parser->get_tokenpos() + pos;
+ int char_offset = 0;
+ if (strcmp(io_enc, "UTF-8") == 0) {
+ for (int i = 0; i < byte_offset; i++) {
+ if ((buf[i] & 0xc0) != 0x80)
+ char_offset++;
+ }
+ } else {
+ char_offset = byte_offset;
+ }
int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d]));
if (ns == 0) {
fprintf(stdout,"# %s %d", chenc(token, io_enc, ui_enc),
- parser->get_tokenpos() + pos);
+ char_offset);
} else {
fprintf(stdout,"& %s %d %d: ", chenc(token, io_enc, ui_enc), ns,
- parser->get_tokenpos() + pos);
+ char_offset);
fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], ui_enc));
}
for (int j = 1; j < ns; j++) {
next prev parent reply other threads:[~2013-04-14 7:08 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-03 23:14 bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file Reuben Thomas
2011-01-07 13:14 ` Agustin Martin
2011-01-07 14:30 ` Reuben Thomas
2011-02-11 17:00 ` Agustin Martin
2014-10-16 13:37 ` Agustin Martin
2014-10-16 13:54 ` Eli Zaretskii
2014-10-16 14:08 ` Agustin Martin
2012-01-01 21:42 ` bug#7781: ispell problem with hunspell and UTF-8 file (and other, related hunspell problems) Richard Wordingham
2013-04-13 19:12 ` bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file Николай Сущенко
2013-04-14 5:42 ` Eli Zaretskii
2013-04-14 6:33 ` Николай Сущенко
2013-04-14 7:08 ` Eli Zaretskii [this message]
2013-04-20 18:43 ` Николай Сущенко
2014-04-27 21:30 ` bug#7781: hunspell and latex-mode Peter Münster
2014-04-28 15:37 ` Eli Zaretskii
2014-04-28 16:18 ` Peter Münster
2014-04-28 16:48 ` Eli Zaretskii
2014-04-28 17:17 ` Peter Münster
2014-04-28 17:32 ` Eli Zaretskii
2014-04-28 18:27 ` Peter Münster
2014-04-29 10:03 ` Agustin Martin
2014-04-29 10:13 ` Peter Münster
2014-04-29 10:21 ` Agustin Martin
2014-04-29 10:20 ` Peter Münster
2014-04-29 10:39 ` Agustin Martin
2014-04-29 11:54 ` Peter Münster
2014-04-29 12:48 ` Peter Münster
2014-04-29 13:57 ` Eli Zaretskii
2014-04-29 14:30 ` Peter Münster
2014-04-29 15:25 ` Eli Zaretskii
2014-04-29 16:34 ` Peter Münster
2014-09-25 9:54 ` bug#7781: Bug still present in hunspell 1.3.3; Eli's patch still works Reuben Thomas
2020-08-28 12:00 ` bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file Stefan Kangas
2020-08-28 12:36 ` Eli Zaretskii
2020-08-28 12:56 ` Stefan Kangas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83ppxx7fnp.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=7781@debbugs.gnu.org \
--cc=sckol@yandex.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.