all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: "Николай Сущенко" <sckol@yandex.ru>
Cc: 7781@debbugs.gnu.org
Subject: bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file
Date: Sun, 14 Apr 2013 10:08:42 +0300	[thread overview]
Message-ID: <83ppxx7fnp.fsf@gnu.org> (raw)
In-Reply-To: <516A4DC3.90205@yandex.ru>

> Date: Sun, 14 Apr 2013 10:33:39 +0400
> From: Николай Сущенко
>  <sckol@yandex.ru>
> CC: 7781@debbugs.gnu.org
> 
> Please send me this patch, I'll ask the hunspell developers to include it.

Attached.  This is a small part of a much larger patch, most of it for
Windows-specific problems.  If you have problems compiling the patched
hunspell, let me know: it could be that I omitted some hunk that is
needed for this part.

> Could you also recall which concrete problems produces this workaround? 
> For me it works fine, but I haven't tested it in different languages and 
> encodings.

One problem is that you assume the encoding of the communications with
hunspell is UTF-8, and thus matches the internal representation of
text in Emacs buffers and strings (only then will byte-to-position
give correct results).  But that assumption is false: hunspell
supports any encoding that it can convert to/from UTF-8 (it uses
libiconv internally).  The "usual" choice of the encoding is the one
used by the dictionary.  Not every dictionary out there is in UTF-8.

> If it is some problems, I could try to fix it

I don't think you can fix this on the Emacs side, because Emacs cannot
easily and/or quickly convert between bytes and characters in an
arbitrary multibyte encoding.

When I discovered this problem, I also tried fixing it on the Emacs
side first, but then I realized that this kind of solution has too
many problems, and instead fixed it in hunspell.

--- src/tools/hunspell.cxx~0	2011-01-21 19:01:29.000000000 +0200
+++ src/tools/hunspell.cxx	2013-02-07 10:11:54.443610900 +0200
@@ -710,13 +748,22 @@ if (pos >= 0) {
 			fflush(stdout);
 		} else {
 			char ** wlst = NULL;
-			int ns = pMS[d]->suggest(&wlst, token);
+			int byte_offset = parser->get_tokenpos() + pos;
+			int char_offset = 0;
+			if (strcmp(io_enc, "UTF-8") == 0) {
+				for (int i = 0; i < byte_offset; i++) {
+					if ((buf[i] & 0xc0) != 0x80)
+						char_offset++;
+				}
+			} else {
+				char_offset = byte_offset;
+			}
+			int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d]));
 			if (ns == 0) {
-		    		fprintf(stdout,"# %s %d", token,
-		    		    parser->get_tokenpos() + pos);
+		    		fprintf(stdout,"# %s %d", token, char_offset);
 			} else {
 				fprintf(stdout,"& %s %d %d: ", token, ns,
-				    parser->get_tokenpos() + pos);
+					char_offset);
 				fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], io_enc));
 			}
 			for (int j = 1; j < ns; j++) {
@@ -745,13 +792,23 @@ if (pos >= 0) {
 			if (root) free(root);
 		} else {
 			char ** wlst = NULL;
+			int byte_offset = parser->get_tokenpos() + pos;
+			int char_offset = 0;
+			if (strcmp(io_enc, "UTF-8") == 0) {
+				for (int i = 0; i < byte_offset; i++) {
+					if ((buf[i] & 0xc0) != 0x80)
+						char_offset++;
+				}
+			} else {
+				char_offset = byte_offset;
+			}
 			int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d]));
 			if (ns == 0) {
 		    		fprintf(stdout,"# %s %d", chenc(token, io_enc, ui_enc),
-		    		    parser->get_tokenpos() + pos);
+		    		    char_offset);
 			} else {
 				fprintf(stdout,"& %s %d %d: ", chenc(token, io_enc, ui_enc), ns,
-				    parser->get_tokenpos() + pos);
+				    char_offset);
 				fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], ui_enc));
 			}
 			for (int j = 1; j < ns; j++) {






  reply	other threads:[~2013-04-14  7:08 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-03 23:14 bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file Reuben Thomas
2011-01-07 13:14 ` Agustin Martin
2011-01-07 14:30   ` Reuben Thomas
2011-02-11 17:00   ` Agustin Martin
2014-10-16 13:37     ` Agustin Martin
2014-10-16 13:54       ` Eli Zaretskii
2014-10-16 14:08         ` Agustin Martin
2012-01-01 21:42 ` bug#7781: ispell problem with hunspell and UTF-8 file (and other, related hunspell problems) Richard Wordingham
2013-04-13 19:12 ` bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file Николай Сущенко
2013-04-14  5:42   ` Eli Zaretskii
2013-04-14  6:33     ` Николай Сущенко
2013-04-14  7:08       ` Eli Zaretskii [this message]
2013-04-20 18:43         ` Николай Сущенко
2014-04-27 21:30 ` bug#7781: hunspell and latex-mode Peter Münster
2014-04-28 15:37   ` Eli Zaretskii
2014-04-28 16:18     ` Peter Münster
2014-04-28 16:48       ` Eli Zaretskii
2014-04-28 17:17         ` Peter Münster
2014-04-28 17:32           ` Eli Zaretskii
2014-04-28 18:27             ` Peter Münster
2014-04-29 10:03       ` Agustin Martin
2014-04-29 10:13         ` Peter Münster
2014-04-29 10:21           ` Agustin Martin
2014-04-29 10:20         ` Peter Münster
2014-04-29 10:39           ` Agustin Martin
2014-04-29 11:54             ` Peter Münster
2014-04-29 12:48               ` Peter Münster
2014-04-29 13:57                 ` Eli Zaretskii
2014-04-29 14:30                   ` Peter Münster
2014-04-29 15:25                     ` Eli Zaretskii
2014-04-29 16:34                       ` Peter Münster
2014-09-25  9:54 ` bug#7781: Bug still present in hunspell 1.3.3; Eli's patch still works Reuben Thomas
2020-08-28 12:00 ` bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file Stefan Kangas
2020-08-28 12:36   ` Eli Zaretskii
2020-08-28 12:56     ` Stefan Kangas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83ppxx7fnp.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=7781@debbugs.gnu.org \
    --cc=sckol@yandex.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.