From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file Date: Sun, 14 Apr 2013 10:08:42 +0300 Message-ID: <83ppxx7fnp.fsf@gnu.org> References: <87sjx9fula.fsf@sc3d.org> <5169AE26.1000403@yandex.ru> <83wqs57jnw.fsf@gnu.org> <516A4DC3.90205@yandex.ru> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1365923341 23457 80.91.229.3 (14 Apr 2013 07:09:01 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 14 Apr 2013 07:09:01 +0000 (UTC) Cc: 7781@debbugs.gnu.org To: =?UTF-8?Q?=D0=9D=D0=B8=D0=BA=D0=BE=D0=BB=D0=B0=D0=B9_?= =?UTF-8?Q?=D0=A1=D1=83=D1=89=D0=B5=D0=BD=D0=BA=D0=BE?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sun Apr 14 09:09:05 2013 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1URH3g-0006sw-QG for geb-bug-gnu-emacs@m.gmane.org; Sun, 14 Apr 2013 09:09:05 +0200 Original-Received: from localhost ([::1]:33178 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URH3f-0004AJ-TR for geb-bug-gnu-emacs@m.gmane.org; Sun, 14 Apr 2013 03:09:03 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:59773) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URH3b-0004AC-KI for bug-gnu-emacs@gnu.org; Sun, 14 Apr 2013 03:09:01 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1URH3a-0002rT-GF for bug-gnu-emacs@gnu.org; Sun, 14 Apr 2013 03:08:59 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:45566) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URH3a-0002rN-Cr for bug-gnu-emacs@gnu.org; Sun, 14 Apr 2013 03:08:58 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1URH7W-0003cI-EW for bug-gnu-emacs@gnu.org; Sun, 14 Apr 2013 03:13:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 14 Apr 2013 07:13:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 7781 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 7781-submit@debbugs.gnu.org id=B7781.136592357213880 (code B ref 7781); Sun, 14 Apr 2013 07:13:02 +0000 Original-Received: (at 7781) by debbugs.gnu.org; 14 Apr 2013 07:12:52 +0000 Original-Received: from localhost ([127.0.0.1]:49675 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1URH7L-0003bm-Jg for submit@debbugs.gnu.org; Sun, 14 Apr 2013 03:12:52 -0400 Original-Received: from mtaout22.012.net.il ([80.179.55.172]:33453) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1URH7I-0003bI-0J for 7781@debbugs.gnu.org; Sun, 14 Apr 2013 03:12:49 -0400 Original-Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0ML800200H3LPP00@a-mtaout22.012.net.il> for 7781@debbugs.gnu.org; Sun, 14 Apr 2013 10:08:36 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0ML8002ECH6B5X70@a-mtaout22.012.net.il>; Sun, 14 Apr 2013 10:08:36 +0300 (IDT) In-reply-to: <516A4DC3.90205@yandex.ru> X-012-Sender: halo1@inter.net.il X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:73397 Archived-At: > Date: Sun, 14 Apr 2013 10:33:39 +0400 > From: =D0=9D=D0=B8=D0=BA=D0=BE=D0=BB=D0=B0=D0=B9 =D0=A1=D1=83=D1= =89=D0=B5=D0=BD=D0=BA=D0=BE > > CC: 7781@debbugs.gnu.org >=20 > Please send me this patch, I'll ask the hunspell developers to incl= ude it. Attached. This is a small part of a much larger patch, most of it fo= r Windows-specific problems. If you have problems compiling the patche= d hunspell, let me know: it could be that I omitted some hunk that is needed for this part. > Could you also recall which concrete problems produces this workaro= und?=20 > For me it works fine, but I haven't tested it in different language= s and=20 > encodings. One problem is that you assume the encoding of the communications wit= h hunspell is UTF-8, and thus matches the internal representation of text in Emacs buffers and strings (only then will byte-to-position give correct results). But that assumption is false: hunspell supports any encoding that it can convert to/from UTF-8 (it uses libiconv internally). The "usual" choice of the encoding is the one used by the dictionary. Not every dictionary out there is in UTF-8. > If it is some problems, I could try to fix it I don't think you can fix this on the Emacs side, because Emacs canno= t easily and/or quickly convert between bytes and characters in an arbitrary multibyte encoding. When I discovered this problem, I also tried fixing it on the Emacs side first, but then I realized that this kind of solution has too many problems, and instead fixed it in hunspell. --- src/tools/hunspell.cxx~0=092011-01-21 19:01:29.000000000 +0200 +++ src/tools/hunspell.cxx=092013-02-07 10:11:54.443610900 +0200 @@ -710,13 +748,22 @@ if (pos >=3D 0) { =09=09=09fflush(stdout); =09=09} else { =09=09=09char ** wlst =3D NULL; -=09=09=09int ns =3D pMS[d]->suggest(&wlst, token); +=09=09=09int byte_offset =3D parser->get_tokenpos() + pos; +=09=09=09int char_offset =3D 0; +=09=09=09if (strcmp(io_enc, "UTF-8") =3D=3D 0) { +=09=09=09=09for (int i =3D 0; i < byte_offset; i++) { +=09=09=09=09=09if ((buf[i] & 0xc0) !=3D 0x80) +=09=09=09=09=09=09char_offset++; +=09=09=09=09} +=09=09=09} else { +=09=09=09=09char_offset =3D byte_offset; +=09=09=09} +=09=09=09int ns =3D pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_= enc[d])); =09=09=09if (ns =3D=3D 0) { -=09=09 =09=09fprintf(stdout,"# %s %d", token, -=09=09 =09=09 parser->get_tokenpos() + pos); +=09=09 =09=09fprintf(stdout,"# %s %d", token, char_offset); =09=09=09} else { =09=09=09=09fprintf(stdout,"& %s %d %d: ", token, ns, -=09=09=09=09 parser->get_tokenpos() + pos); +=09=09=09=09=09char_offset); =09=09=09=09fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], io_enc))= ; =09=09=09} =09=09=09for (int j =3D 1; j < ns; j++) { @@ -745,13 +792,23 @@ if (pos >=3D 0) { =09=09=09if (root) free(root); =09=09} else { =09=09=09char ** wlst =3D NULL; +=09=09=09int byte_offset =3D parser->get_tokenpos() + pos; +=09=09=09int char_offset =3D 0; +=09=09=09if (strcmp(io_enc, "UTF-8") =3D=3D 0) { +=09=09=09=09for (int i =3D 0; i < byte_offset; i++) { +=09=09=09=09=09if ((buf[i] & 0xc0) !=3D 0x80) +=09=09=09=09=09=09char_offset++; +=09=09=09=09} +=09=09=09} else { +=09=09=09=09char_offset =3D byte_offset; +=09=09=09} =09=09=09int ns =3D pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_= enc[d])); =09=09=09if (ns =3D=3D 0) { =09=09 =09=09fprintf(stdout,"# %s %d", chenc(token, io_enc, ui_en= c), -=09=09 =09=09 parser->get_tokenpos() + pos); +=09=09 =09=09 char_offset); =09=09=09} else { =09=09=09=09fprintf(stdout,"& %s %d %d: ", chenc(token, io_enc, ui_e= nc), ns, -=09=09=09=09 parser->get_tokenpos() + pos); +=09=09=09=09 char_offset); =09=09=09=09fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], ui_enc))= ; =09=09=09} =09=09=09for (int j =3D 1; j < ns; j++) {