From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.bugs Subject: bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file Date: Fri, 7 Jan 2011 14:14:03 +0100 Message-ID: References: <87sjx9fula.fsf@sc3d.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1294406659 25664 80.91.229.12 (7 Jan 2011 13:24:19 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 7 Jan 2011 13:24:19 +0000 (UTC) Cc: 7781@debbugs.gnu.org To: Reuben Thomas Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Jan 07 14:24:15 2011 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PbCIf-00083T-GX for geb-bug-gnu-emacs@m.gmane.org; Fri, 07 Jan 2011 14:24:13 +0100 Original-Received: from localhost ([127.0.0.1]:38131 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PbCIe-0005e4-SQ for geb-bug-gnu-emacs@m.gmane.org; Fri, 07 Jan 2011 08:24:12 -0500 Original-Received: from [140.186.70.92] (port=60911 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PbCIX-0005cf-Ff for bug-gnu-emacs@gnu.org; Fri, 07 Jan 2011 08:24:07 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PbCIV-00057L-LS for bug-gnu-emacs@gnu.org; Fri, 07 Jan 2011 08:24:05 -0500 Original-Received: from debbugs.gnu.org ([140.186.70.43]:38195) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PbCIV-00057H-Jz for bug-gnu-emacs@gnu.org; Fri, 07 Jan 2011 08:24:03 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.69) (envelope-from ) id 1PbC21-0000wD-Qs; Fri, 07 Jan 2011 08:07:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Agustin Martin Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 07 Jan 2011 13:07:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 7781 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 7781-submit@debbugs.gnu.org id=B7781.12944056163593 (code B ref 7781); Fri, 07 Jan 2011 13:07:01 +0000 Original-Received: (at 7781) by debbugs.gnu.org; 7 Jan 2011 13:06:56 +0000 Original-Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PbC1o-0000vl-Cc for submit@debbugs.gnu.org; Fri, 07 Jan 2011 08:06:56 -0500 Original-Received: from mail-iy0-f172.google.com ([209.85.210.172]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PbC1n-0000vZ-1F for 7781@debbugs.gnu.org; Fri, 07 Jan 2011 08:06:47 -0500 Original-Received: by iyi42 with SMTP id 42so16260690iyi.3 for <7781@debbugs.gnu.org>; Fri, 07 Jan 2011 05:14:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=i+hijFMoKnysf7sqYkv/nuN1AWSEdhbVx7E30B96dlU=; b=Jb21LTJ8D4gPyd5HLDIWkbB6IYupoB+UxGuDr1YJfgLmmWCmLt288xINUpGfLvaO5Y 8duCRY8tX9YLyi3jRhoaNCOSaUATKcJfIs0bQncF2V1NverX1tbOaFKzLukvpjZv5UBG NchJph6FnFsqWadgcLNmuwFVwtLpyVXGO+qMY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=oVLBXBh0oqQJVOq2xoCsnKr5qww8OP6OXWLwMbjgsj46z8wTWx6P8pnLlyUYUokNjJ pqemvjYb6khSS7oMUt2aPh4Dm6qNInZtCWvDFMozrstIofBEU5MwK8o8iws1yjQKUh7p WJqi8odxQmvaOAgy1hGzeAoHfCSzjsm0p7IdI= Original-Received: by 10.231.36.68 with SMTP id s4mr26798060ibd.178.1294406043666; Fri, 07 Jan 2011 05:14:03 -0800 (PST) Original-Received: by 10.231.14.13 with HTTP; Fri, 7 Jan 2011 05:14:03 -0800 (PST) In-Reply-To: <87sjx9fula.fsf@sc3d.org> X-Google-Sender-Auth: kH0hKpNGqXzxuPe5k1Z6JCXeuVc X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list Resent-Date: Fri, 07 Jan 2011 08:07:01 -0500 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:43179 Archived-At: 2011/1/4 Reuben Thomas : > With the following text, and using emacs -Q, I get the errors you can > see in the messages log below when using hunspell to spell-check a UTF-8 > buffer with some extended characters in it. > > I did test this with emacs -Q, but the current session, in which I > reproduced the problem and am now composing this bug report, was not > started with -Q (this is so submitting the bug report works properly!). > > I am running a freshly bzr-pulled build of the emacs-23 branch. Hi, Reuben, I can also reproduce this with emacs23.2. I could locate problems in two lines, after splititng original lines, -- Cut here -- 8< ----- minimal.txt: utf-8 of out-of-copyright works. The Kindle may be a loss leader, but at =A3109 it=92s still not cheap. Feedbooks, rather than integrating easily into -- Cut here -- 8< ----- End of minimal.txt In first line, currency seems to give some conversion errors when iso-8859-1 is used, when that should have ignored by hunspell. I get tons of UTF-8 encoding error. Missing continuation byte in 0. character position: for that line when using $ cat minimal.txt | hunspell -d en_US -a -i iso-8859-1 In second line unusual apostrophe seems to cause some confusion to hunspell when utf8 is used. Comparing what aspell and hunspell give in similar text I get $ cat minimal.txt | aspell --encoding=3Dutf-8 -d en_US -a & Feedbooks 6 22: Feed books, Feed-books, Feedback's, Feedbags, ... $ cat minimal.txt | hunspell -d en_US -i utf-8 -a & Feedbooks 8 24: Feed books, Feed-books, Feedback, Feedbags, ... Do not worry about first number, is the number of suggestions. However position in second number differ. Seems that hunspell is not considering that apostrophe as a single (multibyte) char when counting, but as three components Looks to me an hunspell bug. I found no reference to this problem in hunspell sf site, but noticed that Hunspell 1.2.14 was released yesterday. Need to check if that has some related new. --=20 Agustin