From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Agustin Martin <agustin.martin@hispalinux.es>
Newsgroups: gmane.emacs.bugs
Subject: bug#7668: ispell and dictionary encodings
Date: Mon, 20 Dec 2010 12:31:48 +0100
Message-ID: <20101220113148.GA12469@agmartin.aq.upm.es>
References: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: dough.gmane.org 1292846015 30023 80.91.229.12 (20 Dec 2010 11:53:35 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Mon, 20 Dec 2010 11:53:35 +0000 (UTC)
To: Reuben Thomas <rrt@sc3d.org>, 7668@debbugs.gnu.org
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Dec 20 12:53:31 2010
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1PUeIz-00007V-BI
	for geb-bug-gnu-emacs@m.gmane.org; Mon, 20 Dec 2010 12:53:30 +0100
Original-Received: from localhost ([127.0.0.1]:59646 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1PUeIo-00036Q-FM
	for geb-bug-gnu-emacs@m.gmane.org; Mon, 20 Dec 2010 06:53:18 -0500
Original-Received: from [140.186.70.92] (port=34241 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PUeIX-00035d-JU
	for bug-gnu-emacs@gnu.org; Mon, 20 Dec 2010 06:53:15 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1PUeIO-0001ZK-K9
	for bug-gnu-emacs@gnu.org; Mon, 20 Dec 2010 06:53:01 -0500
Original-Received: from debbugs.gnu.org ([140.186.70.43]:36363)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1PUeIO-0001Z4-IG
	for bug-gnu-emacs@gnu.org; Mon, 20 Dec 2010 06:52:52 -0500
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.69)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>)
	id 1PUdsQ-0000iC-3I; Mon, 20 Dec 2010 06:26:02 -0500
X-Loop: help-debbugs@gnu.org
Resent-From: Agustin Martin <agustin.martin@hispalinux.es>
Original-Sender: debbugs-submit-bounces@debbugs.gnu.org
Resent-To: owner@debbugs.gnu.org
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Mon, 20 Dec 2010 11:26:02 +0000
Resent-Message-ID: <handler.7668.B7668.12928443252688@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 7668
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
Original-Received: via spool by 7668-submit@debbugs.gnu.org id=B7668.12928443252688
	(code B ref 7668); Mon, 20 Dec 2010 11:26:02 +0000
Original-Received: (at 7668) by debbugs.gnu.org; 20 Dec 2010 11:25:25 +0000
Original-Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1PUdro-0000hI-Ea
	for submit@debbugs.gnu.org; Mon, 20 Dec 2010 06:25:24 -0500
Original-Received: from edison.ccupm.upm.es ([138.100.198.71] helo=smtp.upm.es)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <agustin.martin@upm.es>) id 1PUdrl-0000h4-Il
	for 7668@debbugs.gnu.org; Mon, 20 Dec 2010 06:25:22 -0500
Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131])
	by smtp.upm.es (8.14.3/8.14.3/edison-001) with ESMTP id oBKBVmUE016148; 
	Mon, 20 Dec 2010 12:31:48 +0100
Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000)
	id B887682365; Mon, 20 Dec 2010 12:31:48 +0100 (CET)
Content-Disposition: inline
In-Reply-To: <AANLkTin26ZJupakNsWxgteT9A4TOGCZQtAc=H6OTGG7-@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
Resent-Date: Mon, 20 Dec 2010 06:26:02 -0500
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
	the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.bugs:42660
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/42660>

On Fri, Dec 17, 2010 at 06:30:14PM +0000, Reuben Thomas wrote:
> I've just been puzzling my way through ispell.gz's dictionary encoding
> code, after switching from aspell to hunspell in order to be able to
> treat Unicode curly single quotes as normal intraword punctuation
> (which it seems aspell cannot be persuaded to do, but that's another
> story).
> 
> I noticed a feature of ispell-dictionary-base-alist, which I don't
> understand: the last (7th) element of each dictionary definition is
> called "Coding System", which seems to be the coding system of the
> case character and non-case-character strings, but it is also passed
> to the spelling program as the input encoding, which is wrong, since
> the input encoding depends on the file to be checked.

That element represents the language that will be used for communication
with the dictionary. case-character and non-case-character strings should 
be in the same encoding as it.

> I currently use the classic workaround of making up my own dictionary
> definition which includes accented characters that I want to be able
> to use in words (which is necessary anyway), and which specifies utf-8
> as the coding system. This only works because I use utf-8 for all my
> text files.

If you are not going to use XEmacs, but only FSF Emacs, just use [:alpha:]
for the case-character and non-case-character strings along with utf-8. That
is already done automatically for aspell dictionaries, where is easy to get
a list of installed dictionaries and additional info.

> It seems, therefore, that the argument to follow
> ispell-encoding8-command (which itself is mis-documented:
> 
> Command line option prefix to select UTF-8 if supported, nil otherwise.
> If UTF-8 if supported by spellchecker and is selectable from the command line
> this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell,
> so UTF-8 or other mime charsets can be selected.  That will be set for hunspell
> >=1.1.6 or aspell >= 0.60 in `ispell-check-version'.
> 
> It is not just for selecting UTF-8; indeed, that's the irony: in the
> default configuration it's used mostly to select 8-bit character sets!
> And there are one or two other typos. How about (suitably rewrapped):
> 
> Command line option prefix to select coding system if supported, nil otherwise.
> If the coding system is selectable from the command line
> this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell,
> so that the input encoding can be selected.  That will be set for hunspell
> >= 1.1.6 or aspell >= 0.60 in `ispell-check-version'.

Agreed, thanks

> Then, the following code in ispell-start-process:
> 
>     ;; If we are using recent aspell or hunspell, make sure we use the
> right encoding
>     ;; for communication. ispell or older aspell/hunspell does not support this
>     (if ispell-encoding8-command
> 	(setq args
> 	      (append args
> 		      (list
> 		       (concat ispell-encoding8-command
> 			       (symbol-name (ispell-get-coding-system)))))))
> 
> needs fixing: rather than using ispell-get-coding-system, it should
> use a prefix of buffer-file-coding-system (without the suffix that
> specifies the line ending).

No, current code is correct. It is telling the spellchecker that
communication with the dictionary will be done in (ispell-get-coding-system) 
coding system. ispell.el will do the internal conversions needed for that in 
a diferent place, so everything is transparent to the user.

> I'm sure I'm missing things here, but if what I've said above makes
> any sense, I'd like to help refine it into a sensible proposal to
> improve ispell.el.

Thanks for looking into this. Will prepare a change with the
`ispell-encoding8-command' documentation fix.

Regards,

-- 
Agustin