From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.bugs Subject: bug#7668: ispell and dictionary encodings Date: Mon, 20 Dec 2010 12:31:48 +0100 Message-ID: <20101220113148.GA12469@agmartin.aq.upm.es> References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1292846015 30023 80.91.229.12 (20 Dec 2010 11:53:35 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 20 Dec 2010 11:53:35 +0000 (UTC) To: Reuben Thomas , 7668@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Dec 20 12:53:31 2010 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PUeIz-00007V-BI for geb-bug-gnu-emacs@m.gmane.org; Mon, 20 Dec 2010 12:53:30 +0100 Original-Received: from localhost ([127.0.0.1]:59646 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PUeIo-00036Q-FM for geb-bug-gnu-emacs@m.gmane.org; Mon, 20 Dec 2010 06:53:18 -0500 Original-Received: from [140.186.70.92] (port=34241 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PUeIX-00035d-JU for bug-gnu-emacs@gnu.org; Mon, 20 Dec 2010 06:53:15 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PUeIO-0001ZK-K9 for bug-gnu-emacs@gnu.org; Mon, 20 Dec 2010 06:53:01 -0500 Original-Received: from debbugs.gnu.org ([140.186.70.43]:36363) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PUeIO-0001Z4-IG for bug-gnu-emacs@gnu.org; Mon, 20 Dec 2010 06:52:52 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.69) (envelope-from ) id 1PUdsQ-0000iC-3I; Mon, 20 Dec 2010 06:26:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Agustin Martin Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 20 Dec 2010 11:26:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 7668 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 7668-submit@debbugs.gnu.org id=B7668.12928443252688 (code B ref 7668); Mon, 20 Dec 2010 11:26:02 +0000 Original-Received: (at 7668) by debbugs.gnu.org; 20 Dec 2010 11:25:25 +0000 Original-Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PUdro-0000hI-Ea for submit@debbugs.gnu.org; Mon, 20 Dec 2010 06:25:24 -0500 Original-Received: from edison.ccupm.upm.es ([138.100.198.71] helo=smtp.upm.es) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PUdrl-0000h4-Il for 7668@debbugs.gnu.org; Mon, 20 Dec 2010 06:25:22 -0500 Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131]) by smtp.upm.es (8.14.3/8.14.3/edison-001) with ESMTP id oBKBVmUE016148; Mon, 20 Dec 2010 12:31:48 +0100 Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000) id B887682365; Mon, 20 Dec 2010 12:31:48 +0100 (CET) Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list Resent-Date: Mon, 20 Dec 2010 06:26:02 -0500 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:42660 Archived-At: On Fri, Dec 17, 2010 at 06:30:14PM +0000, Reuben Thomas wrote: > I've just been puzzling my way through ispell.gz's dictionary encoding > code, after switching from aspell to hunspell in order to be able to > treat Unicode curly single quotes as normal intraword punctuation > (which it seems aspell cannot be persuaded to do, but that's another > story). > > I noticed a feature of ispell-dictionary-base-alist, which I don't > understand: the last (7th) element of each dictionary definition is > called "Coding System", which seems to be the coding system of the > case character and non-case-character strings, but it is also passed > to the spelling program as the input encoding, which is wrong, since > the input encoding depends on the file to be checked. That element represents the language that will be used for communication with the dictionary. case-character and non-case-character strings should be in the same encoding as it. > I currently use the classic workaround of making up my own dictionary > definition which includes accented characters that I want to be able > to use in words (which is necessary anyway), and which specifies utf-8 > as the coding system. This only works because I use utf-8 for all my > text files. If you are not going to use XEmacs, but only FSF Emacs, just use [:alpha:] for the case-character and non-case-character strings along with utf-8. That is already done automatically for aspell dictionaries, where is easy to get a list of installed dictionaries and additional info. > It seems, therefore, that the argument to follow > ispell-encoding8-command (which itself is mis-documented: > > Command line option prefix to select UTF-8 if supported, nil otherwise. > If UTF-8 if supported by spellchecker and is selectable from the command line > this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell, > so UTF-8 or other mime charsets can be selected. That will be set for hunspell > >=1.1.6 or aspell >= 0.60 in `ispell-check-version'. > > It is not just for selecting UTF-8; indeed, that's the irony: in the > default configuration it's used mostly to select 8-bit character sets! > And there are one or two other typos. How about (suitably rewrapped): > > Command line option prefix to select coding system if supported, nil otherwise. > If the coding system is selectable from the command line > this variable will contain \"--encoding=\" for aspell and \"-i \" for hunspell, > so that the input encoding can be selected. That will be set for hunspell > >= 1.1.6 or aspell >= 0.60 in `ispell-check-version'. Agreed, thanks > Then, the following code in ispell-start-process: > > ;; If we are using recent aspell or hunspell, make sure we use the > right encoding > ;; for communication. ispell or older aspell/hunspell does not support this > (if ispell-encoding8-command > (setq args > (append args > (list > (concat ispell-encoding8-command > (symbol-name (ispell-get-coding-system))))))) > > needs fixing: rather than using ispell-get-coding-system, it should > use a prefix of buffer-file-coding-system (without the suffix that > specifies the line ending). No, current code is correct. It is telling the spellchecker that communication with the dictionary will be done in (ispell-get-coding-system) coding system. ispell.el will do the internal conversions needed for that in a diferent place, so everything is transparent to the user. > I'm sure I'm missing things here, but if what I've said above makes > any sense, I'd like to help refine it into a sensible proposal to > improve ispell.el. Thanks for looking into this. Will prepare a change with the `ispell-encoding8-command' documentation fix. Regards, -- Agustin