From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.devel Subject: Re: Ispell and unibyte characters Date: Fri, 13 Apr 2012 18:38:23 +0200 Message-ID: <20120413163823.GA26947@agmartin.aq.upm.es> References: <83aa3f2hgh.fsf@gnu.org> <20120326173912.GA22306@agmartin.aq.upm.es> <20120328191821.GA6266@agmartin.aq.upm.es> <20120410190803.GA13517@agmartin.aq.upm.es> <83ty0r5rmd.fsf@gnu.org> <20120412143657.GA18352@agmartin.aq.upm.es> <83d37c4vw5.fsf@gnu.org> <20120413152525.GA14949@agmartin.aq.upm.es> <83ehrr39wq.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1334335122 25426 80.91.229.3 (13 Apr 2012 16:38:42 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 13 Apr 2012 16:38:42 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 13 18:38:42 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SIjWD-00044u-EX for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 18:38:41 +0200 Original-Received: from localhost ([::1]:43211 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIjWC-0003fi-Nx for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 12:38:40 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:49179) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIjW5-0003fC-Bl for emacs-devel@gnu.org; Fri, 13 Apr 2012 12:38:39 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SIjVy-0003V0-PU for emacs-devel@gnu.org; Fri, 13 Apr 2012 12:38:32 -0400 Original-Received: from fibonacci.ccupm.upm.es ([138.100.198.70]:37951 helo=smtp.upm.es) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIjVy-0003Ug-FC for emacs-devel@gnu.org; Fri, 13 Apr 2012 12:38:26 -0400 Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131]) by smtp.upm.es (8.14.3/8.14.3/fibonacci-001) with ESMTP id q3DGcN3s016478; Fri, 13 Apr 2012 18:38:23 +0200 Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000) id B0ADEE52; Fri, 13 Apr 2012 18:38:23 +0200 (CEST) Mail-Followup-To: emacs-devel@gnu.org Content-Disposition: inline In-Reply-To: <83ehrr39wq.fsf@gnu.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-MIME-Autoconverted: from 8bit to quoted-printable by smtp.upm.es id q3DGcN3s016478 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 138.100.198.70 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:149639 Archived-At: On Fri, Apr 13, 2012 at 06:53:57PM +0300, Eli Zaretskii wrote: > > Date: Fri, 13 Apr 2012 17:25:25 +0200 > > From: Agustin Martin > >=20 > > > I don't understand what are you trying to accomplish by encoding > > > OTHERCHARS in UTF-8. What exactly is the problem with them being > > > encoded in some 8-bit encoding? Please explain. > >=20 > > Imagine a fake entry in the general list, either in ispell.el or prov= ided > > through `ispell-base-dicts-override-alist' (no accented chars for sim= plicity) > >=20 > > ("catala8" > > "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil = iso-8859-1) > >=20 > > Unless emacs knows the encoding for \267 (middledot "=B7") it cannot = decode it > > properly. I prefer to not use UTF-8 here, because I want the entry to= also be > > useful for ispell (and also be XEmacs incompatible). The best approac= h here > > seems to decode the otherchars regexp according to provided coding-sy= stem. > >=20 > > I have noticed that there seems to be no need to encode the resulting= string > > in UTF-8, Emacs will know what to do with the decoded string. > >=20 > > I tested something like > >=20 > > (dolist (adict ispell-dictionary-alist) > > (add-to-list 'tmp-dicts-alist > > (list > > (nth 0 adict) ; dict name > > "[[:alpha:]]" ; casechars > > "[^[:alpha:]]" ; not-casechars > > (if ispell-encoding8-command > > ;; Decode 8bit otherchars if needed > > (decode-coding-string (nth 3 adict) (nth 7 adict)) > > (nth 3 adict)) ; otherchars > > (nth 4 adict) ; many-otherchars-p > > (nth 5 adict) ; ispell-args > > (nth 6 adict) ; extended-character-mode > > (if ispell-encoding8-command > > 'utf-8 > > (nth 7 adict))))) > >=20 > > and seems to work well. >=20 > So you are taking the Catalan dictionary spec written for Ispell and > convert it to a spec that could be used to support more characters by > using UTF-8, is that right? If so, I find this a bit kludgey. =20 I think differently and like above approach because I find it way more versatile for general definitions. This is not a matter of ispell blind reuse. In particular I noticed this problem in Debian with the catalan sp= ec written for aspell (automatically created after info provided by aspell-c= a package). That info is written that way to also be useful for XEmacs, bu= t with above post-processing it can work way better for Emacs. > How > about having a completely separate spec instead? More generally, why > not separate ispell-dictionary-alist into 2 alists, one to be used > with Ispell, the other to be used with aspell and hunspell? I think > this would be cleaner, don't you agree? As a matter of fact that is what we do in Debian from info provided by ispell, aspell and hunspell dicts maintainers. The difference is that the provided info is supposed to be valid for both Emacs and XEmacs, so I find post-processing as above very useful, because it helps to take the best for Emacs. Global dicts alist is built from (dolist (dict (append found-dicts-alist ispell-base-dicts-override-alist ispell-dictionary-base-alist)) where first found wins. `found-dicts-alist' has the result of automatic search (currently used only for aspell) and has higher priority,=20 `ispell-dictionary-base-alist' is the fallback alist having the lower priority. Depending on the spellchecker=20 `ispell-base-dicts-override-alist' is set to an alist corresponding to ispell, aspell or hunspell dictionaries (they are handled independently) I do not think that maintaining separate hardcoded dict lists in ispell.e= l for ispell, aspell and hunspell worths. For hunspell, in the future I'd go for some sort of parsing mechanism lik= e current one for aspell. --=20 Agustin