From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.devel Subject: Re: Ispell and unibyte characters Date: Fri, 13 Apr 2012 20:44:01 +0200 Message-ID: <20120413184401.GA17992@agmartin.aq.upm.es> References: <83aa3f2hgh.fsf@gnu.org> <20120326173912.GA22306@agmartin.aq.upm.es> <20120328191821.GA6266@agmartin.aq.upm.es> <20120410190803.GA13517@agmartin.aq.upm.es> <83ty0r5rmd.fsf@gnu.org> <20120412143657.GA18352@agmartin.aq.upm.es> <83d37c4vw5.fsf@gnu.org> <20120413152525.GA14949@agmartin.aq.upm.es> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1334342654 19288 80.91.229.3 (13 Apr 2012 18:44:14 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 13 Apr 2012 18:44:14 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 13 20:44:13 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SIlTg-0000Nz-D2 for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 20:44:12 +0200 Original-Received: from localhost ([::1]:59982 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIlTf-0003HS-OH for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 14:44:11 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:48440) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIlTc-0003HG-E6 for emacs-devel@gnu.org; Fri, 13 Apr 2012 14:44:09 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SIlTY-0001XR-8r for emacs-devel@gnu.org; Fri, 13 Apr 2012 14:44:07 -0400 Original-Received: from fibonacci.ccupm.upm.es ([138.100.198.70]:34859 helo=smtp.upm.es) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIlTX-0001Up-VY for emacs-devel@gnu.org; Fri, 13 Apr 2012 14:44:04 -0400 Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131]) by smtp.upm.es (8.14.3/8.14.3/fibonacci-001) with ESMTP id q3DIi1WN028099; Fri, 13 Apr 2012 20:44:01 +0200 Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000) id 52060E52; Fri, 13 Apr 2012 20:44:01 +0200 (CEST) Mail-Followup-To: emacs-devel@gnu.org Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-MIME-Autoconverted: from 8bit to quoted-printable by smtp.upm.es id q3DIi1WN028099 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 138.100.198.70 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:149642 Archived-At: On Fri, Apr 13, 2012 at 01:51:15PM -0400, Stefan Monnier wrote: > > ("catala8" > > "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil = iso-8859-1) >=20 > > Unless emacs knows the encoding for \267 (middledot "=B7") it cannot = decode it > > properly. I prefer to not use UTF-8 here, because I want the entry to= also be > > useful for ispell (and also be XEmacs incompatible). The best approac= h here > > seems to decode the otherchars regexp according to provided coding-sy= stem. >=20 > There's something I don't understand here: >=20 > If you want a middle dot, why don't you put a middle dot? > I mean why write "['\267-]" rather than ['=B7-]? The problem is that in a dictionary alist you can have dictionaries with different unibyte encodings, if you happen to have two of that chars in different encodings I'd expect problems. I really should have gone in more detail about the system where I noticed this, even if it is a bit Debian specific. I noticed this problem in aspell catalan entry provided by Debian aspell-= ca package. In Debian for the different aspell {and ispell and hunspell} dictionaries alists are created on dictionary installation and stored in = a file (for the curious /var/cache/dictionaries-common/emacsen-ispell-dicts= .el).=20 Some maintainers provide \xxx, some provide explicit chars in different encodings, and all that info it put together in dict alist form in that f= ile, so it cannot be loaded with a given unique encoding but as 'raw-text, and that implies loading as bytes rather than as chars. > I think this is related to your saying "I prefer to not use UTF-8 here"= , > but again I don't know what you mean by "use UTF-8", because using > a middle dot character in the source file does not imply using UTF-8 > anywhere (the file can be saved in any encoding that includes the > middle dot). >=20 > For me notations like \267 should be used exclusively to talk about > *bytes*, not about *chars*. So it might make sense to use those for > things like matching particular bytes in [ia]spell's output, but it > makes no sense to match chars in the buffer being spell-checked since > the buffer does not contain bytes but chars. That is why I want to decode those bytes into actual chars to be used in spellchecking, and make sure that they are decoded from correct coding-system. Otherwise if process coding-system is changed to UTF-8 and that stays as bytes matching the wrong encoding things may not work well. If there is a consensus that I should not go the decode- way for othercha= rs, I will not commit that part. For Debian I can simply keep loading emacsen-ispell-dicts.el as raw-text and do the decode- processing on its contents, before they are passed to ispell.el through `ispell-base-dicts-override-alist', so this last contains chars more that bytes. I however think that is better to keep the decode- stuff for more general use. I will wait at least a couple of days before committing so is clear what = to do. Thanks all for your comments, --=20 Agustin