From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.devel Subject: Re: Ispell and unibyte characters Date: Sun, 15 Apr 2012 02:02:11 +0200 Message-ID: References: <83aa3f2hgh.fsf@gnu.org> <20120326173912.GA22306@agmartin.aq.upm.es> <20120328191821.GA6266@agmartin.aq.upm.es> <20120410190803.GA13517@agmartin.aq.upm.es> <83ty0r5rmd.fsf@gnu.org> <20120412143657.GA18352@agmartin.aq.upm.es> <83d37c4vw5.fsf@gnu.org> <20120413152525.GA14949@agmartin.aq.upm.es> <20120413184401.GA17992@agmartin.aq.upm.es> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1334448146 23552 80.91.229.3 (15 Apr 2012 00:02:26 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sun, 15 Apr 2012 00:02:26 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 15 02:02:25 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SJCv8-0004FN-Np for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2012 02:02:22 +0200 Original-Received: from localhost ([::1]:49356 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SJCv7-0003SV-NU for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2012 20:02:21 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:56297) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SJCv3-0003SC-PE for emacs-devel@gnu.org; Sat, 14 Apr 2012 20:02:19 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SJCv1-0007vY-JT for emacs-devel@gnu.org; Sat, 14 Apr 2012 20:02:17 -0400 Original-Received: from mail-pb0-f41.google.com ([209.85.160.41]:48253) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SJCv1-0007vL-7L for emacs-devel@gnu.org; Sat, 14 Apr 2012 20:02:15 -0400 Original-Received: by pbcup15 with SMTP id up15so5148841pbc.0 for ; Sat, 14 Apr 2012 17:02:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=1xoRItUbTAAi3i23SILR+SIB2lKKq9MX9uHa3X7tz+0=; b=lE9919dUTK+o6JZm5IwLhkibacmnh1tKTPxG+vZ6NuFkMLd8XSm9LDz+wdvUWXW9Nx qz2BzMTYX0CwMqjKWSdicBnX8YYi3GYt07Dxh0NX4WlY5i1XWBaQYEr7kOvNlfswdakU lZaBIUp0/d9P/VVUpcT2cB1dN4OsevPHdQst061wWhpeUAEGgGdc+h9w9EZTNSYTtzV3 XA/RH9oF7Psm9ZjPRCqbAPDHmwVpL5VAogoNhLohvSW4RzStmZlklvyg74dMIl0RKQFU Y/KMP4ztsFvbbEXxY/zUccMruyD06I+D6djbDGVVfD/AFNUqY+lq+tYv/4DNELm1v9cZ x4ug== Original-Received: by 10.68.213.104 with SMTP id nr8mr10765363pbc.91.1334448131771; Sat, 14 Apr 2012 17:02:11 -0700 (PDT) Original-Received: by 10.68.6.164 with HTTP; Sat, 14 Apr 2012 17:02:11 -0700 (PDT) In-Reply-To: X-Google-Sender-Auth: vwqFk4ijos4z_F6ecJ4HyXFzYNI X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 209.85.160.41 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:149669 Archived-At: El d=EDa 14 de abril de 2012 03:57, Stefan Monnier escribi=F3: >>> If you want a middle dot, why don't you put a middle dot? >>> I mean why write "['\267-]" rather than ['=B7-]? >> The problem is that in a dictionary alist you can have dictionaries with >> different unibyte encodings, if you happen to have two of that chars in >> different encodings I'd expect problems. > > I still don't understand. =A0Can you be more specific? Imagine Catalan dictionary with iso-8859-1 "=B7" in otherchars and other dictionary (I am guessing the possibility to be more general, do not actually have a real example of something different from our Debian file with all info put together) with another upper char in otherchars, but in a different encoding (e.g., koi8r). The only possibility to have both coexist as chars in the same file is to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and koi8r, so Emacs properly gets chars when reading the file (if properly guessing file coding-system). XEmacs seems to be a bit more tricky regarding UTF-8, but I'd expect things to work once proper decoding is done. The traditional possibility is to use octal codes to represent bytes matching the char in dict declared charset. Using UTF-8 is actually what eliz proposed in the beginning of this thread. While one of ispell.el docstrings claims that only unibyte chars can be put here, I'd expect this to work, at least for Emacs. As a matter of fact, when I was first trying the (encode- (decode ..)) way I actually got UTF-8 (that was decoded again by `ispell-get-otherchars' according to new 'utf-8 coding-system) and seemed to work (apart from the psgml/sgml-lexical-context problem) for Emacs. At that time I did not notice that once Emacs loads something as char, encodings only matter when writing it (Yes, I am really learning all this encode-* decode-* stuff in more depth in this thread). I'd however use this only in personal ~/.emacs files and if needed. >>> For me notations like \267 should be used exclusively to talk about >>> *bytes*, not about *chars*. =A0So it might make sense to use those for >>> things like matching particular bytes in [ia]spell's output, but it >>> makes no sense to match chars in the buffer being spell-checked since >>> the buffer does not contain bytes but chars. >> That is why I want to decode those bytes into actual chars to be used in > > If I understand correctly what you mean by "those bytes", then using "=B7= " > instead of "\267" gives you the decoded form right away without having > to do extra work. That is true for files with a single encoding. However, the problem happens when a file has mixed encodings like in the Debian example I mentioned. I know, this will not happen in real manually edited files, but can happen and happens in aggregates like the one I mentioned. If file is loaded with a given coding-system-for-read chars in that coding-system will be properly interpreted by Emacs when reading, but not the others. Something like that happened with iso-8859-1/iso-8859-15 chars in http://bugs.debian.org/337214 and the simple way to avoid the mess was to read as 'raw-text, and that indeed reads upper chars as pure bytes although they were originally written as chars (I mean not through octal codes), no implicit on the fly "decoding/interpretation" at all. Not a big problem, we know the encoding for every single dict, so things can be properly decoded (\xxx + coding-system gives a char). If we later change default encoding for communication for entries in that file, we need to decode the bytes obtained from 'raw-text read to actual char so is internally handled as desired char. Changing it also to UTF-8 (and expecting ispell-get-otherchars to decode again to char) seems to work in Emacs, but also seems absolutely not needed. I am getting more and more convinced that this is a Debian-only problem because of the way we create that file, so I should handle this special case as Debian-only and do needed decoding there, not in ispell.el. --=20 Agustin