From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Agustin Martin <agustin.martin@hispalinux.es>
Newsgroups: gmane.emacs.devel
Subject: Re: Ispell and unibyte characters
Date: Sun, 15 Apr 2012 02:02:11 +0200
Message-ID: <CAKy3oZr6KYDfz6Vk-VA8xHbLvbLfGU5HcJiRtaGMy0tq82fgEw@mail.gmail.com>
References: <83aa3f2hgh.fsf@gnu.org>
	<20120326173912.GA22306@agmartin.aq.upm.es>
	<E1SCGD0-0001Dm-Tu@fencepost.gnu.org>
	<20120328191821.GA6266@agmartin.aq.upm.es>
	<20120410190803.GA13517@agmartin.aq.upm.es>
	<83ty0r5rmd.fsf@gnu.org>
	<20120412143657.GA18352@agmartin.aq.upm.es>
	<83d37c4vw5.fsf@gnu.org>
	<20120413152525.GA14949@agmartin.aq.upm.es>
	<jwvd37b34og.fsf-monnier+emacs@gnu.org>
	<20120413184401.GA17992@agmartin.aq.upm.es>
	<jwvehrr13kp.fsf-monnier+emacs@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Trace: dough.gmane.org 1334448146 23552 80.91.229.3 (15 Apr 2012 00:02:26 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Sun, 15 Apr 2012 00:02:26 +0000 (UTC)
To: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 15 02:02:25 2012
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SJCv8-0004FN-Np
	for ged-emacs-devel@m.gmane.org; Sun, 15 Apr 2012 02:02:22 +0200
Original-Received: from localhost ([::1]:49356 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SJCv7-0003SV-NU
	for ged-emacs-devel@m.gmane.org; Sat, 14 Apr 2012 20:02:21 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:56297)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustimartin@gmail.com>) id 1SJCv3-0003SC-PE
	for emacs-devel@gnu.org; Sat, 14 Apr 2012 20:02:19 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <agustimartin@gmail.com>) id 1SJCv1-0007vY-JT
	for emacs-devel@gnu.org; Sat, 14 Apr 2012 20:02:17 -0400
Original-Received: from mail-pb0-f41.google.com ([209.85.160.41]:48253)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustimartin@gmail.com>) id 1SJCv1-0007vL-7L
	for emacs-devel@gnu.org; Sat, 14 Apr 2012 20:02:15 -0400
Original-Received: by pbcup15 with SMTP id up15so5148841pbc.0
	for <emacs-devel@gnu.org>; Sat, 14 Apr 2012 17:02:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:content-type
	:content-transfer-encoding;
	bh=1xoRItUbTAAi3i23SILR+SIB2lKKq9MX9uHa3X7tz+0=;
	b=lE9919dUTK+o6JZm5IwLhkibacmnh1tKTPxG+vZ6NuFkMLd8XSm9LDz+wdvUWXW9Nx
	qz2BzMTYX0CwMqjKWSdicBnX8YYi3GYt07Dxh0NX4WlY5i1XWBaQYEr7kOvNlfswdakU
	lZaBIUp0/d9P/VVUpcT2cB1dN4OsevPHdQst061wWhpeUAEGgGdc+h9w9EZTNSYTtzV3
	XA/RH9oF7Psm9ZjPRCqbAPDHmwVpL5VAogoNhLohvSW4RzStmZlklvyg74dMIl0RKQFU
	Y/KMP4ztsFvbbEXxY/zUccMruyD06I+D6djbDGVVfD/AFNUqY+lq+tYv/4DNELm1v9cZ
	x4ug==
Original-Received: by 10.68.213.104 with SMTP id nr8mr10765363pbc.91.1334448131771;
	Sat, 14 Apr 2012 17:02:11 -0700 (PDT)
Original-Received: by 10.68.6.164 with HTTP; Sat, 14 Apr 2012 17:02:11 -0700 (PDT)
In-Reply-To: <jwvehrr13kp.fsf-monnier+emacs@gnu.org>
X-Google-Sender-Auth: vwqFk4ijos4z_F6ecJ4HyXFzYNI
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 209.85.160.41
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:149669
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/149669>

El d=EDa 14 de abril de 2012 03:57, Stefan Monnier
<monnier@iro.umontreal.ca> escribi=F3:
>>> If you want a middle dot, why don't you put a middle dot?
>>> I mean why write "['\267-]" rather than ['=B7-]?
>> The problem is that in a dictionary alist you can have dictionaries with
>> different unibyte encodings, if you happen to have two of that chars in
>> different encodings I'd expect problems.
>
> I still don't understand. =A0Can you be more specific?

Imagine Catalan dictionary with iso-8859-1 "=B7" in otherchars and other
dictionary (I am guessing the possibility to be more general, do not
actually have a real example of something different from our Debian
file with all info put together) with another upper char in
otherchars, but in a different encoding (e.g., koi8r).

The only possibility to have both coexist as chars in the same file is
to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
koi8r, so Emacs properly gets chars when reading the file (if properly
guessing file coding-system). XEmacs seems to be a bit more tricky
regarding UTF-8, but I'd expect things to work once proper decoding is
done. The traditional  possibility is to use octal codes to represent
bytes matching the char in dict declared charset.

Using UTF-8 is actually what eliz proposed in the beginning of this
thread. While one of ispell.el docstrings claims that only unibyte
chars can be put here, I'd expect this to work, at least for Emacs. As
a matter of fact, when I was first trying the (encode- (decode ..))
way I actually got UTF-8 (that was decoded again by
`ispell-get-otherchars' according to new 'utf-8 coding-system) and
seemed to work (apart from the psgml/sgml-lexical-context problem) for
Emacs. At that time I did not notice that once Emacs loads something
as char, encodings only matter when writing it (Yes, I am really
learning all this encode-* decode-* stuff in more depth in this
thread).

I'd however use this only in personal ~/.emacs files and if needed.

>>> For me notations like \267 should be used exclusively to talk about
>>> *bytes*, not about *chars*. =A0So it might make sense to use those for
>>> things like matching particular bytes in [ia]spell's output, but it
>>> makes no sense to match chars in the buffer being spell-checked since
>>> the buffer does not contain bytes but chars.
>> That is why I want to decode those bytes into actual chars to be used in
>
> If I understand correctly what you mean by "those bytes", then using "=B7=
"
> instead of "\267" gives you the decoded form right away without having
> to do extra work.

That is true for files with a single encoding. However, the problem
happens when a file has mixed encodings like in the Debian example I
mentioned. I know, this will not happen in real manually edited files,
but can happen and happens in aggregates like the one I mentioned.

If file is loaded with a given coding-system-for-read chars in that
coding-system will be properly interpreted by Emacs when reading, but
not the others. Something like that happened with
iso-8859-1/iso-8859-15 chars in

http://bugs.debian.org/337214

and the simple way to avoid the mess was to read as 'raw-text, and
that indeed reads upper chars as pure bytes although they were
originally written as chars (I mean not through octal codes), no
implicit on the fly "decoding/interpretation" at all. Not a big
problem, we know the encoding for every single dict, so things can be
properly decoded (\xxx + coding-system gives a char).

If we later change default encoding for communication for entries in
that file, we need to decode the bytes obtained from 'raw-text read to
actual char so is internally handled as desired char. Changing it also
to UTF-8 (and expecting ispell-get-otherchars to decode again to char)
seems  to work in Emacs, but also seems absolutely not needed.

I am getting more and more convinced that this is a Debian-only
problem because of the way we create that file, so I should handle
this special case as Debian-only and do needed decoding there, not in
ispell.el.

--=20
Agustin