From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Agustin Martin <agustin.martin@hispalinux.es>
Newsgroups: gmane.emacs.devel
Subject: Re: Ispell and unibyte characters
Date: Fri, 13 Apr 2012 20:44:01 +0200
Message-ID: <20120413184401.GA17992@agmartin.aq.upm.es>
References: <83aa3f2hgh.fsf@gnu.org>
	<20120326173912.GA22306@agmartin.aq.upm.es>
	<E1SCGD0-0001Dm-Tu@fencepost.gnu.org>
	<20120328191821.GA6266@agmartin.aq.upm.es>
	<20120410190803.GA13517@agmartin.aq.upm.es>
	<83ty0r5rmd.fsf@gnu.org>
	<20120412143657.GA18352@agmartin.aq.upm.es>
	<83d37c4vw5.fsf@gnu.org>
	<20120413152525.GA14949@agmartin.aq.upm.es>
	<jwvd37b34og.fsf-monnier+emacs@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Trace: dough.gmane.org 1334342654 19288 80.91.229.3 (13 Apr 2012 18:44:14 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Fri, 13 Apr 2012 18:44:14 +0000 (UTC)
To: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 13 20:44:13 2012
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SIlTg-0000Nz-D2
	for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 20:44:12 +0200
Original-Received: from localhost ([::1]:59982 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SIlTf-0003HS-OH
	for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 14:44:11 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:48440)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SIlTc-0003HG-E6
	for emacs-devel@gnu.org; Fri, 13 Apr 2012 14:44:09 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SIlTY-0001XR-8r
	for emacs-devel@gnu.org; Fri, 13 Apr 2012 14:44:07 -0400
Original-Received: from fibonacci.ccupm.upm.es ([138.100.198.70]:34859 helo=smtp.upm.es)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SIlTX-0001Up-VY
	for emacs-devel@gnu.org; Fri, 13 Apr 2012 14:44:04 -0400
Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131])
	by smtp.upm.es (8.14.3/8.14.3/fibonacci-001) with ESMTP id
	q3DIi1WN028099; Fri, 13 Apr 2012 20:44:01 +0200
Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000)
	id 52060E52; Fri, 13 Apr 2012 20:44:01 +0200 (CEST)
Mail-Followup-To: emacs-devel@gnu.org
Content-Disposition: inline
In-Reply-To: <jwvd37b34og.fsf-monnier+emacs@gnu.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-MIME-Autoconverted: from 8bit to quoted-printable by smtp.upm.es id
	q3DIi1WN028099
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 138.100.198.70
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:149642
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/149642>

On Fri, Apr 13, 2012 at 01:51:15PM -0400, Stefan Monnier wrote:
> > ("catala8"
> >      "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil =
iso-8859-1)
>=20
> > Unless emacs knows the encoding for \267 (middledot "=B7") it cannot =
decode it
> > properly. I prefer to not use UTF-8 here, because I want the entry to=
 also be
> > useful for ispell (and also be XEmacs incompatible). The best approac=
h here
> > seems to decode the otherchars regexp according to provided coding-sy=
stem.
>=20
> There's something I don't understand here:
>=20
> If you want a middle dot, why don't you put a middle dot?
> I mean why write "['\267-]" rather than ['=B7-]?

The problem is that in a dictionary alist you can have dictionaries with
different unibyte encodings, if you happen to have two of that chars in
different encodings I'd expect problems.

I really should have gone in more detail about the system where I noticed
this, even if it is a bit Debian specific.

I noticed this problem in aspell catalan entry provided by Debian aspell-=
ca
package. In Debian for the different aspell {and ispell and hunspell}
dictionaries alists are created on dictionary installation and stored in =
a
file (for the curious /var/cache/dictionaries-common/emacsen-ispell-dicts=
.el).=20
Some maintainers provide \xxx, some provide explicit chars in different
encodings, and all that info it put together in dict alist form in that f=
ile,
so it cannot be loaded with a given unique encoding but as 'raw-text, and
that implies loading as bytes rather than as chars.

> I think this is related to your saying "I prefer to not use UTF-8 here"=
,
> but again I don't know what you mean by "use UTF-8", because using
> a middle dot character in the source file does not imply using UTF-8
> anywhere (the file can be saved in any encoding that includes the
> middle dot).
>=20
> For me notations like \267 should be used exclusively to talk about
> *bytes*, not about *chars*.  So it might make sense to use those for
> things like matching particular bytes in [ia]spell's output, but it
> makes no sense to match chars in the buffer being spell-checked since
> the buffer does not contain bytes but chars.

That is why I want to decode those bytes into actual chars to be used in
spellchecking, and make sure that they are decoded from correct
coding-system. Otherwise if process coding-system is changed to UTF-8 and
that stays as bytes matching the wrong encoding things may not work well.

If there is a consensus that I should not go the decode- way for othercha=
rs,
I will not commit that part. For Debian I can simply keep loading
emacsen-ispell-dicts.el as raw-text and do the decode- processing on its
contents, before they are passed to ispell.el through
`ispell-base-dicts-override-alist', so this last contains chars more that
bytes. I however think that is better to keep the decode- stuff for more
general use.

I will wait at least a couple of days before committing so is clear what =
to
do.

Thanks all for your comments,

--=20
Agustin