From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Agustin Martin <agustin.martin@hispalinux.es>
Newsgroups: gmane.emacs.devel
Subject: Re: Ispell and unibyte characters
Date: Fri, 20 Apr 2012 17:25:32 +0200
Message-ID: <20120420152532.GA3733@agmartin.aq.upm.es>
References: <20120410190803.GA13517@agmartin.aq.upm.es>
	<83ty0r5rmd.fsf@gnu.org>
	<20120412143657.GA18352@agmartin.aq.upm.es>
	<83d37c4vw5.fsf@gnu.org>
	<20120413152525.GA14949@agmartin.aq.upm.es>
	<jwvd37b34og.fsf-monnier+emacs@gnu.org>
	<20120413184401.GA17992@agmartin.aq.upm.es>
	<jwvehrr13kp.fsf-monnier+emacs@gnu.org>
	<CAKy3oZr6KYDfz6Vk-VA8xHbLvbLfGU5HcJiRtaGMy0tq82fgEw@mail.gmail.com>
	<jwvehroxvam.fsf-monnier+emacs@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="dDRMvlgZJXvWKvBx"
X-Trace: dough.gmane.org 1334935571 31007 80.91.229.3 (20 Apr 2012 15:26:11 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Fri, 20 Apr 2012 15:26:11 +0000 (UTC)
To: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 20 17:26:10 2012
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SLFiq-0006DX-Tq
	for ged-emacs-devel@m.gmane.org; Fri, 20 Apr 2012 17:26:09 +0200
Original-Received: from localhost ([::1]:36745 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1SLFiq-00018t-99
	for ged-emacs-devel@m.gmane.org; Fri, 20 Apr 2012 11:26:08 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:38215)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SLFic-00010G-Cb
	for emacs-devel@gnu.org; Fri, 20 Apr 2012 11:26:07 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SLFiR-00040p-H2
	for emacs-devel@gnu.org; Fri, 20 Apr 2012 11:25:53 -0400
Original-Received: from edison.ccupm.upm.es ([138.100.198.71]:40035 helo=smtp.upm.es)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agustin.martin@upm.es>) id 1SLFiR-0003v3-34
	for emacs-devel@gnu.org; Fri, 20 Apr 2012 11:25:43 -0400
Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131])
	by smtp.upm.es (8.14.3/8.14.3/edison-001) with ESMTP id q3KFPX95002971; 
	Fri, 20 Apr 2012 17:25:33 +0200
Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000)
	id CFA1B405; Fri, 20 Apr 2012 17:25:32 +0200 (CEST)
Mail-Followup-To: emacs-devel@gnu.org
Content-Disposition: inline
In-Reply-To: <jwvehroxvam.fsf-monnier+emacs@gnu.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 138.100.198.71
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:149864
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/149864>


--dDRMvlgZJXvWKvBx
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by smtp.upm.es id q3KFPX95002971

On Sun, Apr 15, 2012 at 10:40:29PM -0400, Stefan Monnier wrote:
> > Imagine Catalan dictionary with iso-8859-1 "=B7" in otherchars and ot=
her
> > dictionary (I am guessing the possibility to be more general, do not
> > actually have a real example of something different from our Debian
> > file with all info put together) with another upper char in
> > otherchars, but in a different encoding (e.g., koi8r).
>=20
> You're still living in Emacs-21/22: since Emacs-23, basically chars are=
n't
> associated with their encoding (actually charset) any more.

Not once inside Emacs, but when reading a file, encoding matters and in
some corner cases with mixed charsets Emacs may get wrong chars
(with no sane way to make it automatically get the right ones), see
attached file.

BTW, I do not even have Emacs-21/22 installed, I am testing this in
Emacs23/24 (together with XEmacs to check that I do not introduce
additional incompatibilities)

> > The only possibility to have both coexist as chars in the same file i=
s
> > to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
> > koi8r, so Emacs properly gets chars when reading the file (if properl=
y
> > guessing file coding-system).
>=20
> Not at all, there are many encodings which cover the superset of
> iso-8859-* and koi8-*.  UTF-8 is the more fashionable one nowadays, but
> not anywhere close to the only one. e.g. there's also iso-2022,
> emacs-mule, and then some.

Sorry, should have written something like supersets, put UTF-8 as an exam=
ple.
=20
> > I'd however use this only in personal ~/.emacs files and if needed.
>=20
> Why?  It would make the code more clear and simpler.

To make my Debian changes minimal I prefer to keep compatibility with
XEmacs when possible. That makes my life easier when adapting changes in =
FSF
Emacs repo to Debian. Seems that XEmacs has very recently added support f=
or
automatic on-the-fly UTF-8 parsing, so my POV may change, but I admit I a=
m
currently biassed to the 7bit \xxx strings.=20

Since Emacs should now (for some days) use [:alpha:] in "Casechars" and
"Not-Casechars" for global dicts, I think we should not worry very much a=
bout
this from Emacs side, just for Otherchars in the very few cases it contai=
ns
an upper char (none in current ispell.el). And for that I still personall=
y
prefer keep using for now the 7bit string "\xxx".

> > That is true for files with a single encoding.  However, the problem
> > happens when a file has mixed encodings like in the Debian example I
> > mentioned.  I know, this will not happen in real manually edited file=
s,
> > but can happen and happens in aggregates like the one I mentioned.
>=20
> That's an old solved problem.

May be we are speaking about different things, but as I understand this, =
it
does not seem so. And I do not think this can be solved in a robust enoug=
h
way for all files. See attached file and comments below.

> > If file is loaded with a given coding-system-for-read chars in that
> > coding-system will be properly interpreted by Emacs when reading, but
> > not the others. Something like that happened with
> > iso-8859-1/iso-8859-15 chars in
>=20
> That was then.  Not any more.

I think you mean that iso-8859-* chars are currently unified. I am aware =
of
that, but I am speaking about something different, mixed encodings, also
discussed in that thread together with the iso-8859-1/iso-8859-15 problem=
s.

See attached file. It contains middledot in two encodings, UTF-8 in first
line and latin1 in the second, together with something that was originall=
y
written as iso-8859-7 lowercase greek zeta. In my iso-8859-1 box emacs24
(emacs-snapshot_20120410) reads it as

--
=C2=B7
=B7
=E6
--

so gets the wrong char both for UTF-8 and for greek lowercase zeta. In a
different environment Emacs may have guessed that first line is UTF-8, bu=
t
I do not see a robust enough to properly guess all the mixed encodings fo=
r
a small file like this.

That is the kind of things I am now dealing with for Debian. Currently no=
t a
big problem for Emacs after [:alpha:] changes for casechars/not casechars=
=20
(the chance that a new dict adds otherchars in incompatible charsets is
small), however this can still happen us for XEmacs in that aggregated
file.

Sorry if I did not make that clear enough and helped making this thread t=
his
long.

Regards,

--=20
Agustin

--dDRMvlgZJXvWKvBx
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: attachment; filename="test.txt"
Content-Transfer-Encoding: base64
X-MIME-Autoconverted: from 8bit to base64 by smtp.upm.es id q3KFPX95002971

wrcNCrcNCuYNCg==
--dDRMvlgZJXvWKvBx--