From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.devel Subject: Re: Ispell and unibyte characters Date: Fri, 20 Apr 2012 17:25:32 +0200 Message-ID: <20120420152532.GA3733@agmartin.aq.upm.es> References: <20120410190803.GA13517@agmartin.aq.upm.es> <83ty0r5rmd.fsf@gnu.org> <20120412143657.GA18352@agmartin.aq.upm.es> <83d37c4vw5.fsf@gnu.org> <20120413152525.GA14949@agmartin.aq.upm.es> <20120413184401.GA17992@agmartin.aq.upm.es> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="dDRMvlgZJXvWKvBx" X-Trace: dough.gmane.org 1334935571 31007 80.91.229.3 (20 Apr 2012 15:26:11 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 20 Apr 2012 15:26:11 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 20 17:26:10 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SLFiq-0006DX-Tq for ged-emacs-devel@m.gmane.org; Fri, 20 Apr 2012 17:26:09 +0200 Original-Received: from localhost ([::1]:36745 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SLFiq-00018t-99 for ged-emacs-devel@m.gmane.org; Fri, 20 Apr 2012 11:26:08 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:38215) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SLFic-00010G-Cb for emacs-devel@gnu.org; Fri, 20 Apr 2012 11:26:07 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SLFiR-00040p-H2 for emacs-devel@gnu.org; Fri, 20 Apr 2012 11:25:53 -0400 Original-Received: from edison.ccupm.upm.es ([138.100.198.71]:40035 helo=smtp.upm.es) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SLFiR-0003v3-34 for emacs-devel@gnu.org; Fri, 20 Apr 2012 11:25:43 -0400 Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131]) by smtp.upm.es (8.14.3/8.14.3/edison-001) with ESMTP id q3KFPX95002971; Fri, 20 Apr 2012 17:25:33 +0200 Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000) id CFA1B405; Fri, 20 Apr 2012 17:25:32 +0200 (CEST) Mail-Followup-To: emacs-devel@gnu.org Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 138.100.198.71 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:149864 Archived-At: --dDRMvlgZJXvWKvBx Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by smtp.upm.es id q3KFPX95002971 On Sun, Apr 15, 2012 at 10:40:29PM -0400, Stefan Monnier wrote: > > Imagine Catalan dictionary with iso-8859-1 "=B7" in otherchars and ot= her > > dictionary (I am guessing the possibility to be more general, do not > > actually have a real example of something different from our Debian > > file with all info put together) with another upper char in > > otherchars, but in a different encoding (e.g., koi8r). >=20 > You're still living in Emacs-21/22: since Emacs-23, basically chars are= n't > associated with their encoding (actually charset) any more. Not once inside Emacs, but when reading a file, encoding matters and in some corner cases with mixed charsets Emacs may get wrong chars (with no sane way to make it automatically get the right ones), see attached file. BTW, I do not even have Emacs-21/22 installed, I am testing this in Emacs23/24 (together with XEmacs to check that I do not introduce additional incompatibilities) > > The only possibility to have both coexist as chars in the same file i= s > > to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and > > koi8r, so Emacs properly gets chars when reading the file (if properl= y > > guessing file coding-system). >=20 > Not at all, there are many encodings which cover the superset of > iso-8859-* and koi8-*. UTF-8 is the more fashionable one nowadays, but > not anywhere close to the only one. e.g. there's also iso-2022, > emacs-mule, and then some. Sorry, should have written something like supersets, put UTF-8 as an exam= ple. =20 > > I'd however use this only in personal ~/.emacs files and if needed. >=20 > Why? It would make the code more clear and simpler. To make my Debian changes minimal I prefer to keep compatibility with XEmacs when possible. That makes my life easier when adapting changes in = FSF Emacs repo to Debian. Seems that XEmacs has very recently added support f= or automatic on-the-fly UTF-8 parsing, so my POV may change, but I admit I a= m currently biassed to the 7bit \xxx strings.=20 Since Emacs should now (for some days) use [:alpha:] in "Casechars" and "Not-Casechars" for global dicts, I think we should not worry very much a= bout this from Emacs side, just for Otherchars in the very few cases it contai= ns an upper char (none in current ispell.el). And for that I still personall= y prefer keep using for now the 7bit string "\xxx". > > That is true for files with a single encoding. However, the problem > > happens when a file has mixed encodings like in the Debian example I > > mentioned. I know, this will not happen in real manually edited file= s, > > but can happen and happens in aggregates like the one I mentioned. >=20 > That's an old solved problem. May be we are speaking about different things, but as I understand this, = it does not seem so. And I do not think this can be solved in a robust enoug= h way for all files. See attached file and comments below. > > If file is loaded with a given coding-system-for-read chars in that > > coding-system will be properly interpreted by Emacs when reading, but > > not the others. Something like that happened with > > iso-8859-1/iso-8859-15 chars in >=20 > That was then. Not any more. I think you mean that iso-8859-* chars are currently unified. I am aware = of that, but I am speaking about something different, mixed encodings, also discussed in that thread together with the iso-8859-1/iso-8859-15 problem= s. See attached file. It contains middledot in two encodings, UTF-8 in first line and latin1 in the second, together with something that was originall= y written as iso-8859-7 lowercase greek zeta. In my iso-8859-1 box emacs24 (emacs-snapshot_20120410) reads it as -- =C2=B7 =B7 =E6 -- so gets the wrong char both for UTF-8 and for greek lowercase zeta. In a different environment Emacs may have guessed that first line is UTF-8, bu= t I do not see a robust enough to properly guess all the mixed encodings fo= r a small file like this. That is the kind of things I am now dealing with for Debian. Currently no= t a big problem for Emacs after [:alpha:] changes for casechars/not casechars= =20 (the chance that a new dict adds otherchars in incompatible charsets is small), however this can still happen us for XEmacs in that aggregated file. Sorry if I did not make that clear enough and helped making this thread t= his long. Regards, --=20 Agustin --dDRMvlgZJXvWKvBx Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: attachment; filename="test.txt" Content-Transfer-Encoding: base64 X-MIME-Autoconverted: from 8bit to base64 by smtp.upm.es id q3KFPX95002971 wrcNCrcNCuYNCg== --dDRMvlgZJXvWKvBx--