Re: Ispell and unibyte characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Agustin Martin <agustin.martin@hispalinux.es>
To: emacs-devel@gnu.org
Subject: Re: Ispell and unibyte characters
Date: Fri, 20 Apr 2012 17:25:32 +0200	[thread overview]
Message-ID: <20120420152532.GA3733@agmartin.aq.upm.es> (raw)
In-Reply-To: <jwvehroxvam.fsf-monnier+emacs@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 4404 bytes --]

On Sun, Apr 15, 2012 at 10:40:29PM -0400, Stefan Monnier wrote:
> > Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other
> > dictionary (I am guessing the possibility to be more general, do not
> > actually have a real example of something different from our Debian
> > file with all info put together) with another upper char in
> > otherchars, but in a different encoding (e.g., koi8r).
> 
> You're still living in Emacs-21/22: since Emacs-23, basically chars aren't
> associated with their encoding (actually charset) any more.

Not once inside Emacs, but when reading a file, encoding matters and in
some corner cases with mixed charsets Emacs may get wrong chars
(with no sane way to make it automatically get the right ones), see
attached file.

BTW, I do not even have Emacs-21/22 installed, I am testing this in
Emacs23/24 (together with XEmacs to check that I do not introduce
additional incompatibilities)

> > The only possibility to have both coexist as chars in the same file is
> > to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
> > koi8r, so Emacs properly gets chars when reading the file (if properly
> > guessing file coding-system).
> 
> Not at all, there are many encodings which cover the superset of
> iso-8859-* and koi8-*.  UTF-8 is the more fashionable one nowadays, but
> not anywhere close to the only one. e.g. there's also iso-2022,
> emacs-mule, and then some.

Sorry, should have written something like supersets, put UTF-8 as an example.

> > I'd however use this only in personal ~/.emacs files and if needed.
> 
> Why?  It would make the code more clear and simpler.

To make my Debian changes minimal I prefer to keep compatibility with
XEmacs when possible. That makes my life easier when adapting changes in FSF
Emacs repo to Debian. Seems that XEmacs has very recently added support for
automatic on-the-fly UTF-8 parsing, so my POV may change, but I admit I am
currently biassed to the 7bit \xxx strings. 

Since Emacs should now (for some days) use [:alpha:] in "Casechars" and
"Not-Casechars" for global dicts, I think we should not worry very much about
this from Emacs side, just for Otherchars in the very few cases it contains
an upper char (none in current ispell.el). And for that I still personally
prefer keep using for now the 7bit string "\xxx".

> > That is true for files with a single encoding.  However, the problem
> > happens when a file has mixed encodings like in the Debian example I
> > mentioned.  I know, this will not happen in real manually edited files,
> > but can happen and happens in aggregates like the one I mentioned.
> 
> That's an old solved problem.

May be we are speaking about different things, but as I understand this, it
does not seem so. And I do not think this can be solved in a robust enough
way for all files. See attached file and comments below.

> > If file is loaded with a given coding-system-for-read chars in that
> > coding-system will be properly interpreted by Emacs when reading, but
> > not the others. Something like that happened with
> > iso-8859-1/iso-8859-15 chars in
> 
> That was then.  Not any more.

I think you mean that iso-8859-* chars are currently unified. I am aware of
that, but I am speaking about something different, mixed encodings, also
discussed in that thread together with the iso-8859-1/iso-8859-15 problems.

See attached file. It contains middledot in two encodings, UTF-8 in first
line and latin1 in the second, together with something that was originally
written as iso-8859-7 lowercase greek zeta. In my iso-8859-1 box emacs24
(emacs-snapshot_20120410) reads it as

--
Â·
·
æ
--

so gets the wrong char both for UTF-8 and for greek lowercase zeta. In a
different environment Emacs may have guessed that first line is UTF-8, but
I do not see a robust enough to properly guess all the mixed encodings for
a small file like this.

That is the kind of things I am now dealing with for Debian. Currently not a
big problem for Emacs after [:alpha:] changes for casechars/not casechars 
(the chance that a new dict adds otherchars in incompatible charsets is
small), however this can still happen us for XEmacs in that aggregated
file.

Sorry if I did not make that clear enough and helped making this thread this
long.

Regards,

-- 
Agustin

[-- Attachment #2: test.txt --]
[-- Type: text/plain, Size: 10 bytes --]

Â·
·
æ

next prev parent reply	other threads:[~2012-04-20 15:25 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii
2012-03-26 17:39 ` Agustin Martin
2012-03-26 20:08   ` Eli Zaretskii
2012-03-26 22:07     ` Lennart Borgman
2012-03-28 19:18     ` Agustin Martin
2012-03-29 18:06       ` Eli Zaretskii
2012-03-29 21:13         ` Andreas Schwab
2012-03-30  6:28           ` Eli Zaretskii
2012-04-26  9:54         ` Eli Zaretskii
2012-04-10 19:08       ` Agustin Martin
2012-04-10 19:11         ` Eli Zaretskii
2012-04-12 14:36           ` Agustin Martin
2012-04-12 19:01             ` Eli Zaretskii
2012-04-13 15:25               ` Agustin Martin
2012-04-13 15:53                 ` Eli Zaretskii
2012-04-13 16:38                   ` Agustin Martin
2012-04-13 17:51                 ` Stefan Monnier
2012-04-13 18:44                   ` Agustin Martin
2012-04-14  1:57                     ` Stefan Monnier
2012-04-15  0:02                       ` Agustin Martin
2012-04-16  2:40                         ` Stefan Monnier
2012-04-20 15:25                           ` Agustin Martin [this message]
2012-04-20 15:36                             ` Eli Zaretskii
2012-04-20 16:17                               ` Agustin Martin
2012-04-21  2:17                                 ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120420152532.GA3733@agmartin.aq.upm.es \
    --to=agustin.martin@hispalinux.es \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).