Re: Ispell and unibyte characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Agustin Martin <agustin.martin@hispalinux.es>
To: emacs-devel@gnu.org
Subject: Re: Ispell and unibyte characters
Date: Sun, 15 Apr 2012 02:02:11 +0200	[thread overview]
Message-ID: <CAKy3oZr6KYDfz6Vk-VA8xHbLvbLfGU5HcJiRtaGMy0tq82fgEw@mail.gmail.com> (raw)
In-Reply-To: <jwvehrr13kp.fsf-monnier+emacs@gnu.org>

El día 14 de abril de 2012 03:57, Stefan Monnier
<monnier@iro.umontreal.ca> escribió:
>>> If you want a middle dot, why don't you put a middle dot?
>>> I mean why write "['\267-]" rather than ['·-]?
>> The problem is that in a dictionary alist you can have dictionaries with
>> different unibyte encodings, if you happen to have two of that chars in
>> different encodings I'd expect problems.
>
> I still don't understand.  Can you be more specific?

Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other
dictionary (I am guessing the possibility to be more general, do not
actually have a real example of something different from our Debian
file with all info put together) with another upper char in
otherchars, but in a different encoding (e.g., koi8r).

The only possibility to have both coexist as chars in the same file is
to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
koi8r, so Emacs properly gets chars when reading the file (if properly
guessing file coding-system). XEmacs seems to be a bit more tricky
regarding UTF-8, but I'd expect things to work once proper decoding is
done. The traditional  possibility is to use octal codes to represent
bytes matching the char in dict declared charset.

Using UTF-8 is actually what eliz proposed in the beginning of this
thread. While one of ispell.el docstrings claims that only unibyte
chars can be put here, I'd expect this to work, at least for Emacs. As
a matter of fact, when I was first trying the (encode- (decode ..))
way I actually got UTF-8 (that was decoded again by
`ispell-get-otherchars' according to new 'utf-8 coding-system) and
seemed to work (apart from the psgml/sgml-lexical-context problem) for
Emacs. At that time I did not notice that once Emacs loads something
as char, encodings only matter when writing it (Yes, I am really
learning all this encode-* decode-* stuff in more depth in this
thread).

I'd however use this only in personal ~/.emacs files and if needed.

>>> For me notations like \267 should be used exclusively to talk about
>>> *bytes*, not about *chars*.  So it might make sense to use those for
>>> things like matching particular bytes in [ia]spell's output, but it
>>> makes no sense to match chars in the buffer being spell-checked since
>>> the buffer does not contain bytes but chars.
>> That is why I want to decode those bytes into actual chars to be used in
>
> If I understand correctly what you mean by "those bytes", then using "·"
> instead of "\267" gives you the decoded form right away without having
> to do extra work.

That is true for files with a single encoding. However, the problem
happens when a file has mixed encodings like in the Debian example I
mentioned. I know, this will not happen in real manually edited files,
but can happen and happens in aggregates like the one I mentioned.

If file is loaded with a given coding-system-for-read chars in that
coding-system will be properly interpreted by Emacs when reading, but
not the others. Something like that happened with
iso-8859-1/iso-8859-15 chars in

http://bugs.debian.org/337214

and the simple way to avoid the mess was to read as 'raw-text, and
that indeed reads upper chars as pure bytes although they were
originally written as chars (I mean not through octal codes), no
implicit on the fly "decoding/interpretation" at all. Not a big
problem, we know the encoding for every single dict, so things can be
properly decoded (\xxx + coding-system gives a char).

If we later change default encoding for communication for entries in
that file, we need to decode the bytes obtained from 'raw-text read to
actual char so is internally handled as desired char. Changing it also
to UTF-8 (and expecting ispell-get-otherchars to decode again to char)
seems  to work in Emacs, but also seems absolutely not needed.

I am getting more and more convinced that this is a Debian-only
problem because of the way we create that file, so I should handle
this special case as Debian-only and do needed decoding there, not in
ispell.el.

-- 
Agustin

next prev parent reply	other threads:[~2012-04-15  0:02 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii
2012-03-26 17:39 ` Agustin Martin
2012-03-26 20:08   ` Eli Zaretskii
2012-03-26 22:07     ` Lennart Borgman
2012-03-28 19:18     ` Agustin Martin
2012-03-29 18:06       ` Eli Zaretskii
2012-03-29 21:13         ` Andreas Schwab
2012-03-30  6:28           ` Eli Zaretskii
2012-04-26  9:54         ` Eli Zaretskii
2012-04-10 19:08       ` Agustin Martin
2012-04-10 19:11         ` Eli Zaretskii
2012-04-12 14:36           ` Agustin Martin
2012-04-12 19:01             ` Eli Zaretskii
2012-04-13 15:25               ` Agustin Martin
2012-04-13 15:53                 ` Eli Zaretskii
2012-04-13 16:38                   ` Agustin Martin
2012-04-13 17:51                 ` Stefan Monnier
2012-04-13 18:44                   ` Agustin Martin
2012-04-14  1:57                     ` Stefan Monnier
2012-04-15  0:02                       ` Agustin Martin [this message]
2012-04-16  2:40                         ` Stefan Monnier
2012-04-20 15:25                           ` Agustin Martin
2012-04-20 15:36                             ` Eli Zaretskii
2012-04-20 16:17                               ` Agustin Martin
2012-04-21  2:17                                 ` Stefan Monnier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKy3oZr6KYDfz6Vk-VA8xHbLvbLfGU5HcJiRtaGMy0tq82fgEw@mail.gmail.com \
    --to=agustin.martin@hispalinux.es \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).