Ispell and unibyte characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Ispell and unibyte characters
@ 2012-03-17 18:46 Eli Zaretskii
  2012-03-26 17:39 ` Agustin Martin
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2012-03-17 18:46 UTC (permalink / raw)
  To: emacs-devel

The doc string of ispell-dictionary-alist says, inter alia:

  Each element of this list is also a list:

  (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P
	  ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET)
  ...
  CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings
  containing bytes of CHARACTER-SET.  In addition, if they contain
  a non-ASCII byte, the regular expression must be a single
  `character set' construct that doesn't specify a character range
  for non-ASCII bytes.

Why the restriction to unibyte character sets?  This is quite a
serious limitation, given that the modern spellers (aspell and
hunspell) use UTF-8 as their default encoding.

The only reason for this limitation I could find is in
ispell-process-line, which assumes that the byte offsets returned by
the speller can be used to compute character position of the
misspelled word in the buffer.  Are there any other places in
ispell.el that assume unibyte characters?

If ispell-process-line is the only place, then it should be easy to
extend it so it handles correctly UTF-8 in addition to unibyte
character sets.

In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and
OTHERCHARS as ugly unibyte escapes, since their usage is entirely
consistent with multibyte characters: they are used to construct
regular expressions and match buffer text against those regexps.  Did
I miss something important?

Any comments and pointers to my blunders are welcome.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii
@ 2012-03-26 17:39 ` Agustin Martin
  2012-03-26 20:08   ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Agustin Martin @ 2012-03-26 17:39 UTC (permalink / raw)
  To: emacs-devel

On Sat, Mar 17, 2012 at 08:46:54PM +0200, Eli Zaretskii wrote:
> The doc string of ispell-dictionary-alist says, inter alia:
> 
>   Each element of this list is also a list:
> 
>   (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P
> 	  ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET)
>   ...
>   CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings
>   containing bytes of CHARACTER-SET.  In addition, if they contain
>   a non-ASCII byte, the regular expression must be a single
>   `character set' construct that doesn't specify a character range
>   for non-ASCII bytes.
> 
> Why the restriction to unibyte character sets?  This is quite a
> serious limitation, given that the modern spellers (aspell and
> hunspell) use UTF-8 as their default encoding.

Hi Eli,

At least for aspell ispell.el already uses utf8 as default communication
encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
OTHERCHARS is guessed from aspell .dat file for given dictionary.

Since currently it is not possible to ask hunspell for installed
dictionaries (hunspell -D does not return control to the console)
no one tried something similar for hunspell.

> The only reason for this limitation I could find is in
> ispell-process-line, which assumes that the byte offsets returned by
> the speller can be used to compute character position of the
> misspelled word in the buffer.  Are there any other places in
> ispell.el that assume unibyte characters?

Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
I do not remember reports about this. 

> If ispell-process-line is the only place, then it should be easy to
> extend it so it handles correctly UTF-8 in addition to unibyte
> character sets.
> 
> In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and
> OTHERCHARS as ugly unibyte escapes, since their usage is entirely
> consistent with multibyte characters: they are used to construct
> regular expressions and match buffer text against those regexps.  

IIRC, the reason to use octal escapes is mostly that they are encoding
independent. Otherwise a .emacs file may have mixed unibyte/multibyte
encodings.

Current limitation in docstring may be only something left from old times. I
will try to look with recent ispell american dict, which can be called in
utf8. Will let you know.

Regards,

-- 
Agustin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-26 17:39 ` Agustin Martin
@ 2012-03-26 20:08   ` Eli Zaretskii
  2012-03-26 22:07     ` Lennart Borgman
  2012-03-28 19:18     ` Agustin Martin
  0 siblings, 2 replies; 25+ messages in thread
From: Eli Zaretskii @ 2012-03-26 20:08 UTC (permalink / raw)
  To: Agustin Martin; +Cc: emacs-devel

> Date: Mon, 26 Mar 2012 19:39:12 +0200
> From: Agustin Martin <agustin.martin@hispalinux.es>
> 
> Hi Eli,

Thanks for responding, I was beginning to think that no one is
interested.  In general, I find that ispell.el is in sore need of
modernization; at least that's my conclusion so far from playing with
hunspell (with which I want to replace my aging collection of Ispell
and its dictionaries that I use for many years).

> At least for aspell ispell.el already uses utf8 as default communication
> encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
> OTHERCHARS is guessed from aspell .dat file for given dictionary.

The question is, why isn't this done for any modern speller.  The only
one I know of that cannot handle UTF-8 is Ispell.

OTHERCHARS are not very important anyway, at least for languages I'm
interested in.

> Since currently it is not possible to ask hunspell for installed
> dictionaries (hunspell -D does not return control to the console)
> no one tried something similar for hunspell.

In what version do you have problems with -D?

In any case, hunspell supports multiple dictionaries in the same
session.  One can invoke it with, e.g., "-d en_US,de_DE,ru_RU,he_IL"
and have it spell-check mixed text that uses all these languages in
the same buffer (at least in theory; I didn't yet try that in my
experiments).  Clearly, this can only be done with UTF-8 or some such
as the encoding.

So I think we should deprecate usage of the unibyte characters in the
ispell.el defaults, and simply use [:alpha:] for all languages.  As a
bonus, we can then get rid of the ridiculously long and hard to
maintain customization of each new dictionary you add to your
repertory.  Just one entry will serve almost any language, or at least
supply an excellent default.

> > The only reason for this limitation I could find is in
> > ispell-process-line, which assumes that the byte offsets returned by
> > the speller can be used to compute character position of the
> > misspelled word in the buffer.  Are there any other places in
> > ispell.el that assume unibyte characters?
> 
> Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
> I do not remember reports about this. 

Since I wrote that, I found that the problem was due to a bug in
hunspell (which I fixed in my copy): it reported byte offsets of the
misspelled words, rather than character offsets.  After fixing that
bug, there's no issue here anymore and nothing to fix in ispell.el.
There's a bug report with a patch about that in the hunspell bug
tracker, so there's reason to believe this bug will be fixed in a
future release.

> IIRC, the reason to use octal escapes is mostly that they are encoding
> independent.

They aren't; their encoding is guessed by Emacs based on the locale.
Using them is asking for trouble, IMO.  We specifically discourage use
of unibyte text in Emacs manuals, and yet we ourselves use them in a
package that is part of Emacs!

> Otherwise a .emacs file may have mixed unibyte/multibyte encodings.

I was talking about ispell.el, first and foremost.  There's no problem
with having ispell.el encoded in UTF-8, if needed (but I don't think
there's a need, see above).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-26 20:08   ` Eli Zaretskii
@ 2012-03-26 22:07     ` Lennart Borgman
  2012-03-28 19:18     ` Agustin Martin
  1 sibling, 0 replies; 25+ messages in thread
From: Lennart Borgman @ 2012-03-26 22:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Agustin Martin, emacs-devel

On Mon, Mar 26, 2012 at 22:08, Eli Zaretskii <eliz@gnu.org> wrote:
>
> > Date: Mon, 26 Mar 2012 19:39:12 +0200
> > From: Agustin Martin <agustin.martin@hispalinux.es>
> >
> > Hi Eli,
>
> Thanks for responding, I was beginning to think that no one is
> interested.  In general, I find that ispell.el is in sore need of

I am interested, but I have just given up on this since I found I did
not have time to fix it. On w32 it was all a mess.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-26 20:08   ` Eli Zaretskii
  2012-03-26 22:07     ` Lennart Borgman
@ 2012-03-28 19:18     ` Agustin Martin
  2012-03-29 18:06       ` Eli Zaretskii
  2012-04-10 19:08       ` Agustin Martin
  1 sibling, 2 replies; 25+ messages in thread
From: Agustin Martin @ 2012-03-28 19:18 UTC (permalink / raw)
  To: emacs-devel

On Mon, Mar 26, 2012 at 04:08:06PM -0400, Eli Zaretskii wrote:
> > Date: Mon, 26 Mar 2012 19:39:12 +0200
> > From: Agustin Martin <agustin.martin@hispalinux.es>
> > 
> > Hi Eli,
> 
> Thanks for responding, I was beginning to think that no one is
> interested.  In general, I find that ispell.el is in sore need of
> modernization; at least that's my conclusion so far from playing with
> hunspell (with which I want to replace my aging collection of Ispell
> and its dictionaries that I use for many years).
> 
> > At least for aspell ispell.el already uses utf8 as default communication
> > encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
> > OTHERCHARS is guessed from aspell .dat file for given dictionary.
> 
> The question is, why isn't this done for any modern speller.  The only
> one I know of that cannot handle UTF-8 is Ispell.

I think the only real remaining reason is for XEmacs compatibility. AFAIK 
XEmacs does not support [:alpha:].

I thought about filtering ispell-dictionary-base-alist when used from FSF
Emacs, so it uses [:alpha:] and still keeps compatibility. I am currently a
bit busy, but at some time I may try this for Debian and see what happens.

For XEMACS in Debian GNU/* even changing to [:alpha:] should have a reduced
impact, strings provided by dictionary maintainers take precedence, but
better if I can easily do the above anyway, so [:alpha:] is used if
available.

Once release happens, I'd like to commit some other changes to decrease
XEmacs incompatibilities in ispell.el and flyspell.el, so my changes for
Debian GNU/* become smaller.

> OTHERCHARS are not very important anyway, at least for languages I'm
> interested in.
> 
> > Since currently it is not possible to ask hunspell for installed
> > dictionaries (hunspell -D does not return control to the console)
> > no one tried something similar for hunspell.
> 
> In what version do you have problems with -D?

Hunspell 1.3.2. Does not return control until I press ^C. This may be useful
if someone wants to know about installed hunspell dictionaries and prepare
something to play with that info, in a way similar to what is currently done
for aspell in ispell.el.

> In any case, hunspell supports multiple dictionaries in the same
> session.  One can invoke it with, e.g., "-d en_US,de_DE,ru_RU,he_IL"
> and have it spell-check mixed text that uses all these languages in
> the same buffer (at least in theory; I didn't yet try that in my
> experiments).  Clearly, this can only be done with UTF-8 or some such
> as the encoding.

Right.

> So I think we should deprecate usage of the unibyte characters in the
> ispell.el defaults, and simply use [:alpha:] for all languages.  As a
> bonus, we can then get rid of the ridiculously long and hard to
> maintain customization of each new dictionary you add to your
> repertory.  Just one entry will serve almost any language, or at least
> supply an excellent default.
> 
> > > The only reason for this limitation I could find is in
> > > ispell-process-line, which assumes that the byte offsets returned by
> > > the speller can be used to compute character position of the
> > > misspelled word in the buffer.  Are there any other places in
> > > ispell.el that assume unibyte characters?
> > 
> > Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
> > I do not remember reports about this. 
> 
> Since I wrote that, I found that the problem was due to a bug in
> hunspell (which I fixed in my copy): it reported byte offsets of the
> misspelled words, rather than character offsets.  After fixing that
> bug, there's no issue here anymore and nothing to fix in ispell.el.
> There's a bug report with a patch about that in the hunspell bug
> tracker, so there's reason to believe this bug will be fixed in a
> future release.

You mean

http://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395

I filed that bug one year ago and received no reply from hunspell
maintainers. This year I received a followup with a proposed change, but
there is still no reply to it.

There is other problem that mostly hits re-using ispell default entries
under hunspell

http://sourceforge.net/tracker/?func=detail&aid=2617130&group_id=143754&atid=756395

[~ prefixed strings are treated as words in pipe mode]

that now stands for three years. I have waited in the hope this is fixed,
but I think I will soon commit to Emacs the same change I use for Debian, 
making sure extended-character-mode is nil for hunspell. I do not think
extended-character-mode pseudo-charsets will ever be implemented in
hunspell.

-- 
Agustin



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-28 19:18     ` Agustin Martin
@ 2012-03-29 18:06       ` Eli Zaretskii
  2012-03-29 21:13         ` Andreas Schwab
  2012-04-26  9:54         ` Eli Zaretskii
  2012-04-10 19:08       ` Agustin Martin
  1 sibling, 2 replies; 25+ messages in thread
From: Eli Zaretskii @ 2012-03-29 18:06 UTC (permalink / raw)
  To: Agustin Martin; +Cc: emacs-devel

> Date: Wed, 28 Mar 2012 21:18:21 +0200
> From: Agustin Martin <agustin.martin@hispalinux.es>
> 
> > OTHERCHARS are not very important anyway, at least for languages I'm
> > interested in.
> > 
> > > Since currently it is not possible to ask hunspell for installed
> > > dictionaries (hunspell -D does not return control to the console)
> > > no one tried something similar for hunspell.
> > 
> > In what version do you have problems with -D?
> 
> Hunspell 1.3.2. Does not return control until I press ^C. This may be useful
> if someone wants to know about installed hunspell dictionaries and prepare
> something to play with that info, in a way similar to what is currently done
> for aspell in ispell.el.

Well, to be fair to the Hunspell developers, the documentation doesn't
say that -D should exit after displaying the available dictionaries.
And the code really doesn't do that.  However, with a simple 2-liner
(below) I can make it do what you want.

> > Since I wrote that, I found that the problem was due to a bug in
> > hunspell (which I fixed in my copy): it reported byte offsets of the
> > misspelled words, rather than character offsets.  After fixing that
> > bug, there's no issue here anymore and nothing to fix in ispell.el.
> > There's a bug report with a patch about that in the hunspell bug
> > tracker, so there's reason to believe this bug will be fixed in a
> > future release.
> 
> You mean
> 
> http://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395

Yes.

> I filed that bug one year ago and received no reply from hunspell
> maintainers. This year I received a followup with a proposed change, but
> there is still no reply to it.

I simply fixed this.  This _is_ Free Software, isn't it?

> There is other problem that mostly hits re-using ispell default entries
> under hunspell
> 
> http://sourceforge.net/tracker/?func=detail&aid=2617130&group_id=143754&atid=756395
> 
> [~ prefixed strings are treated as words in pipe mode]

Another easy fix (the feature is not implemented, so the code should
simply ignore such lines).

> that now stands for three years. I have waited in the hope this is fixed,

It's true that development seems to be slow, but then aspell
development is not exactly vibrant, either: both spellers hadn't a
release in many months.

Anyway, to me, Hunspell is a better tool, because of its support for
multiple dictionaries, which fixes the most annoying inconvenience in
Emacs spell-checking: the need to switch dictionaries according to the
language -- this is really a bad thing when you use Flyspell.

With multiple dictionaries, with very rare exceptions, one needs a
single entry in ispell-dictionary-alist, having all of the
dictionaries for languages one normally uses, [[:alpha:]] as
CASECHARS, and UTF-8 as the encoding.

> but I think I will soon commit to Emacs the same change I use for Debian, 
> making sure extended-character-mode is nil for hunspell.

Probably a good idea.


--- src/tools/hunspell.cxx~0	2011-01-21 19:01:29.000000000 +0200
+++ src/tools/hunspell.cxx	2012-03-21 16:40:31.255690500 +0200
@@ -1756,6 +1763,7 @@ int main(int argc, char** argv)
 		fprintf(stderr, gettext("SEARCH PATH:\n%s\n"), path);
 		fprintf(stderr, gettext("AVAILABLE DICTIONARIES (path is not mandatory for -d option):\n"));
 		search(path, NULL, NULL);
+		if (arg_files==-1) exit(0);
 	}
 
 	if (!privdicname) privdicname = mystrdup(getenv("WORDLIST"));



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-29 18:06       ` Eli Zaretskii
@ 2012-03-29 21:13         ` Andreas Schwab
  2012-03-30  6:28           ` Eli Zaretskii
  2012-04-26  9:54         ` Eli Zaretskii
  1 sibling, 1 reply; 25+ messages in thread
From: Andreas Schwab @ 2012-03-29 21:13 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Agustin Martin, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Wed, 28 Mar 2012 21:18:21 +0200
>> From: Agustin Martin <agustin.martin@hispalinux.es>
>> 
>> > OTHERCHARS are not very important anyway, at least for languages I'm
>> > interested in.
>> > 
>> > > Since currently it is not possible to ask hunspell for installed
>> > > dictionaries (hunspell -D does not return control to the console)
>> > > no one tried something similar for hunspell.
>> > 
>> > In what version do you have problems with -D?
>> 
>> Hunspell 1.3.2. Does not return control until I press ^C. This may be useful
>> if someone wants to know about installed hunspell dictionaries and prepare
>> something to play with that info, in a way similar to what is currently done
>> for aspell in ispell.el.
>
> Well, to be fair to the Hunspell developers, the documentation doesn't
> say that -D should exit after displaying the available dictionaries.
> And the code really doesn't do that.  However, with a simple 2-liner
> (below) I can make it do what you want.

You can just redirect from /dev/null instead.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-29 21:13         ` Andreas Schwab
@ 2012-03-30  6:28           ` Eli Zaretskii
  0 siblings, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2012-03-30  6:28 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: agustin.martin, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Agustin Martin <agustin.martin@hispalinux.es>,  emacs-devel@gnu.org
> Date: Thu, 29 Mar 2012 23:13:19 +0200
> 
> > Well, to be fair to the Hunspell developers, the documentation doesn't
> > say that -D should exit after displaying the available dictionaries.
> > And the code really doesn't do that.  However, with a simple 2-liner
> > (below) I can make it do what you want.
> 
> You can just redirect from /dev/null instead.

Right, thanks.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-28 19:18     ` Agustin Martin
  2012-03-29 18:06       ` Eli Zaretskii
@ 2012-04-10 19:08       ` Agustin Martin
  2012-04-10 19:11         ` Eli Zaretskii
  1 sibling, 1 reply; 25+ messages in thread
From: Agustin Martin @ 2012-04-10 19:08 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

On Wed, Mar 28, 2012 at 09:18:21PM +0200, Agustin Martin wrote:
> On Mon, Mar 26, 2012 at 04:08:06PM -0400, Eli Zaretskii wrote:
> > > Date: Mon, 26 Mar 2012 19:39:12 +0200
> > > From: Agustin Martin <agustin.martin@hispalinux.es>
> > > 
> > > Hi Eli,
> > 
> > Thanks for responding, I was beginning to think that no one is
> > interested.  In general, I find that ispell.el is in sore need of
> > modernization; at least that's my conclusion so far from playing with
> > hunspell (with which I want to replace my aging collection of Ispell
> > and its dictionaries that I use for many years).
> > 
> > > At least for aspell ispell.el already uses utf8 as default communication
> > > encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
> > > OTHERCHARS is guessed from aspell .dat file for given dictionary.
> > 
> > The question is, why isn't this done for any modern speller.  The only
> > one I know of that cannot handle UTF-8 is Ispell.
> 
> I think the only real remaining reason is for XEmacs compatibility. AFAIK 
> XEmacs does not support [:alpha:].
> 
> I thought about filtering ispell-dictionary-base-alist when used from FSF
> Emacs, so it uses [:alpha:] and still keeps compatibility. I am currently a
> bit busy, but at some time I may try this for Debian and see what happens.

For the records, I am attaching what I am currently trying, post-processing
global dictionary list while leaving local definitions at ~/.emacs
unmodified. This should also deal with [#11200: ispell.el sets incorrect
encoding for the default dictionary]. I would like to test this a bit more
and commit if there are no problems.

-- 
Agustin

[-- Attachment #2: ispell.el_alpha-regexp.2.diff --]
[-- Type: text/x-diff, Size: 2073 bytes --]

--- ispell.el.orig	2012-04-10 20:02:51.422092761 +0200
+++ ispell.el	2012-04-10 20:18:27.464680054 +0200
@@ -783,6 +783,12 @@
 (make-obsolete-variable 'ispell-aspell-supports-utf8
                         'ispell-encoding8-command "23.1")
 
+(defvar ispell-emacs-alpha-regexp
+  (if (string-match "^[[:alpha:]]+$" "abcde")
+      "[[:alpha:]]"
+    nil)
+  "[[:alpha:]] if Emacs supports [:alpha:] regexp, nil
+otherwise (current XEmacs does not support it).")
 
 ;;; **********************************************************************
 ;;; The following are used by ispell, and should not be changed.
@@ -1179,8 +1185,7 @@
 	       (error nil))
 	     ispell-really-aspell
 	     ispell-encoding8-command
-	     ;; XEmacs does not like [:alpha:] regexps.
-	     (string-match "^[[:alpha:]]+$" "abcde"))
+	     ispell-emacs-alpha-regexp)
 	(unless ispell-aspell-dictionary-alist
 	  (ispell-find-aspell-dictionaries)))
 
@@ -1204,8 +1209,27 @@
 			    ispell-dictionary-base-alist))
 	(unless (assoc (car dict) all-dicts-alist)
 	  (add-to-list 'all-dicts-alist dict)))
-      (setq ispell-dictionary-alist all-dicts-alist))))
+      (setq ispell-dictionary-alist all-dicts-alist))
 
+    ;; If Emacs flavor supports [:alpha:] use it for global dicts.  If
+    ;; spellchecker also supports UTF-8 via command-line option use it
+    ;; in communication.  This does not affect definitions in ~/.emacs.
+    (if ispell-emacs-alpha-regexp
+     	(let (tmp-dicts-alist)
+    	  (dolist (adict ispell-dictionary-alist)
+  	    (add-to-list 'tmp-dicts-alist
+   			 (list
+   			  (nth 0 adict)  ; dict name
+    			  "[[:alpha:]]"  ; casechars
+    			  "[^[:alpha:]]" ; not-casechars
+   			  (nth 3 adict)  ; otherchars
+    			  (nth 4 adict)  ; many-otherchars-p
+   			  (nth 5 adict)  ; ispell-args
+   			  (nth 6 adict)  ; extended-character-mode
+			  (if ispell-encoding8-command
+			      'utf-8
+			    (nth 7 adict)))))
+    	  (setq ispell-dictionary-alist tmp-dicts-alist)))))
 
 (defun ispell-valid-dictionary-list ()
   "Return a list of valid dictionaries.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-10 19:08       ` Agustin Martin
@ 2012-04-10 19:11         ` Eli Zaretskii
  2012-04-12 14:36           ` Agustin Martin
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2012-04-10 19:11 UTC (permalink / raw)
  To: Agustin Martin; +Cc: emacs-devel

> Date: Tue, 10 Apr 2012 21:08:03 +0200
> From: Agustin Martin <agustin.martin@hispalinux.es>
> 
> For the records, I am attaching what I am currently trying, post-processing
> global dictionary list while leaving local definitions at ~/.emacs
> unmodified. This should also deal with [#11200: ispell.el sets incorrect
> encoding for the default dictionary]. I would like to test this a bit more
> and commit if there are no problems.

Thanks, looks good to me.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-10 19:11         ` Eli Zaretskii
@ 2012-04-12 14:36           ` Agustin Martin
  2012-04-12 19:01             ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Agustin Martin @ 2012-04-12 14:36 UTC (permalink / raw)
  To: emacs-devel

On Tue, Apr 10, 2012 at 10:11:38PM +0300, Eli Zaretskii wrote:
> > Date: Tue, 10 Apr 2012 21:08:03 +0200
> > From: Agustin Martin <agustin.martin@hispalinux.es>
> > 
> > For the records, I am attaching what I am currently trying, post-processing
> > global dictionary list while leaving local definitions at ~/.emacs
> > unmodified. This should also deal with [#11200: ispell.el sets incorrect
> > encoding for the default dictionary]. I would like to test this a bit more
> > and commit if there are no problems.
> 
> Thanks, looks good to me.

Just some info, this is taking longer than expected,

I am still dealing with an open issue here. Some languages have non 7bit
wordchars, like Catalan middledot, and it should be converted to UTF-8 if
default communication language is changed to UTF-8.

I have looked at the encoding stuff and I am currently trying something
like

(if ispell-encoding8-command
    ;; Convert non 7bit otherchars to utf-8 if needed
    (encode-coding-string
     (decode-coding-string (nth 3 adict) (nth 7 adict))
     'utf-8)
  (nth 3 adict)) ; otherchars

to get new UTF-8 string where

(nth 7 adict) -> dict-coding-system
(nth 3 adict) -> Original otherchars

but get a sgml-lexical-context error. Need to look more carefuly, so this
will take longer. I am far from expert in handling encodings, so comments
are welcome.

-- 
Agustin



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-12 14:36           ` Agustin Martin
@ 2012-04-12 19:01             ` Eli Zaretskii
  2012-04-13 15:25               ` Agustin Martin
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2012-04-12 19:01 UTC (permalink / raw)
  To: Agustin Martin; +Cc: emacs-devel

> Date: Thu, 12 Apr 2012 16:36:57 +0200
> From: Agustin Martin <agustin.martin@hispalinux.es>
> 
> I am still dealing with an open issue here. Some languages have non 7bit
> wordchars, like Catalan middledot, and it should be converted to UTF-8 if
> default communication language is changed to UTF-8.

Sorry, I don't understand: do you mean "non 8-bit wordchars"?  I don't
think 7 bits is assumed anywhere.

Assuming you did mean 8-bit, then why not use UTF-8 for Catalan from
the get-go?  Only some languages can use single-byte encodings, and
evidently Catalan is not one of them.  For that matter, why shouldn't
aspell and hunspell use UTF-8 by default (something I already asked)?

> I have looked at the encoding stuff and I am currently trying something
> like
> 
> (if ispell-encoding8-command
>     ;; Convert non 7bit otherchars to utf-8 if needed
>     (encode-coding-string
>      (decode-coding-string (nth 3 adict) (nth 7 adict))
>      'utf-8)
>   (nth 3 adict)) ; otherchars
> 
> to get new UTF-8 string where
> 
> (nth 7 adict) -> dict-coding-system
> (nth 3 adict) -> Original otherchars
> 
> but get a sgml-lexical-context error. Need to look more carefuly, so this
> will take longer. I am far from expert in handling encodings, so comments
> are welcome.

I don't understand what are you trying to accomplish by encoding
OTHERCHARS in UTF-8.  What exactly is the problem with them being
encoded in some 8-bit encoding?  Please explain.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-12 19:01             ` Eli Zaretskii
@ 2012-04-13 15:25               ` Agustin Martin
  2012-04-13 15:53                 ` Eli Zaretskii
  2012-04-13 17:51                 ` Stefan Monnier
  0 siblings, 2 replies; 25+ messages in thread
From: Agustin Martin @ 2012-04-13 15:25 UTC (permalink / raw)
  To: emacs-devel

On Thu, Apr 12, 2012 at 10:01:30PM +0300, Eli Zaretskii wrote:
> I wrote:
> > I am still dealing with an open issue here. Some languages have non 7bit
> > wordchars, like Catalan middledot, and it should be converted to UTF-8 if
> > default communication language is changed to UTF-8.
> 
> Sorry, I don't understand: do you mean "non 8-bit wordchars"?  I don't
> think 7 bits is assumed anywhere.

I mean wordchars that cannot be represented in 7bit encoding, like Catalan
middledot (available in 8bit latin1)

> Assuming you did mean 8-bit, then why not use UTF-8 for Catalan from
> the get-go?  Only some languages can use single-byte encodings, and
> evidently Catalan is not one of them.  For that matter, why shouldn't
> aspell and hunspell use UTF-8 by default (something I already asked)?

[...]

> I don't understand what are you trying to accomplish by encoding
> OTHERCHARS in UTF-8.  What exactly is the problem with them being
> encoded in some 8-bit encoding?  Please explain.

Imagine a fake entry in the general list, either in ispell.el or provided
through `ispell-base-dicts-override-alist' (no accented chars for simplicity)

("catala8"
     "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1)

Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it
properly. I prefer to not use UTF-8 here, because I want the entry to also be
useful for ispell (and also be XEmacs incompatible). The best approach here
seems to decode the otherchars regexp according to provided coding-system.

I have noticed that there seems to be no need to encode the resulting string
in UTF-8, Emacs will know what to do with the decoded string.

I tested something like

 (dolist (adict ispell-dictionary-alist)
  	    (add-to-list 'tmp-dicts-alist
   			 (list
   			  (nth 0 adict)  ; dict name
    			  "[[:alpha:]]"  ; casechars
    			  "[^[:alpha:]]" ; not-casechars
			  (if ispell-encoding8-command
			      ;; Decode 8bit otherchars if needed
			      (decode-coding-string (nth 3 adict) (nth 7 adict))
			    (nth 3 adict)) ; otherchars
    			  (nth 4 adict)  ; many-otherchars-p
   			  (nth 5 adict)  ; ispell-args
   			  (nth 6 adict)  ; extended-character-mode
			  (if ispell-encoding8-command
			      'utf-8
			    (nth 7 adict)))))

and seems to work well.

> I wrote:
> > but get a sgml-lexical-context error. Need to look more carefuly, so this
> > will take longer.

I have tested further and this seems to be an unrelated problem. Some time
ago I already noticed some problems with flyspell.el and sgml mode (in
particular psgml) regarding sgml-lexical-context error

sgml-lexical-context: Wrong type argument: stringp, nil

sometimes when running flyspell-buffer after enabling flyspell-mode. I am
also seing something like

Error in post-command-hook (flyspell-post-command-hook):
(wrong-type-argument stringp nil)

when enabling flyspell-mode from the beginning of my sgml buffer. Cannot
reproduce with emacs -Q, still trying to find where this comes from. Both
problems tested with emacs-snapshot_20120410.

For Debian I do not use sgml-lexical-context, but an improved version of old
regexp to try keeping things compatible with XEmacs. This seems to work well
and has some advantages over sgml-lexical-context

1) Is compatible with XEmacs
2) Is twice faster when using flyspell-buffer than sgml-lexical-context
3) Does not trigger above error.

I am considering to use this improved regexp instead of sgml-lexical-context
for above reasons, but this is another issue.

-- 
Agustin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-13 15:25               ` Agustin Martin
@ 2012-04-13 15:53                 ` Eli Zaretskii
  2012-04-13 16:38                   ` Agustin Martin
  2012-04-13 17:51                 ` Stefan Monnier
  1 sibling, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2012-04-13 15:53 UTC (permalink / raw)
  To: Agustin Martin; +Cc: emacs-devel

> Date: Fri, 13 Apr 2012 17:25:25 +0200
> From: Agustin Martin <agustin.martin@hispalinux.es>
> 
> > I don't understand what are you trying to accomplish by encoding
> > OTHERCHARS in UTF-8.  What exactly is the problem with them being
> > encoded in some 8-bit encoding?  Please explain.
> 
> Imagine a fake entry in the general list, either in ispell.el or provided
> through `ispell-base-dicts-override-alist' (no accented chars for simplicity)
> 
> ("catala8"
>      "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1)
> 
> Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it
> properly. I prefer to not use UTF-8 here, because I want the entry to also be
> useful for ispell (and also be XEmacs incompatible). The best approach here
> seems to decode the otherchars regexp according to provided coding-system.
> 
> I have noticed that there seems to be no need to encode the resulting string
> in UTF-8, Emacs will know what to do with the decoded string.
> 
> I tested something like
> 
>  (dolist (adict ispell-dictionary-alist)
>   	    (add-to-list 'tmp-dicts-alist
>    			 (list
>    			  (nth 0 adict)  ; dict name
>     			  "[[:alpha:]]"  ; casechars
>     			  "[^[:alpha:]]" ; not-casechars
> 			  (if ispell-encoding8-command
> 			      ;; Decode 8bit otherchars if needed
> 			      (decode-coding-string (nth 3 adict) (nth 7 adict))
> 			    (nth 3 adict)) ; otherchars
>     			  (nth 4 adict)  ; many-otherchars-p
>    			  (nth 5 adict)  ; ispell-args
>    			  (nth 6 adict)  ; extended-character-mode
> 			  (if ispell-encoding8-command
> 			      'utf-8
> 			    (nth 7 adict)))))
> 
> and seems to work well.

So you are taking the Catalan dictionary spec written for Ispell and
convert it to a spec that could be used to support more characters by
using UTF-8, is that right?  If so, I find this a bit kludgey.  How
about having a completely separate spec instead?  More generally, why
not separate ispell-dictionary-alist into 2 alists, one to be used
with Ispell, the other to be used with aspell and hunspell?  I think
this would be cleaner, don't you agree?




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-13 15:53                 ` Eli Zaretskii
@ 2012-04-13 16:38                   ` Agustin Martin
  0 siblings, 0 replies; 25+ messages in thread
From: Agustin Martin @ 2012-04-13 16:38 UTC (permalink / raw)
  To: emacs-devel

On Fri, Apr 13, 2012 at 06:53:57PM +0300, Eli Zaretskii wrote:
> > Date: Fri, 13 Apr 2012 17:25:25 +0200
> > From: Agustin Martin <agustin.martin@hispalinux.es>
> > 
> > > I don't understand what are you trying to accomplish by encoding
> > > OTHERCHARS in UTF-8.  What exactly is the problem with them being
> > > encoded in some 8-bit encoding?  Please explain.
> > 
> > Imagine a fake entry in the general list, either in ispell.el or provided
> > through `ispell-base-dicts-override-alist' (no accented chars for simplicity)
> > 
> > ("catala8"
> >      "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1)
> > 
> > Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it
> > properly. I prefer to not use UTF-8 here, because I want the entry to also be
> > useful for ispell (and also be XEmacs incompatible). The best approach here
> > seems to decode the otherchars regexp according to provided coding-system.
> > 
> > I have noticed that there seems to be no need to encode the resulting string
> > in UTF-8, Emacs will know what to do with the decoded string.
> > 
> > I tested something like
> > 
> >  (dolist (adict ispell-dictionary-alist)
> >   	    (add-to-list 'tmp-dicts-alist
> >    			 (list
> >    			  (nth 0 adict)  ; dict name
> >     			  "[[:alpha:]]"  ; casechars
> >     			  "[^[:alpha:]]" ; not-casechars
> > 			  (if ispell-encoding8-command
> > 			      ;; Decode 8bit otherchars if needed
> > 			      (decode-coding-string (nth 3 adict) (nth 7 adict))
> > 			    (nth 3 adict)) ; otherchars
> >     			  (nth 4 adict)  ; many-otherchars-p
> >    			  (nth 5 adict)  ; ispell-args
> >    			  (nth 6 adict)  ; extended-character-mode
> > 			  (if ispell-encoding8-command
> > 			      'utf-8
> > 			    (nth 7 adict)))))
> > 
> > and seems to work well.
> 
> So you are taking the Catalan dictionary spec written for Ispell and
> convert it to a spec that could be used to support more characters by
> using UTF-8, is that right?  If so, I find this a bit kludgey.  

I think differently and like above approach because I find it way more
versatile for general definitions. This is not a matter of ispell blind
reuse. In particular I noticed this problem in Debian with the catalan spec
written for aspell (automatically created after info provided by aspell-ca
package).  That info is written that way to also be useful for XEmacs, but
with above post-processing it can work way better for Emacs.

> How
> about having a completely separate spec instead?  More generally, why
> not separate ispell-dictionary-alist into 2 alists, one to be used
> with Ispell, the other to be used with aspell and hunspell?  I think
> this would be cleaner, don't you agree?

As a matter of fact that is what we do in Debian from info provided by
ispell, aspell and hunspell dicts maintainers. The difference is that the
provided info is supposed to be valid for both Emacs and XEmacs, so
I find post-processing as above very useful, because it helps to take the
best for Emacs. Global dicts alist is built from

(dolist (dict (append found-dicts-alist
  	    ispell-base-dicts-override-alist
	    ispell-dictionary-base-alist))

where first found wins. `found-dicts-alist' has the result of automatic
search (currently used only for aspell) and has higher priority, 
`ispell-dictionary-base-alist' is the fallback alist having the lower
priority. Depending on the spellchecker 
`ispell-base-dicts-override-alist' is set to an alist corresponding to
ispell, aspell or hunspell dictionaries (they are handled independently)

I do not think that maintaining separate hardcoded dict lists in ispell.el
for ispell, aspell and hunspell worths.

For hunspell, in the future I'd go for some sort of parsing mechanism like
current one for aspell.

-- 
Agustin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-13 15:25               ` Agustin Martin
  2012-04-13 15:53                 ` Eli Zaretskii
@ 2012-04-13 17:51                 ` Stefan Monnier
  2012-04-13 18:44                   ` Agustin Martin
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2012-04-13 17:51 UTC (permalink / raw)
  To: emacs-devel

> ("catala8"
>      "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1)

> Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it
> properly. I prefer to not use UTF-8 here, because I want the entry to also be
> useful for ispell (and also be XEmacs incompatible). The best approach here
> seems to decode the otherchars regexp according to provided coding-system.

There's something I don't understand here:

If you want a middle dot, why don't you put a middle dot?
I mean why write "['\267-]" rather than ['·-]?

I think this is related to your saying "I prefer to not use UTF-8 here",
but again I don't know what you mean by "use UTF-8", because using
a middle dot character in the source file does not imply using UTF-8
anywhere (the file can be saved in any encoding that includes the
middle dot).

For me notations like \267 should be used exclusively to talk about
*bytes*, not about *chars*.  So it might make sense to use those for
things like matching particular bytes in [ia]spell's output, but it
makes no sense to match chars in the buffer being spell-checked since
the buffer does not contain bytes but chars.

        Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-13 17:51                 ` Stefan Monnier
@ 2012-04-13 18:44                   ` Agustin Martin
  2012-04-14  1:57                     ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Agustin Martin @ 2012-04-13 18:44 UTC (permalink / raw)
  To: emacs-devel

On Fri, Apr 13, 2012 at 01:51:15PM -0400, Stefan Monnier wrote:
> > ("catala8"
> >      "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1)
> 
> > Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it
> > properly. I prefer to not use UTF-8 here, because I want the entry to also be
> > useful for ispell (and also be XEmacs incompatible). The best approach here
> > seems to decode the otherchars regexp according to provided coding-system.
> 
> There's something I don't understand here:
> 
> If you want a middle dot, why don't you put a middle dot?
> I mean why write "['\267-]" rather than ['·-]?

The problem is that in a dictionary alist you can have dictionaries with
different unibyte encodings, if you happen to have two of that chars in
different encodings I'd expect problems.

I really should have gone in more detail about the system where I noticed
this, even if it is a bit Debian specific.

I noticed this problem in aspell catalan entry provided by Debian aspell-ca
package. In Debian for the different aspell {and ispell and hunspell}
dictionaries alists are created on dictionary installation and stored in a
file (for the curious /var/cache/dictionaries-common/emacsen-ispell-dicts.el). 
Some maintainers provide \xxx, some provide explicit chars in different
encodings, and all that info it put together in dict alist form in that file,
so it cannot be loaded with a given unique encoding but as 'raw-text, and
that implies loading as bytes rather than as chars.

> I think this is related to your saying "I prefer to not use UTF-8 here",
> but again I don't know what you mean by "use UTF-8", because using
> a middle dot character in the source file does not imply using UTF-8
> anywhere (the file can be saved in any encoding that includes the
> middle dot).
> 
> For me notations like \267 should be used exclusively to talk about
> *bytes*, not about *chars*.  So it might make sense to use those for
> things like matching particular bytes in [ia]spell's output, but it
> makes no sense to match chars in the buffer being spell-checked since
> the buffer does not contain bytes but chars.

That is why I want to decode those bytes into actual chars to be used in
spellchecking, and make sure that they are decoded from correct
coding-system. Otherwise if process coding-system is changed to UTF-8 and
that stays as bytes matching the wrong encoding things may not work well.

If there is a consensus that I should not go the decode- way for otherchars,
I will not commit that part. For Debian I can simply keep loading
emacsen-ispell-dicts.el as raw-text and do the decode- processing on its
contents, before they are passed to ispell.el through
`ispell-base-dicts-override-alist', so this last contains chars more that
bytes. I however think that is better to keep the decode- stuff for more
general use.

I will wait at least a couple of days before committing so is clear what to
do.

Thanks all for your comments,

-- 
Agustin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-13 18:44                   ` Agustin Martin
@ 2012-04-14  1:57                     ` Stefan Monnier
  2012-04-15  0:02                       ` Agustin Martin
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2012-04-14  1:57 UTC (permalink / raw)
  To: emacs-devel

>> If you want a middle dot, why don't you put a middle dot?
>> I mean why write "['\267-]" rather than ['·-]?
> The problem is that in a dictionary alist you can have dictionaries with
> different unibyte encodings, if you happen to have two of that chars in
> different encodings I'd expect problems.

I still don't understand.  Can you be more specific?

>> For me notations like \267 should be used exclusively to talk about
>> *bytes*, not about *chars*.  So it might make sense to use those for
>> things like matching particular bytes in [ia]spell's output, but it
>> makes no sense to match chars in the buffer being spell-checked since
>> the buffer does not contain bytes but chars.
> That is why I want to decode those bytes into actual chars to be used in

If I understand correctly what you mean by "those bytes", then using "·"
instead of "\267" gives you the decoded form right away without having
to do extra work.

> spellchecking, and make sure that they are decoded from correct
> coding-system.  Otherwise if process coding-system is changed to UTF-8 and
> that stays as bytes matching the wrong encoding things may not work well.

I lost you here.  I agree that "if it stays as bytes" you're going to
suffer, which is why I propose to use chars instead.


        Stefan



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-14  1:57                     ` Stefan Monnier
@ 2012-04-15  0:02                       ` Agustin Martin
  2012-04-16  2:40                         ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Agustin Martin @ 2012-04-15  0:02 UTC (permalink / raw)
  To: emacs-devel

El día 14 de abril de 2012 03:57, Stefan Monnier
<monnier@iro.umontreal.ca> escribió:
>>> If you want a middle dot, why don't you put a middle dot?
>>> I mean why write "['\267-]" rather than ['·-]?
>> The problem is that in a dictionary alist you can have dictionaries with
>> different unibyte encodings, if you happen to have two of that chars in
>> different encodings I'd expect problems.
>
> I still don't understand.  Can you be more specific?

Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other
dictionary (I am guessing the possibility to be more general, do not
actually have a real example of something different from our Debian
file with all info put together) with another upper char in
otherchars, but in a different encoding (e.g., koi8r).

The only possibility to have both coexist as chars in the same file is
to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
koi8r, so Emacs properly gets chars when reading the file (if properly
guessing file coding-system). XEmacs seems to be a bit more tricky
regarding UTF-8, but I'd expect things to work once proper decoding is
done. The traditional  possibility is to use octal codes to represent
bytes matching the char in dict declared charset.

Using UTF-8 is actually what eliz proposed in the beginning of this
thread. While one of ispell.el docstrings claims that only unibyte
chars can be put here, I'd expect this to work, at least for Emacs. As
a matter of fact, when I was first trying the (encode- (decode ..))
way I actually got UTF-8 (that was decoded again by
`ispell-get-otherchars' according to new 'utf-8 coding-system) and
seemed to work (apart from the psgml/sgml-lexical-context problem) for
Emacs. At that time I did not notice that once Emacs loads something
as char, encodings only matter when writing it (Yes, I am really
learning all this encode-* decode-* stuff in more depth in this
thread).

I'd however use this only in personal ~/.emacs files and if needed.

>>> For me notations like \267 should be used exclusively to talk about
>>> *bytes*, not about *chars*.  So it might make sense to use those for
>>> things like matching particular bytes in [ia]spell's output, but it
>>> makes no sense to match chars in the buffer being spell-checked since
>>> the buffer does not contain bytes but chars.
>> That is why I want to decode those bytes into actual chars to be used in
>
> If I understand correctly what you mean by "those bytes", then using "·"
> instead of "\267" gives you the decoded form right away without having
> to do extra work.

That is true for files with a single encoding. However, the problem
happens when a file has mixed encodings like in the Debian example I
mentioned. I know, this will not happen in real manually edited files,
but can happen and happens in aggregates like the one I mentioned.

If file is loaded with a given coding-system-for-read chars in that
coding-system will be properly interpreted by Emacs when reading, but
not the others. Something like that happened with
iso-8859-1/iso-8859-15 chars in

http://bugs.debian.org/337214

and the simple way to avoid the mess was to read as 'raw-text, and
that indeed reads upper chars as pure bytes although they were
originally written as chars (I mean not through octal codes), no
implicit on the fly "decoding/interpretation" at all. Not a big
problem, we know the encoding for every single dict, so things can be
properly decoded (\xxx + coding-system gives a char).

If we later change default encoding for communication for entries in
that file, we need to decode the bytes obtained from 'raw-text read to
actual char so is internally handled as desired char. Changing it also
to UTF-8 (and expecting ispell-get-otherchars to decode again to char)
seems  to work in Emacs, but also seems absolutely not needed.

I am getting more and more convinced that this is a Debian-only
problem because of the way we create that file, so I should handle
this special case as Debian-only and do needed decoding there, not in
ispell.el.

-- 
Agustin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-15  0:02                       ` Agustin Martin
@ 2012-04-16  2:40                         ` Stefan Monnier
  2012-04-20 15:25                           ` Agustin Martin
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2012-04-16  2:40 UTC (permalink / raw)
  To: Agustin Martin; +Cc: emacs-devel

> Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other
> dictionary (I am guessing the possibility to be more general, do not
> actually have a real example of something different from our Debian
> file with all info put together) with another upper char in
> otherchars, but in a different encoding (e.g., koi8r).

You're still living in Emacs-21/22: since Emacs-23, basically chars aren't
associated with their encoding (actually charset) any more.

> The only possibility to have both coexist as chars in the same file is
> to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
> koi8r, so Emacs properly gets chars when reading the file (if properly
> guessing file coding-system).

Not at all, there are many encodings which cover the superset of
iso-8859-* and koi8-*.  UTF-8 is the more fashionable one nowadays, but
not anywhere close to the only one. e.g. there's also iso-2022,
emacs-mule, and then some.

> I'd however use this only in personal ~/.emacs files and if needed.

Why?  It would make the code more clear and simpler.

> That is true for files with a single encoding.  However, the problem
> happens when a file has mixed encodings like in the Debian example I
> mentioned.  I know, this will not happen in real manually edited files,
> but can happen and happens in aggregates like the one I mentioned.

That's an old solved problem.

> If file is loaded with a given coding-system-for-read chars in that
> coding-system will be properly interpreted by Emacs when reading, but
> not the others. Something like that happened with
> iso-8859-1/iso-8859-15 chars in

That was then.  Not any more.


        Stefan



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-16  2:40                         ` Stefan Monnier
@ 2012-04-20 15:25                           ` Agustin Martin
  2012-04-20 15:36                             ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Agustin Martin @ 2012-04-20 15:25 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 4404 bytes --]

On Sun, Apr 15, 2012 at 10:40:29PM -0400, Stefan Monnier wrote:
> > Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other
> > dictionary (I am guessing the possibility to be more general, do not
> > actually have a real example of something different from our Debian
> > file with all info put together) with another upper char in
> > otherchars, but in a different encoding (e.g., koi8r).
> 
> You're still living in Emacs-21/22: since Emacs-23, basically chars aren't
> associated with their encoding (actually charset) any more.

Not once inside Emacs, but when reading a file, encoding matters and in
some corner cases with mixed charsets Emacs may get wrong chars
(with no sane way to make it automatically get the right ones), see
attached file.

BTW, I do not even have Emacs-21/22 installed, I am testing this in
Emacs23/24 (together with XEmacs to check that I do not introduce
additional incompatibilities)

> > The only possibility to have both coexist as chars in the same file is
> > to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
> > koi8r, so Emacs properly gets chars when reading the file (if properly
> > guessing file coding-system).
> 
> Not at all, there are many encodings which cover the superset of
> iso-8859-* and koi8-*.  UTF-8 is the more fashionable one nowadays, but
> not anywhere close to the only one. e.g. there's also iso-2022,
> emacs-mule, and then some.

Sorry, should have written something like supersets, put UTF-8 as an example.

> > I'd however use this only in personal ~/.emacs files and if needed.
> 
> Why?  It would make the code more clear and simpler.

To make my Debian changes minimal I prefer to keep compatibility with
XEmacs when possible. That makes my life easier when adapting changes in FSF
Emacs repo to Debian. Seems that XEmacs has very recently added support for
automatic on-the-fly UTF-8 parsing, so my POV may change, but I admit I am
currently biassed to the 7bit \xxx strings. 

Since Emacs should now (for some days) use [:alpha:] in "Casechars" and
"Not-Casechars" for global dicts, I think we should not worry very much about
this from Emacs side, just for Otherchars in the very few cases it contains
an upper char (none in current ispell.el). And for that I still personally
prefer keep using for now the 7bit string "\xxx".

> > That is true for files with a single encoding.  However, the problem
> > happens when a file has mixed encodings like in the Debian example I
> > mentioned.  I know, this will not happen in real manually edited files,
> > but can happen and happens in aggregates like the one I mentioned.
> 
> That's an old solved problem.

May be we are speaking about different things, but as I understand this, it
does not seem so. And I do not think this can be solved in a robust enough
way for all files. See attached file and comments below.

> > If file is loaded with a given coding-system-for-read chars in that
> > coding-system will be properly interpreted by Emacs when reading, but
> > not the others. Something like that happened with
> > iso-8859-1/iso-8859-15 chars in
> 
> That was then.  Not any more.

I think you mean that iso-8859-* chars are currently unified. I am aware of
that, but I am speaking about something different, mixed encodings, also
discussed in that thread together with the iso-8859-1/iso-8859-15 problems.

See attached file. It contains middledot in two encodings, UTF-8 in first
line and latin1 in the second, together with something that was originally
written as iso-8859-7 lowercase greek zeta. In my iso-8859-1 box emacs24
(emacs-snapshot_20120410) reads it as

--
Â·
·
æ
--

so gets the wrong char both for UTF-8 and for greek lowercase zeta. In a
different environment Emacs may have guessed that first line is UTF-8, but
I do not see a robust enough to properly guess all the mixed encodings for
a small file like this.

That is the kind of things I am now dealing with for Debian. Currently not a
big problem for Emacs after [:alpha:] changes for casechars/not casechars 
(the chance that a new dict adds otherchars in incompatible charsets is
small), however this can still happen us for XEmacs in that aggregated
file.

Sorry if I did not make that clear enough and helped making this thread this
long.

Regards,

-- 
Agustin

[-- Attachment #2: test.txt --]
[-- Type: text/plain, Size: 10 bytes --]

Â·
·
æ

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-20 15:25                           ` Agustin Martin
@ 2012-04-20 15:36                             ` Eli Zaretskii
  2012-04-20 16:17                               ` Agustin Martin
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2012-04-20 15:36 UTC (permalink / raw)
  To: Agustin Martin; +Cc: emacs-devel

> Date: Fri, 20 Apr 2012 17:25:32 +0200
> From: Agustin Martin <agustin.martin@hispalinux.es>
> 
> > That was then.  Not any more.
> 
> I think you mean that iso-8859-* chars are currently unified. I am aware of
> that, but I am speaking about something different, mixed encodings, also
> discussed in that thread together with the iso-8859-1/iso-8859-15 problems.
> 
> See attached file. It contains middledot in two encodings, UTF-8 in first
> line and latin1 in the second, together with something that was originally
> written as iso-8859-7 lowercase greek zeta.

Why should we care about files that mix encodings?  We were talking
about dictionary definitions in ispell.el, and that file will surely
NOT mix encodings.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-20 15:36                             ` Eli Zaretskii
@ 2012-04-20 16:17                               ` Agustin Martin
  2012-04-21  2:17                                 ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Agustin Martin @ 2012-04-20 16:17 UTC (permalink / raw)
  To: emacs-devel

On Fri, Apr 20, 2012 at 06:36:40PM +0300, Eli Zaretskii wrote:
> > Date: Fri, 20 Apr 2012 17:25:32 +0200
> > From: Agustin Martin <agustin.martin@hispalinux.es>
> > 
> > > That was then.  Not any more.
> > 
> > I think you mean that iso-8859-* chars are currently unified. I am aware of
> > that, but I am speaking about something different, mixed encodings, also
> > discussed in that thread together with the iso-8859-1/iso-8859-15 problems.
> > 
> > See attached file. It contains middledot in two encodings, UTF-8 in first
> > line and latin1 in the second, together with something that was originally
> > written as iso-8859-7 lowercase greek zeta.
> 
> Why should we care about files that mix encodings?  We were talking
> about dictionary definitions in ispell.el, and that file will surely
> NOT mix encodings.

Was just trying to explain why I was dealing with this for Debian. Changes
committed to Emacs bzr repo did not try to deal with this.

-- 
Agustin



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-04-20 16:17                               ` Agustin Martin
@ 2012-04-21  2:17                                 ` Stefan Monnier
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Monnier @ 2012-04-21  2:17 UTC (permalink / raw)
  To: emacs-devel

> Was just trying to explain why I was dealing with this for Debian.
> Changes committed to Emacs bzr repo did not try to deal with this.

I still really have no clue what problem you're talking about.
ispell.el operates on buffers, so file encodings do not affect it, and
the only file involved is ispell.el itself, where we can choose the
encoding to be "non mixed".
And all of that should apply to Debian as well as to any other
environment where ispell.el might be used/distributed.

        Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Ispell and unibyte characters
  2012-03-29 18:06       ` Eli Zaretskii
  2012-03-29 21:13         ` Andreas Schwab
@ 2012-04-26  9:54         ` Eli Zaretskii
  1 sibling, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2012-04-26  9:54 UTC (permalink / raw)
  To: agustin.martin; +Cc: emacs-devel

> Date: Thu, 29 Mar 2012 20:06:17 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> CC: emacs-devel@gnu.org
> 
> Anyway, to me, Hunspell is a better tool, because of its support for
> multiple dictionaries, which fixes the most annoying inconvenience in
> Emacs spell-checking: the need to switch dictionaries according to the
> language -- this is really a bad thing when you use Flyspell.
> 
> With multiple dictionaries, with very rare exceptions, one needs a
> single entry in ispell-dictionary-alist, having all of the
> dictionaries for languages one normally uses, [[:alpha:]] as
> CASECHARS, and UTF-8 as the encoding.

Unfortunately, I have to take that back.  Hunspell _does_ support
multiple dictionaries, but only if they can use the same .aff file.
When you invoke Hunspell with several dictionaries, as in

  hunspell -d "foo,bar,baz"

only the first dictionary is loaded with its .aff file; the rest use
that same .aff file.  Therefore, it is practically impossible to use
Hunspell to spell multi-lingual buffers without switching
dictionaries.  This feature _is_ useful when you want to add
specialized dictionaries (e.g., for terminology in some specific field
of knowledge or discipline) to the general dictionary of the same
language, though.

Sorry for posting misleading information.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-04-26  9:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii
2012-03-26 17:39 ` Agustin Martin
2012-03-26 20:08   ` Eli Zaretskii
2012-03-26 22:07     ` Lennart Borgman
2012-03-28 19:18     ` Agustin Martin
2012-03-29 18:06       ` Eli Zaretskii
2012-03-29 21:13         ` Andreas Schwab
2012-03-30  6:28           ` Eli Zaretskii
2012-04-26  9:54         ` Eli Zaretskii
2012-04-10 19:08       ` Agustin Martin
2012-04-10 19:11         ` Eli Zaretskii
2012-04-12 14:36           ` Agustin Martin
2012-04-12 19:01             ` Eli Zaretskii
2012-04-13 15:25               ` Agustin Martin
2012-04-13 15:53                 ` Eli Zaretskii
2012-04-13 16:38                   ` Agustin Martin
2012-04-13 17:51                 ` Stefan Monnier
2012-04-13 18:44                   ` Agustin Martin
2012-04-14  1:57                     ` Stefan Monnier
2012-04-15  0:02                       ` Agustin Martin
2012-04-16  2:40                         ` Stefan Monnier
2012-04-20 15:25                           ` Agustin Martin
2012-04-20 15:36                             ` Eli Zaretskii
2012-04-20 16:17                               ` Agustin Martin
2012-04-21  2:17                                 ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).