unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
       [not found]       ` <E1BQ5z5-0000f4-5u@fencepost.gnu.org>
@ 2004-05-19 11:44         ` Agustin Martin
  2004-05-21  8:01           ` Agustin Martin
  0 siblings, 1 reply; 50+ messages in thread
From: Agustin Martin @ 2004-05-19 11:44 UTC (permalink / raw)
  Cc: emacs-devel

On Tue, May 18, 2004 at 10:54:23AM -0400, Richard Stallman wrote:
> If this is an issue of encodings, could you talk about it with
> handa@etl.go.jp, and cc emavs-devel?  Handa is the expert on
> this part of Emacs.
> 

Hi, 

As Richard suggested I am writing to you about this problem. I have been
recently taking a look at some of the Debian spell related emacs21 bug
reports and was trying to reproduce this one, as well as getting aditional
information. You can find the full thread at

http://bugs.debian.org/130397

and this is what I saw when trying to reproduce the problem (same contents
as in my mail to Richard with a minor adition)

Having french selected as ispell default I do:

a) Start emacs with fr_FR@euro locale and manually type the (mispelled)
   french word dèplorable. Try ispell-word it. Bug reproduced, high bit is
   not considered a word element.
b) After previous step, I save the file containing that word and run emacs
   on it again. ispell-word now works as expected and detects the complete
   mispelled word proposing the right fix.

In both cases, emacs is called as

$ LC_ALL=fr_FR@euro emacs fr-test &

c) If I now type again the mispelled word after the previous one, previous
   word is properly handled by ispell mode, but the bug is reproduced for
   the just typed one. Also, both 'è' (previous and last one) clearly seem
   to have a different look when using LC_ALL=fr_FR@euro. However, if I
   type it with a latin1 LC_ALL they look similar.

d) If I save the file and re-edit, ispell-word now works well on both words.

I have tested this with 'sid' Debian emacs21 (version 21.3+1-5)

This seems to be a problem with the way emacs internally handles different
encodings. My guess is that emacs is handling differently the 'è' character
(In case of encoding problems in the mail, it is the grave accented 
lowercase e) when typed in the fr_FR@euro locale than when file is read from
disk or typed in the fr_FR locale.

This also sounds me to something I read somewhere about 'emacs utf-8 support
is not yet complete' problem, but I cannot remember now where I found that.

I do not know if this is already fixed in CVS emacs, but is better to be
sure that you are aware of this.

Thanks,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2004-05-19 11:44         ` Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary) Agustin Martin
@ 2004-05-21  8:01           ` Agustin Martin
  0 siblings, 0 replies; 50+ messages in thread
From: Agustin Martin @ 2004-05-21  8:01 UTC (permalink / raw)


On Wed, May 19, 2004 at 01:44:04PM +0200, Agustin Martin wrote:

>  You can find the full thread at
> 
> http://bugs.debian.org/130397
> 
> and this is what I saw when trying to reproduce the problem (same contents
> as in my mail to Richard with a minor adition)
> 
> Having french selected as ispell default I do:
> ... 

Damm, what has been obfuscated in the lines below was not an email address
but a locale, it should read fr_FR-at-euro (deobfuscate at your convenience)
instead of the obfuscated fr_FR@euro.

By the way, I am not subscribed to emacs-devel, please cc me on replies

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
       [not found]     ` <20040517120658.GA6919@agmartin.aq.upm.es>
       [not found]       ` <E1BQ5z5-0000f4-5u@fencepost.gnu.org>
@ 2004-12-17 12:15       ` Agustin Martin
  2004-12-22 12:37         ` Kenichi Handa
  1 sibling, 1 reply; 50+ messages in thread
From: Agustin Martin @ 2004-12-17 12:15 UTC (permalink / raw)
  Cc: Lionel Elie Mamane, emacs-devel

On Mon, May 17, 2004 at 02:06:58PM +0200, Agustin Martin wrote:

> My guess is that emacs is handling differently the 'è' character (In case of
> ancoding problems in the mail, it is the grave lowercase e `e) when typed in
> the fr_FR@euro locale than when file is read or typed in the fr_FR locale.
> 

No news from upstream about this.

Seems that this problem is still present with sid emacs. Since sid
dictionaries-common has ispell.el patched to allow any coding-system
supported by emacs (including iso-8859-15 for {x}emacs21) I am considering
a new ispell.el patch to workaround this latin0-latin1 unification problem.

I am playing with redefining ispell-get-coding-system function in ispell.el
so dict coding-system is changed to iso-8859-15 if was originally
iso-8859-1 and emacs has iso-8859-15 as buffer-file-coding-system, something
like

----------------------------------------
(defun ispell-get-coding-system ()
  (let (ispell-coding-system emacs-coding-system)
    (setq ispell-coding-system
	  (nth 7 (assoc ispell-dictionary ispell-dictionary-alist)))
    (setq emacs-coding-system
	  (coding-system-get buffer-file-coding-system 'mime-charset))
    (if (and (string-equal emacs-coding-system "iso-8859-15")
	     (string-equal ispell-coding-system "iso-8859-1"))
	emacs-coding-system
      ispell-coding-system)))
----------------------------------------

It seems to work for emacs21, but not for xemacs21 (seems a bug of this
latter when giving the value of buffer-file-coding-system, just reported as
#285990).

This has the advantage that no special entries are needed for latin0 in the
ispell-dictionary-alist.

I will test this a bit more before uploading. If everything seems O.K. and
nobody opposes I will proceed this way.

Suggestions are welcome. I am cc'ing emacs-devel for their info.

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2004-12-17 12:15       ` Agustin Martin
@ 2004-12-22 12:37         ` Kenichi Handa
  2004-12-22 17:13           ` Agustin Martin
  0 siblings, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2004-12-22 12:37 UTC (permalink / raw)
  Cc: lionel, emacs-devel, 130397

In article <20041217121515.GA2270@agmartin.aq.upm.es>, Agustin Martin <agustin.martin@hispalinux.es> writes:

> On Mon, May 17, 2004 at 02:06:58PM +0200, Agustin Martin wrote:
>>  My guess is that emacs is handling differently the 'è' character (In case of
>>  ancoding problems in the mail, it is the grave lowercase e `e) when typed in
>>  the fr_FR@euro locale than when file is read or typed in the fr_FR locale.

> No news from upstream about this.

Sorry for the late respose.  I have overlooked your original
mail.  Your guess above is correct.  Emacs has multiple
different characters for e-grave.

> Having french selected as ispell default I do:
> 
> a) Start emacs with fr_FR@euro locale and manually type
>    the (mispelled) french word déplorable. Try ispell-word
>    it. Bug reproduced, high bit is not considered a word
>    element.
>
> b) After previous step, I save the file containing that word and run emacs
>    on it again. ispell-word now works as expected and detects the complete
>    mispelled word proposing the right fix.
> 
> In both cases, emacs is called as
> 
> $ fr_FR@euro emacs fr-test &
> 
> c) If I now type again the mispelled word after the previous one, previous
>    word is properly handled by ispell mode, but the bug is reproduced for
>    the just typed one. Also, both 'é' (previous and last one) clearly seem
>    to have a different look when using fr_FR@euro However, if I
>    type it with a latin1 LC_ALL they look similar.
> 
> d) If I save the file and re-edit, ispell-word now works well on both words.
> 
> I have tested this with 'sid' Debian emacs21 (version 21.3+1-5)

Please try the same thing with the latest CVS code.  With
that, when you type e-grave in fr_FR@euro locale, e-grave of
latin-iso8859-15 should be inserted in a buffer.  So, as far
as you are using a dictionary that uses iso-8859-15 encoding
(or in general, using a dictionary that uses the same
encoding as your locale), you should not face the above
problem.

> Seems that this problem is still present with sid emacs. Since sid
> dictionaries-common has ispell.el patched to allow any coding-system
> supported by emacs (including iso-8859-15 for {x}emacs21) I am considering
> a new ispell.el patch to workaround this latin0-latin1 unification problem.

> I am playing with redefining ispell-get-coding-system function in ispell.el
> so dict coding-system is changed to iso-8859-15 if was originally
> iso-8859-1 and emacs has iso-8859-15 as buffer-file-coding-system, something
> like

> ----------------------------------------
> (defun ispell-get-coding-system ()
>   (let (ispell-coding-system emacs-coding-system)
>     (setq ispell-coding-system
> 	  (nth 7 (assoc ispell-dictionary ispell-dictionary-alist)))
>     (setq emacs-coding-system
> 	  (coding-system-get buffer-file-coding-system 'mime-charset))
>     (if (and (string-equal emacs-coding-system "iso-8859-15")
> 	     (string-equal ispell-coding-system "iso-8859-1"))
> 	emacs-coding-system
>       ispell-coding-system)))
> ----------------------------------------
>
> It seems to work for emacs21, but not for xemacs21 (seems a bug of this
> latter when giving the value of buffer-file-coding-system, just reported as
> #285990).
>
> This has the advantage that no special entries are needed for latin0 in the
> ispell-dictionary-alist.

At least you should check if buffer-file-coding-system is
nil or not before callding coding-system-get.  But, anyway,
I think the above function is too ad-hoc.  As iso-8859-1 and
iso-8859-15 contains different set of characters (even if
they are few), it's not good to treat them as the same
thing.

For instance, if a dictionary uses iso-8859-1 encoding, it
doesn't contain "\264" in CASECHARS entry.  But, if a
dictionary uses iso-8859-15 encoding, it should contain
"\264" (Z-WITH-CARON) in CASECHARS entry.

So, if you are going to check the spell of some word
containing Z-WITH-CARON by iso-8859-1 dictionary, something
goes wrong.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2004-12-22 12:37         ` Kenichi Handa
@ 2004-12-22 17:13           ` Agustin Martin
  2005-01-04 12:50             ` Kenichi Handa
  2005-01-10 13:06             ` Lionel Elie Mamane
  0 siblings, 2 replies; 50+ messages in thread
From: Agustin Martin @ 2004-12-22 17:13 UTC (permalink / raw)
  Cc: lionel, emacs-devel

On Wed, Dec 22, 2004 at 09:37:32PM +0900, Kenichi Handa wrote:

> Please try the same thing with the latest CVS code.  With
> that, when you type e-grave in fr_FR@euro locale, e-grave of
> latin-iso8859-15 should be inserted in a buffer.  So, as far
> as you are using a dictionary that uses iso-8859-15 encoding
> (or in general, using a dictionary that uses the same
> encoding as your locale), you should not face the above
> problem.
> 

Thanks for the tip. I am not maintaining emacs, but a package for the common
dictionaries setup (dictionaries-common) that provides a recent and patched
ispell.el for all the diferent emacsen flavours ({x}emacs) to integrate the
different dicts and spellchecking engines in some way. I will be happy
to test this once is included in sid emacs.

> > I am playing with redefining ispell-get-coding-system function in ispell.el
> > so dict coding-system is changed to iso-8859-15 if was originally
> > iso-8859-1 and emacs has iso-8859-15 as buffer-file-coding-system, something
> > like
> 
> At least you should check if buffer-file-coding-system is
> nil or not before callding coding-system-get.  

Thanks for pointing put this, change added.

> But, anyway,
> I think the above function is too ad-hoc.  As iso-8859-1 and
> iso-8859-15 contains different set of characters (even if
> they are few), it's not good to treat them as the same
> thing.
> 
> For instance, if a dictionary uses iso-8859-1 encoding, it
> doesn't contain "\264" in CASECHARS entry.  But, if a
> dictionary uses iso-8859-15 encoding, it should contain
> "\264" (Z-WITH-CARON) in CASECHARS entry.
> 
> So, if you are going to check the spell of some word
> containing Z-WITH-CARON by iso-8859-1 dictionary, something
> goes wrong.
> 

I was aware of this, but anyway thanks for reminding. Code is probably too
ad-hoc, but latin{0,1} thing is also a somewhat ad-hoc scenario, where
latin0 should have really be named as something like iso-8859-1v2, that is,
a revision. I cannot imagine somebody using a iso-8859-2 dict and trying to
write in a iso8859-1 buffer, but with iso-8859-1 and iso-8859-15 that is
happening too frequently. 

So we have a lot of people that blindly select the locale @euro variant
without realizing its implications, and that iso-8859-1 and iso-8859-15
are different, but very close encodings (from a practical point of view,
they are fully equivalent for most languages but IIRC french (oe,"Y) and
finnish {sSzZ}^, ^ stands for caron; the euro symbol seems not significant
to spellchecking). 

Furthermore (this is probably fixed by the CVS code you mentioned above),
in current sid emacs utf-8 files can be checked with a latin1 dict (of
course if they do not use chars outside latin1) using the ispell.el
internal reencodings, but fails for iso-8859-15 declared dict.

The current state of ispell dicts in Debian is that ifrench is iso-8859-15
as default (although has a real latin1 entry), while finnish do not set at
all the {s,z}-caron chars, so it is a fully latin1 entry. aspell-fr and
aspell-fi are set to plain latin1.

So the only language that might currently require extra work is french, and
for it I find reasonable to use for emacs as default the iso-8859-15 entry
(tagged as iso-8859-1 for the above sustem to work). For this I would like
to hear Lionel's point of view, since he has put a lot of effort to make
iso-8859-15 available for spellchecking (Hi, Lionel). 

I personally do not like having separate iso-8859-15 entries unless they are
really required. For the above dicts, that would be for french, and I am not
at all sure that it is really required.

Thanks a lot for your feedback, Handa.

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2004-12-22 17:13           ` Agustin Martin
@ 2005-01-04 12:50             ` Kenichi Handa
  2005-01-04 14:55               ` Bug 130397 Stefan
  2005-01-07 15:34               ` Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary) Agustin Martin
  2005-01-10 13:06             ` Lionel Elie Mamane
  1 sibling, 2 replies; 50+ messages in thread
From: Kenichi Handa @ 2005-01-04 12:50 UTC (permalink / raw)
  Cc: lionel, emacs-devel, 130397

In article <20041222171306.GA4462@agmartin.aq.upm.es>, Agustin Martin <agustin.martin@hispalinux.es> writes:

> I was aware of this, but anyway thanks for reminding. Code is probably too
> ad-hoc, but latin{0,1} thing is also a somewhat ad-hoc scenario, where
> latin0 should have really be named as something like iso-8859-1v2, that is,
> a revision. I cannot imagine somebody using a iso-8859-2 dict and trying to
> write in a iso8859-1 buffer, but with iso-8859-1 and iso-8859-15 that is
> happening too frequently. 

> So we have a lot of people that blindly select the locale @euro variant
> without realizing its implications, and that iso-8859-1 and iso-8859-15
> are different, but very close encodings (from a practical point of view,
> they are fully equivalent for most languages but IIRC french (oe,"Y) and
> finnish {sSzZ}^, ^ stands for caron; the euro symbol seems not significant
> to spellchecking). 

> Furthermore (this is probably fixed by the CVS code you mentioned above),
> in current sid emacs utf-8 files can be checked with a latin1 dict (of
> course if they do not use chars outside latin1) using the ispell.el
> internal reencodings, but fails for iso-8859-15 declared dict.

No, this is not yet fixed.

> The current state of ispell dicts in Debian is that ifrench is iso-8859-15
> as default (although has a real latin1 entry), while finnish do not set at
> all the {s,z}-caron chars, so it is a fully latin1 entry. aspell-fr and
> aspell-fi are set to plain latin1.

> So the only language that might currently require extra work is french, and
> for it I find reasonable to use for emacs as default the iso-8859-15 entry
> (tagged as iso-8859-1 for the above sustem to work). For this I would like
> to hear Lionel's point of view, since he has put a lot of effort to make
> iso-8859-15 available for spellchecking (Hi, Lionel). 

> I personally do not like having separate iso-8859-15 entries unless they are
> really required. For the above dicts, that would be for french, and I am not
> at all sure that it is really required.

Hmmm, then how about the attached patch to the latest CVS
emacs?  With that, all equivalent charaters (e.g a-grave in
all laitn-X) should be handled well.  This patch will be
applicable also to Emacs 21.3 but not yet tested in that
version.

---
Ken'ichi HANDA
handa@m17n.org


*** ispell.el	25 Dec 2004 11:43:11 +0900	1.151
--- ispell.el	03 Jan 2005 16:05:48 +0900	
***************
*** 1074,1088 ****
        (decode-coding-string str (ispell-get-coding-system))
      str))
  
  (defun ispell-get-casechars ()
!   (ispell-decode-string
!    (nth 1 (assoc ispell-dictionary ispell-dictionary-alist))))
  (defun ispell-get-not-casechars ()
!   (ispell-decode-string
!    (nth 2 (assoc ispell-dictionary ispell-dictionary-alist))))
  (defun ispell-get-otherchars ()
!   (ispell-decode-string
!    (nth 3 (assoc ispell-dictionary ispell-dictionary-alist))))
  (defun ispell-get-many-otherchars-p ()
    (nth 4 (assoc ispell-dictionary ispell-dictionary-alist)))
  (defun ispell-get-ispell-args ()
--- 1074,1127 ----
        (decode-coding-string str (ispell-get-coding-system))
      str))
  
+ (put 'ispell-unified-chars-table 'char-table-extra-slots 0)
+ 
+ ;; Char-table that maps an Unicode character (charset:
+ ;; latin-iso8859-1, mule-unicode-0100-24ff) to
+ ;; a string in which all equivalent characters are listed.
+ 
+ (defconst ispell-unified-chars-table
+   (let ((table (make-char-table 'ispell-unified-chars-table)))
+     (map-char-table
+      #'(lambda (c v)
+ 	 (if (and v (/= c v))
+ 	     (let ((unified (or (aref table v) (string v))))
+ 	       (aset table v (concat unified (string c))))))
+      ucs-mule-8859-to-mule-unicode)
+     table))
+ 
+ ;; Return a string decoded from Nth element of the current dictionary
+ ;; while splicing equivalent characters into the string.  This splicing
+ ;; is done only if the string is a regular expression of the form
+ ;; "[...]" because, otherwise, splicing will result in incorrect
+ ;; regular expression matching.
+ 
+ (defun ispell-get-decoded-string (n)
+   (let* ((slot (assoc ispell-dictionary ispell-dictionary-alist))
+ 	 (str (nth n slot)))
+     (when (and (> (length str) 0)
+ 	       (not (multibyte-string-p str)))
+       (setq str (ispell-decode-string str))
+       (if (and (= (aref str 0) ?\[)
+ 	       (eq (string-match "\\]" str) (1- (length str))))
+ 	  (setq str
+ 		(string-as-multibyte
+ 		 (mapconcat
+ 		  #'(lambda (c)
+ 		      (let ((unichar (aref ucs-mule-8859-to-mule-unicode c)))
+ 			(if unichar
+ 			    (aref ispell-unified-chars-table unichar)
+ 			  (string c))))
+ 		  str ""))))
+       (setcar (nthcdr n slot) str))
+     str))
+ 
  (defun ispell-get-casechars ()
!   (ispell-get-decoded-string 1))
  (defun ispell-get-not-casechars ()
!   (ispell-get-decoded-string 2))
  (defun ispell-get-otherchars ()
!   (ispell-get-decoded-string 3))
  (defun ispell-get-many-otherchars-p ()
    (nth 4 (assoc ispell-dictionary ispell-dictionary-alist)))
  (defun ispell-get-ispell-args ()

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-04 12:50             ` Kenichi Handa
@ 2005-01-04 14:55               ` Stefan
  2005-01-05  2:00                 ` Kenichi Handa
  2005-01-07 15:34               ` Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary) Agustin Martin
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan @ 2005-01-04 14:55 UTC (permalink / raw)
  Cc: 130397, Agustin Martin, lionel, Ken Stevens, emacs-devel

> Hmmm, then how about the attached patch to the latest CVS
> emacs?  With that, all equivalent charaters (e.g a-grave in
> all laitn-X) should be handled well.  This patch will be
> applicable also to Emacs 21.3 but not yet tested in that
> version.

Can someone explain to me why ispell.el needs those kinds of things?

My vague understanding is that ispell.el needs to know which chars are part
of a word and that in the past (pre-MULE), this had to be redefined for each
and every language since the codes 128-255 could mean completely
different things.

Why can't ispell.el just use the `w' syntax to decide what is a word and
then rely on the decoding/encoding to do the rest of the work?

That would fix the problem where a word like "expérience" is checked as two
words if the dictionary is "american".

> + ;; Char-table that maps an Unicode character (charset:
> + ;; latin-iso8859-1, mule-unicode-0100-24ff) to
> + ;; a string in which all equivalent characters are listed.
> + 
> + (defconst ispell-unified-chars-table
> +   (let ((table (make-char-table 'ispell-unified-chars-table)))
> +     (map-char-table
> +      #'(lambda (c v)
> + 	 (if (and v (/= c v))
> + 	     (let ((unified (or (aref table v) (string v))))
> + 	       (aset table v (concat unified (string c))))))
> +      ucs-mule-8859-to-mule-unicode)
> +     table))

All the elements of this table should be multibyte strings.
For this, we may need to wrap the (string X) into
(string-to-multibyte (string X))

> + 		(string-as-multibyte
> + 		 (mapconcat
> + 		  #'(lambda (c)
> + 		      (let ((unichar (aref ucs-mule-8859-to-mule-unicode c)))
> + 			(if unichar
> + 			    (aref ispell-unified-chars-table unichar)
> + 			  (string c))))
> + 		  str ""))))

Do you expect the output of mapconcat to be unibyte and to contain
emacs-mule encoding of multibyte chars?  I don't.
So I'd recommend string-to-multibyte rather than string-as-multibyte.
If I'm wrong, could you explain where the emacs-mule encoding
got introduced?


        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-04 14:55               ` Bug 130397 Stefan
@ 2005-01-05  2:00                 ` Kenichi Handa
  2005-01-05  4:42                   ` Stefan Monnier
  0 siblings, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2005-01-05  2:00 UTC (permalink / raw)
  Cc: agustin.martin, lionel, emacs-devel, k.stevens, 130397

In article <m1llb9p887.fsf-monnier+emacs@gnu.org>, Stefan <monnier@iro.umontreal.ca> writes:

>>  Hmmm, then how about the attached patch to the latest CVS
>>  emacs?  With that, all equivalent charaters (e.g a-grave in
>>  all laitn-X) should be handled well.  This patch will be
>>  applicable also to Emacs 21.3 but not yet tested in that
>>  version.

> Can someone explain to me why ispell.el needs those kinds of things?

> My vague understanding is that ispell.el needs to know which chars are part
> of a word and that in the past (pre-MULE), this had to be redefined for each
> and every language since the codes 128-255 could mean completely
> different things.

> Why can't ispell.el just use the `w' syntax to decide what is a word and
> then rely on the decoding/encoding to do the rest of the work?

> That would fix the problem where a word like "expérience" is checked as two
> words if the dictionary is "american".

That will cause another problem.  For instance, when we have
"español" in a buffer and the ispell dictionary is czech
(latin-2), as "español" is encoded into "espa?ol" by
latin-2, it causes the error "Ispell and its process have
different character maps" because ispell returns the result
of two words "eapa" and "ol".

>>  + ;; Char-table that maps an Unicode character (charset:
>>  + ;; latin-iso8859-1, mule-unicode-0100-24ff) to
>>  + ;; a string in which all equivalent characters are listed.
>>  + 
>>  + (defconst ispell-unified-chars-table
>>  +   (let ((table (make-char-table 'ispell-unified-chars-table)))
>>  +     (map-char-table
>>  +      #'(lambda (c v)
>>  + 	 (if (and v (/= c v))
>>  + 	     (let ((unified (or (aref table v) (string v))))
>>  + 	       (aset table v (concat unified (string c))))))
>>  +      ucs-mule-8859-to-mule-unicode)
>>  +     table))

> All the elements of this table should be multibyte strings.
> For this, we may need to wrap the (string X) into
> (string-to-multibyte (string X))

As `c' and `v' are always multibyte characters, (string X)
always return a multibyte string.

>>  + 		(string-as-multibyte
>>  + 		 (mapconcat
>>  + 		  #'(lambda (c)
>>  + 		      (let ((unichar (aref ucs-mule-8859-to-mule-unicode c)))
>>  + 			(if unichar
>>  + 			    (aref ispell-unified-chars-table unichar)
>>  + 			  (string c))))
>>  + 		  str ""))))

> Do you expect the output of mapconcat to be unibyte and to contain
> emacs-mule encoding of multibyte chars?

No.  STR may be an ASCII-only string, in which case, the
result of mapconcat is a unibyte ASCII-only string.  I'd
like to change it to a multibyte ASCII-only stirng to avoid
converting STR again and again in such a case.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-05  2:00                 ` Kenichi Handa
@ 2005-01-05  4:42                   ` Stefan Monnier
  2005-01-05  5:50                     ` Kenichi Handa
  0 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2005-01-05  4:42 UTC (permalink / raw)
  Cc: agustin.martin, lionel, emacs-devel, k.stevens, 130397

>> Why can't ispell.el just use the `w' syntax to decide what is a word and
>> then rely on the decoding/encoding to do the rest of the work?

>> That would fix the problem where a word like "expérience" is checked as two
>> words if the dictionary is "american".

> That will cause another problem.  For instance, when we have
> "español" in a buffer and the ispell dictionary is czech
> (latin-2), as "español" is encoded into "espa?ol" by
> latin-2, it causes the error "Ispell and its process have
> different character maps" because ispell returns the result
> of two words "eapa" and "ol".

But ispell.el should be able to automatically check whether the chars can be
safely encoded with the coding-system and if not (as in your example),
ispell.el will know that the word can't be checked by ispell and should
just be skipped (and maybe marked as "uncheckable").

>>> + 		(string-as-multibyte
>>> + 		 (mapconcat
>>> + 		  #'(lambda (c)
>>> + 		      (let ((unichar (aref ucs-mule-8859-to-mule-unicode c)))
>>> + 			(if unichar
>>> + 			    (aref ispell-unified-chars-table unichar)
>>> + 			  (string c))))
>>> + 		  str ""))))

>> Do you expect the output of mapconcat to be unibyte and to contain
>> emacs-mule encoding of multibyte chars?

> No.  STR may be an ASCII-only string, in which case, the
> result of mapconcat is a unibyte ASCII-only string.  I'd
> like to change it to a multibyte ASCII-only stirng to avoid
> converting STR again and again in such a case.

Then string-to-multibyte sounds like a safer choice.
`string-as-multibyte' has very strange semantics, I recommend we avoid it as
much as possible.


        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-05  4:42                   ` Stefan Monnier
@ 2005-01-05  5:50                     ` Kenichi Handa
  2005-01-05 14:02                       ` Stefan Monnier
  2005-01-07 15:36                       ` Agustin Martin
  0 siblings, 2 replies; 50+ messages in thread
From: Kenichi Handa @ 2005-01-05  5:50 UTC (permalink / raw)
  Cc: 130397, agustin.martin, lionel, k.stevens, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:
> But ispell.el should be able to automatically check whether the chars can be
> safely encoded with the coding-system and if not (as in your example),
> ispell.el will know that the word can't be checked by ispell and should
> just be skipped (and maybe marked as "uncheckable").

That seems to be a good approach.  But, just checking
whether the chars is encodable with the coding-system is not
enough.  For instance, entry for "francais" dict doesn't
contain "ñ" in CASECHARS, but "español" is safely encodable
by iso-8859-1.  So, the same error happens.  For ispell.el
to know that "español" is uncheckable, we anyway need the
current database ispell-dictionary-alist.

By the way, isn't it possible to make that database
automatically from *.aff?

>>  No.  STR may be an ASCII-only string, in which case, the
>>  result of mapconcat is a unibyte ASCII-only string.  I'd
>>  like to change it to a multibyte ASCII-only stirng to avoid
>>  converting STR again and again in such a case.

> Then string-to-multibyte sounds like a safer choice.
> `string-as-multibyte' has very strange semantics, I recommend we avoid it as
> much as possible.

Ok, I agree with using string-to-multibyte here.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-05  5:50                     ` Kenichi Handa
@ 2005-01-05 14:02                       ` Stefan Monnier
  2005-01-06  0:44                         ` Kenichi Handa
  2005-01-07 15:36                       ` Agustin Martin
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2005-01-05 14:02 UTC (permalink / raw)
  Cc: 130397, agustin.martin, lionel, k.stevens, emacs-devel

>> But ispell.el should be able to automatically check whether the chars can be
>> safely encoded with the coding-system and if not (as in your example),
>> ispell.el will know that the word can't be checked by ispell and should
>> just be skipped (and maybe marked as "uncheckable").

> That seems to be a good approach.  But, just checking
> whether the chars is encodable with the coding-system is not
> enough.  For instance, entry for "francais" dict doesn't
> contain "ñ" in CASECHARS, but "español" is safely encodable
> by iso-8859-1.  So, the same error happens.  For ispell.el
> to know that "español" is uncheckable, we anyway need the
> current database ispell-dictionary-alist.

Aaaahhhh.... I'm beginning to understand, thank you.
But I still think ispell.el should not try to check "espa" and "ol".
So I now agree that the CASECHARS table is needed, but it should be used
after encoding the word (rather than when determining what is a word), and
if some char is not in CASECHARS the word should be flagged as uncheckable.

> By the way, isn't it possible to make that database
> automatically from *.aff?

I wouldn't know.


        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-05 14:02                       ` Stefan Monnier
@ 2005-01-06  0:44                         ` Kenichi Handa
  2005-01-06 16:30                           ` Ken Stevens
  0 siblings, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2005-01-06  0:44 UTC (permalink / raw)
  Cc: agustin.martin, lionel, emacs-devel, k.stevens, 130397

In article <87llb8htbf.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>>  But ispell.el should be able to automatically check whether the chars can be
>>>  safely encoded with the coding-system and if not (as in your example),
>>>  ispell.el will know that the word can't be checked by ispell and should
>>>  just be skipped (and maybe marked as "uncheckable").

>>  That seems to be a good approach.  But, just checking
>>  whether the chars is encodable with the coding-system is not
>>  enough.  For instance, entry for "francais" dict doesn't
>>  contain "ñ" in CASECHARS, but "español" is safely encodable
>>  by iso-8859-1.  So, the same error happens.  For ispell.el
>>  to know that "español" is uncheckable, we anyway need the
>>  current database ispell-dictionary-alist.

> Aaaahhhh.... I'm beginning to understand, thank you.
> But I still think ispell.el should not try to check "espa" and "ol".
> So I now agree that the CASECHARS table is needed, but it should be used
> after encoding the word (rather than when determining what is a word), and
> if some char is not in CASECHARS the word should be flagged as uncheckable.

Although I have not yet understood the detail, "if some char
is not in CASECHARS" is not enough.  First of all, CASECHARS
is a regular expression.  And NOT-CASECHARS, OTHERCHARS,
MANU-OTHERCHARS-P should also be checked somehow.  If that
is the way we are going to take, I'd like to ask maintainers
of ispell.el to do such a change.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-06  0:44                         ` Kenichi Handa
@ 2005-01-06 16:30                           ` Ken Stevens
  2005-01-06 17:33                             ` Stefan Monnier
                                               ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Ken Stevens @ 2005-01-06 16:30 UTC (permalink / raw)
  Cc: k.stevens, 130397, agustin.martin, lionel, emacs-devel,
	Stefan Monnier

Kenichi Handa writes:


> In article <87llb8htbf.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>>>>  But ispell.el should be able to automatically check whether the
>>>>  chars can be safely encoded with the coding-system and if not (as
>>>>  in your example), ispell.el will know that the word can't be
>>>>  checked by ispell and should just be skipped (and maybe marked as
>>>>  "uncheckable").
>
>>>  That seems to be a good approach.  But, just checking
>>>  whether the chars is encodable with the coding-system is not
>>>  enough.  For instance, entry for "francais" dict doesn't
>>>  contain "ñ" in CASECHARS, but "español" is safely encodable
>>>  by iso-8859-1.  So, the same error happens.  For ispell.el
>>>  to know that "español" is uncheckable, we anyway need the
>>>  current database ispell-dictionary-alist.
>
>> Aaaahhhh.... I'm beginning to understand, thank you.  But I still
>> think ispell.el should not try to check "espa" and "ol".  So I now
>> agree that the CASECHARS table is needed, but it should be used after
>> encoding the word (rather than when determining what is a word), and
>> if some char is not in CASECHARS the word should be flagged as
>> uncheckable.
>
> Although I have not yet understood the detail, "if some char
> is not in CASECHARS" is not enough.  First of all, CASECHARS
> is a regular expression.  And NOT-CASECHARS, OTHERCHARS,
> MANU-OTHERCHARS-P should also be checked somehow.  If that
> is the way we are going to take, I'd like to ask maintainers
> of ispell.el to do such a change.

Remember that the internationalization of ispell was done long before the
MULE code was added to emacs.  The encoding of the character sets and
the interaction between ispell and emacs was embodied in the ispell code
and interactions.  In ispell.el, this has been controlled by the
CASECHARS, NOT-CASECHARS, OTHERCHARS, MANY-OTHERCHARS-P,
EXTENDED-CHARACER-MODE, and CHARACTER-SET.

The problem is more complicated than simply parsing what are word
characters.  There are differences in encoding when one uses latex as
the source with it's encoding of latin characters with escape sequences
versus a raw ISO character set.  For instance, the dictionary stores
information regarding compound words, possessives, etc. in the spell
checking routines.  Knowing that the "'" character is used as a
possessive, for instance, ispell knows that "Ken's" is a correct
spelling based on the root "Ken".

Most of this complication can be invisibly hidden in ispell.  The
problems mainly arise in two circumstances.

1. when spell checking a single word.
2. when an error occurs and the error is highlighted.

For instance, one of the major issues when MULE was implemented was the
fact that multiple bytes passed to ispell may only count as a single
byte or character on the display.

Here is where most of the hassles with libraries occur.  There may well
be a much better way of encoding the character sets and interactions
right now.  Perhaps we should investigate simplifying and possibly
removing the character set issues.  We would still minimally need to
communicate mode information to ispell.

Geoff has a much better understanding of the underlying spell search
engine.  Perhaps he can shed additional light on this topic.

regards		 -Ken

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-06 16:30                           ` Ken Stevens
@ 2005-01-06 17:33                             ` Stefan Monnier
  2005-01-07  0:39                               ` Kenichi Handa
  2005-01-07 15:48                             ` Agustin Martin
  2005-01-08 12:31                             ` Geoff Kuenning
  2 siblings, 1 reply; 50+ messages in thread
From: Stefan Monnier @ 2005-01-06 17:33 UTC (permalink / raw)
  Cc: Kenichi Handa, k.stevens, 130397, agustin.martin, lionel,
	emacs-devel, ispell-bugs

> Remember that the internationalization of ispell was done long before the
> MULE code was added to emacs.

Actually, it's this understanding that leads me to think that 
CASECHARS, NOT-CASECHARS, OTHERCHARS, MANY-OTHERCHARS-P,
EXTENDED-CHARACER-MODE, and CHARACTER-SET, should be used after encoding
the word.

Before MULE, Emacs only worked with single-byte coding systems (things like
latin-1, but not iso-2022 or utf-8) and the exact same coding-system was
used by ispell, so ispell.el's CASECHARS, NOT-CASECHARS, OTHERCHARS,
MANY-OTHERCHARS-P, EXTENDED-CHARACER-MODE, and CHARACTER-SET applied to
*encoded* text (i.e. text in latin-1 encoding, not in the internal encoding
used in Emacs MULE).

So it would seem to make sense (in order to simulate the pre-MULE behavior),
to first encode the text (into latin-1 or somesuch
singlebyte coding system) and then use CASECHARS, NOT-CASECHARS, OTHERCHARS,
MANY-OTHERCHARS-P, EXTENDED-CHARACER-MODE, and CHARACTER-SET.

Now encoding the whole text can't be realistically done, so we need to first
recognize words, then encode them, then use those vars.
I.e. the word-recogniztion code shouldn't use CASECHARS, NOT-CASECHARS,
OTHERCHARS, MANY-OTHERCHARS-P, EXTENDED-CHARACER-MODE, and CHARACTER-SET.

> For instance, one of the major issues when MULE was implemented was the
> fact that multiple bytes passed to ispell may only count as a single
> byte or character on the display.

How/when can that happen?  Can you give an example?


        Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-06 17:33                             ` Stefan Monnier
@ 2005-01-07  0:39                               ` Kenichi Handa
  0 siblings, 0 replies; 50+ messages in thread
From: Kenichi Handa @ 2005-01-07  0:39 UTC (permalink / raw)
  Cc: k.stevens, 130397, agustin.martin, lionel, emacs-devel, kstevens,
	ispell-bugs

In article <jwvbrc25v1k.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Now encoding the whole text can't be realistically done, so we need to first
> recognize words, then encode them, then use those vars.
> I.e. the word-recogniztion code shouldn't use CASECHARS, NOT-CASECHARS,
> OTHERCHARS, MANY-OTHERCHARS-P, EXTENDED-CHARACER-MODE, and CHARACTER-SET.

It seems that it doesn't work.  The documentation of
ispell-dictionary-alist says as this:

OTHERCHARS is a regexp of characters in the NOT-CASECHARS set but which can be
used to construct words in some special way.  If OTHERCHARS characters follow
and precede characters from CASECHARS, they are parsed as part of a word,
otherwise they become word-breaks.  As an example in English, assume the
regular expression "[']" for OTHERCHARS.  Then "they're" and
"Steven's" are parsed as single words including the "'" character, but
"Stevens'" does not include the quote character as part of the word.
If you want OTHERCHARS to be empty, use the empty string.
Hint: regexp syntax requires the hyphen to be declared first here.

MANY-OTHERCHARS-P is non-nil when multiple OTHERCHARS are allowed in a word.
Otherwise only a single OTHERCHARS character is allowed to be part of any
single word.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-04 12:50             ` Kenichi Handa
  2005-01-04 14:55               ` Bug 130397 Stefan
@ 2005-01-07 15:34               ` Agustin Martin
  1 sibling, 0 replies; 50+ messages in thread
From: Agustin Martin @ 2005-01-07 15:34 UTC (permalink / raw)


On Tue, Jan 04, 2005 at 09:50:33PM +0900, Kenichi Handa wrote:

> Hmmm, then how about the attached patch to the latest CVS
> emacs?  With that, all equivalent charaters (e.g a-grave in
> all laitn-X) should be handled well.  This patch will be
> applicable also to Emacs 21.3 but not yet tested in that
> version.
> 

Hi,

Thanks for the patch, and sorry for not being very responsive these days. I
can hardly keep up to date with my mail.

Your patch applies cleanly to ispell.el shipped with dictionaries-common
(just some minor shift), since it is 3.6 as well as for CVS emacs.

First noting that I still get ispell misalignement errors when spellchecking
an utf8 file with an iso-8859-15 dict for chars not in the iso-8859-1 set
(oe-char).

I also had some problems when making it work with the different emacsen
flavours. For sid emacs21 (21.3+1-8) works with some limitations, that seem
somewhat similar to those for my dirty hack. However for older versions
(e.g., woody emacs21) ucs-mule-8859-to-mule-unicode seems not avilable, as
well as for emacs20, so I get

Symbol's value as variable is void: ucs-mule-8859-to-mule-unicode

For xemacs, besides the known problems with buffer-file-coding-system I also
get the error message

Unrecognized char table type: ispell-unified-chars-table

Note that ispell.el at dictionaries-common package must work with all the
shipped emacsen flavours (unless the problem is due to a bug in the emacsen
flavour), currently emacs21 and xemacs21-{mule,nomule...} and preferrably to
not make things worse for previous flavours (emacs20, no need of going back
more, even if I sometimes test emacs19), so we need to be very conservative.
Some checks are possible, however, but I am rather undecided about that.

Thanks a lot for your feedback,

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-05  5:50                     ` Kenichi Handa
  2005-01-05 14:02                       ` Stefan Monnier
@ 2005-01-07 15:36                       ` Agustin Martin
  2005-01-07 20:29                         ` Ken Stevens
  2005-01-07 21:27                         ` Juri Linkov
  1 sibling, 2 replies; 50+ messages in thread
From: Agustin Martin @ 2005-01-07 15:36 UTC (permalink / raw)


On Wed, Jan 05, 2005 at 02:50:09PM +0900, Kenichi Handa wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
> > But ispell.el should be able to automatically check whether the chars can be
> > safely encoded with the coding-system and if not (as in your example),
> > ispell.el will know that the word can't be checked by ispell and should
> > just be skipped (and maybe marked as "uncheckable").
> 
> That seems to be a good approach.  But, just checking
> whether the chars is encodable with the coding-system is not
> enough.  For instance, entry for "francais" dict doesn't
> contain "ñ" in CASECHARS, but "español" is safely encodable
> by iso-8859-1.  So, the same error happens.  For ispell.el
> to know that "español" is uncheckable, we anyway need the
> current database ispell-dictionary-alist.

Expect otherwise something like

 ispell and its process have different character maps

during ispell-word, as well as some other possible errors. This for single
byte chars. When there is a char that cannot be encoded in the dict encoding
the 'ispell misalignment' errors appears. 

*Ken*, since you are being cc'ed I vaguely remembered some info I somewhere
read about this misalignements. I finally found it,

 http://lists.gnu.org/archive/html/emacs-devel/2002-09/msg01007.html

Essentially seems to be suggested that ispell-word (as well as flyspell)
does not show the misalignment problems because of the way words are passed
to ispell, while ispell-region (and so ispell-buffer) does. I have tested
that in an ad-hoc file, ispell-buffer gives the misalignement error
while flyspell-buffer not. The suggestion is that making ispell-region iterate
over words intead of over lines this could be fixed. Do you think this would
help to get rid of the misalignements, or there are other drawbacks I am not
aware of? I did not see any reply to that mail.

> 
> By the way, isn't it possible to make that database
> automatically from *.aff?
> 

Remember that there is also aspell, so should use .aff when using ispell and
some other way when using aspell.

The way we do this is trust dict maintainers to provide a file with all the
relevant info updated to the dict current values. ispell-dictionary-alist is
rebuilt after that data, that is parsed at dictionary installation. This way
we try to make sure that all values really match, and also that errors can be
fixed more quickly by the dict maintainer, without needing a centralized
maintainer to keep that alist up to date, and that things are done after the
really installed dicts.

By the way, in emacs CVS esperanto entry claims to use iso-8859-1 encoding,
while it should be iso-8859-3, and that being added to the possible
coding-system values.

Regarding this, we added a patch by Joao Cachopo to allow for coding-system
any coding system supported by emacs (http://bugs.debian.org/208518), 
using  

(coding-system :tag "Coding System")

instead of

(choice :tag "Coding system"
      (const iso-8859-1)
      (const iso-8859-2)
      (const koi8-r))

in both ispell-local-dictionary-alist and ispell-dictionary-alist
defcustoms.

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-06 16:30                           ` Ken Stevens
  2005-01-06 17:33                             ` Stefan Monnier
@ 2005-01-07 15:48                             ` Agustin Martin
  2005-01-08 12:31                             ` Geoff Kuenning
  2 siblings, 0 replies; 50+ messages in thread
From: Agustin Martin @ 2005-01-07 15:48 UTC (permalink / raw)


On Thu, Jan 06, 2005 at 08:30:10AM -0800, Ken Stevens wrote:

> Here is where most of the hassles with libraries occur.  There may well
> be a much better way of encoding the character sets and interactions
> right now.  Perhaps we should investigate simplifying and possibly
> removing the character set issues.  We would still minimally need to
> communicate mode information to ispell.
> 

There are still xemacs21-nomule packages flying around (at least in Debian).
But probably just not making them fully fail is enough, that is, working for
them with no reencodings at all.

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-07 15:36                       ` Agustin Martin
@ 2005-01-07 20:29                         ` Ken Stevens
  2005-01-07 21:27                         ` Juri Linkov
  1 sibling, 0 replies; 50+ messages in thread
From: Ken Stevens @ 2005-01-07 20:29 UTC (permalink / raw)
  Cc: 130397, emacs-devel, Stefan Monnier, k.stevens, Kenichi Handa

Agustin Martin writes: 

> *Ken*, since you are being cc'ed I vaguely remembered some info I somewhere
> read about this misalignements. I finally found it,
> 
>  http://lists.gnu.org/archive/html/emacs-devel/2002-09/msg01007.html
> 
> Essentially seems to be suggested that ispell-word (as well as flyspell)
> does not show the misalignment problems because of the way words are passed
> to ispell, while ispell-region (and so ispell-buffer) does. I have tested
> that in an ad-hoc file, ispell-buffer gives the misalignement error
> while flyspell-buffer not. The suggestion is that making ispell-region iterate
> over words intead of over lines this could be fixed. Do you think this would
> help to get rid of the misalignements, or there are other drawbacks I am not
> aware of? I did not see any reply to that mail.

Yes.  When the dictionary and the database match, iterating over valid
character sequences that making up potential words rather than regions
(such as a line) should eliminate these errors.

There are two drawbacks - I suspect performance would be degraded by the
increased process communication.  I could run some tests to see how much
that would be the case on systems I use, though I assume it could vary
substantially over different OSes.  The second is that this is a fairly
fundamental change that would require some considerable coding effort -
which is probably the biggest issue.  Someone needs to kick me to make
it happen.  I apologize that I have been very busy lately and this has
fallen off my list.

regards		-Ken

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-07 15:36                       ` Agustin Martin
  2005-01-07 20:29                         ` Ken Stevens
@ 2005-01-07 21:27                         ` Juri Linkov
  2005-01-13  5:59                           ` Kenichi Handa
  1 sibling, 1 reply; 50+ messages in thread
From: Juri Linkov @ 2005-01-07 21:27 UTC (permalink / raw)
  Cc: 130397, k.stevens, ispell-el-bugs, emacs-devel

Agustin Martin <agustin.martin@hispalinux.es> writes:
> *Ken*, since you are being cc'ed I vaguely remembered some info I somewhere
> read about this misalignements. I finally found it,
>
>  http://lists.gnu.org/archive/html/emacs-devel/2002-09/msg01007.html

The bug reported on this URL occurs only in Emacs 21.3, not in Emacs CVS.
It seems something was fixed already.

However, with a strange coincidence I got the same error in Emacs CVS just
today for the first time.  So I can describe how this bug can be reproduced
in Emacs CVS: when the first part of a word was copied from an external
application and got encoded in the buffer in mule-unicode-0100-24ff,
and the second part of the word typed with an input method and gets encoded
in cyrillic-iso8859-5, then calling ispell-buffer on a buffer with the word
composed with different encodings with `russian' dictionary signals the
error "Ispell misalignment".

And while on this topic, I want to remind that many Emacs users suffer
from the inability of ispell.el to simultaneously check mixed multi-language
texts.  So, whoever fixes ispell.el, please take that into account.
Such combining is quite easily doable for any disjoint alphabets, as well
as for alphabets where one alphabet is a superset of another, like e.g.
English and some other Latin-based alphabets.  Even for overlapping
alphabets it would be possible with using the `w' syntax to get a word
and to feed it to different ispell instances for each dictionary.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-06 16:30                           ` Ken Stevens
  2005-01-06 17:33                             ` Stefan Monnier
  2005-01-07 15:48                             ` Agustin Martin
@ 2005-01-08 12:31                             ` Geoff Kuenning
  2005-01-08 12:47                               ` David Kastrup
  2005-01-08 22:39                               ` Peter Heslin
  2 siblings, 2 replies; 50+ messages in thread
From: Geoff Kuenning @ 2005-01-08 12:31 UTC (permalink / raw)
  Cc: Kenichi Handa, 130397, agustin.martin, lionel, emacs-devel, juri,
	Stefan Monnier

Ken writes:

> Geoff has a much better understanding of the underlying spell search
> engine.  Perhaps he can shed additional light on this topic.

I just looked at the code to be sure my memory is correct.  Here's the
short rundown: in the '-a' interface, ispell interfaces with the
outside world purely in a byte-indexed mode.  It is perfectly capable
of handling UTF-8 and similar multi-byte encodings, but when it
reports the offsets of incorrect words, it does so as a byte offset,
not a character offset.

Does emacs provide an underlying byte-indexed interface to the buffer?
If so, life should be easy: just have ispell.el use that interface.
If not, I think life is going to be very, very difficult.  It's
possible that I could modify ispell to provide a display-width index
rather than a byte index, but it's not trivial and there may be
pitfalls.  There's also the problem that--even if I get off my butt
and produce a new release reasonably soon--there are lots of old
copies of ispell out there that wouldn't support the new interface.

Juri writes:

> And while on this topic, I want to remind that many Emacs users suffer
> from the inability of ispell.el to simultaneously check mixed multi-language
> texts.  So, whoever fixes ispell.el, please take that into account.
> Such combining is quite easily doable for any disjoint alphabets, as well
> as for alphabets where one alphabet is a superset of another, like e.g.
> English and some other Latin-based alphabets.  Even for overlapping
> alphabets it would be possible with using the `w' syntax to get a word
> and to feed it to different ispell instances for each dictionary.

I'm not entirely sure what you mean here.  For disjoint alphabets,
it's certainly relatively easy to figure out which word should go to
which ispell instance.  For identical, superset, or overlapping
alphabets, the problem is basically insoluable.  For example, "fra" is
a misspelling in English but legal in Italian.  If it appears in a
mixed passage, which dictionary should it be fed to?  The only
solution would seem to be to require the user to mark passages in some
way, as is done in HTML.
-- 
    Geoff Kuenning   geoff@cs.hmc.edu   http://www.cs.hmc.edu/~geoff/

One could not be a successful scientist without realizing that, in contrast to
the popular conception supported by newspapers and mothers of scientists, a
goodly number of scientists are not only narrow-minded and dull, but also just
stupid. -- James Watson

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-08 12:31                             ` Geoff Kuenning
@ 2005-01-08 12:47                               ` David Kastrup
  2005-01-08 13:29                                 ` Miles Bader
  2005-01-08 22:39                               ` Peter Heslin
  1 sibling, 1 reply; 50+ messages in thread
From: David Kastrup @ 2005-01-08 12:47 UTC (permalink / raw)
  Cc: Kenichi Handa, 130397, agustin.martin, lionel, emacs-devel, juri,
	Ken Stevens, Stefan Monnier

Geoff Kuenning <geoff@cs.hmc.edu> writes:

> Ken writes:
>
>> Geoff has a much better understanding of the underlying spell search
>> engine.  Perhaps he can shed additional light on this topic.
>
> I just looked at the code to be sure my memory is correct.  Here's
> the short rundown: in the '-a' interface, ispell interfaces with the
> outside world purely in a byte-indexed mode.  It is perfectly
> capable of handling UTF-8 and similar multi-byte encodings, but when
> it reports the offsets of incorrect words, it does so as a byte
> offset, not a character offset.
>
> Does emacs provide an underlying byte-indexed interface to the
> buffer?  If so, life should be easy: just have ispell.el use that
> interface.

You are wrongly assuming that the buffer is maintained in UTF-8.  It
isn't.  Byte indexing is not going to be fun with regard to
efficiency, unless we get some interface that will, while writing out
a file in UTF-8, store an array of byte/character correspondences for
the UTF-8 (or whatever other) character conversion somewhere.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-08 12:47                               ` David Kastrup
@ 2005-01-08 13:29                                 ` Miles Bader
  2005-01-08 17:15                                   ` Geoff Kuenning
  2005-01-10  4:45                                   ` Eli Zaretskii
  0 siblings, 2 replies; 50+ messages in thread
From: Miles Bader @ 2005-01-08 13:29 UTC (permalink / raw)
  Cc: Geoff Kuenning, 130397, agustin.martin, lionel, Kenichi Handa,
	emacs-devel, juri, Ken Stevens, Stefan Monnier

> You are wrongly assuming that the buffer is maintained in UTF-8.  It
> isn't.  Byte indexing is not going to be fun with regard to
> efficiency, unless we get some interface that will, while writing out
> a file in UTF-8, store an array of byte/character correspondences for
> the UTF-8 (or whatever other) character conversion somewhere.

Er, does efficiency matter all that much when parsing output from
ispell -a?  After all, it's feeding the input line-by-line to ispell
(judging from man page), and you only have to actually deal with
offsets for lines with mispellings -- which are the minority, and
result in user interaction anyway, which will tend to hide any sort of
slight inefficiency.

If ispell wants utf-8, it's easy enough to convert each input line to
utf-8 and deal with offsets into that in the event of a mispelling;
even if emacs has to process the line character by character to do it,
it seems like it would be fast enough.  For the great bulk of the
buffer without any mispellings, you won't incur the inefficiency,
which is what really matters.

-Miles

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-08 13:29                                 ` Miles Bader
@ 2005-01-08 17:15                                   ` Geoff Kuenning
  2005-01-10  4:45                                   ` Eli Zaretskii
  1 sibling, 0 replies; 50+ messages in thread
From: Geoff Kuenning @ 2005-01-08 17:15 UTC (permalink / raw)
  Cc: 130397, agustin.martin, lionel, Kenichi Handa, emacs-devel, juri,
	Ken Stevens, Stefan Monnier, miles

> If ispell wants utf-8, it's easy enough to convert each input line to
> utf-8 and deal with offsets into that in the event of a mispelling;

Just to clarify: ispell can theoretically handle almost any
instantaneously decodable code that is restricted to byte boundaries.
At the moment, I don't know of a language that has a UTF-8 affix file,
though I imagine that problem will be corrected in the near future.

In particular, if emacs stores stuff internally using a constant-width
encoding, for most languages you could convert that to latin-1, feed
that to ispell, and then generate line offsets with a simple multiply.
-- 
    Geoff Kuenning   geoff@cs.hmc.edu   http://www.cs.hmc.edu/~geoff/

McDonald's, which does not wait on your table, does not cook your food
to order, and does not clear your table, came up with the slogan ``We
Do It All For You.''
	-- Dave Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-08 12:31                             ` Geoff Kuenning
  2005-01-08 12:47                               ` David Kastrup
@ 2005-01-08 22:39                               ` Peter Heslin
  1 sibling, 0 replies; 50+ messages in thread
From: Peter Heslin @ 2005-01-08 22:39 UTC (permalink / raw)


On 2005-01-08, Geoff Kuenning <geoff@cs.hmc.edu> wrote:
>  For identical, superset, or overlapping alphabets, the problem is
>  basically insoluable.  For example, "fra" is a misspelling in
>  English but legal in Italian.  If it appears in a mixed passage,
>  which dictionary should it be fed to?  The only solution would seem
>  to be to require the user to mark passages in some way, as is done
>  in HTML.

I have some code, which works with flyspell, that parses the buffer
around point, and sets a text-property to indicate the current
language.  In LaTeX buffers this is done by examining Babel commands
that declare the language; in XML buffers it is done by examining the
xml:lang attributes.

This only works with flyspell, not ispell.el, because flyspell
conveniently provides a hook for a function that gets called whenever
a word is spell-checked.  My function looks at the relevant
text-property and the value of ispell-local-dictionary, and if they
don't match, it starts a new ispell/aspell process with the correct
dictionary for the current text.

It would be great if ispell.el itself checked a text-property like
this to indicate the language, so that code like mine could work with
both flyspell and ispell.  Even better would be if ispell.el could be
configured to keep multiple ispell/aspell processes running: one for
each language.  The speed bottleneck in my code is that when you move
point from a part of the buffer in one language to a part in another
language, I have to kill the old ispell/aspell process and start up a
new one.  This causes a noticeable delay when moving the cursor.

It would be awesome if, for a buffer with text in two languages, you
could keep an ispell/aspell process running for each.  I suppose its
mainly in flyspell that you would see the speed benefit, though.

Peter

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-08 13:29                                 ` Miles Bader
  2005-01-08 17:15                                   ` Geoff Kuenning
@ 2005-01-10  4:45                                   ` Eli Zaretskii
  2005-01-10  9:09                                     ` David Kastrup
  1 sibling, 1 reply; 50+ messages in thread
From: Eli Zaretskii @ 2005-01-10  4:45 UTC (permalink / raw)
  Cc: geoff, 130397, agustin.martin, lionel, emacs-devel, kstevens

> Date: Sat, 8 Jan 2005 22:29:21 +0900
> From: Miles Bader <snogglethorpe@gmail.com>
> Cc: Geoff Kuenning <geoff@cs.hmc.edu>, 130397@bugs.debian.org,
> 	agustin.martin@hispalinux.es, lionel@mamane.lu,
> 	Kenichi Handa <handa@m17n.org>, emacs-devel@gnu.org,
> 	juri@jurta.org, Ken Stevens <kstevens@ichips.intel.com>,
> 	Stefan Monnier <monnier@iro.umontreal.ca>
> 
> If ispell wants utf-8, it's easy enough to convert each input line to
> utf-8 and deal with offsets into that in the event of a mispelling;

Or account for byte offsets by (variable) multibyte lenght of each
character, which Emacs knows.  I don't remember for the moment whether
the multibyte length of the UTF-8 encoding can be gotten at by a Lisp
program, but if not, we could add some primitive to do that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-10  4:45                                   ` Eli Zaretskii
@ 2005-01-10  9:09                                     ` David Kastrup
  2005-01-10 20:16                                       ` Eli Zaretskii
  2005-01-13  7:50                                       ` Kenichi Handa
  0 siblings, 2 replies; 50+ messages in thread
From: David Kastrup @ 2005-01-10  9:09 UTC (permalink / raw)
  Cc: geoff, 130397, agustin.martin, lionel, emacs-devel, kstevens,
	snogglethorpe, miles

"Eli Zaretskii" <eliz@gnu.org> writes:

>> Date: Sat, 8 Jan 2005 22:29:21 +0900
>> From: Miles Bader <snogglethorpe@gmail.com>
>> Cc: Geoff Kuenning <geoff@cs.hmc.edu>, 130397@bugs.debian.org,
>> 	agustin.martin@hispalinux.es, lionel@mamane.lu,
>> 	Kenichi Handa <handa@m17n.org>, emacs-devel@gnu.org,
>> 	juri@jurta.org, Ken Stevens <kstevens@ichips.intel.com>,
>> 	Stefan Monnier <monnier@iro.umontreal.ca>
>> 
>> If ispell wants utf-8, it's easy enough to convert each input line to
>> utf-8 and deal with offsets into that in the event of a mispelling;
>
> Or account for byte offsets by (variable) multibyte lenght of each
> character, which Emacs knows.  I don't remember for the moment whether
> the multibyte length of the UTF-8 encoding can be gotten at by a Lisp
> program, but if not, we could add some primitive to do that.

Just encode the line to utf-8, find the correct point in the byte
string, cut off the line there, convert back and check the length of
the string.  This works unless you are in the middle of a character.

But it would be much saner if our conversion facilities would preserve
markers (which they don't do right now): encode to utf-8, place a
marker at the right byte offset, undo the conversion.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2004-12-22 17:13           ` Agustin Martin
  2005-01-04 12:50             ` Kenichi Handa
@ 2005-01-10 13:06             ` Lionel Elie Mamane
  2005-01-10 17:16               ` Agustin Martin
  1 sibling, 1 reply; 50+ messages in thread
From: Lionel Elie Mamane @ 2005-01-10 13:06 UTC (permalink / raw)
  Cc: 130397, emacs-devel, Kenichi Handa


[-- Attachment #1.1: Type: text/plain, Size: 3300 bytes --]

On Wed, Dec 22, 2004 at 06:13:06PM +0100, Agustin Martin wrote:
> On Wed, Dec 22, 2004 at 09:37:32PM +0900, Kenichi Handa wrote:

> Thanks for the tip. I am not maintaining emacs, but a package for
> the common dictionaries setup (dictionaries-common) that provides a
> recent and patched ispell.el for all the diferent emacsen flavours
> ({x}emacs) to integrate the different dicts and spellchecking
> engines in some way. I will be happy to test this once is included
> in sid emacs.

>>> I am playing with redefining ispell-get-coding-system function in
>>> ispell.el so dict coding-system is changed to iso-8859-15 if was
>>> originally iso-8859-1 and emacs has iso-8859-15 as
>>> buffer-file-coding-system, something like

>> But, anyway, I think the above function is too ad-hoc.  As
>> iso-8859-1 and iso-8859-15 contains different set of characters
>> (even if they are few), it's not good to treat them as the same
>> thing.

> I was aware of this, but anyway thanks for reminding. Code is
> probably too ad-hoc, but latin{0,1} thing is also a somewhat ad-hoc
> scenario, where latin0 should have really be named as something like
> iso-8859-1v2, that is, a revision. I cannot imagine somebody using a
> iso-8859-2 dict and trying to write in a iso8859-1 buffer, but with
> iso-8859-1 and iso-8859-15 that is happening too frequently.

> So we have a lot of people that blindly select the locale @euro
> variant without realizing its implications, and that iso-8859-1 and
> iso-8859-15 are different, but very close encodings (from a
> practical point of view, they are fully equivalent for most
> languages but IIRC french and finnish).

> The current state of ispell dicts in Debian is that ifrench is iso-8859-15
> as default (although has a real latin1 entry).

> So the only language that might currently require extra work is
> french, and for it I find reasonable to use for emacs as default the
> iso-8859-15 entry (tagged as iso-8859-1 for the above sustem to
> work). For this I would like to hear Lionel's point of view, since
> he has put a lot of effort to make iso-8859-15 available for
> spellchecking (Hi, Lionel).

I think that if we do that, then latin1 text won't be spell-checked
correctly: Ispell will try to insert "one half" and "one quarter"
characters (the characters occupying the same place as OE and oe in
latin9), won't it?

> I personally do not like having separate iso-8859-15 entries unless
> they are really required. For the above dicts, that would be for
> french, and I am not at all sure that it is really required.

Having separate entries that the user has to select manually is bad,
but it is the best we can have with the current system if we want to
keep correctness of the spell-checking, as far as I understand. Having
the system (the combination of emacs + dicts-common + the dicts)
select the right dictionary + options combination automatically based
on the (language, encoding) pair (like "-d francais -T ~latin1" for
french and latin1) would be cool from the user's POV.

We have special entries for "(La)TeX", which can be seen as another
encoding, so why not special entries for iso8859-15 (when necessary)?
What is so fundamentally different about iso8859-15?

-- 
Lionel

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-10 13:06             ` Lionel Elie Mamane
@ 2005-01-10 17:16               ` Agustin Martin
  2005-01-11  5:16                 ` Kenichi Handa
  2005-01-11 14:29                 ` Richard Stallman
  0 siblings, 2 replies; 50+ messages in thread
From: Agustin Martin @ 2005-01-10 17:16 UTC (permalink / raw)


(Handa, your patch worked better than I thought, read below)

On Mon, Jan 10, 2005 at 02:06:41PM +0100, Lionel Elie Mamane wrote:
> On Wed, Dec 22, 2004 at 06:13:06PM +0100, Agustin Martin wrote:
> > So the only language that might currently require extra work is
> > french, and for it I find reasonable to use for emacs as default the
> > iso-8859-15 entry (tagged as iso-8859-1 for the above sustem to
> > work). For this I would like to hear Lionel's point of view, since
> > he has put a lot of effort to make iso-8859-15 available for
> > spellchecking (Hi, Lionel).
> 
> I think that if we do that, then latin1 text won't be spell-checked
> correctly: Ispell will try to insert "one half" and "one quarter"
> characters (the characters occupying the same place as OE and oe in
> latin9), won't it?

Yes, things will be that way. I am considering an "exclusion" list, that is
languages for which this hack should not be done, so they can have really
different iso-8859-1 and iso-8859-15 entries for {x}emacs. That will
currently be only french, since seems that finnish dicts do not include the
iso-8859-15 chars. For all other languages considering them as internally
equivalent seems not unreasonable.

It is then up to the french dicts maintainers to decide which one is to be
considered as "default", that is, to be called "francais". Also coordination
with the aspell french dict maintainer is needed, so they both share the
same ispell.el entry in the less conflicting way.

> 
> > I personally do not like having separate iso-8859-15 entries unless
> > they are really required. For the above dicts, that would be for
> > french, and I am not at all sure that it is really required.
> 
> Having separate entries that the user has to select manually is bad,
> but it is the best we can have with the current system if we want to
> keep correctness of the spell-checking, as far as I understand. Having
> the system (the combination of emacs + dicts-common + the dicts)
> select the right dictionary + options combination automatically based
> on the (language, encoding) pair (like "-d francais -T ~latin1" for
> french and latin1) would be cool from the user's POV.
> 
> We have special entries for "(La)TeX", which can be seen as another
> encoding, so why not special entries for iso8859-15 (when necessary)?
> What is so fundamentally different about iso8859-15?
> 

The problem was that when editing an utf-8 buffer and using an iso-8859-15
ispell dict entry for emacs there were some problems, notably that some
misalignment errors appeared. I vaguely remember that word boundaries were
not well found, but I am not sure, and if existed, seems to be gone in
current sid emacs21. 

Also Kenichi Handa provided us with a patch to ensure that all equivalent
accented chars are mapped to the same char, if available under different
encodings, so are not considered as word boundaries if spell-checkable,
but I still got misalignment errors with it. This would however fixed
the word boundaries problem for a iso-8859-15 buffer using a iso-8859-1
dict.

But I have just noticed that if I add coeur (with oe-1char) to
the french dict (ifrench, it contained only the oe-2char version) the
misalignment errors disappear (I only tested with coeur, do not know which
other words have the same char although I guess that most the oeu)

So, patch from Kenichi Handa seems to work well for sid emacs21, much better
than I thought. However it uses code that has only been recently added to
emacs21, and things that are not available for xemacs or emacs20.

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-10  9:09                                     ` David Kastrup
@ 2005-01-10 20:16                                       ` Eli Zaretskii
  2005-01-13  7:50                                       ` Kenichi Handa
  1 sibling, 0 replies; 50+ messages in thread
From: Eli Zaretskii @ 2005-01-10 20:16 UTC (permalink / raw)
  Cc: geoff, 130397, agustin.martin, lionel, emacs-devel, kstevens

> Cc: snogglethorpe@gmail.com,  miles@gnu.org,  geoff@cs.hmc.edu,
> 	  130397@bugs.debian.org,  agustin.martin@hispalinux.es,
> 	  lionel@mamane.lu,  emacs-devel@gnu.org,  kstevens@ichips.intel.com
> From: David Kastrup <dak@gnu.org>
> Date: Mon, 10 Jan 2005 10:09:41 +0100
> 
> Just encode the line to utf-8, find the correct point in the byte
> string, cut off the line there, convert back and check the length of
> the string.

This isn't needed, I think: the number of bytes in the UTF-8 encoding
of a certain Unicode codepoint is a very simple function of the
codepoint.

> This works unless you are in the middle of a character.

I think this cannot happen, since spelling always works on character
boundaries.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-10 17:16               ` Agustin Martin
@ 2005-01-11  5:16                 ` Kenichi Handa
  2005-01-11 19:56                   ` Agustin Martin
  2005-01-11 14:29                 ` Richard Stallman
  1 sibling, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2005-01-11  5:16 UTC (permalink / raw)
  Cc: lionel, emacs-devel, 130397

In article <20050110171611.GA10357@agmartin.aq.upm.es>, Agustin Martin <agustin.martin@hispalinux.es> writes:

> (Handa, your patch worked better than I thought, read below)

Thank you, that's a good news.

> Also Kenichi Handa provided us with a patch to ensure that all equivalent
> accented chars are mapped to the same char, if available under different
> encodings, so are not considered as word boundaries if spell-checkable,
> but I still got misalignment errors with it. This would however fixed
> the word boundaries problem for a iso-8859-15 buffer using a iso-8859-1
> dict.

> But I have just noticed that if I add coeur (with oe-1char) to
> the french dict (ifrench, it contained only the oe-2char version) the
> misalignment errors disappear (I only tested with coeur, do not know which
> other words have the same char although I guess that most the oeu)

Sorry I don't understand.  What is oe-1char?  U+0153 or
U+0276?  But, neither of them are not included in
iso-8859-1/iso-8859-15?   And, I have no idea why adding
coeur (with oe-1char) to the dictionary solves the
misalignment error.  Is it because of ispell's bug?

> So, patch from Kenichi Handa seems to work well for sid emacs21, much better
> than I thought. However it uses code that has only been recently added to
> emacs21, and things that are not available for xemacs or emacs20.

Then, how about using your workaround for them, and enable
my patch for an emacs that has
ucs-mule-8859-to-mule-unicode?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-10 17:16               ` Agustin Martin
  2005-01-11  5:16                 ` Kenichi Handa
@ 2005-01-11 14:29                 ` Richard Stallman
  2005-01-12  7:45                   ` Kenichi Handa
  1 sibling, 1 reply; 50+ messages in thread
From: Richard Stallman @ 2005-01-11 14:29 UTC (permalink / raw)
  Cc: 130397, lionel, emacs-devel, handa

People have been discussing this issue for a while now,
and due to the volume of mail, I could not read it all.

Handa, is it clear what we should do now for the coming release?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-11  5:16                 ` Kenichi Handa
@ 2005-01-11 19:56                   ` Agustin Martin
  2005-01-11 21:39                     ` Lionel Elie Mamane
  2005-01-12  7:37                     ` Kenichi Handa
  0 siblings, 2 replies; 50+ messages in thread
From: Agustin Martin @ 2005-01-11 19:56 UTC (permalink / raw)


On Tue, Jan 11, 2005 at 02:16:33PM +0900, Kenichi Handa wrote:
> In article <20050110171611.GA10357@agmartin.aq.upm.es>, Agustin Martin <agustin.martin@hispalinux.es> writes:
> 
> Sorry I don't understand.  What is oe-1char?  U+0153 or
> U+0276?  But, neither of them are not included in
> iso-8859-1/iso-8859-15?   And, I have no idea why adding
> coeur (with oe-1char) to the dictionary solves the
> misalignment error.  Is it because of ispell's bug?
> 

Sorry, I should have explained myself better,

I meant with oe-1char oe as a single char (U+0153), available
in iso-8859-15 (octal \275 here), but not in iso-8859-1 (you
have one half instead), and with oe-2char the two 7bit chars sequence
'oe', available anywhere, and that is the trick usually used in
iso-8859-1 to represent that char. The french dict is originally
latin1 and use the two chars sequence 'oe', although the Debian dict
defines single-chars 'oe' and 'OE' at the default stringchars section,
but did not use inside. 

Why that addition to the dict made the misalignment disappear is something
that completely puzzles me. Since ispell shipped with Debian is somewhat
old I cannot rule out that it is not reponsible, but I should try with an
ad-hoc more recent ispell to be sure. It was not the usual misalignment
error with a non regognisable string, but the word itself with the error.
Anyway, the french dict should be consistent, and that seems to make the
problem disappear.

> > So, patch from Kenichi Handa seems to work well for sid emacs21, much better
> > than I thought. However it uses code that has only been recently added to
> > emacs21, and things that are not available for xemacs or emacs20.
> 
> Then, how about using your workaround for them, and enable
> my patch for an emacs that has
> ucs-mule-8859-to-mule-unicode?
> 

I like the idea, and at a first glance it should not be difficult to
implement, even for a person like me, whose lisp skills are limited. It will
help for old emacs21, for emacs20 my workaround will do nothing since it has
no iso-8859-15, and for Debian xemacs21 my workaround is also doing nothing
since xemacs21 seems to return some extra (IMHO wrong) stuff in
buffer-file-coding-system.

I will first retest everything with a 'good' (built to my taste) french
dict, pen and paper, to know in detail the differences in the results for
both systems. Last day I could notice that the misalignment error
disappeared and that things worked better, but not much more. Need to test
also with aspell and other languages.

Even in the case both systems give mostly similar results I will try
integrating them, since I guess your patch will be more appropriate for
future emacs21 development, and looks somewhat more general. 

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-11 19:56                   ` Agustin Martin
@ 2005-01-11 21:39                     ` Lionel Elie Mamane
  2005-01-12  7:37                     ` Kenichi Handa
  1 sibling, 0 replies; 50+ messages in thread
From: Lionel Elie Mamane @ 2005-01-11 21:39 UTC (permalink / raw)
  Cc: 130397, emacs-devel, Kenichi Handa

On Tue, Jan 11, 2005 at 08:56:23PM +0100, Agustin Martin wrote:

> The french dict is originally latin1 and use the two chars sequence
> 'oe', although the Debian dict defines single-chars 'oe' and 'OE' at
> the default stringchars section, but did not use inside.

I'm not sure what you mean with "but did not use inside". oe-1char
_is_ used in the debian ifrench-gut dictionary.

-- 
Lionel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-11 19:56                   ` Agustin Martin
  2005-01-11 21:39                     ` Lionel Elie Mamane
@ 2005-01-12  7:37                     ` Kenichi Handa
  2005-01-12 19:17                       ` Agustin Martin
  1 sibling, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2005-01-12  7:37 UTC (permalink / raw)
  Cc: lionel, emacs-devel, 130397

In article <20050111195623.GA4031@agmartin.aq.upm.es>, Agustin Martin <agustin.martin@hispalinux.es> writes:
> I meant with oe-1char oe as a single char (U+0153), available
> in iso-8859-15 (octal \275 here), but not in iso-8859-1 (you
> have one half instead), and with oe-2char the two 7bit chars sequence
> 'oe', available anywhere, and that is the trick usually used in
> iso-8859-1 to represent that char.

Ah, I see.

> I like the idea, and at a first glance it should not be difficult to
> implement, even for a person like me, whose lisp skills are limited. It will
> help for old emacs21, for emacs20 my workaround will do nothing since it has
> no iso-8859-15, and for Debian xemacs21 my workaround is also doing nothing
> since xemacs21 seems to return some extra (IMHO wrong) stuff in
> buffer-file-coding-system.

> I will first retest everything with a 'good' (built to my taste) french
> dict, pen and paper, to know in detail the differences in the results for
> both systems. Last day I could notice that the misalignment error
> disappeared and that things worked better, but not much more. Need to test
> also with aspell and other languages.

> Even in the case both systems give mostly similar results I will try
> integrating them, since I guess your patch will be more appropriate for
> future emacs21 development, and looks somewhat more general. 

How are ispell.el of Emacs and that of dictionaries-common
maintained?  Are they synched somehow?  Should I install my
patch for CVS Emacs.  Or, is it better to wait for you or
some other maintainer work on it?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-11 14:29                 ` Richard Stallman
@ 2005-01-12  7:45                   ` Kenichi Handa
  0 siblings, 0 replies; 50+ messages in thread
From: Kenichi Handa @ 2005-01-12  7:45 UTC (permalink / raw)
  Cc: agustin.martin, lionel, emacs-devel, 130397

In article <E1CoN1i-0000fX-5f@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

> People have been discussing this issue for a while now,
> and due to the volume of mail, I could not read it all.

I'm reading it, and I think I understand what is the
problem.

> Handa, is it clear what we should do now for the coming release?

As far as I understand, there are two problems, and my patch
solves one of them for the CVS Emacs.  I asked how to do
with it in my previous mail.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-12  7:37                     ` Kenichi Handa
@ 2005-01-12 19:17                       ` Agustin Martin
  2005-01-13  5:53                         ` Kenichi Handa
  0 siblings, 1 reply; 50+ messages in thread
From: Agustin Martin @ 2005-01-12 19:17 UTC (permalink / raw)


On Wed, Jan 12, 2005 at 04:37:50PM +0900, Kenichi Handa wrote:

> How are ispell.el of Emacs and that of dictionaries-common
> maintained?  Are they synched somehow?  

They are independent, dictionaries-common ispell.el should be
first in the load-path unless specifically disabled, and 
emacs21.3 ispell.el will rarely be used in normal setups. As a matter
of fact emacs 21.3 seems to still ship with ispell.el 3.4, while
dictionaries-common one is 3.6 with some patches to help integration of
dicts and some bug fixes and improvements taken from both emacs and
xemacs CVS.

> Should I install my
> patch for CVS Emacs.  Or, is it better to wait for you or
> some other maintainer work on it?
> 

I have retested your patch and my workaround with a good ifrench dict
(it was indeed buggy), and both give reasonable results for
iso-8859-{1,15} dict/buffer combinations, with some of the expected
misalignments due to iso-8859-15 chars, but your patch does the good
work for an utf-8 buffer with iso-8859-15 only chars and an
iso-8859-15 dict emacs entry, while my workaround does nothing there.

I suggest you to install your patch for CVS Emacs, so it can have a wider
testing. 

I will try to adapt both systems for use in dictionaries-common ispell.el,
with your patch as primary choice and the workaround as a fallback.

Again, thanks a lot for your feedback (and to emacs-devel people and Rob,
debian emacs21 maintainer, for their patience with this rather long thread)

Cheers,

-- 
Agustin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary)
  2005-01-12 19:17                       ` Agustin Martin
@ 2005-01-13  5:53                         ` Kenichi Handa
  0 siblings, 0 replies; 50+ messages in thread
From: Kenichi Handa @ 2005-01-13  5:53 UTC (permalink / raw)
  Cc: emacs-devel

In article <20050112191716.GA19198@agmartin.aq.upm.es>, Agustin Martin <agustin.martin@hispalinux.es> writes:
>>  Should I install my
>>  patch for CVS Emacs.  Or, is it better to wait for you or
>>  some other maintainer work on it?

> I have retested your patch and my workaround with a good ifrench dict
> (it was indeed buggy), and both give reasonable results for
> iso-8859-{1,15} dict/buffer combinations, with some of the expected
> misalignments due to iso-8859-15 chars, but your patch does the good
> work for an utf-8 buffer with iso-8859-15 only chars and an
> iso-8859-15 dict emacs entry, while my workaround does nothing there.

> I suggest you to install your patch for CVS Emacs, so it can have a wider
> testing. 

Ok, I've just installed it.

> I will try to adapt both systems for use in dictionaries-common ispell.el,
> with your patch as primary choice and the workaround as a fallback.

I see.  Thank you.

The remaining problem is that a character sequence
recognized as a single word by Emacs won't be recognized as
a word by ispell-get-word because of an existence of
character that can't be encoded by a coding system specified
for the current ispell dictionary.  For instance, if we
check a buffer containing "español" by american dictionary,
"espa" and "ol" are cheched separately.

This problem is not that serious as the misalignment error,
but is better to be fixed somehow.  Stephen suggested that
such a word should be skipped or marked as "uncheckable".  I
think the latter is better because 'ñ' in the above case may
be a typo.

This kind of situation can be detected by checking if the
the first and last characters of a word detected by
ispell-get-word is surely at word-boundary.  But, I don't
know where to check it nor how to mark something
"uncheckable" or "unknown word".  It requires a work by
someone who knows ispell.el well.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-07 21:27                         ` Juri Linkov
@ 2005-01-13  5:59                           ` Kenichi Handa
  2005-01-18 10:44                             ` Juri Linkov
  0 siblings, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2005-01-13  5:59 UTC (permalink / raw)
  Cc: agustin.martin, emacs-devel, k.stevens, ispell-el-bugs, 130397

In article <878y7553qd.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> Agustin Martin <agustin.martin@hispalinux.es> writes:
>>  *Ken*, since you are being cc'ed I vaguely remembered some info I somewhere
>>  read about this misalignements. I finally found it,
>> 
>>   http://lists.gnu.org/archive/html/emacs-devel/2002-09/msg01007.html

> The bug reported on this URL occurs only in Emacs 21.3, not in Emacs CVS.
> It seems something was fixed already.

> However, with a strange coincidence I got the same error in Emacs CVS just
> today for the first time.  So I can describe how this bug can be reproduced
> in Emacs CVS: when the first part of a word was copied from an external
> application and got encoded in the buffer in mule-unicode-0100-24ff,
> and the second part of the word typed with an input method and gets encoded
> in cyrillic-iso8859-5, then calling ispell-buffer on a buffer with the word
> composed with different encodings with `russian' dictionary signals the
> error "Ispell misalignment".

Please try the latest ispell.el.  I think at least this
misalignment error is fixed now.

> And while on this topic, I want to remind that many Emacs users suffer
> from the inability of ispell.el to simultaneously check mixed multi-language
> texts.  So, whoever fixes ispell.el, please take that into account.
> Such combining is quite easily doable for any disjoint alphabets, as well
> as for alphabets where one alphabet is a superset of another, like e.g.
> English and some other Latin-based alphabets.  Even for overlapping
> alphabets it would be possible with using the `w' syntax to get a word
> and to feed it to different ispell instances for each dictionary.

As for this, I agree with the following statement.

Geoff Kuenning <geoff@cs.hmc.edu> writes:
> I'm not entirely sure what you mean here.  For disjoint alphabets,
> it's certainly relatively easy to figure out which word should go to
> which ispell instance.  For identical, superset, or overlapping
> alphabets, the problem is basically insoluable.  For example, "fra" is
> a misspelling in English but legal in Italian.  If it appears in a
> mixed passage, which dictionary should it be fed to?  The only
> solution would seem to be to require the user to mark passages in some
> way, as is done in HTML.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-10  9:09                                     ` David Kastrup
  2005-01-10 20:16                                       ` Eli Zaretskii
@ 2005-01-13  7:50                                       ` Kenichi Handa
  1 sibling, 0 replies; 50+ messages in thread
From: Kenichi Handa @ 2005-01-13  7:50 UTC (permalink / raw)
  Cc: geoff, 130397, agustin.martin, lionel, emacs-devel, kstevens,
	eliz, snogglethorpe, miles

In article <x5zmzhr6p6.fsf@lola.goethe.zz>, David Kastrup <dak@gnu.org> writes:
>>>  If ispell wants utf-8, it's easy enough to convert each input line to
>>>  utf-8 and deal with offsets into that in the event of a mispelling;
>> 
>>  Or account for byte offsets by (variable) multibyte lenght of each
>>  character, which Emacs knows.  I don't remember for the moment whether
>>  the multibyte length of the UTF-8 encoding can be gotten at by a Lisp
>>  program, but if not, we could add some primitive to do that.

> Just encode the line to utf-8, find the correct point in the byte
> string, cut off the line there, convert back and check the length of
> the string.  This works unless you are in the middle of a character.

> But it would be much saner if our conversion facilities would preserve
> markers (which they don't do right now): encode to utf-8, place a
> marker at the right byte offset, undo the conversion.

You can encode a text to utf-8, place several makers, encode
regions between markers one by one.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-13  5:59                           ` Kenichi Handa
@ 2005-01-18 10:44                             ` Juri Linkov
  2005-01-18 13:57                               ` Geoff Kuenning
  2005-01-18 23:24                               ` Kenichi Handa
  0 siblings, 2 replies; 50+ messages in thread
From: Juri Linkov @ 2005-01-18 10:44 UTC (permalink / raw)
  Cc: agustin.martin, emacs-devel, k.stevens, ispell-el-bugs, 130397

Kenichi Handa <handa@m17n.org> writes:
> Please try the latest ispell.el.  I think at least this
> misalignment error is fixed now.

I tried the latest ispell.el and I see that your change is a definite
improvement since it now allows to check words in mule-unicode charsets.
But it still doesn't fix the misalignment error.  It even makes this
error more frequent because it now occurs in all UTF-8 texts checked
with ispell-region (which earlier were simply skipped before your change).

The cause of the error is the following: a line sent by ispell.el
to the ispell process is converted from mule-unicode charset to the
process charset, and the accepted output gets converted from process
coding to the internal Emacs charset iso8859.  So `search-forward' in
`ispell-process-line' fails to find a string in iso8859 charset
in the buffer with the same string in mule-unicode charset.

> As for this, I agree with the following statement.
>
> Geoff Kuenning <geoff@cs.hmc.edu> writes:
>> I'm not entirely sure what you mean here.  For disjoint alphabets,
>> it's certainly relatively easy to figure out which word should go to
>> which ispell instance.  For identical, superset, or overlapping
>> alphabets, the problem is basically insoluable.  For example, "fra" is
>> a misspelling in English but legal in Italian.  If it appears in a
>> mixed passage, which dictionary should it be fed to?  The only
>> solution would seem to be to require the user to mark passages in some
>> way, as is done in HTML.

I agree that marking would help ispell.el to decide which dictionary
to use on a word.  However, even without marking users might still prefer
to check words simultaneously with multiple dictionaries and to accept
a word when it's found in one dictionary, because such cases where a word
appears in both dictionaries might be too rare for two chosen languages.

A similar problem exists even within one language where a misspelled word
is still a valid word according to ispell (e.g. misspelled "male" instead
of "mail").

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-18 10:44                             ` Juri Linkov
@ 2005-01-18 13:57                               ` Geoff Kuenning
  2005-01-19  7:34                                 ` Juri Linkov
  2005-01-18 23:24                               ` Kenichi Handa
  1 sibling, 1 reply; 50+ messages in thread
From: Geoff Kuenning @ 2005-01-18 13:57 UTC (permalink / raw)
  Cc: agustin.martin, 130397, emacs-devel, k.stevens, Kenichi Handa

> I agree that marking would help ispell.el to decide which dictionary
> to use on a word.  However, even without marking users might still prefer
> to check words simultaneously with multiple dictionaries

That's a good point.  Unfortunately, the current implementation of
ispell makes it impossible to use two dictionaries simultaneously.

I do recall hearing from one user who used a pipe something like this:

        ispell -d language-1 -l | ispell -d language-2 -l ... | sort -uf

but of course that only provides a list of misspelled words, and loses
all the correction capabilities.

Just brainstorming, it probably wouldn't be too hard to write a
postprocessing script for -a mode that turned the output of ispell -a
into something suitable for another ispell.  The idea would be that
you feed:

        I do not want to acept my bda lueck


and turn the output lines:

        @(#) International Ispell Version 3.2.06 08/01/01
        *
        *
        *
        *
        *
        & acept 2 17: accept, adept
        *
        & bda 9 26: Ada, baa, bad, bea, bida, boa, bra, FDA, Ida
        & lueck 1 30: luck

into a line of blanks and misspelled words:

                         acept    bda lueck

which can then be fed into another ispell -a instance.  The final
output would be returned to emacs.  The pipe would look like:

        ispell -a -d language-1 | fixispell-a | ispell -a -d language-2

In fact, this script is so easy I think I'll whip it up right now.
It's called fixispell-a, and it will be in the next ispell release.
Here it is:

#!/bin/sh
#
# $Id: fixispell-a,v 1.1 2005/01/18 13:48:52 geoff Exp geoff $
#
# Copyright 2005, Geoff Kuenning, Claremont, CA.
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#
# 1. Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
# 3. All modifications to the source code must be clearly marked as
#    such.  Binary redistributions based on modified source code
#    must be clearly marked as modified versions in the documentation
#    and/or other materials provided with the distribution.
# 4. The code that causes the 'ispell -v' command to display a prominent
#    link to the official ispell Web site may not be removed.
# 5. The name of Geoff Kuenning may not be used to endorse or promote
#    products derived from this software without specific prior
#    written permission.
#
# THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED.  IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
# OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
#
# Take the output of "ispell -a" and turn it into a line that can be
# fed into another "ispell -a" instance.
#
# Usage:
#
USAGE='Usage: ispell -a <ispell-switches> | fixispell-a | ispell -a ...'
#
# BUGS:
#
# This script is probably not portable to older systems.
#
# $Log: fixispell-a,v $
# Revision 1.1  2005/01/18 13:48:52  geoff
# Initial revision
#

case "$#" in
    0)
	;;
    *)
	echo "$USAGE" 1>&2
	exit 2
	;;
esac

awk 'NR == 1 \
	{
	next
	}
    NF == 0 \
	{
	print line
	line = ""
	next
	}
    $1 == "*"  ||  $1 == "+"  ||  $1 == "-" \
	{
	next
	}
    $1 == "&"  ||  $1 == "?"  ||  $1 == "#" \
	{
	if ($1 == "#")
	    offset = $3 + 0
	else
	    offset = substr($4, 1, length($4) - 1) + 0
	if (length(line) < offset)
	    line = sprintf("%s%*s", line, offset - length(line), "")
	line = line $2
	next
	}
	{
	print "fixispell-a: unrecognized ispell input line" $0 > "/dev/stderr"
	exit(2)
	}'
-- 
    Geoff Kuenning   geoff@cs.hmc.edu   http://www.cs.hmc.edu/~geoff/

If a person is obviously mentally disabled, such as having Down's
syndrome or Alzheimer's, decent people exercise sympathy and
understanding in their interactions.  So why, if someone merely has a
low IQ, is he treated with ridicule and contempt?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-18 10:44                             ` Juri Linkov
  2005-01-18 13:57                               ` Geoff Kuenning
@ 2005-01-18 23:24                               ` Kenichi Handa
  2005-01-19  7:43                                 ` Juri Linkov
  1 sibling, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2005-01-18 23:24 UTC (permalink / raw)
  Cc: agustin.martin, 130397, k.stevens, ispell-el-bugs, emacs-devel

In article <878y6rnhd3.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> Kenichi Handa <handa@m17n.org> writes:
>>  Please try the latest ispell.el.  I think at least this
>>  misalignment error is fixed now.

> I tried the latest ispell.el and I see that your change is a definite
> improvement since it now allows to check words in mule-unicode charsets.
> But it still doesn't fix the misalignment error.  It even makes this
> error more frequent because it now occurs in all UTF-8 texts checked
> with ispell-region (which earlier were simply skipped before your change).

> The cause of the error is the following: a line sent by ispell.el
> to the ispell process is converted from mule-unicode charset to the
> process charset, and the accepted output gets converted from process
> coding to the internal Emacs charset iso8859.  So `search-forward' in
> `ispell-process-line' fails to find a string in iso8859 charset
> in the buffer with the same string in mule-unicode charset.

Ah! I see.  I've just installed the attached change which
should fix that misalignment error.  ispell-looking-at is
not that tuned yet, and there will be a better way to
implemente it.

---
Ken'ichi HANDA
handa@m17n.org

Index: ispell.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/textmodes/ispell.el,v
retrieving revision 1.152
retrieving revision 1.153
diff -u -c -r1.152 -r1.153
cvs diff: conflicting specifications of output style
*** ispell.el	13 Jan 2005 04:33:05 -0000	1.152
--- ispell.el	18 Jan 2005 23:16:27 -0000	1.153
***************
*** 2794,2799 ****
--- 2794,2808 ----
      string))
  
  
+ (defun ispell-looking-at (string)
+   (let ((coding (ispell-get-coding-system))
+ 	(len (length string)))
+     (and (<= (+ (point) len) (point-max))
+ 	 (equal (encode-coding-string string coding)
+ 		(encode-coding-string (buffer-substring-no-properties
+ 				       (point) (+ (point) len))
+ 				      coding)))))
+ 
  ;;; Avoid error messages when compiling for these dynamic variables.
  (eval-when-compile
    (defvar start)
***************
*** 2842,2853 ****
  
  	    ;; Alignment cannot be tracked and this error will occur when
  	    ;; `query-replace' makes multiple corrections on the starting line.
! 	    (if (/= (+ word-len (point))
! 		    (progn
! 		      ;; NB: Search can fail with Mule coding systems that don't
! 		      ;;  display properly.  Ignore the error in this case?
! 		      (search-forward (car poss) (+ word-len (point)) t)
! 		      (point)))
  		;; This occurs due to filter pipe problems
  		(error (concat "Ispell misalignment: word "
  			       "`%s' point %d; probably incompatible versions")
--- 2851,2857 ----
  
  	    ;; Alignment cannot be tracked and this error will occur when
  	    ;; `query-replace' makes multiple corrections on the starting line.
! 	    (or (ispell-looking-at (car poss))
  		;; This occurs due to filter pipe problems
  		(error (concat "Ispell misalignment: word "
  			       "`%s' point %d; probably incompatible versions")

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-18 13:57                               ` Geoff Kuenning
@ 2005-01-19  7:34                                 ` Juri Linkov
  2005-01-19 12:22                                   ` Geoff Kuenning
  2005-04-29  0:29                                   ` Geoff Kuenning
  0 siblings, 2 replies; 50+ messages in thread
From: Juri Linkov @ 2005-01-19  7:34 UTC (permalink / raw)
  Cc: agustin.martin, emacs-devel, k.stevens, 130397

Geoff Kuenning <geoff@cs.hmc.edu> writes:
> Just brainstorming, it probably wouldn't be too hard to write a
> postprocessing script for -a mode that turned the output of ispell -a
> into something suitable for another ispell.

This approach is quite promising, but it doesn't work sufficiently well
for non-English languages.  It loses all characters that don't belong
to the alphabet specified in .aff file.  For example, it turns the line:

        I do not want to acept my bda español

into:

                         acept    bda espa ol

One solution is to add the -w flag to specify additional characters:

        ispell -a -w ñ -d american | fixispell-a | ispell -a -d spanish

Perhaps, ispell.el is able to find such a set of additional
characters automatically as a subtraction between two alphabets.

But there is another problem.  fixispell-a returns a list of near misses
only for the last language in the pipe.  It would be better if it
accumulated a list of near misses from all ispell commands in the pipe.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-18 23:24                               ` Kenichi Handa
@ 2005-01-19  7:43                                 ` Juri Linkov
  2005-01-19 12:52                                   ` Kenichi Handa
  0 siblings, 1 reply; 50+ messages in thread
From: Juri Linkov @ 2005-01-19  7:43 UTC (permalink / raw)
  Cc: agustin.martin, 130397, k.stevens, ispell-el-bugs, emacs-devel

Kenichi Handa <handa@m17n.org> writes:
> In article <878y6rnhd3.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:
>> The cause of the error is the following: a line sent by ispell.el
>> to the ispell process is converted from mule-unicode charset to the
>> process charset, and the accepted output gets converted from process
>> coding to the internal Emacs charset iso8859.  So `search-forward' in
>> `ispell-process-line' fails to find a string in iso8859 charset
>> in the buffer with the same string in mule-unicode charset.
>
> Ah! I see.  I've just installed the attached change which
> should fix that misalignment error.  ispell-looking-at is
> not that tuned yet, and there will be a better way to
> implemente it.

I tried your fix, and the misalignment error doesn't occur anymore.
Thanks.

Now a new problem was uncovered: after selecting a correct word from
a list of near misses returned from ispell, ispell.el replaces the
misspelled word with a selected word, and inserts it into the buffer
not in its original mule-unicode charset, but in iso8859.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-19  7:34                                 ` Juri Linkov
@ 2005-01-19 12:22                                   ` Geoff Kuenning
  2005-04-29  0:29                                   ` Geoff Kuenning
  1 sibling, 0 replies; 50+ messages in thread
From: Geoff Kuenning @ 2005-01-19 12:22 UTC (permalink / raw)
  Cc: agustin.martin, emacs-devel, k.stevens, 130397

> This approach is quite promising, but it doesn't work sufficiently well
> for non-English languages.  It loses all characters that don't belong
> to the alphabet specified in .aff file.  For example, it turns the line:

For related reasons, the english.aff file in the next ispell release
will include a much expanded character set.  That allows words adopted
from other languages (such as "naïve") to be included in the
dictionary.  If every language did the same, part of the problem would
go away.

> But there is another problem.  fixispell-a returns a list of near misses
> only for the last language in the pipe.  It would be better if it
> accumulated a list of near misses from all ispell commands in the pipe.

Yeah, I just realized that drawback.  I think I can come up with a way
to fix it, though the invocation mechanism would be different.  The
revision would be a command called something like multispell, invoked
like this:

    multispell [ispell-switches] -d language-1 -d language-2

and behaving like ispell -a.  For convenience, it could also
automatically supply a catch-all "-w" switch.
-- 
    Geoff Kuenning   geoff@cs.hmc.edu   http://www.cs.hmc.edu/~geoff/

Statistics don't bore people, people bore people.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-19  7:43                                 ` Juri Linkov
@ 2005-01-19 12:52                                   ` Kenichi Handa
  2005-01-19 13:08                                     ` David Kastrup
  0 siblings, 1 reply; 50+ messages in thread
From: Kenichi Handa @ 2005-01-19 12:52 UTC (permalink / raw)
  Cc: agustin.martin, 130397, k.stevens, ispell-el-bugs, emacs-devel

In article <87r7kit0nz.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:
> Now a new problem was uncovered: after selecting a correct word from
> a list of near misses returned from ispell, ispell.el replaces the
> misspelled word with a selected word, and inserts it into the buffer
> not in its original mule-unicode charset, but in iso8859.

Perhaps the following function can be utilized somewhere in
ispell to do that, but, as I still don't understand ispell
code that much, I'd like to ask someone else to modify
ispell to use it.

;; Destructively modify WORD by converting each character in it to the
;; equivalent character of CHARSET.

(defun ispell-adjust-charset (word charset)
  (let ((len (length word)))
    (if (< len (string-bytes word))
	(dotimes (i len)
	  (let ((c (aref word i))
		this-charset equiv-chars)
	    (if (and (>= c 128)
		     (not (eq (setq this-charset (char-charset c)) charset))
		     (or (memq this-charset '(mule-unicode-0100-24ff 
					      mule-unicode-2500-34ff))
			 (setq c (aref ucs-mule-8859-to-mule-unicode c)))
		     (setq equivs (aref ispell-unified-chars-table c)))
		(catch 'tag
		  (dotimes (j (length equiv-chars))
		    (when (eq (char-charset (aref equiv-chars j)) charset)
		      (aset word i (aref equiv-chars j))
		      (throw 'tag nil))))))))))

---
Ken'ichi HANDA
handa@m17n.org

PS. I personally feel it's a waste of time to struggle with
charset matters in ispell that much because emacs-unicode
should not have such a problem.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-19 12:52                                   ` Kenichi Handa
@ 2005-01-19 13:08                                     ` David Kastrup
  0 siblings, 0 replies; 50+ messages in thread
From: David Kastrup @ 2005-01-19 13:08 UTC (permalink / raw)
  Cc: k.stevens, ispell-el-bugs, 130397, agustin.martin, emacs-devel,
	Juri Linkov

Kenichi Handa <handa@m17n.org> writes:

> PS. I personally feel it's a waste of time to struggle with
> charset matters in ispell that much because emacs-unicode
> should not have such a problem.

Oh, but in the 5+ years until it gets released, people might still be
glad to have a fix.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-01-19  7:34                                 ` Juri Linkov
  2005-01-19 12:22                                   ` Geoff Kuenning
@ 2005-04-29  0:29                                   ` Geoff Kuenning
  2005-04-29  8:45                                     ` Thien-Thi Nguyen
  1 sibling, 1 reply; 50+ messages in thread
From: Geoff Kuenning @ 2005-04-29  0:29 UTC (permalink / raw)
  Cc: agustin.martin, emacs-devel, k.stevens, 130397

For those of you who don't know, I've released ispell 3.3.00.  Having
gotten that off my plate, I'm busily working on some improvements that
will go into 3.3.01.  Number one on that list is to redo the
fixispell-a script that I whipped up a few months ago.

Juri points out:

> This approach is quite promising, but it doesn't work sufficiently well
> for non-English languages.  It loses all characters that don't belong
> to the alphabet specified in .aff file.

and:

> But there is another problem.  fixispell-a returns a list of near misses
> only for the last language in the pipe.  It would be better if it
> accumulated a list of near misses from all ispell commands in the pipe.

The former problem is best addressed using Juri's suggestion of
passing the "-w" switch to specify a superset.  In addition, in the
new release, the english.aff file includes all of Latin-1 (since
English sometimes adopts accented words and names from other
languages).  The -w switch is still needed, though, to handle things
like the apostrophe, which isn't in all non-English affix files.  I
welcome further suggestions.

The latter problem motivated me to write an entirely new program,
multispell, which does a better job of what fixispell-a attempted.
It's invoked as:

        multispell [ispell-switches] dict1 dict2 dict3

For example:

        multispell -m english deutsch francais

Multispell behaves like ispell -a, but accepts any word that any of
the mentioned dictionaries accept.  If a word is rejected, it combines
suggestions from all dictionaries.  So, for example, sending "wuld" to
the above line produces:

        & wuld 0 7 weld, wild, wold, would, Wald, wild, wund

This brings me to a question and a discussion point.  The question is
highlighted in the above line: the word "wild" appears as a
suggestion twice, because the English and German dictionaries both
produce it.  Do people think that's a Bad Thing?  I can certainly
write code to suppress the duplicates; I'm just feeling lazy at the
moment. *grin*

The discussion point is a bit more complex.  If you invoke multispell
with:

        multispell -T latin1 -m english deutsch francais

it will fail because the English dictionary doesn't recognize "latin1"
as a valid encoding.  How do people think I should handle these
variations among affix files?  One obvious option would be to make the
-T switch be dictionary-specific in multispell, so you'd write:

        multispell -m -T list english -T latin1 deutsch -T latin1 francais

Another option would be to insist that all affix files follow a common
naming scheme, so that everybody would be willing to accept "latin1"
as an encoding name, and so forth.

>From my point of view, both options are bad.  The first requires too
much intelligence on the part of ispell.el.  The second is going to be
hard to enforce.

Opinions are welcomed.
-- 
    Geoff Kuenning   geoff@cs.hmc.edu   http://www.cs.hmc.edu/~geoff/

Windows XP is the "most reliable Windows ever," which is like saying
that asparagus is "the most articulate vegetable ever."
	-- Dave Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Bug 130397
  2005-04-29  0:29                                   ` Geoff Kuenning
@ 2005-04-29  8:45                                     ` Thien-Thi Nguyen
  0 siblings, 0 replies; 50+ messages in thread
From: Thien-Thi Nguyen @ 2005-04-29  8:45 UTC (permalink / raw)
  Cc: Juri Linkov, agustin.martin, 130397, k.stevens, emacs-devel

Geoff Kuenning <geoff@cs.hmc.edu> writes:

> >From my point of view, both options are bad.  The first requires too
> much intelligence on the part of ispell.el.  The second is going to be
> hard to enforce.
> 
> Opinions are welcomed.

IMHO, intelligence should properly reside in ispell.el since emacs has
all the infrastructure to (dis)ambiguate associations between codings
and other aspects, and since en/decoding occurs on i/o.

if something is hard to enforce that just means you need to have
intelligence in the heuristics that comprise the workarounds in the
error handling (or, omitting this, suffer a buggy ispell experience).
that doesn't seem like much fun to program.  better to make clients
sweat the specifications than the non-specifications.

thi

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2005-04-29  8:45 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.43.0305140821370.30166-100000@wr-linux02.rki.ivbb.bund.de>
     [not found] ` <m3addpd2ur.fsf@dionysos.nib>
     [not found]   ` <E19HNCh-0000tv-00@fencepost.gnu.org>
     [not found]     ` <20040517120658.GA6919@agmartin.aq.upm.es>
     [not found]       ` <E1BQ5z5-0000f4-5u@fencepost.gnu.org>
2004-05-19 11:44         ` Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary) Agustin Martin
2004-05-21  8:01           ` Agustin Martin
2004-12-17 12:15       ` Agustin Martin
2004-12-22 12:37         ` Kenichi Handa
2004-12-22 17:13           ` Agustin Martin
2005-01-04 12:50             ` Kenichi Handa
2005-01-04 14:55               ` Bug 130397 Stefan
2005-01-05  2:00                 ` Kenichi Handa
2005-01-05  4:42                   ` Stefan Monnier
2005-01-05  5:50                     ` Kenichi Handa
2005-01-05 14:02                       ` Stefan Monnier
2005-01-06  0:44                         ` Kenichi Handa
2005-01-06 16:30                           ` Ken Stevens
2005-01-06 17:33                             ` Stefan Monnier
2005-01-07  0:39                               ` Kenichi Handa
2005-01-07 15:48                             ` Agustin Martin
2005-01-08 12:31                             ` Geoff Kuenning
2005-01-08 12:47                               ` David Kastrup
2005-01-08 13:29                                 ` Miles Bader
2005-01-08 17:15                                   ` Geoff Kuenning
2005-01-10  4:45                                   ` Eli Zaretskii
2005-01-10  9:09                                     ` David Kastrup
2005-01-10 20:16                                       ` Eli Zaretskii
2005-01-13  7:50                                       ` Kenichi Handa
2005-01-08 22:39                               ` Peter Heslin
2005-01-07 15:36                       ` Agustin Martin
2005-01-07 20:29                         ` Ken Stevens
2005-01-07 21:27                         ` Juri Linkov
2005-01-13  5:59                           ` Kenichi Handa
2005-01-18 10:44                             ` Juri Linkov
2005-01-18 13:57                               ` Geoff Kuenning
2005-01-19  7:34                                 ` Juri Linkov
2005-01-19 12:22                                   ` Geoff Kuenning
2005-04-29  0:29                                   ` Geoff Kuenning
2005-04-29  8:45                                     ` Thien-Thi Nguyen
2005-01-18 23:24                               ` Kenichi Handa
2005-01-19  7:43                                 ` Juri Linkov
2005-01-19 12:52                                   ` Kenichi Handa
2005-01-19 13:08                                     ` David Kastrup
2005-01-07 15:34               ` Bug 130397 (Was: Emacs - Ispell problem with i[no]german dictionary) Agustin Martin
2005-01-10 13:06             ` Lionel Elie Mamane
2005-01-10 17:16               ` Agustin Martin
2005-01-11  5:16                 ` Kenichi Handa
2005-01-11 19:56                   ` Agustin Martin
2005-01-11 21:39                     ` Lionel Elie Mamane
2005-01-12  7:37                     ` Kenichi Handa
2005-01-12 19:17                       ` Agustin Martin
2005-01-13  5:53                         ` Kenichi Handa
2005-01-11 14:29                 ` Richard Stallman
2005-01-12  7:45                   ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).