iso-8859-1 and non-latin-1 chars

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* iso-8859-1 and non-latin-1 chars
@ 2002-11-07 14:57 Stefan Monnier
  2002-11-07 15:25 ` Eli Zaretskii
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Monnier @ 2002-11-07 14:57 UTC (permalink / raw)



When encoding text containing non-latin-1 chars with the latin-1
coding-system, they get output as some kind of escape sequence.

Would it be possible to output a `?' instead ?
If so, how ?


	Stefan


PS: This manifests itself in the "ispell misalignment" problem because
    ispell doesn't like these escape sequences.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-07 14:57 iso-8859-1 and non-latin-1 chars Stefan Monnier
@ 2002-11-07 15:25 ` Eli Zaretskii
  2002-11-07 17:06   ` Stefan Monnier
  0 siblings, 1 reply; 38+ messages in thread
From: Eli Zaretskii @ 2002-11-07 15:25 UTC (permalink / raw)
  Cc: emacs-devel

> From: "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu>
> Date: Thu, 07 Nov 2002 09:57:50 -0500
> 
> When encoding text containing non-latin-1 chars with the latin-1
> coding-system, they get output as some kind of escape sequence.

Yes.  IIRC, this is hard-coded in the encoder's C code: it works as if
latin-1 was actually iso-latin-1-wth-esc.

> PS: This manifests itself in the "ispell misalignment" problem because
>     ispell doesn't like these escape sequences.

Indeed; a known problem.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-07 15:25 ` Eli Zaretskii
@ 2002-11-07 17:06   ` Stefan Monnier
  2002-11-07 23:42     ` Kenichi Handa
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Monnier @ 2002-11-07 17:06 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, emacs-devel

> > When encoding text containing non-latin-1 chars with the latin-1
> > coding-system, they get output as some kind of escape sequence.
> 
> Yes.  IIRC, this is hard-coded in the encoder's C code: it works as if
> latin-1 was actually iso-latin-1-wth-esc.

How can we change that ?


	Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-07 17:06   ` Stefan Monnier
@ 2002-11-07 23:42     ` Kenichi Handa
  2002-11-07 23:58       ` Stefan Monnier
  2002-11-09 11:54       ` Richard Stallman
  0 siblings, 2 replies; 38+ messages in thread
From: Kenichi Handa @ 2002-11-07 23:42 UTC (permalink / raw)
  Cc: eliz, monnier+gnu/emacs, emacs-devel

In article <200211071706.gA7H6hW09141@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:

>>  > When encoding text containing non-latin-1 chars with the latin-1
>>  > coding-system, they get output as some kind of escape sequence.
>>  
>>  Yes.  IIRC, this is hard-coded in the encoder's C code: it works as if
>>  latin-1 was actually iso-latin-1-wth-esc.

> How can we change that ?

This change will do.

*** european.el.~1.75.~	Wed Nov  6 09:13:16 2002
--- european.el	Fri Nov  8 08:32:12 2002
***************
*** 36,42 ****
   'iso-latin-1 2 ?1
   "ISO 2022 based 8-bit encoding for Latin-1 (MIME:ISO-8859-1)."
   '(ascii latin-iso8859-1 nil nil
!    nil nil nil nil nil nil nil nil nil nil nil nil t)
   '((safe-charsets ascii latin-iso8859-1)
     (mime-charset . iso-8859-1)))
  
--- 36,42 ----
   'iso-latin-1 2 ?1
   "ISO 2022 based 8-bit encoding for Latin-1 (MIME:ISO-8859-1)."
   '(ascii latin-iso8859-1 nil nil
!    nil nil nil nil nil nil nil nil nil nil nil t t)
   '((safe-charsets ascii latin-iso8859-1)
     (mime-charset . iso-8859-1)))
  
Or, if this is a problem only for ispell, we can make series
of "safe" coding-systems for ispell.

Or, we can add a global flag, say
`inhibit-unsafe-iso-escape, to tell encoding routine not to
produces those escape sequences.  Then, ispell can let-bind
that variable to t on encoding.

I think the last one is the best solution.

What do you think?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-07 23:42     ` Kenichi Handa
@ 2002-11-07 23:58       ` Stefan Monnier
  2002-11-09 11:54       ` Richard Stallman
  1 sibling, 0 replies; 38+ messages in thread
From: Stefan Monnier @ 2002-11-07 23:58 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, eliz, emacs-devel

> In article <200211071706.gA7H6hW09141@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:
> 
> >>  > When encoding text containing non-latin-1 chars with the latin-1
> >>  > coding-system, they get output as some kind of escape sequence.
> >>  
> >>  Yes.  IIRC, this is hard-coded in the encoder's C code: it works as if
> >>  latin-1 was actually iso-latin-1-wth-esc.
> 
> > How can we change that ?
> 
> This change will do.
> 
> *** european.el.~1.75.~	Wed Nov  6 09:13:16 2002
> --- european.el	Fri Nov  8 08:32:12 2002
> ***************
> *** 36,42 ****
>    'iso-latin-1 2 ?1
>    "ISO 2022 based 8-bit encoding for Latin-1 (MIME:ISO-8859-1)."
>    '(ascii latin-iso8859-1 nil nil
> !    nil nil nil nil nil nil nil nil nil nil nil nil t)
>    '((safe-charsets ascii latin-iso8859-1)
>      (mime-charset . iso-8859-1)))
>   
> --- 36,42 ----
>    'iso-latin-1 2 ?1
>    "ISO 2022 based 8-bit encoding for Latin-1 (MIME:ISO-8859-1)."
>    '(ascii latin-iso8859-1 nil nil
> !    nil nil nil nil nil nil nil nil nil nil nil t t)
>    '((safe-charsets ascii latin-iso8859-1)
>      (mime-charset . iso-8859-1)))
>   
> Or, if this is a problem only for ispell, we can make series
> of "safe" coding-systems for ispell.

I think the problem is only known to bite ispell, but I doubt there
are many other applications that need to (or try to) encode a piece
of text with unsafe chars, so the above patch should be safe.

I also think the patch is correct since it otherwise outputs code
that are not part of latin-1, strictly speaking.  If you want such
a behavior, you should use iso-latin-1-with-esc.

The same patch should also be applied for other iso8859-N charsets
I suppose.

> Or, we can add a global flag, say
> `inhibit-unsafe-iso-escape, to tell encoding routine not to
> produces those escape sequences.  Then, ispell can let-bind
> that variable to t on encoding.

That seems overkill since you can use iso-latin-1-with-esc instead.
But it would save us from changing all the coding-systems.


	Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-07 23:42     ` Kenichi Handa
  2002-11-07 23:58       ` Stefan Monnier
@ 2002-11-09 11:54       ` Richard Stallman
  2002-11-09 20:32         ` Stefan Monnier
  2002-11-11  4:00         ` Kenichi Handa
  1 sibling, 2 replies; 38+ messages in thread
From: Richard Stallman @ 2002-11-09 11:54 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, eliz, monnier+gnu/emacs, emacs-devel

    > >>  Yes.  IIRC, this is hard-coded in the encoder's C code: it works as if
    > >>  latin-1 was actually iso-latin-1-wth-esc.
    > 
    > > How can we change that ?
    > 
    > This change will do.

This user-visible change would affect much more than ispell.
Is it the right thing in general?

    Or, if this is a problem only for ispell, we can make series
    of "safe" coding-systems for ispell.

Making a full range of alternate coding systems would be a nuisance
and inconvenient for ispell.el to use.

    Or, we can add a global flag, say
    `inhibit-unsafe-iso-escape, to tell encoding routine not to
    produces those escape sequences.  Then, ispell can let-bind
    that variable to t on encoding.

    I think the last one is the best solution.

I agree, that would be much easier to implement and to use.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-09 11:54       ` Richard Stallman
@ 2002-11-09 20:32         ` Stefan Monnier
  2002-11-11 10:19           ` Richard Stallman
  2002-11-11  4:00         ` Kenichi Handa
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Monnier @ 2002-11-09 20:32 UTC (permalink / raw)
  Cc: handa, monnier+gnu/emacs, eliz, emacs-devel

>     > >>  Yes.  IIRC, this is hard-coded in the encoder's C code: it works as if
>     > >>  latin-1 was actually iso-latin-1-wth-esc.
>     > > How can we change that ?
>     > This change will do.
> This user-visible change would affect much more than ispell.

Huh?  Such as what?  I's very unusual to apply a coding-system to a piece of
text that is not within the safe-chars of that coding-system.
I'd say it's *not* user-visible.

> Is it the right thing in general?

I think so.  If you want to allow and deal with non-latin-N chars,
you should use the iso-latin-N-with-esc version of the coding-system.


	Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-09 11:54       ` Richard Stallman
  2002-11-09 20:32         ` Stefan Monnier
@ 2002-11-11  4:00         ` Kenichi Handa
  2002-11-12  5:47           ` Richard Stallman
  1 sibling, 1 reply; 38+ messages in thread
From: Kenichi Handa @ 2002-11-11  4:00 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, eliz, emacs-devel

In article <E18AUC2-00012T-00@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:
>>  >>  Yes.  IIRC, this is hard-coded in the encoder's C code: it works as if
>>  >>  latin-1 was actually iso-latin-1-wth-esc.
>>  
>>  > How can we change that ?
>>  
>>  This change will do.

> This user-visible change would affect much more than ispell.
> Is it the right thing in general?

After reading Stefan's previous mail, I tend to agree with
him.  At least I think this change makes nothing worse.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-09 20:32         ` Stefan Monnier
@ 2002-11-11 10:19           ` Richard Stallman
  0 siblings, 0 replies; 38+ messages in thread
From: Richard Stallman @ 2002-11-11 10:19 UTC (permalink / raw)
  Cc: handa, monnier+gnu/emacs, eliz, emacs-devel

    > This user-visible change would affect much more than ispell.

    Huh?  Such as what?  I's very unusual to apply a coding-system to a piece of
    text that is not within the safe-chars of that coding-system.

Are Latin-2 characters within the safe-chars of iso-latin-1 now?  I
don't know how to tell, but I would expect that they are, since it can
encode them.  With this change, it won't be able to encode them,
and it will be unsafe in cases where now it is safe.

That is a user-visible change.  Perhaps it is a good change,
but I want to hear what people think of it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-11  4:00         ` Kenichi Handa
@ 2002-11-12  5:47           ` Richard Stallman
  2002-11-18  0:08             ` Kenichi Handa
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Stallman @ 2002-11-12  5:47 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, eliz, emacs-devel

    After reading Stefan's previous mail, I tend to agree with
    him.  At least I think this change makes nothing worse.

In that case, would you please make the change?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-12  5:47           ` Richard Stallman
@ 2002-11-18  0:08             ` Kenichi Handa
  2002-11-18 19:09               ` Richard Stallman
  0 siblings, 1 reply; 38+ messages in thread
From: Kenichi Handa @ 2002-11-18  0:08 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, eliz, emacs-devel

In article <E18BTtE-0006Ka-00@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:
>     After reading Stefan's previous mail, I tend to agree with
>     him.  At least I think this change makes nothing worse.

> In that case, would you please make the change?

Ok, I've just installed this change.

2002-11-18  Kenichi Handa  <handa@m17n.org>

	* language/cyrillic.el (cyrillic-iso-8bit): Make it safe.

	* language/european.el (iso-latin-1): Make it safe.
	(iso-latin-2, iso-latin-3, iso-latin-4, iso-latin-5, iso-latin-8) 
	(iso-latin-9): Likewise.

	* language/greek.el (greek-iso-8bit): Make it safe.

	* language/hebrew.el (hebrew-iso-8bit): Make it safe.

	* language/lao.el (lao): Make it safe.

	* language/thai.el (thai-tis620): Make it safe.



---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-18  0:08             ` Kenichi Handa
@ 2002-11-18 19:09               ` Richard Stallman
  0 siblings, 0 replies; 38+ messages in thread
From: Richard Stallman @ 2002-11-18 19:09 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, eliz, emacs-devel

    > In that case, would you please make the change?

    Ok, I've just installed this change.

    2002-11-18  Kenichi Handa  <handa@m17n.org>

	    * language/cyrillic.el (cyrillic-iso-8bit): Make it safe.

	    * language/european.el (iso-latin-1): Make it safe.
	    (iso-latin-2, iso-latin-3, iso-latin-4, iso-latin-5, iso-latin-8) 
	    (iso-latin-9): Likewise.

I think this merits a NEWS entry.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
@ 2002-11-28 17:01 Dave Love
  2002-12-02 15:47 ` Richard Stallman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Love @ 2002-11-28 17:01 UTC (permalink / raw)


On querying a change, handa referred me to a thread which lead to it.
The archive loses threading info, sorry.

> Ok, I've just installed this change.
> 
> 2002-11-18  Kenichi Handa  <[13]handa@m17n.org>
> 
>         * language/cyrillic.el (cyrillic-iso-8bit): Make it safe.
> 
>         * language/european.el (iso-latin-1): Make it safe.
>         (iso-latin-2, iso-latin-3, iso-latin-4, iso-latin-5, iso-latin-8)
>         (iso-latin-9): Likewise.
> 
>         * language/greek.el (greek-iso-8bit): Make it safe.
> 
>         * language/hebrew.el (hebrew-iso-8bit): Make it safe.
> 
>         * language/lao.el (lao): Make it safe.
> 
>         * language/thai.el (thai-tis620): Make it safe.       

I think this is the wrong thing to do.  As rms said, it's a
user-visible change that affects more than Ispell.  It breaks the
principle that Emacs tries to avoid losing information in coding
conversions (whereas XEmacs readily trashes data).  If something uses
one of these encodings incorrectly now you can't recover the data as
you used to be able to.

Anyhow, if the issue is just Ispell, it's definitely the wrong fix.
ispell.el and similar stuff shouldn't send un-encodable text in the
first place.  (If `?' happens to be one of the extra word characters
for Ispell, you'll lose anyway.)

With the development source, you can easily fix what Stefan complained
about as follows.  There's more to it than that, though -- see the
comment in the diff.  I haven't had time to sort that out, though I
did make changes to Flyspell along similar lines.  That's easier,
since Flyspell already works word-wise (roughly), but of course you
likely run into problems displaying the choices without multilingual
menus.

[If you really wanted to fix such a thing just with a coding system
change, you could set up a scratch coding system for the job or
temporarily set a coding system property around the process setup.]

By the way, ispell.el in CVS isn't up-to-date with Stevens' version.

*** ispell.el.~1.133.~	Tue Nov 19 14:49:21 2002
--- ispell.el	Mon Nov 25 15:41:02 2002
***************
*** 1347,1364 ****
  	(or quietly
  	    (message "Checking spelling of %s..."
  		     (funcall ispell-format-word word)))
! 	(ispell-send-string "%\n")	; put in verbose mode
! 	(ispell-send-string (concat "^" word "\n"))
! 	;; wait until ispell has processed word
! 	(while (progn
! 		 (ispell-accept-output)
! 		 (not (string= "" (car ispell-filter)))))
! 	;;(ispell-send-string "!\n") ;back to terse mode.
! 	(setq ispell-filter (cdr ispell-filter)) ; remove extra \n
! 	(if (and ispell-filter (listp ispell-filter))
! 	    (if (> (length ispell-filter) 1)
! 		(error "Ispell and its process have different character maps")
! 	      (setq poss (ispell-parse-output (car ispell-filter)))))
  	(cond ((eq poss t)
  	       (or quietly
  		   (message "%s is correct"
--- 1347,1369 ----
  	(or quietly
  	    (message "Checking spelling of %s..."
  		     (funcall ispell-format-word word)))
! 	(if (and enable-multibyte-characters
! 		 (unencodable-char-position
! 		  0 (length word) (process-coding-system ispell-process)
! 		  nil word))
! 	    (setq poss (list word 1 nil nil))
! 	  (ispell-send-string "%\n")	; put in verbose mode
! 	  (ispell-send-string (concat "^" word "\n"))
! 	  ;; wait until ispell has processed word
! 	  (while (progn
! 		   (ispell-accept-output)
! 		   (not (string= "" (car ispell-filter)))))
! 	  ;;(ispell-send-string "!\n") ;back to terse mode.
! 	  (setq ispell-filter (cdr ispell-filter)) ; remove extra \n
! 	  (if (and ispell-filter (listp ispell-filter))
! 	      (if (> (length ispell-filter) 1)
! 		  (error "Ispell and its process have different character maps")
! 		(setq poss (ispell-parse-output (car ispell-filter))))))
  	(cond ((eq poss t)
  	       (or quietly
  		   (message "%s is correct"
***************
*** 2604,2609 ****
--- 2609,2628 ----
    (let (poss accept-list)
      (if (not (numberp shift))
  	(setq shift 0))
+     (if (and enable-multibyte-characters
+ 	     (fboundp 'unencodable-char-position))
+ 	;; Avoid sending un-encodable input to the process, which can
+ 	;; specifically confuse the current implementation.  Fixme: Do
+ 	;; it for 21.2 too.  Fixme: The implementation here needs
+ 	;; changing to check word-by-word (according to syntax tables,
+ 	;; not a fixed list of characters) from known positions in the
+ 	;; buffer, not not looking for matches of ispell output (which
+ 	;; may be inappropriately encoded, for instance) in the
+ 	;; original buffer.
+ 	(dolist (i (unencodable-char-position
+ 		    0 (length string) (process-coding-system ispell-process)
+ 		    (length string) string))
+ 	  (aset string i ?\ )))
      ;; send string to spell process and get input.
      (ispell-send-string string)
      (while (progn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-11-28 17:01 iso-8859-1 and non-latin-1 chars Dave Love
@ 2002-12-02 15:47 ` Richard Stallman
  2002-12-06 16:38   ` Dave Love
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Stallman @ 2002-12-02 15:47 UTC (permalink / raw)
  Cc: emacs-devel

    > 2002-11-18  Kenichi Handa  <[13]handa@m17n.org>
    > 
    >         * language/cyrillic.el (cyrillic-iso-8bit): Make it safe.
    	      ...

    I think this is the wrong thing to do.  As rms said, it's a
    user-visible change that affects more than Ispell.  It breaks the
    principle that Emacs tries to avoid losing information in coding
    conversions (whereas XEmacs readily trashes data).

It would do that if users explicitly specify these coding systems by
name and if Emacs fails, when they subsequently save the file, to warn
that it can't handle all the characters.

Emacs normally does warn when the coding system doesn't handle all the
characters in the file.  Is there a common scenario where that warning
is bypassed?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-02 15:47 ` Richard Stallman
@ 2002-12-06 16:38   ` Dave Love
  2002-12-09  6:08     ` Kenichi Handa
       [not found]     ` <E18LCz8-0004It-00@fencepost.gnu.org>
  0 siblings, 2 replies; 38+ messages in thread
From: Dave Love @ 2002-12-06 16:38 UTC (permalink / raw)
  Cc: handa, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> Emacs normally does warn when the coding system doesn't handle all the
> characters in the file.  Is there a common scenario where that warning
> is bypassed?

I don't know how common, but for example: broken code (Gnus at times,
I'd bet) or customizations (select Latin-1 for your BBDB database) and
C-x RET c.  Basically whenever the coding system is actually specified
rather than selected from the set of somehow-preferred coding systems.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-06 16:38   ` Dave Love
@ 2002-12-09  6:08     ` Kenichi Handa
  2002-12-15 16:24       ` Dave Love
       [not found]       ` <E18LZqb-0007si-00@fencepost.gnu.org>
       [not found]     ` <E18LCz8-0004It-00@fencepost.gnu.org>
  1 sibling, 2 replies; 38+ messages in thread
From: Kenichi Handa @ 2002-12-09  6:08 UTC (permalink / raw)
  Cc: emacs-devel

In article <rzqhedrp0f5.fsf@albion.dl.ac.uk>, Dave Love <d.love@dl.ac.uk> writes:
> Richard Stallman <rms@gnu.org> writes:
>>  Emacs normally does warn when the coding system doesn't handle all the
>>  characters in the file.  Is there a common scenario where that warning
>>  is bypassed?

> I don't know how common, but for example: broken code (Gnus at times,
> I'd bet) or customizations (select Latin-1 for your BBDB database) and
> C-x RET c.  Basically whenever the coding system is actually specified
> rather than selected from the set of somehow-preferred coding systems.

Even if we write out a proper escape seqeucne for
unencodable characters in a file, those people who are not
familiar with handling coding systems can't read the file
correctly.  If he reads it without C-x RET c, it will be
read as iso-2022-7bit or something like that, not as
iso-latin-1-with-esc.  If he doesn't notice it, he will be
in a big confusion.  If he reads it with C-x RET c
iso-latin-1, the escape sequence is not decoded, thus he
will see raw ESC codes.

So, my conclusion was that writing out those escape
sequences not only violates the commonly accepted concept
about a coding system, but also doesn't help people that
much.

Richard Stallman <rms@gnu.org> writes:
> If you specify a coding system with C-x RET c, and it doesn't
> handle all the text, should Emacs warn about that?

I think that is better, but, can we distinguish these cases
in Fwrite_region.
  (1) coding system is specified by C-x RET c.
  (2) coding system is specified by a code:
	 (let ((coding-system-for-write ...)) ...)

Or, do you think we should warn in both cases?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
       [not found]     ` <E18LCz8-0004It-00@fencepost.gnu.org>
@ 2002-12-10 23:47       ` Dave Love
  2002-12-11 20:39         ` Richard Stallman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Love @ 2002-12-10 23:47 UTC (permalink / raw)
  Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> If you specify a coding system with C-x RET c, and it doesn't
> handle all the text, should Emacs warn about that?

I don't think so.  That would amount to a significant overhead on all
i/o, since C-x RET c just amounts to binding
coding-system-for-{read,write} around the invocation of the command.
Perhaps you could special-case it somehow in interactive use, but the
issue is most relevant to the other case -- when data are written by a
program with an inappropriate coding system.  You presumably can't do
anything about it for process output anyhow.

Is anyone actually taking care of the issues in Ispell/Flyspell?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-10 23:47       ` Dave Love
@ 2002-12-11 20:39         ` Richard Stallman
  2002-12-13  2:58           ` Kenichi Handa
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Stallman @ 2002-12-11 20:39 UTC (permalink / raw)
  Cc: emacs-devel

    > If you specify a coding system with C-x RET c, and it doesn't
    > handle all the text, should Emacs warn about that?

    I don't think so.  That would amount to a significant overhead on all
    i/o, since C-x RET c just amounts to binding
    coding-system-for-{read,write} around the invocation of the command.

It could also bind a flag saying "do warn".

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-11 20:39         ` Richard Stallman
@ 2002-12-13  2:58           ` Kenichi Handa
  2002-12-14 18:31             ` Richard Stallman
  0 siblings, 1 reply; 38+ messages in thread
From: Kenichi Handa @ 2002-12-13  2:58 UTC (permalink / raw)
  Cc: emacs-devel

In article <E18MDdz-0005Mz-00@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:
>>  If you specify a coding system with C-x RET c, and it doesn't
>>  handle all the text, should Emacs warn about that?

>     I don't think so.  That would amount to a significant overhead on all
>     i/o, since C-x RET c just amounts to binding
>     coding-system-for-{read,write} around the invocation of the command.

> It could also bind a flag saying "do warn".

How about the attached change?  Shall I install it?

Notes on the change:

(1) I made a new variable coding-system-require-warning, and
universal-coding-system-argument binds it to t.

(2) If car of the arg DEFAULT-CODING-SYSTEM is t, it
indicates that select-safe-coding-system should not include
buffer-file-coding-system and most preferred coding system
in a list of coding systems tried by default.
Fwrite_region calls select-safe-coding-system in this way if
coding-system-require-warning is non-nil.

(3) Now a user can specify any coding system in
select-safe-coding-system on his risk.  At least, this is
necessary when an unsafe coding sysetm is specified by C-x
RET c.

---
Ken'ichi HANDA
handa@m17n.org


Index: lisp/ChangeLog
===================================================================
RCS file: /cvs/emacs/lisp/ChangeLog,v
retrieving revision 1.4625
diff -u -c -b -r1.4625 ChangeLog
*** lisp/ChangeLog	13 Dec 2002 00:40:52 -0000	1.4625
--- lisp/ChangeLog	13 Dec 2002 02:51:25 -0000
***************
*** 1,3 ****
--- 1,12 ----
+ 2002-12-13  Kenichi Handa  <handa@m17n.org>
+ 
+ 	* international/mule-cmds.el (universal-coding-system-argument):
+ 	Bind coding-system-require-warning to t.
+ 	(select-safe-coding-system): Handle t in the arg
+ 	DEFAULT-CODING-SYSTEM specially.  Use read-coding-system to read a
+ 	coding-system to allow users to specify unsafe coding system on
+ 	their risk.
+ 
  2002-12-13  Nick Roberts  <nick@nick.uklinux.net>
  
  	* gdb-ui.el: Improve documentation strings.
Index: lisp/international/mule-cmds.el
===================================================================
RCS file: /cvs/emacs/lisp/international/mule-cmds.el,v
retrieving revision 1.211
diff -u -c -b -r1.211 mule-cmds.el
*** lisp/international/mule-cmds.el	12 Dec 2002 00:51:31 -0000	1.211
--- lisp/international/mule-cmds.el	13 Dec 2002 02:51:30 -0000
***************
*** 305,310 ****
--- 305,311 ----
  
      (let ((coding-system-for-read coding-system)
  	  (coding-system-for-write coding-system)
+ 	  (coding-system-require-warning t)
  	  (current-prefix-arg prefix))
        (message "")
        (call-interactively cmd))))
***************
*** 604,610 ****
  
  Optional 3rd arg DEFAULT-CODING-SYSTEM specifies a coding system or a
  list of coding systems to be prepended to the default coding system
! list.
  
  Optional 4th arg ACCEPT-DEFAULT-P, if non-nil, is a function to
  determine the acceptability of the silently selected coding system.
--- 605,614 ----
  
  Optional 3rd arg DEFAULT-CODING-SYSTEM specifies a coding system or a
  list of coding systems to be prepended to the default coding system
! list.  However, if DEFAULT-CODING-SYSTEM is a list and the first
! element is t, the cdr part is used as the defualt coding system list,
! i.e. `buffer-file-coding-system' and the most prepended coding system
! is not used.
  
  Optional 4th arg ACCEPT-DEFAULT-P, if non-nil, is a function to
  determine the acceptability of the silently selected coding system.
***************
*** 624,634 ****
--- 628,644 ----
  	   (not (listp default-coding-system)))
        (setq default-coding-system (list default-coding-system)))
  
+   (let ((no-other-defaults nil))
+     (if (eq (car default-coding-system) t)
+ 	(setq no-other-defaults t
+ 	      default-coding-system (cdr default-coding-system)))
+ 
      ;; Change elements of the list to (coding . base-coding).
      (setq default-coding-system
  	  (mapcar (function (lambda (x) (cons x (coding-system-base x))))
  		  default-coding-system))
  
+     (unless no-other-defaults
        ;; If buffer-file-coding-system is not nil nor undecided, append it
        ;; to the defaults.
        (if buffer-file-coding-system
***************
*** 653,659 ****
  	 (not (assq preferred default-coding-system))
  	 (not (rassq base default-coding-system))
  	 (setq default-coding-system
! 	       (append default-coding-system (list (cons preferred base))))))
  
    (if select-safe-coding-system-accept-default-p
        (setq accept-default-p select-safe-coding-system-accept-default-p))
--- 663,670 ----
  	     (not (assq preferred default-coding-system))
  	     (not (rassq base default-coding-system))
  	     (setq default-coding-system
! 		   (append default-coding-system
! 			   (list (cons preferred base))))))))
  
    (if select-safe-coding-system-accept-default-p
        (setq accept-default-p select-safe-coding-system-accept-default-p))
***************
*** 821,840 ****
  		(mapcar (function (lambda (x) (princ "  ") (princ x)))
  			codings)
  		(insert "\n")
! 		(fill-region-as-paragraph pos (point)))))
  
  	  ;; Read a coding system.
! 	  (if safe
! 	      (setq codings (append safe codings)))
! 	  (let* ((safe-names (mapcar (lambda (x) (list (symbol-name x)))
! 				     codings))
! 		 (name (completing-read
  			(format "Select coding system (default %s): "
! 				(car codings))
! 			safe-names nil t nil nil
! 			(car (car safe-names)))))
! 	    (setq last-coding-system-specified (intern name)
! 		  coding-system last-coding-system-specified)))
  	(kill-buffer "*Warning*")
  	(set-window-configuration window-configuration)))
  
--- 832,850 ----
  		(mapcar (function (lambda (x) (princ "  ") (princ x)))
  			codings)
  		(insert "\n")
! 		(fill-region-as-paragraph pos (point)))
! 	      (insert "Or specify any other coding system
! on your risk of loosing the problematic characters.\n")))
  
  	  ;; Read a coding system.
! 	  (setq default-coding-system (or (car safe) (car codings)))
! 	  (setq coding-system
! 		(read-coding-system 
  		 (format "Select coding system (default %s): "
! 			 default-coding-system)
! 		 default-coding-system))
! 	  (setq last-coding-system-specified coding-system))
! 
  	(kill-buffer "*Warning*")
  	(set-window-configuration window-configuration)))
  
Index: src/ChangeLog
===================================================================
RCS file: /cvs/emacs/src/ChangeLog,v
retrieving revision 1.2994
diff -u -c -b -r1.2994 ChangeLog
*** src/ChangeLog	13 Dec 2002 02:35:34 -0000	1.2994
--- src/ChangeLog	13 Dec 2002 02:51:31 -0000
***************
*** 1,5 ****
--- 1,17 ----
  2002-12-13  Kenichi Handa  <handa@m17n.org>
  
+ 	* coding.c (coding_system_require_warning): New variable.
+ 	(syms_of_coding): DEFVAR it.
+ 
+ 	* coding.h (coding_system_require_warning): Extern it.
+ 
+ 	* fileio.c (choose_write_coding_system): Even if
+ 	Vcoding_system_for_write is non-nil, if
+ 	coding_system_require_warning is nonzero, call
+ 	Vselect_safe_coding_system_function.
+ 
+ 2002-12-13  Kenichi Handa  <handa@m17n.org>
+ 
  	* coding.c (Funencodable_char_position): Set pend correctly.
  
  2002-12-12  Jason Rumney  <jasonr@gnu.org>
Index: src/coding.c
===================================================================
RCS file: /cvs/emacs/src/coding.c,v
retrieving revision 1.265
diff -u -c -b -r1.265 coding.c
*** src/coding.c	13 Dec 2002 02:35:51 -0000	1.265
--- src/coding.c	13 Dec 2002 02:51:33 -0000
***************
*** 367,372 ****
--- 367,374 ----
  
  Lisp_Object Vselect_safe_coding_system_function;
  
+ int coding_system_require_warning;
+ 
  /* Mnemonic string for each format of end-of-line.  */
  Lisp_Object eol_mnemonic_unix, eol_mnemonic_dos, eol_mnemonic_mac;
  /* Mnemonic string to indicate format of end-of-line is not yet
***************
*** 7530,7535 ****
--- 7532,7546 ----
  
  The default value is `select-safe-coding-system' (which see).  */);
    Vselect_safe_coding_system_function = Qnil;
+ 
+   DEFVAR_BOOL ("coding-system-require-warning",
+ 	       &coding_system_require_warning,
+ 	       doc: /* Internal use only.
+ If non-nil, on writing a file, select-safe-coding-system-function is
+ called even if coding-system-for-write is non-nil.  The command
+ universal-coding-system-argument binds this variable to t temporarily.  */);
+   coding_system_require_warning = 0;
+ 
  
    DEFVAR_LISP ("char-coding-system-table", &Vchar_coding_system_table,
  	       doc: /* Char-table containing safe coding systems of each characters.
Index: src/coding.h
===================================================================
RCS file: /cvs/emacs/src/coding.h,v
retrieving revision 1.64
diff -u -c -b -r1.64 coding.h
*** src/coding.h	19 Jul 2002 14:27:01 -0000	1.64
--- src/coding.h	13 Dec 2002 02:51:33 -0000
***************
*** 705,710 ****
--- 705,714 ----
     system.  */
  extern Lisp_Object Vselect_safe_coding_system_function;
  
+ /* If nonzero, on writing a file, Vselect_safe_coding_system_function
+    is called even if Vcoding_system_for_write is non-nil.  */
+ extern int coding_system_require_warning;
+ 
  /* Coding system for file names, or nil if none.  */
  extern Lisp_Object Vfile_name_coding_system;
  
Index: src/fileio.c
===================================================================
RCS file: /cvs/emacs/src/fileio.c,v
retrieving revision 1.467
diff -u -c -b -r1.467 fileio.c
*** src/fileio.c	7 Dec 2002 21:39:50 -0000	1.467
--- src/fileio.c	13 Dec 2002 02:51:34 -0000
***************
*** 4624,4630 ****
--- 4624,4638 ----
    if (auto_saving)
      val = Qnil;
    else if (!NILP (Vcoding_system_for_write))
+     {
        val = Vcoding_system_for_write;
+       if (coding_system_require_warning
+ 	  && !NILP (Ffboundp (Vselect_safe_coding_system_function)))
+ 	/* Confirm that VAL can surely encode the current region.  */
+ 	val = call5 (Vselect_safe_coding_system_function,
+ 		     start, end, Fcons (Qt, Fcons (val, Qnil)),
+ 		     Qnil, filename);
+     }
    else
      {
        /* If the variable `buffer-file-coding-system' is set locally,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-13  2:58           ` Kenichi Handa
@ 2002-12-14 18:31             ` Richard Stallman
  2002-12-17 11:41               ` None Kenichi Handa
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Stallman @ 2002-12-14 18:31 UTC (permalink / raw)
  Cc: emacs-devel

    Notes on the change:

    (1) I made a new variable coding-system-require-warning, and
    universal-coding-system-argument binds it to t.

    (2) If car of the arg DEFAULT-CODING-SYSTEM is t, it
    indicates that select-safe-coding-system should not include
    buffer-file-coding-system and most preferred coding system
    in a list of coding systems tried by default.
    Fwrite_region calls select-safe-coding-system in this way if
    coding-system-require-warning is non-nil.

    (3) Now a user can specify any coding system in
    select-safe-coding-system on his risk.  At least, this is
    necessary when an unsafe coding sysetm is specified by C-x
    RET c.

I didn't have time to read the code (sorry), but that sounds ok to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-09  6:08     ` Kenichi Handa
@ 2002-12-15 16:24       ` Dave Love
  2002-12-16  0:42         ` Kenichi Handa
                           ` (2 more replies)
       [not found]       ` <E18LZqb-0007si-00@fencepost.gnu.org>
  1 sibling, 3 replies; 38+ messages in thread
From: Dave Love @ 2002-12-15 16:24 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> Even if we write out a proper escape seqeucne for
> unencodable characters in a file, those people who are not
> familiar with handling coding systems can't read the file
> correctly.  If he reads it without C-x RET c, it will be
> read as iso-2022-7bit or something like that, not as
> iso-latin-1-with-esc.  If he doesn't notice it, he will be
> in a big confusion.  If he reads it with C-x RET c
> iso-latin-1, the escape sequence is not decoded, thus he
> will see raw ESC codes.

Sure, but the cases I'm particularly thinking of are actually
programs, not interactive use.  I think it's better to get escape
codes, which can be reconstructed, than to get `?', which can't.  I
realize this won't work generally, and will be different for Emacs 22.
In Emacs 22, it would be reasonable to do what yudit does (as far as I
remember) and write representations of the unicodes involved as \uxxxx
or similar.

> So, my conclusion was that writing out those escape
> sequences not only violates the commonly accepted concept
> about a coding system,

What concept do you mean, exactly?

> but also doesn't help people that much.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
       [not found]       ` <E18LZqb-0007si-00@fencepost.gnu.org>
@ 2002-12-15 16:25         ` Dave Love
  2002-12-16 16:42           ` Richard Stallman
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Love @ 2002-12-15 16:25 UTC (permalink / raw)
  Cc: handa

Richard Stallman <rms@gnu.org> writes:

> I suspect that in case 2 we sometimes want it to warn,
> but I am not certain.  I think that making it warn in case 2
> is a good idea to start with.

No.  That would significantly slow down the output operations.  The
warning messages would typically be obscure to users anyhow, since
they'd come from the guts of packages.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-15 16:24       ` Dave Love
@ 2002-12-16  0:42         ` Kenichi Handa
  2002-12-19 22:35           ` Dave Love
  2002-12-16 14:06         ` Stefan Monnier
  2002-12-16 16:42         ` Richard Stallman
  2 siblings, 1 reply; 38+ messages in thread
From: Kenichi Handa @ 2002-12-16  0:42 UTC (permalink / raw)
  Cc: emacs-devel

In article <rzqlm2rfdx0.fsf@albion.dl.ac.uk>, Dave Love <d.love@dl.ac.uk> writes:
> Sure, but the cases I'm particularly thinking of are actually
> programs, not interactive use.  I think it's better to get escape
> codes, which can be reconstructed, than to get `?', which can't.

Yes, it's better for Emacs, but not for the other programs
(e.g. ispell).  And such Emacs Lisp applications that use
wrong coding system should be fixed anyway.

> I realize this won't work generally, and will be different
> for Emacs 22.  In Emacs 22, it would be reasonable to do
> what yudit does (as far as I remember) and write
> representations of the unicodes involved as \uxxxx or
> similar.

I was also thinking about that for emacs-unicode.  Perhaps
using the same format as yudit is good.   And, we'll provide
a command `recover-unencoded-characters'.

>>  So, my conclusion was that writing out those escape
>>  sequences not only violates the commonly accepted concept
>>  about a coding system,

> What concept do you mean, exactly?

For instance, the MIME charset iso-8859-1 can encode Latin-1
chars only, and it's stateless, no escape sequences.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-15 16:24       ` Dave Love
  2002-12-16  0:42         ` Kenichi Handa
@ 2002-12-16 14:06         ` Stefan Monnier
  2002-12-19 22:33           ` Dave Love
  2002-12-16 16:42         ` Richard Stallman
  2 siblings, 1 reply; 38+ messages in thread
From: Stefan Monnier @ 2002-12-16 14:06 UTC (permalink / raw)
  Cc: Kenichi Handa

> > So, my conclusion was that writing out those escape
> > sequences not only violates the commonly accepted concept
> > about a coding system,
> What concept do you mean, exactly?

One of the problems is that before the recent change, iso-latin-1 was
sometimes outputting non-latin-1 characters (i.e. bytes between
128 and 160).  That breaks ispell and can break other programs as well.


	Stefan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-15 16:25         ` Dave Love
@ 2002-12-16 16:42           ` Richard Stallman
  0 siblings, 0 replies; 38+ messages in thread
From: Richard Stallman @ 2002-12-16 16:42 UTC (permalink / raw)
  Cc: handa

    > I suspect that in case 2 we sometimes want it to warn,
    > but I am not certain.  I think that making it warn in case 2
    > is a good idea to start with.

    No.  That would significantly slow down the output operations.

Would you explain why you think so?  Given that in many cases Emacs
does check for such a problem and warn, in the usual case, why should
doing so in other cases cause any concern about speed?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-15 16:24       ` Dave Love
  2002-12-16  0:42         ` Kenichi Handa
  2002-12-16 14:06         ` Stefan Monnier
@ 2002-12-16 16:42         ` Richard Stallman
  2 siblings, 0 replies; 38+ messages in thread
From: Richard Stallman @ 2002-12-16 16:42 UTC (permalink / raw)
  Cc: handa

    Sure, but the cases I'm particularly thinking of are actually
    programs, not interactive use.  I think it's better to get escape
    codes, which can be reconstructed, than to get `?', which can't.

Please describe one of these cases *precisely*, so we can see if we
agree with you.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* None
  2002-12-14 18:31             ` Richard Stallman
@ 2002-12-17 11:41               ` Kenichi Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Kenichi Handa @ 2002-12-17 11:41 UTC (permalink / raw)
  Cc: emacs-devel

In article <E18NH4s-0003g1-00@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:
>     Notes on the change:
>     (1) I made a new variable coding-system-require-warning, and
>     universal-coding-system-argument binds it to t.

>     (2) If car of the arg DEFAULT-CODING-SYSTEM is t, it
>     indicates that select-safe-coding-system should not include
>     buffer-file-coding-system and most preferred coding system
>     in a list of coding systems tried by default.
>     Fwrite_region calls select-safe-coding-system in this way if
>     coding-system-require-warning is non-nil.

>     (3) Now a user can specify any coding system in
>     select-safe-coding-system on his risk.  At least, this is
>     necessary when an unsafe coding sysetm is specified by C-x
>     RET c.

> I didn't have time to read the code (sorry), but that sounds ok to me.

I've just installed it in HEAD.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-16 14:06         ` Stefan Monnier
@ 2002-12-19 22:33           ` Dave Love
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Love @ 2002-12-19 22:33 UTC (permalink / raw)
  Cc: Kenichi Handa

"Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:

> One of the problems is that before the recent change, iso-latin-1 was
> sometimes outputting non-latin-1 characters (i.e. bytes between
> 128 and 160).  That breaks ispell and can break other programs as well.

As I said, any such problems need fixing in ispell.el and friends.

However, I don't think the above causes ispell misalignment errors --
the original complaint (due to iso 2022 escape sequences being
produced).  Anyway, the current code does nothing about C1 characters,
and there's no parameter to change that behaviour:

(memq 'iso-latin-1
      (find-coding-systems-string (string (make-char 'latin-iso8859-1 #xa3)
					  128
					  (make-char 'latin-iso8859-2 #xa3))))
  => nil

(encode-coding-string (string (make-char 'latin-iso8859-1 #xa3)
			      128
			      (make-char 'latin-iso8859-2 #xa3))
		      'iso-latin-1)
  => "\243\200?"

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-16  0:42         ` Kenichi Handa
@ 2002-12-19 22:35           ` Dave Love
  2002-12-23  6:40             ` Kenichi Handa
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Love @ 2002-12-19 22:35 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> Yes, it's better for Emacs, but not for the other programs
> (e.g. ispell).  And such Emacs Lisp applications that use
> wrong coding system should be fixed anyway.

Yes, my point is that ispell should be fixed, but people are avoiding
the issue.  Other programs allow users to set a coding system for some
file and they sometimes set it wrongly; I think I mentioned BBDB
initially as a real example.

> > What concept do you mean, exactly?
> 
> For instance, the MIME charset iso-8859-1 can encode Latin-1
> chars only, and it's stateless, no escape sequences.

But the issue is what happens when you tell them to encode something
they can't (safely) encode.  They either have to produce some
more-or-less arbitrary result or to arrange to signal an error.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-19 22:35           ` Dave Love
@ 2002-12-23  6:40             ` Kenichi Handa
  2002-12-23 12:27               ` Dave Love
  0 siblings, 1 reply; 38+ messages in thread
From: Kenichi Handa @ 2002-12-23  6:40 UTC (permalink / raw)
  Cc: emacs-devel

In article <rzqbs3h63i5.fsf@albion.dl.ac.uk>, Dave Love <d.love@dl.ac.uk> writes:
> Kenichi Handa <handa@m17n.org> writes:
>>  Yes, it's better for Emacs, but not for the other programs
>>  (e.g. ispell).  And such Emacs Lisp applications that use
>>  wrong coding system should be fixed anyway.

> Yes, my point is that ispell should be fixed, but people are avoiding
> the issue.

I don't think the program `ispell' itself should be fixed.
When it requires a latin-1 input, sending an ESC sequence is
not good.  We can't blame ispell for not interpreting that
ESC sequence properly.

But, ispell.el should be made more robust.  When it finds an
unencodable character in a word, perhaps, it should show the
word as a misspelled word to a user instead of sending it to
the ispell program.

> Other programs allow users to set a coding system for some
> file and they sometimes set it wrongly; I think I mentioned BBDB
> initially as a real example.

Then BBDB should call select-safe-coding-system before
siliently using a specified coding system.

>>  > What concept do you mean, exactly?
>>  
>>  For instance, the MIME charset iso-8859-1 can encode Latin-1
>>  chars only, and it's stateless, no escape sequences.

> But the issue is what happens when you tell them to encode something
> they can't (safely) encode.  They either have to produce some
> more-or-less arbitrary result or to arrange to signal an error.

Yes, or course.  But, I think producing "?" is less
surprising than produing an ESC sequence.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-23  6:40             ` Kenichi Handa
@ 2002-12-23 12:27               ` Dave Love
  2002-12-25 13:05                 ` Kenichi Handa
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Love @ 2002-12-23 12:27 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> I don't think the program `ispell' itself should be fixed.

I meant ispell.el, though it is unfortunate that we don't have a
spelling program that deals with multibyte encodings as far as I know.

> But, ispell.el should be made more robust.  When it finds an
> unencodable character in a word, perhaps, it should show the
> word as a misspelled word to a user instead of sending it to
> the ispell program.

Yes, that's what I've implemented for flyspell.  The problem is that
ispell.el sends a whole line to the subprocess (unlike flyspell).
Also it doesn't use the Emacs syntax table to decide what's a word.

> Then BBDB should call select-safe-coding-system before
> siliently using a specified coding system.

No, it should use a general coding system to store all text.  I
implemented that, but people have used inappropriate values (either by
setting the relevant variable or via `file-coding-system-alist') and
there could be problems in the transition to the new version.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-23 12:27               ` Dave Love
@ 2002-12-25 13:05                 ` Kenichi Handa
  2002-12-31 17:14                   ` Ken Stevens
                                     ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Kenichi Handa @ 2002-12-25 13:05 UTC (permalink / raw)
  Cc: emacs-devel

In article <rzqr8c89ay9.fsf@albion.dl.ac.uk>, Dave Love <d.love@dl.ac.uk> writes:
> Kenichi Handa <handa@m17n.org> writes:
>>  I don't think the program `ispell' itself should be fixed.

> I meant ispell.el, though it is unfortunate that we don't have a
> spelling program that deals with multibyte encodings as far as I know.

Sure.  Though, I see this mail in linux-utf8 mailing list.

On Wed Feb  6 16:18:23 2002 +0300 Maxim N. Bychkov wrote:
>>Good day.
>>
>>Could anybody explain me how to use unicode symbols in Ispell?
>>
>>Thank you in advance.
>>
>
>AFAIK you can't.  But some ispell dictionaries, like
>esperanto, have a hack for command line -Tutf8 option to
>spellcheck UTF-8 text.

>>  But, ispell.el should be made more robust.  When it finds an
>>  unencodable character in a word, perhaps, it should show the
>>  word as a misspelled word to a user instead of sending it to
>>  the ispell program.

> Yes, that's what I've implemented for flyspell.  The problem is that
> ispell.el sends a whole line to the subprocess (unlike flyspell).
> Also it doesn't use the Emacs syntax table to decide what's a word.

But, I think it's not that difficult to fix this behaviour
so that it breaks a line at an unencodable word.

>>  Then BBDB should call select-safe-coding-system before
>>  siliently using a specified coding system.

> No, it should use a general coding system to store all
> text.

If that is possible (i.e. users allows that), yes.   But,
why calling select-safe-coding-system is not good?

> I implemented that, but people have used inappropriate
> values (either by setting the relevant variable or via
> `file-coding-system-alist') and there could be problems in
> the transition to the new version.

Those people should have already encountered a problem as I
wrote before because they can't decode a text back with the
same coding system.

By the way, the function choose_write_coding_system ()
checks a coding system specified in file-coding-system-alist
by select-safe-coding-system.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-25 13:05                 ` Kenichi Handa
@ 2002-12-31 17:14                   ` Ken Stevens
  2003-01-06 19:28                     ` Dave Love
  2003-01-06 19:18                   ` Dave Love
  2003-01-06 19:19                   ` Dave Love
  2 siblings, 1 reply; 38+ messages in thread
From: Ken Stevens @ 2002-12-31 17:14 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa writes: 

> >>  I don't think the program `ispell' itself should be fixed.
> 
> > I meant ispell.el, though it is unfortunate that we don't have a
> > spelling program that deals with multibyte encodings as far as I know.
> 
> Sure.  Though, I see this mail in linux-utf8 mailing list.

Ispell _does_ support multibyte characters.  This was one of the
historical reasons ispell.el did not use emacs syntax tables to
determine word boundaries.  (It supported latex words that included
escape sequences such as \'{o}, etc.)

I am not sure what it would take to support all the internal emacs
encodings, or if this would be the best approach.  New libraries would
need to be built with the new syntax.

I have cc'ed Geoff Kuenning, since he is the ispell author.

regards		   -Ken
__________________________________________________________________________
Ken Stevens					e-mail: k.stevens@ieee.org
http://www.kdstevens.com/~stevens/ispell-page.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-25 13:05                 ` Kenichi Handa
  2002-12-31 17:14                   ` Ken Stevens
@ 2003-01-06 19:18                   ` Dave Love
  2003-01-07 13:01                     ` Kenichi Handa
  2003-01-06 19:19                   ` Dave Love
  2 siblings, 1 reply; 38+ messages in thread
From: Dave Love @ 2003-01-06 19:18 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> > No, it should use a general coding system to store all
> > text.
> 
> If that is possible (i.e. users allows that), yes.

They should not change it, but...  (BBDB originally had no encoding
support, and some people took part of my advice to add .bbdb to
`file-coding-system-alist', but chose to ignore the actual coding
system :-(.)

> But, why calling select-safe-coding-system is not good?

It's not relevant.  You need to be able to store all text in the file,
not just Latin-1 names, for instance.  It's the same issue as for
auto-save files &c.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-25 13:05                 ` Kenichi Handa
  2002-12-31 17:14                   ` Ken Stevens
  2003-01-06 19:18                   ` Dave Love
@ 2003-01-06 19:19                   ` Dave Love
  2 siblings, 0 replies; 38+ messages in thread
From: Dave Love @ 2003-01-06 19:19 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> > I meant ispell.el, though it is unfortunate that we don't have a
> > spelling program that deals with multibyte encodings as far as I know.
> 
> Sure.  Though, I see this mail in linux-utf8 mailing list.

I assumed ispell could be made to work with utf-8, though I've never
tried.  The most unfortunate thing is that GNU Aspell seems to have
rejected multibyte when it had the opportunity to DTRT.

> But, I think it's not that difficult to fix this behaviour
> so that it breaks a line at an unencodable word.

Not that hard, but what is a word is somewhat complicated.  I don't
know if there's any real advantage to presenting a whole line rather
than a word at a time, which seems to work OK for flyspell.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2002-12-31 17:14                   ` Ken Stevens
@ 2003-01-06 19:28                     ` Dave Love
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Love @ 2003-01-06 19:28 UTC (permalink / raw)
  Cc: Kenichi Handa

Ken Stevens <kstevens@ichips.intel.com> writes:

> Ispell _does_ support multibyte characters.  This was one of the
> historical reasons ispell.el did not use emacs syntax tables to
> determine word boundaries.  (It supported latex words that included
> escape sequences such as \'{o}, etc.)

I didn't think that's really the same thing, but it's a long time
since I hacked on ispell.  Also, I don't see why Emacs couldn't match
such words the same as ispell.

Anyhow, as I don't know, what does one have to do to create and use a
dictionary for utf-8 text?  I could probably add support for that.

> I am not sure what it would take to support all the internal emacs
> encodings, or if this would be the best approach.

It is only _external_ encodings that are relevant, and perhaps only
utf-8.  [I don't know if spell-checking actually makes sense in the
Oriental languages which typically use the multibyte iso-2022
encodings.]

Note that Emacs could cope now with checking utf-8-encoded text
against a dictionary for an 8-bit character set as long as the text
concerned can be encoded in that set.  The text will be appropriately
encoded when it is sent to the subprocess.  I think that really
requires using Emacs's (multibyte) syntax tables, though.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2003-01-06 19:18                   ` Dave Love
@ 2003-01-07 13:01                     ` Kenichi Handa
  2003-01-10 10:59                       ` Dave Love
  0 siblings, 1 reply; 38+ messages in thread
From: Kenichi Handa @ 2003-01-07 13:01 UTC (permalink / raw)
  Cc: emacs-devel

In article <rzq3co6dr3u.fsf@albion.dl.ac.uk>, Dave Love <d.love@dl.ac.uk> writes:

> Kenichi Handa <handa@m17n.org> writes:
>>  > No, it should use a general coding system to store all
>>  > text.
>>  
>>  If that is possible (i.e. users allows that), yes.

> They should not change it, but...  (BBDB originally had no encoding
> support, and some people took part of my advice to add .bbdb to
> `file-coding-system-alist', but chose to ignore the actual coding
> system :-(.)

>>  But, why calling select-safe-coding-system is not good?

> It's not relevant.  You need to be able to store all text in the file,
> not just Latin-1 names, for instance.  It's the same issue as for
> auto-save files &c.

In that case, anyway using latin-1 is not appropriate.
Instead of enabling latin-1 to save all characters, bbdb
should be modified to use a better coding system.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: iso-8859-1 and non-latin-1 chars
  2003-01-07 13:01                     ` Kenichi Handa
@ 2003-01-10 10:59                       ` Dave Love
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Love @ 2003-01-10 10:59 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> In that case, anyway using latin-1 is not appropriate.

I know, but other people seem to know better.

> Instead of enabling latin-1 to save all characters, bbdb
> should be modified to use a better coding system.

Of course.  I thought I said I made the modification, but I can't stop
people breaking it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2003-01-10 10:59 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-28 17:01 iso-8859-1 and non-latin-1 chars Dave Love
2002-12-02 15:47 ` Richard Stallman
2002-12-06 16:38   ` Dave Love
2002-12-09  6:08     ` Kenichi Handa
2002-12-15 16:24       ` Dave Love
2002-12-16  0:42         ` Kenichi Handa
2002-12-19 22:35           ` Dave Love
2002-12-23  6:40             ` Kenichi Handa
2002-12-23 12:27               ` Dave Love
2002-12-25 13:05                 ` Kenichi Handa
2002-12-31 17:14                   ` Ken Stevens
2003-01-06 19:28                     ` Dave Love
2003-01-06 19:18                   ` Dave Love
2003-01-07 13:01                     ` Kenichi Handa
2003-01-10 10:59                       ` Dave Love
2003-01-06 19:19                   ` Dave Love
2002-12-16 14:06         ` Stefan Monnier
2002-12-19 22:33           ` Dave Love
2002-12-16 16:42         ` Richard Stallman
     [not found]       ` <E18LZqb-0007si-00@fencepost.gnu.org>
2002-12-15 16:25         ` Dave Love
2002-12-16 16:42           ` Richard Stallman
     [not found]     ` <E18LCz8-0004It-00@fencepost.gnu.org>
2002-12-10 23:47       ` Dave Love
2002-12-11 20:39         ` Richard Stallman
2002-12-13  2:58           ` Kenichi Handa
2002-12-14 18:31             ` Richard Stallman
2002-12-17 11:41               ` None Kenichi Handa
  -- strict thread matches above, loose matches on Subject: below --
2002-11-07 14:57 iso-8859-1 and non-latin-1 chars Stefan Monnier
2002-11-07 15:25 ` Eli Zaretskii
2002-11-07 17:06   ` Stefan Monnier
2002-11-07 23:42     ` Kenichi Handa
2002-11-07 23:58       ` Stefan Monnier
2002-11-09 11:54       ` Richard Stallman
2002-11-09 20:32         ` Stefan Monnier
2002-11-11 10:19           ` Richard Stallman
2002-11-11  4:00         ` Kenichi Handa
2002-11-12  5:47           ` Richard Stallman
2002-11-18  0:08             ` Kenichi Handa
2002-11-18 19:09               ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).