Coding system robustness?

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Coding system robustness?
@ 2005-03-18 17:45 David Kastrup
  2005-03-18 18:11 ` Stefan Monnier
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: David Kastrup @ 2005-03-18 17:45 UTC (permalink / raw)



Hi,

I'd like to know whether coding systems in general are supposed to be
robust, meaning that decoding some random byte string into the coding
system and reencoding it is guaranteed to deliver the same byte string
again?

Background for that question: I do error association in preview-latex
(via AUCTeX) with the original source text, and generally unrobust
transformations of the input may happen, such as splitting a
multibyte-char in the middle, or translitering some of those chars,
but not others.  I currently work this by having the process use a
raw-text encoding, replace potentially questionable stuff and reencode
when it turns out that the contexts do not match the source file.
This has the disadvantage that

a) I need to go through the works even in case TeX is set up nicely
enough to deliver mostly working characters, since the raw encoding
will match much less often than a properly decoded stream.

b) The displayed output looks like junk unnecessarily.  If we are
talking about multi-file documents written in different encodings,
this problem is not possible to avoid with tolerable effort, but in
the case where the encodings in one document match, it would be nicer
to have AUCTeX have a nicer output buffer.

So what encodings are expected to be "transparent" for what versions
of Emacs (we are only interested in 21.3 and newer)?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Coding system robustness?
  2005-03-18 17:45 Coding system robustness? David Kastrup
@ 2005-03-18 18:11 ` Stefan Monnier
  2005-03-18 18:33   ` David Kastrup
  2005-03-19  1:08   ` Kenichi Handa
  2005-03-19  0:52 ` Kenichi Handa
  2005-03-19  3:09 ` Richard Stallman
  2 siblings, 2 replies; 8+ messages in thread
From: Stefan Monnier @ 2005-03-18 18:11 UTC (permalink / raw)
  Cc: emacs-devel

> I'd like to know whether coding systems in general are supposed to be
> robust, meaning that decoding some random byte string into the coding
> system and reencoding it is guaranteed to deliver the same byte string
> again?

AFAIK, (encode-coding-string (decode-coding-string STR 'foo) 'foo)
should always return STR, otherwise it's a bug.
With the introduction of eight-bit-*, this should be true of "all"
coding-systems in Emacs-21, although IIRC some bugs were found in this area
and fixed in Emacs-CVS (maybe for utf-8), but my memory is fuzzy.

        Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Coding system robustness?
  2005-03-18 18:11 ` Stefan Monnier
@ 2005-03-18 18:33   ` David Kastrup
  2005-03-20  0:22     ` Richard Stallman
  2005-03-19  1:08   ` Kenichi Handa
  1 sibling, 1 reply; 8+ messages in thread
From: David Kastrup @ 2005-03-18 18:33 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> I'd like to know whether coding systems in general are supposed to be
>> robust, meaning that decoding some random byte string into the coding
>> system and reencoding it is guaranteed to deliver the same byte string
>> again?
>
> AFAIK, (encode-coding-string (decode-coding-string STR 'foo) 'foo)
> should always return STR, otherwise it's a bug.  With the
> introduction of eight-bit-*, this should be true of "all"
> coding-systems in Emacs-21, although IIRC some bugs were found in
> this area and fixed in Emacs-CVS (maybe for utf-8), but my memory is
> fuzzy.

Ok, that means I'll go for it.  I _know_ that XEmacs 21.4 is not
robust in that manner, and maybe I'll provide some customizable
workaround variable you can set to "raw-text" if necessary.  But if
you ask the XEmacs developers' opinion, they'll tell you that you are
mad to use mule-ucs in XEmacs 21.4, anyway.  Not that they could offer
any alternatives right now.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Coding system robustness?
  2005-03-18 17:45 Coding system robustness? David Kastrup
  2005-03-18 18:11 ` Stefan Monnier
@ 2005-03-19  0:52 ` Kenichi Handa
  2005-03-19  3:09 ` Richard Stallman
  2 siblings, 0 replies; 8+ messages in thread
From: Kenichi Handa @ 2005-03-19  0:52 UTC (permalink / raw)
  Cc: emacs-devel

In article <x54qf8df09.fsf@lola.goethe.zz>, David Kastrup <dak@gnu.org> writes:
> I'd like to know whether coding systems in general are supposed to be
> robust, meaning that decoding some random byte string into the coding
> system and reencoding it is guaranteed to deliver the same byte string
> again?

In genenral, no.

> Background for that question: I do error association in preview-latex
> (via AUCTeX) with the original source text, and generally unrobust
> transformations of the input may happen, such as splitting a
> multibyte-char in the middle, or translitering some of those chars,
> but not others.  I currently work this by having the process use a
> raw-text encoding, replace potentially questionable stuff and reencode
> when it turns out that the contexts do not match the source file.
> This has the disadvantage that

> a) I need to go through the works even in case TeX is set up nicely
> enough to deliver mostly working characters, since the raw encoding
> will match much less often than a properly decoded stream.

> b) The displayed output looks like junk unnecessarily.  If we are
> talking about multi-file documents written in different encodings,
> this problem is not possible to avoid with tolerable effort, but in
> the case where the encodings in one document match, it would be nicer
> to have AUCTeX have a nicer output buffer.

> So what encodings are expected to be "transparent" for what versions
> of Emacs (we are only interested in 21.3 and newer)?

These are detected as transparent automatically by the
attached code by the latest code.

chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2
iso-latin-3 iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9
iso-safe japanese-iso-8bit japanese-shift-jis
korean-iso-8bit raw-text

I expect more CCL-based coding systems (lots of CPXXX) are
also transparent (at least utf-XX are so), but can't be
detected automatically.

---
Ken'ichi HANDA
handa@m17n.org

(let ((round-trip-safe nil))
  (dolist (elt (coding-system-list t))
    (and (not (coding-system-pre-write-conversion elt))
	 (not (coding-system-post-read-conversion elt))
	 (let ((type (coding-system-type elt)))
	   (if (memq type '(0 1 3 5))
	       (push elt round-trip-safe)
	     (if (eq type 2)
		 (let ((flags (coding-system-flags elt)))
		   (if (and (not (consp (aref flags 0)))
			    (not (consp (aref flags 1)))
			    (not (consp (aref flags 2)))
			    (not (consp (aref flags 3)))
			    (not (aref flags 8)))
		       (push elt round-trip-safe))))))))
  (pp round-trip-safe)
  nil)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Coding system robustness?
  2005-03-18 18:11 ` Stefan Monnier
  2005-03-18 18:33   ` David Kastrup
@ 2005-03-19  1:08   ` Kenichi Handa
  2005-03-19  9:10     ` David Kastrup
  1 sibling, 1 reply; 8+ messages in thread
From: Kenichi Handa @ 2005-03-19  1:08 UTC (permalink / raw)
  Cc: emacs-devel

In article <87wts43jxx.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>  I'd like to know whether coding systems in general are supposed to be
>>  robust, meaning that decoding some random byte string into the coding
>>  system and reencoding it is guaranteed to deliver the same byte string
>>  again?

> AFAIK, (encode-coding-string (decode-coding-string STR 'foo) 'foo)
> should always return STR, otherwise it's a bug.
> With the introduction of eight-bit-*, this should be true of "all"
> coding-systems in Emacs-21,

No.  Redundant escape sequences in iso-2022 based coding
systems are just ignored.  For instance,

  (decode-coding-string "\e(J" 'iso-2022-jp) => ""

And we can't recover "\e(J" on encoding.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Coding system robustness?
  2005-03-18 17:45 Coding system robustness? David Kastrup
  2005-03-18 18:11 ` Stefan Monnier
  2005-03-19  0:52 ` Kenichi Handa
@ 2005-03-19  3:09 ` Richard Stallman
  2 siblings, 0 replies; 8+ messages in thread
From: Richard Stallman @ 2005-03-19  3:09 UTC (permalink / raw)
  Cc: emacs-devel

    I'd like to know whether coding systems in general are supposed to be
    robust, meaning that decoding some random byte string into the coding
    system and reencoding it is guaranteed to deliver the same byte string
    again?

Handa is the expert, but I am sure this is not generally the case.  In
some coding systems there is more than one representation for the same
series of Emacs characters.  I think one of the coding system involves
shift-sequences that switch between sub-coding-systems.  Re-encoding
cannot preserve the shift-sequences.

There may be some coding systems that are 1-1 encodings for the
Emacs characters they can handle.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Coding system robustness?
  2005-03-19  1:08   ` Kenichi Handa
@ 2005-03-19  9:10     ` David Kastrup
  0 siblings, 0 replies; 8+ messages in thread
From: David Kastrup @ 2005-03-19  9:10 UTC (permalink / raw)
  Cc: Stefan Monnier, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> In article <87wts43jxx.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>>>  I'd like to know whether coding systems in general are supposed to be
>>>  robust, meaning that decoding some random byte string into the coding
>>>  system and reencoding it is guaranteed to deliver the same byte string
>>>  again?
>
>> AFAIK, (encode-coding-string (decode-coding-string STR 'foo) 'foo)
>> should always return STR, otherwise it's a bug.
>> With the introduction of eight-bit-*, this should be true of "all"
>> coding-systems in Emacs-21,
>
> No.  Redundant escape sequences in iso-2022 based coding
> systems are just ignored.  For instance,
>
>   (decode-coding-string "\e(J" 'iso-2022-jp) => ""
>
> And we can't recover "\e(J" on encoding.

Ok, making the problem somewhat more confined: if I have a file that
is written _by_ _Emacs_ in some coding system, and then externally I
chop parts of it into pieces (not dropping material) not taking into
account multibyte boundaries, convert these pieces with interspersed
ASCII) into the original decoding, encode it again to a unibyte
string, properly replace the ASCII-fied pieces with the original
material and decode to the original decoding (phew), I am pretty sure
that I have round-trip behavior, right?

Well, almost.  On escape-based coding systems I don't see in the first
place that one can encode/decode string parts in isolation, so I am
afraid that it is not really feasible to promise anything.  Do the
escapes at least start fresh every line?  I am just being curious
here, there is no actual chance that I am going to support such a
coding system, and I don't see how I sensibly could.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Coding system robustness?
  2005-03-18 18:33   ` David Kastrup
@ 2005-03-20  0:22     ` Richard Stallman
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Stallman @ 2005-03-20  0:22 UTC (permalink / raw)
  Cc: monnier, emacs-devel

    Stefan Monnier <monnier@iro.umontreal.ca> writes:

    > AFAIK, (encode-coding-string (decode-coding-string STR 'foo) 'foo)
    > should always return STR, otherwise it's a bug.  

As Handa has explained, that is definitely NOT true.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-03-20  0:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-03-18 17:45 Coding system robustness? David Kastrup
2005-03-18 18:11 ` Stefan Monnier
2005-03-18 18:33   ` David Kastrup
2005-03-20  0:22     ` Richard Stallman
2005-03-19  1:08   ` Kenichi Handa
2005-03-19  9:10     ` David Kastrup
2005-03-19  0:52 ` Kenichi Handa
2005-03-19  3:09 ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).