decode-coding-string gone awry?

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* decode-coding-string gone awry?
@ 2005-02-13  3:50 David Kastrup
  2005-02-14  1:50 ` Kenichi Handa
  2005-02-14 13:37 ` Stefan Monnier
  0 siblings, 2 replies; 32+ messages in thread
From: David Kastrup @ 2005-02-13  3:50 UTC (permalink / raw)



Hi,

I have the problem that within preview-latex there is a function that
assembles UTF-8 strings from single characters.  This function, when
used manually, mostly works.  It is called within a process sentinel
and fails rather consistently there with a current CVS Emacs.  I
include the code here since I don't know what might be involved here:
regexp-quote, substring, char-to-string etc.  The starting string is
taken from a buffer containing only ASCII (inserted by a process with
coding-system 'raw-text).

Output looks like shown below.


(defun preview-error-quote (string)
  "Turn STRING with potential ^^ sequences into a regexp.
To preserve sanity, additional ^ prefixes are matched literally,
so the character represented by ^^^ preceding extended characters
will not get matched, usually."
  (let (output case-fold-search)
    (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)"
			 string)
      (setq output
	    (concat output
		    (regexp-quote (substring string
					     0
					     (- (match-beginning 1) 2)))
		    (if (match-beginning 2)
			(concat
			 "\\(?:" (regexp-quote
				  (substring string
					     (- (match-beginning 1) 2)
					     (match-end 0)))
			 "\\|"
			 (char-to-string
			  (logxor (aref string (match-beginning 2)) 64))
			 "\\)")
		      (char-to-string
		       (string-to-number (match-string 1 string) 16))))
	    string (substring string (match-end 0))))
    (setq output (concat output (regexp-quote string)))
    (if (featurep 'mule)
	(prog2
	    (message "%S %S " output buffer-file-coding-system)
	    (setq output (decode-coding-string output buffer-file-coding-system))
	  (message "%S\n" output))
      output)))

The prog2 is just for the sake of debugging.  What we get here is
something akin to

"r Weise \\$f\\$ um~\\$1\\$ erhÃ¶ht und \\$e\\$" mule-utf-8-unix 
#("r Weise \\$f\\$ um~\\$1\\$ erh\xc2\x81Á\xc2\xb6ht und \\$e\\$" 0 26 nil 26 28 (display "\\201" help-echo utf-8-help-echo untranslated-utf-8 129) 28 29 nil 29 31 (display "\\266" help-echo utf-8-help-echo untranslated-utf-8 182) 31 43 nil)

when this is called in a mule-utf-8-unix buffer with
(preview-error-quote "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$")

Namely, the decoding from utf-8 does not work.  The original strings
are multibyte before the conversion and look reasonable, with the
bytes produced by char-to-string.

Unfortunately, when I call this stuff by hand instead from the
process-sentinel, it mostly works, so it would appear to be dependent
on some uninitialized stuff or similar that is different in the
process sentinel.

Anybody have a clue what might go wrong here?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-13  3:50 decode-coding-string gone awry? David Kastrup
@ 2005-02-14  1:50 ` Kenichi Handa
  2005-02-14  2:28   ` David Kastrup
  2005-02-15  6:15   ` Richard Stallman
  2005-02-14 13:37 ` Stefan Monnier
  1 sibling, 2 replies; 32+ messages in thread
From: Kenichi Handa @ 2005-02-14  1:50 UTC (permalink / raw)
  Cc: emacs-devel

In article <x5d5v52k4m.fsf@lola.goethe.zz>, David Kastrup <dak@gnu.org> writes:
> I have the problem that within preview-latex there is a function that
> assembles UTF-8 strings from single characters.  This function, when
> used manually, mostly works.  It is called within a process sentinel
> and fails rather consistently there with a current CVS Emacs.  I
> include the code here since I don't know what might be involved here:
> regexp-quote, substring, char-to-string etc.  The starting string is
> taken from a buffer containing only ASCII (inserted by a process with
> coding-system 'raw-text).

It seems that you are caught in a trap of automatic
unibyte->multibyte conversion.

> (defun preview-error-quote (string)
>   "Turn STRING with potential ^^ sequences into a regexp.
> To preserve sanity, additional ^ prefixes are matched literally,
> so the character represented by ^^^ preceding extended characters
> will not get matched, usually."
>   (let (output case-fold-search)
>     (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)"
> 			 string)
>       (setq output
> 	    (concat output
> 		    (regexp-quote (substring string
> 					     0
> 					     (- (match-beginning 1) 2)))

If STRING is taken from a multibyte buffer, it is a
multibyte string.  Thus, the above substring also returns a
multibyte string.

> 		    (if (match-beginning 2)
> 			(concat
> 			 "\\(?:" (regexp-quote
> 				  (substring string
> 					     (- (match-beginning 1) 2)
> 					     (match-end 0)))
> 			 "\\|"
> 			 (char-to-string
> 			  (logxor (aref string (match-beginning 2)) 64))
> 			 "\\)")
> 		      (char-to-string
> 		       (string-to-number (match-string 1 string) 16))))

But, this char-to-string produces a unibyte string.  So, on
concatinating them, this unibyte string is automatically
converted to multibyte by string-make-multibyte function
which usually produces a multibyte string containing latin-1
chars.

> 	    string (substring string (match-end 0))))
>     (setq output (concat output (regexp-quote string)))
>     (if (featurep 'mule)
> 	(prog2
> 	    (message "%S %S " output buffer-file-coding-system)
> 	    (setq output (decode-coding-string output buffer-file-coding-system))

And this decode-coding-string treats the internal byte
sequence of a multibyte string OUTPUT as utf-8, thus you get
some garbage.

> Unfortunately, when I call this stuff by hand instead from the
> process-sentinel, it mostly works

That is because the string you give to preview-error-quote
is a unibyte string in that case.  The Lisp reader generates
a unibyte string when it sees ASCII-only string.

Ex: (multibyte-string-p "abc") => nil

This will also return incorrect string.

(preview-error-quote
  (string-to-multibyte "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$"))

So, the easiest fix will be to do:
  (setq string (string-as-unibyte string))
in the head of preview-error-quote.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14  1:50 ` Kenichi Handa
@ 2005-02-14  2:28   ` David Kastrup
  2005-02-15  6:15   ` Richard Stallman
  1 sibling, 0 replies; 32+ messages in thread
From: David Kastrup @ 2005-02-14  2:28 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> In article <x5d5v52k4m.fsf@lola.goethe.zz>, David Kastrup <dak@gnu.org> writes:
>> I have the problem that within preview-latex there is a function
>> that assembles UTF-8 strings from single characters.  This
>> function, when used manually, mostly works.
>
> It seems that you are caught in a trap of automatic
> unibyte->multibyte conversion.
>
>> (defun preview-error-quote (string)
>>   "Turn STRING with potential ^^ sequences into a regexp.
>> To preserve sanity, additional ^ prefixes are matched literally,
>> so the character represented by ^^^ preceding extended characters
>> will not get matched, usually."
>>   (let (output case-fold-search)
>>     (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)"
>> 			 string)
>>       (setq output
>> 	    (concat output
>> 		    (regexp-quote (substring string
>> 					     0
>> 					     (- (match-beginning 1) 2)))
>
> If STRING is taken from a multibyte buffer, it is a
> multibyte string.  Thus, the above substring also returns a
> multibyte string.
>
>> 		      (char-to-string
>> 		       (string-to-number (match-string 1 string) 16))))
>
> But, this char-to-string produces a unibyte string.  So, on
> concatinating them, this unibyte string is automatically converted
> to multibyte by string-make-multibyte function which usually
> produces a multibyte string containing latin-1 chars.

Oh.  Latin-1 chars.  Can't I tell char-to-string to produce the same
sort of raw-marked chars that raw-text (as process-coding system)
appears to produce?

>>   (setq output (decode-coding-string output buffer-file-coding-system))
>
> And this decode-coding-string treats the internal byte
> sequence of a multibyte string OUTPUT as utf-8, thus you get
> some garbage.
>
>> Unfortunately, when I call this stuff by hand instead from the
>> process-sentinel, it mostly works
>
> That is because the string you give to preview-error-quote
> is a unibyte string in that case.  The Lisp reader generates
> a unibyte string when it sees ASCII-only string.
>
> Ex: (multibyte-string-p "abc") => nil
>
> This will also return incorrect string.
>
> (preview-error-quote
>   (string-to-multibyte "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$"))
>
> So, the easiest fix will be to do:
>   (setq string (string-as-unibyte string))
> in the head of preview-error-quote.

Sigh.  XEmacs-21.4-mule does not seem to have string-as-unibyte.  I'll
have to see whether it happens to work without it on XEmacs.  If not,
I'll have to come up with something else.

Thanks for the analysis!

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-13  3:50 decode-coding-string gone awry? David Kastrup
  2005-02-14  1:50 ` Kenichi Handa
@ 2005-02-14 13:37 ` Stefan Monnier
  2005-02-14 13:50   ` David Kastrup
  1 sibling, 1 reply; 32+ messages in thread
From: Stefan Monnier @ 2005-02-14 13:37 UTC (permalink / raw)
  Cc: emacs-devel

>     (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)"
>                        string)
>       (setq output
>           (concat output
>                   (regexp-quote (substring string
>                                            0
>                                            (- (match-beginning 1) 2)))
>                   (if (match-beginning 2)
>                       (concat
>                        "\\(?:" (regexp-quote
>                                 (substring string
>                                            (- (match-beginning 1) 2)
>                                            (match-end 0)))
>                        "\\|"
>                        (char-to-string
>                         (logxor (aref string (match-beginning 2)) 64))
>                        "\\)")
>                     (char-to-string
>                      (string-to-number (match-string 1 string) 16))))
>           string (substring string (match-end 0))))
>     (setq output (concat output (regexp-quote string)))
>     (if (featurep 'mule)
>       (prog2
>           (message "%S %S " output buffer-file-coding-system)
>           (setq output (decode-coding-string output buffer-file-coding-system))
>         (message "%S\n" output))
>       output)))

The problem is that by passing `output' to decode-coding-string you clearly
consider `output' to be a sequence of bytes.  But to construct `output' you
use pieces of `string' so you have to make sure that `string' is also
a sequence of bytes.  Assuming `string' comes from the TeX process, you can
do that by making sure that that process's output coding system is `binary'
(or `raw-text' if you want EOL-conversion).


        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 13:37 ` Stefan Monnier
@ 2005-02-14 13:50   ` David Kastrup
  2005-02-14 16:57     ` Stefan Monnier
  0 siblings, 1 reply; 32+ messages in thread
From: David Kastrup @ 2005-02-14 13:50 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>     (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)"
>>                        string)
>>       (setq output
>>           (concat output
>>                   (regexp-quote (substring string
>>                                            0
>>                                            (- (match-beginning 1) 2)))
>>                   (if (match-beginning 2)
>>                       (concat
>>                        "\\(?:" (regexp-quote
>>                                 (substring string
>>                                            (- (match-beginning 1) 2)
>>                                            (match-end 0)))
>>                        "\\|"
>>                        (char-to-string
>>                         (logxor (aref string (match-beginning 2)) 64))
>>                        "\\)")
>>                     (char-to-string
>>                      (string-to-number (match-string 1 string) 16))))
>>           string (substring string (match-end 0))))
>>     (setq output (concat output (regexp-quote string)))
>>     (if (featurep 'mule)
>>       (prog2
>>           (message "%S %S " output buffer-file-coding-system)
>>           (setq output (decode-coding-string output buffer-file-coding-system))
>>         (message "%S\n" output))
>>       output)))
>
> The problem is that by passing `output' to decode-coding-string you
> clearly consider `output' to be a sequence of bytes.  But to
> construct `output' you use pieces of `string' so you have to make
> sure that `string' is also a sequence of bytes.  Assuming `string'
> comes from the TeX process, you can do that by making sure that that
> process's output coding system is `binary' (or `raw-text' if you
> want EOL-conversion).

I already mentioned that this _is_ exactly what we do already: the
problem is that some TeX systems are set up to quote _some_ bytes from
utf-8 in the ^^xx hexadecimal notation, and let some bytes through
unchanged.  It is completely braindead.  The funny thing is that with
the _mixed_ representation, the hard case, this code worked.  But with
the _complete_ ASCII transcription, it doesn't.  I have to experiment
a bit with things like string-as-multibyte and stuff to find out what
combination will be right all of the time.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 13:50   ` David Kastrup
@ 2005-02-14 16:57     ` Stefan Monnier
  2005-02-14 17:24       ` David Kastrup
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Monnier @ 2005-02-14 16:57 UTC (permalink / raw)
  Cc: emacs-devel

>> The problem is that by passing `output' to decode-coding-string you
>> clearly consider `output' to be a sequence of bytes.  But to
>> construct `output' you use pieces of `string' so you have to make
>> sure that `string' is also a sequence of bytes.  Assuming `string'
>> comes from the TeX process, you can do that by making sure that that
>> process's output coding system is `binary' (or `raw-text' if you
>> want EOL-conversion).

> I already mentioned that this _is_ exactly what we do already: the
> problem is that some TeX systems are set up to quote _some_ bytes from
> utf-8 in the ^^xx hexadecimal notation, and let some bytes through
> unchanged.

I'm not sure I understand.  What I meant above is not "make sure the TeX
process only outputs binary", but really set the `process-coding-system'
of the TeX process such that its output coding-system is `raw-text' or
`binary'.  This *should* (aka "module bugs") encusre that the strings passed
to the process filter are unibyte.

If the string goes through a buffer instead of being processed directly from
the process filter, then you should also ensure that this buffer is unibyte.

        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 16:57     ` Stefan Monnier
@ 2005-02-14 17:24       ` David Kastrup
  2005-02-14 18:12         ` Stefan Monnier
  2005-02-15 17:28         ` Richard Stallman
  0 siblings, 2 replies; 32+ messages in thread
From: David Kastrup @ 2005-02-14 17:24 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> The problem is that by passing `output' to decode-coding-string you
>>> clearly consider `output' to be a sequence of bytes.  But to
>>> construct `output' you use pieces of `string' so you have to make
>>> sure that `string' is also a sequence of bytes.  Assuming `string'
>>> comes from the TeX process, you can do that by making sure that that
>>> process's output coding system is `binary' (or `raw-text' if you
>>> want EOL-conversion).
>
>> I already mentioned that this _is_ exactly what we do already: the
>> problem is that some TeX systems are set up to quote _some_ bytes from
>> utf-8 in the ^^xx hexadecimal notation, and let some bytes through
>> unchanged.
>
> I'm not sure I understand.  What I meant above is not "make sure the
> TeX process only outputs binary", but really set the
> `process-coding-system' of the TeX process such that its output
> coding-system is `raw-text' or `binary'.  This *should* (aka "module
> bugs") encusre that the strings passed to the process filter are
> unibyte.
>
> If the string goes through a buffer

Yes.

> instead of being processed directly from the process filter, then
> you should also ensure that this buffer is unibyte.

Yuk.  The problem is that this buffer is not only processed by
preview-latex, but also by AUCTeX, and the versions that get combined
may be different.  AUCTeX uses the source code buffer's file encoding
by default, which is fine for basically unibyte based coding systems.

If a buffer is unibyte, how will its characters get displayed?  In
particular, on a system that has all its language-environment set to
accommodate utf-8?  At what time does the decision whether a buffer is
unibyte or multibyte get made?

I guess that in the long run we will have to install something
directly at filter level, with some CCL program processing the TeX
output.  But at the moment I am trying to stumble along in the context
we have now.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 17:24       ` David Kastrup
@ 2005-02-14 18:12         ` Stefan Monnier
  2005-02-14 18:41           ` David Kastrup
  2005-02-15 17:28         ` Richard Stallman
  1 sibling, 1 reply; 32+ messages in thread
From: Stefan Monnier @ 2005-02-14 18:12 UTC (permalink / raw)
  Cc: emacs-devel

>> instead of being processed directly from the process filter, then
>> you should also ensure that this buffer is unibyte.

> Yuk.  The problem is that this buffer is not only processed by
> preview-latex, but also by AUCTeX, and the versions that get combined
> may be different.  AUCTeX uses the source code buffer's file encoding
> by default, which is fine for basically unibyte based coding systems.

If you can't change this part, then your best bet might be to do something
like:

(defun preview-error-quote (string)
  "Turn STRING with potential ^^ sequences into a regexp.
To preserve sanity, additional ^ prefixes are matched literally,
so the character represented by ^^^ preceding extended characters
will not get matched, usually."
  (let (output case-fold-search)
    (while (string-match "\\^*\\(\\^\\^\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)\\)+"
                         string)
      (setq output
            (concat output
                    (regexp-quote (substring string 0 (match-beginning 1)))
                    (decode-coding-string
                     (preview-dequote-thingies (substring (match-beginning 1)
                                                          (match-end 0)))
                     buffer-file-coding-system))
            string (substring string (match-end 0))))
    (setq output (concat output (regexp-quote string)))
    output)))

BTW, you can use the 3rd arg to string-match to avoid consing strings for
`string'.

This way you only apply decode-coding-string to the part of the string which
is still undecoded but not to the rest.


        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 18:12         ` Stefan Monnier
@ 2005-02-14 18:41           ` David Kastrup
  2005-02-14 19:30             ` Stefan Monnier
  0 siblings, 1 reply; 32+ messages in thread
From: David Kastrup @ 2005-02-14 18:41 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> instead of being processed directly from the process filter, then
>>> you should also ensure that this buffer is unibyte.
>
>> Yuk.  The problem is that this buffer is not only processed by
>> preview-latex, but also by AUCTeX, and the versions that get combined
>> may be different.  AUCTeX uses the source code buffer's file encoding
>> by default, which is fine for basically unibyte based coding systems.
>
> If you can't change this part, then your best bet might be to do something
> like:
>
> (defun preview-error-quote (string)
>   "Turn STRING with potential ^^ sequences into a regexp.
> To preserve sanity, additional ^ prefixes are matched literally,
> so the character represented by ^^^ preceding extended characters
> will not get matched, usually."
>   (let (output case-fold-search)
>     (while (string-match "\\^*\\(\\^\\^\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)\\)+"
>                          string)
>       (setq output
>             (concat output
>                     (regexp-quote (substring string 0 (match-beginning 1)))
>                     (decode-coding-string
>                      (preview-dequote-thingies (substring (match-beginning 1)
>                                                           (match-end 0)))
>                      buffer-file-coding-system))
>             string (substring string (match-end 0))))
>     (setq output (concat output (regexp-quote string)))
>     output)))
>
> BTW, you can use the 3rd arg to string-match to avoid consing strings for
> `string'.
>
> This way you only apply decode-coding-string to the part of the
> string which is still undecoded but not to the rest.

No use.  The gag precisely is that TeX may decide to split a _single_
Unicode character into some bytes that it will let go through
unchanged, and some bytes that it will transcribe into ^^ba notation.
If decode-coding-string is supposed to have a chance of reassembling
this junk, it must only be run at the end of reconstructing the byte
stream.  Yes, this is completely insane.  No, I can't avoid having to
deal with it somehow.

Give me a clue: what happens if a process inserts stuff with 'raw-text
encoding into a multibyte buffer?  'raw-text is a reconstructible
encoding, isn't it, so the stuff will get converted into some prefix
byte indicating "isolated single-byte entity instead of utf-8 char"
and the byte itself or something, right?  And decode-encoding-string
does not want to work on something like that?

I have to admit to total cluelessness.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 18:41           ` David Kastrup
@ 2005-02-14 19:30             ` Stefan Monnier
  2005-02-14 20:09               ` David Kastrup
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Monnier @ 2005-02-14 19:30 UTC (permalink / raw)
  Cc: emacs-devel

> Give me a clue: what happens if a process inserts stuff with 'raw-text
> encoding into a multibyte buffer?  'raw-text is a reconstructible
> encoding, isn't it, so the stuff will get converted into some prefix
> byte indicating "isolated single-byte entity instead of utf-8 char"
> and the byte itself or something, right?  And decode-encoding-string
> does not want to work on something like that?

If you want accented chars to appear as accented chars in the (process)
buffer (i.e. you don't want to change the AUCTeX part), then raw-text is
not an option anyway.  If you don't mind about accented chars appearing as
\NNN, then you can make the buffer unibyte and use `raw-text' as the
process's output coding-system.  That's the more robust approach.

If that option is out (i.e. you have to use a multibyte buffer), you'll have
to basically recover the original byte-sequence by replacing the

   (regexp-quote (substring string 0 (match-beginning 1)))

with

   (regexp-quote (encode-coding-string
                  (substring string 0 (match-beginning 1))
                  buffer-file-coding-system))

[assuming buffer-file-coding-system is the process's output coding-system] or

   (regexp-quote (string-make-unibyte
                  (substring string 0 (match-beginning 1))))

which is basically equivalent except that you lose control over which
coding-system is used.

        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 19:30             ` Stefan Monnier
@ 2005-02-14 20:09               ` David Kastrup
  2005-02-14 20:56                 ` Stefan Monnier
  0 siblings, 1 reply; 32+ messages in thread
From: David Kastrup @ 2005-02-14 20:09 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Give me a clue: what happens if a process inserts stuff with
>> 'raw-text encoding into a multibyte buffer?  'raw-text is a
>> reconstructible encoding, isn't it, so the stuff will get converted
>> into some prefix byte indicating "isolated single-byte entity
>> instead of utf-8 char" and the byte itself or something, right?
>> And decode-encoding-string does not want to work on something like
>> that?
>
> If you want accented chars to appear as accented chars in the
> (process) buffer (i.e. you don't want to change the AUCTeX part),
> then raw-text is not an option anyway.

Yes, I figured as much.  I should better explain what I am doing in
the first place.  AUCTeX does the basic management of the buffer,
creating it, associating processes with it, making a filter routine
for it that inserts the strings after some scanning for keyphrases and
so on.

preview-latex uses all of this folderol, but turns the process output
encoding of its own processes to raw text.  This is something that
AUCTeX does _not_ yet do for its own processes.  AUCTeX's own
process output is more likely to be viewed by the user, anyway.  We
can't hope to get a really readable UTF-8 display for AUCTeX's own
processes at the moment, but AUCTeX's behavior right now leads to
user-readable output in all current cases _except_ when TeX thinks it
is in some Latin-1 locale while working on utf-8 input.

Now with the AUCTeX processes, user readability is the most important
thing.  If AUCTeX can't locate the buffer position exactly, it will at
least locate the line, and that's tolerable for all practical
purposes.

With preview-latex, it is not tolerable.  On the other hand, the
output from preview-latex processes is usually not shown to the user
at all: having an unreadable output buffer due to raw-text encoding is
quite ok.

So that is basically the background why we can easily make the process
raw-text, but quite less easily make the buffer unibyte: AUCTeX will
use the same buffer for its next run, just erasing it, and if it has
turned unibyte, we get into trouble.

> If you don't mind about accented chars appearing as \NNN, then you
> can make the buffer unibyte and use `raw-text' as the process's
> output coding-system.  That's the more robust approach.

If the accented chars (in fact, the whole upper 8bit page) appeared as
\NNN, this would actually mostly be a _win_ over the current situation
where we not too rarely get a mixture of raw bytes and nonsense
characters.  However, I am afraid that this is not quite possible
right now.

We are now in the process of preparing the last major standalone
release of preview-latex.  After that, it will get folded into AUCTeX,
and we will streamline the whole junk.  But in the next weeks, I still
want to get out a preview-latex that works with the current AUCTeX
releases and vice versa.

After that, we will probably make the process encoding raw-text for
the _whole_ of AUCTeX and use a CCL-Program for preprocessing the ^^
sequences into bytecodes again, essentially creating an efficient
artificial illusion of a TeX outputting sane error messages in all
surroundings.

> If that option is out (i.e. you have to use a multibyte buffer),
> you'll have to basically recover the original byte-sequence by
> replacing the
>
>    (regexp-quote (substring string 0 (match-beginning 1)))
>
> with
>
>    (regexp-quote (encode-coding-string
>                   (substring string 0 (match-beginning 1))
>                   buffer-file-coding-system))
>
> [assuming buffer-file-coding-system is the process's output
> coding-system]

The process output coding system being raw-text.  Do I really need to
actually encode raw-text?

>    (regexp-quote (string-make-unibyte
>                   (substring string 0 (match-beginning 1))))
>
> which is basically equivalent except that you lose control over
> which coding-system is used.

I have to admit to being befuddled.  I'll probably have to experiment
until I find something that works and cross fingers.  I don't think I
have much of a chance to actually understand all of the involved
intricacies.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 20:09               ` David Kastrup
@ 2005-02-14 20:56                 ` Stefan Monnier
  2005-02-14 21:07                   ` David Kastrup
  2005-02-14 21:26                   ` David Kastrup
  0 siblings, 2 replies; 32+ messages in thread
From: Stefan Monnier @ 2005-02-14 20:56 UTC (permalink / raw)
  Cc: emacs-devel

> So that is basically the background why we can easily make the process
> raw-text, but quite less easily make the buffer unibyte: AUCTeX will
> use the same buffer for its next run, just erasing it, and if it has
> turned unibyte, we get into trouble.

OK.  raw-text is good.

> The process output coding system being raw-text.  Do I really need to
> actually encode raw-text?

If the string comes straight from raw-text (via a multibyte buffer), that
means it only has ascii and eight-bit-* chars, so all you need is to turn it
from multibyte to unibyte, which can be done with (encode-coding-string foo
'raw-text-unix) or (string-make-unibyte foo) or (string-as-unibyte foo).
The three options are basically equivalent in this case.

string-as-unibyte is +/- (encode-coding-string foo 'emacs-mule-unix)
string-make-unibyte is +/- (encode-coding-string foo locale-coding-system)

I personally prefer the use of encode-coding-string because it makes things
more explicit: you can mention that you're encoding with `raw-text' because
you're undoing the raw-text decoding done by the process's coding-system.
That makes it more obviously correct.

        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 20:56                 ` Stefan Monnier
@ 2005-02-14 21:07                   ` David Kastrup
  2005-02-14 21:29                     ` Stefan Monnier
  2005-02-14 21:26                   ` David Kastrup
  1 sibling, 1 reply; 32+ messages in thread
From: David Kastrup @ 2005-02-14 21:07 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> So that is basically the background why we can easily make the
>> process raw-text, but quite less easily make the buffer unibyte:
>> AUCTeX will use the same buffer for its next run, just erasing it,
>> and if it has turned unibyte, we get into trouble.
>
> OK.  raw-text is good.
>
>> The process output coding system being raw-text.  Do I really need
>> to actually encode raw-text?
>
> If the string comes straight from raw-text (via a multibyte buffer),
> that means it only has ascii and eight-bit-* chars, so all you need
> is to turn it from multibyte to unibyte, which can be done with
> (encode-coding-string foo 'raw-text-unix) or (string-make-unibyte
> foo) or (string-as-unibyte foo).  The three options are basically
> equivalent in this case.
>
> string-as-unibyte is +/- (encode-coding-string foo 'emacs-mule-unix)
> string-make-unibyte is +/- (encode-coding-string foo locale-coding-system)
>
> I personally prefer the use of encode-coding-string because it makes
> things more explicit: you can mention that you're encoding with
> `raw-text' because you're undoing the raw-text decoding done by the
> process's coding-system.  That makes it more obviously correct.

Phooey.  Ok, this sounds like it makes sense.  It also sounds like it
should work also under XEmacs without having to engage my brain in
particular.

Now I am venturing into the realm of pure luxury: is there a way to
have the eight-bit-* chars display as octal escapes always even when
real latin1 characters (inserted by a process with process-coding
latin1) get displayed transparently?  I seem to remember that in those
"crazy" utf-8 buffers I had, those that were created by decoding
raw-text, there appeared latin-1 characters like the infamous Ã
character.  But maybe I am mistaken about that.  I'll just experiment
with the stuff a bit and probably use C-x = a lot.

Thanks,

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 20:56                 ` Stefan Monnier
  2005-02-14 21:07                   ` David Kastrup
@ 2005-02-14 21:26                   ` David Kastrup
  1 sibling, 0 replies; 32+ messages in thread
From: David Kastrup @ 2005-02-14 21:26 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> So that is basically the background why we can easily make the process
>> raw-text, but quite less easily make the buffer unibyte: AUCTeX will
>> use the same buffer for its next run, just erasing it, and if it has
>> turned unibyte, we get into trouble.
>
> OK.  raw-text is good.

Just to follow up on my last mail: the raw-text inserted
8bit-characters _are_ displayed with octal escape sequences, just like
I wished they were.  While I am pretty sure that I _have_ seen
something else as well, it would appear that I confused this with what
an AUCTeX run (probably with an intentionally wrong process encoding
of latin-1) produced, or some earlier iteration of the software.

Thanks for all the hand-holding.  All this is a bit crazy: first the
process "converting" to raw-text, then I taking this, "encoding"
raw-text, interpreting the escapes, and decoding to utf-8 or whatever
else...

But that's a secondary issue.  We can try making something more
sensible once we have the merger behind us.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 21:07                   ` David Kastrup
@ 2005-02-14 21:29                     ` Stefan Monnier
  2005-02-14 21:57                       ` David Kastrup
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Monnier @ 2005-02-14 21:29 UTC (permalink / raw)
  Cc: emacs-devel

> Now I am venturing into the realm of pure luxury: is there a way to
> have the eight-bit-* chars display as octal escapes always even when
> real latin1 characters (inserted by a process with process-coding
> latin1) get displayed transparently?  I seem to remember that in those
> "crazy" utf-8 buffers I had, those that were created by decoding
> raw-text, there appeared latin-1 characters like the infamous Ã
> character.  But maybe I am mistaken about that.  I'll just experiment
> with the stuff a bit and probably use C-x = a lot.

The eight-bit-* chars are different characters than the latin1 ones, so they
can indeed be displayed differently.  The eight-bit-* chars have internal
codes 128-255, so you can use slots 128-255 of char tables to control how
they're displayed.  If the display-table says "nil" for one of them it'll be
displayed as \NNN.  IIRC in many normal startup situations, those slots are
set so as to display latin-1 chars.

        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 21:29                     ` Stefan Monnier
@ 2005-02-14 21:57                       ` David Kastrup
  0 siblings, 0 replies; 32+ messages in thread
From: David Kastrup @ 2005-02-14 21:57 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Now I am venturing into the realm of pure luxury: is there a way to
>> have the eight-bit-* chars display as octal escapes always even when
>> real latin1 characters (inserted by a process with process-coding
>> latin1) get displayed transparently?  I seem to remember that in those
>> "crazy" utf-8 buffers I had, those that were created by decoding
>> raw-text, there appeared latin-1 characters like the infamous Ã
>> character.  But maybe I am mistaken about that.  I'll just experiment
>> with the stuff a bit and probably use C-x = a lot.
>
> The eight-bit-* chars are different characters than the latin1 ones, so they
> can indeed be displayed differently.  The eight-bit-* chars have internal
> codes 128-255, so you can use slots 128-255 of char tables to control how
> they're displayed.  If the display-table says "nil" for one of them it'll be
> displayed as \NNN.  IIRC in many normal startup situations, those slots are
> set so as to display latin-1 chars.

That explains that I remembered seeing Latin-1 (which is my normal
setup).  That I was right now seeing the _expected_ \xxx sequences is
quite likely entirely the fault of my X11 environment which for some
completely unfathomable reason has LC_CTYPE=C set.  I suspect a recent
change to fluxbox, but have yet to find the culprit.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14  1:50 ` Kenichi Handa
  2005-02-14  2:28   ` David Kastrup
@ 2005-02-15  6:15   ` Richard Stallman
  2005-02-15  9:31     ` David Kastrup
  2005-02-15 16:17     ` Stefan Monnier
  1 sibling, 2 replies; 32+ messages in thread
From: Richard Stallman @ 2005-02-15  6:15 UTC (permalink / raw)
  Cc: emacs-devel

    > 	    (setq output (decode-coding-string output buffer-file-coding-system))

    And this decode-coding-string treats the internal byte
    sequence of a multibyte string OUTPUT as utf-8, thus you get
    some garbage.

Is it reasonable to operate with decode-coding-string on a multibyte
string?  If that is nonsense, maybe we should make it get an error,
to help people debug such problems.

If there are some few cases where decode-coding-string makes sense on
a multibyte string, maybe we can make it get an error except in those
few cases.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-15  6:15   ` Richard Stallman
@ 2005-02-15  9:31     ` David Kastrup
  2005-02-15 16:17     ` Stefan Monnier
  1 sibling, 0 replies; 32+ messages in thread
From: David Kastrup @ 2005-02-15  9:31 UTC (permalink / raw)
  Cc: emacs-devel, Kenichi Handa

Richard Stallman <rms@gnu.org> writes:

>     > 	    (setq output (decode-coding-string output
>     > 	    buffer-file-coding-system))
>
>     And this decode-coding-string treats the internal byte sequence
>     of a multibyte string OUTPUT as utf-8, thus you get some
>     garbage.
>
> Is it reasonable to operate with decode-coding-string on a multibyte
> string?  If that is nonsense, maybe we should make it get an error,
> to help people debug such problems.

In my case, this might have helped.  I would not have been able to
make head or tails of the error at first, but without the error I
would have looked elsewhere for the problem.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-15  6:15   ` Richard Stallman
  2005-02-15  9:31     ` David Kastrup
@ 2005-02-15 16:17     ` Stefan Monnier
  2005-02-17 10:35       ` Richard Stallman
  2005-02-17 12:08       ` Kenichi Handa
  1 sibling, 2 replies; 32+ messages in thread
From: Stefan Monnier @ 2005-02-15 16:17 UTC (permalink / raw)
  Cc: emacs-devel, Kenichi Handa

> Is it reasonable to operate with decode-coding-string on a multibyte
> string?  If that is nonsense, maybe we should make it get an error,
> to help people debug such problems.

I think it would indeed make sense to signal errors when decoding
a multibyte string or when encoding a unibyte string.

> If there are some few cases where decode-coding-string makes sense on
> a multibyte string, maybe we can make it get an error except in those
> few cases.

The problem I suspect is that it's pretty common for ASCII-only strings to
be arbitrarily marked unibyte or multibyte depending on the circumstance.
So we would have to check for the case where the string is ASCII-only before
signalling an error.

I'm actually running right now with an Emacs that does signal such errors.
I've changed the notion of "multibyte/unibyte" string by saying:
- [same as now] if size_byte < 0, it's UNIBYTE.
- [same as now] if size_byte > size, it's MULTIBYTE.
- [changed]     if size_byte == size, it's neither/both (ASCII-only).

Then I've changed several parts of the C code to try and set size_byte==size
whenever possible (instead of marking the string as unibyte).


        Stefan


PS: As of now, the only place where Emacs has signalled a bad
    encoding/decoding with the proposed error is in Gnus, though I haven't
    checked any further whether this error really is a bug in Gnus.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-14 17:24       ` David Kastrup
  2005-02-14 18:12         ` Stefan Monnier
@ 2005-02-15 17:28         ` Richard Stallman
  2005-02-15 21:42           ` David Kastrup
  1 sibling, 1 reply; 32+ messages in thread
From: Richard Stallman @ 2005-02-15 17:28 UTC (permalink / raw)
  Cc: monnier, emacs-devel

    Yuk.  The problem is that this buffer is not only processed by
    preview-latex, but also by AUCTeX, and the versions that get combined
    may be different.  AUCTeX uses the source code buffer's file encoding
    by default, which is fine for basically unibyte based coding systems.

It sounds like the safest thing is to convert the string to what you
want, just before you use it.

    If decode-coding-string is supposed to have a chance of reassembling
    this junk, it must only be run at the end of reconstructing the byte
    stream.  Yes, this is completely insane.  No, I can't avoid having to
    deal with it somehow.

If you reconstruct the correct byte stream, it should work to apply
decode-coding-string to it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-15 17:28         ` Richard Stallman
@ 2005-02-15 21:42           ` David Kastrup
  0 siblings, 0 replies; 32+ messages in thread
From: David Kastrup @ 2005-02-15 21:42 UTC (permalink / raw)
  Cc: monnier, emacs-devel

Richard Stallman <rms@gnu.org> writes:

>     Yuk.  The problem is that this buffer is not only processed by
>     preview-latex, but also by AUCTeX, and the versions that get combined
>     may be different.  AUCTeX uses the source code buffer's file encoding
>     by default, which is fine for basically unibyte based coding systems.
>
> It sounds like the safest thing is to convert the string to what you
> want, just before you use it.
>
>     If decode-coding-string is supposed to have a chance of reassembling
>     this junk, it must only be run at the end of reconstructing the byte
>     stream.  Yes, this is completely insane.  No, I can't avoid having to
>     deal with it somehow.
>
> If you reconstruct the correct byte stream, it should work to apply
> decode-coding-string to it.

Yes, it now works.  The process has a process-encoding of raw-text,
inserts in the (multibyte) error message buffer.  The buffer contents
(which are error messages with error contexts) are then tried to be
matched with the source buffer from the compilation directly.  If this
fails, the error message buffer contents are taken,
(encode-coding-string 'raw-text)ed again, the ^^xx hexadecimal bytes
are converted to their equivalent bytes, and then the stuff gets
(decode-coding-string 'buffer-coding-system)ed with the encoding of
the source buffer with which those error messages are compared.  At
this time the error contexts should really match (or I'll start
weeping).

It currently appears to work with all sane and insane combination of
TeX quoting schemes, system language environments and Emacs language
settings.

Thanks for all the help here.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-15 16:17     ` Stefan Monnier
@ 2005-02-17 10:35       ` Richard Stallman
  2005-02-17 12:08       ` Kenichi Handa
  1 sibling, 0 replies; 32+ messages in thread
From: Richard Stallman @ 2005-02-17 10:35 UTC (permalink / raw)
  Cc: emacs-devel, handa

    I'm actually running right now with an Emacs that does signal such errors.
    I've changed the notion of "multibyte/unibyte" string by saying:
    - [same as now] if size_byte < 0, it's UNIBYTE.
    - [same as now] if size_byte > size, it's MULTIBYTE.
    - [changed]     if size_byte == size, it's neither/both (ASCII-only).

That is a far-reaching change.  It would have to be thought about
theoretically, not just tried.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-15 16:17     ` Stefan Monnier
  2005-02-17 10:35       ` Richard Stallman
@ 2005-02-17 12:08       ` Kenichi Handa
  2005-02-17 13:20         ` Stefan Monnier
  2005-02-18 14:12         ` Richard Stallman
  1 sibling, 2 replies; 32+ messages in thread
From: Kenichi Handa @ 2005-02-17 12:08 UTC (permalink / raw)
  Cc: rms, emacs-devel

In article <jwvacq56cyt.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>  Is it reasonable to operate with decode-coding-string on a multibyte
>>  string?  If that is nonsense, maybe we should make it get an error,
>>  to help people debug such problems.

> I think it would indeed make sense to signal errors when decoding
> a multibyte string or when encoding a unibyte string.

>>  If there are some few cases where decode-coding-string makes sense on
>>  a multibyte string, maybe we can make it get an error except in those
>>  few cases.

> The problem I suspect is that it's pretty common for ASCII-only strings to
> be arbitrarily marked unibyte or multibyte depending on the circumstance.
> So we would have to check for the case where the string is ASCII-only before
> signalling an error.

> I'm actually running right now with an Emacs that does signal such errors.
> I've changed the notion of "multibyte/unibyte" string by saying:
> - [same as now] if size_byte < 0, it's UNIBYTE.
> - [same as now] if size_byte > size, it's MULTIBYTE.
> - [changed]     if size_byte == size, it's neither/both (ASCII-only).

> Then I've changed several parts of the C code to try and set size_byte==size
> whenever possible (instead of marking the string as unibyte).

Even if size_byte == size, it may contain eight-bit-graphic
characters, and decoding such a string is a valid operation.
And even if size_byte > size, it may contain only ASCII,
eight-bit-graphic, and eight-bit-control charactes.  It's
also a valid operation to decode it.

It's not a trivial work to change the current code (in
coding.c) to signal an error safely while doing a code
conversion.  So, to check if decoding is valid or not, we
have to check all characters in a string in advance, which,
I think, slows down the operation considerably.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-17 12:08       ` Kenichi Handa
@ 2005-02-17 13:20         ` Stefan Monnier
  2005-02-18  8:30           ` Kenichi Handa
  2005-02-18 14:12           ` Richard Stallman
  2005-02-18 14:12         ` Richard Stallman
  1 sibling, 2 replies; 32+ messages in thread
From: Stefan Monnier @ 2005-02-17 13:20 UTC (permalink / raw)
  Cc: rms, emacs-devel

> Even if size_byte == size, it may contain eight-bit-graphic
> characters, and decoding such a string is a valid operation.
> And even if size_byte > size, it may contain only ASCII,
> eight-bit-graphic, and eight-bit-control charactes.  It's
> also a valid operation to decode it.

I think it should not be considered valid to decode a multibyte string,
whether the string happens to only contains ASCII (or ASCII+eight-bit-*)
or not.

> It's not a trivial work to change the current code (in coding.c) to signal
> an error safely while doing a code conversion.

If by "safely" you mean "which will not break currently working code",
I agree.  If by "safely" you mean "which will not break properly written
code", I disagree.


        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-17 13:20         ` Stefan Monnier
@ 2005-02-18  8:30           ` Kenichi Handa
  2005-02-18 12:56             ` Stefan Monnier
  2005-02-19  9:44             ` Richard Stallman
  2005-02-18 14:12           ` Richard Stallman
  1 sibling, 2 replies; 32+ messages in thread
From: Kenichi Handa @ 2005-02-18  8:30 UTC (permalink / raw)
  Cc: rms, emacs-devel

In article <878y5n9vh9.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>  Even if size_byte == size, it may contain eight-bit-graphic
>>  characters, and decoding such a string is a valid operation.
>>  And even if size_byte > size, it may contain only ASCII,
>>  eight-bit-graphic, and eight-bit-control charactes.  It's
>>  also a valid operation to decode it.

> I think it should not be considered valid to decode a multibyte string,
> whether the string happens to only contains ASCII (or ASCII+eight-bit-*)
> or not.

But, we allow decode-coding-region in a multibyte buffer.
Then, it's strange not to allow something like this:
  (decode-coding-string (buffer-substring FROM TO) CODING)

>>  It's not a trivial work to change the current code (in coding.c) to signal
>>  an error safely while doing a code conversion.

> If by "safely" you mean "which will not break currently working code",
> I agree.  If by "safely" you mean "which will not break properly written
> code", I disagree.

I mean by "safely" to signal an error only at a safe place,
i.e., the place where we can do a global exit.  For
instance, we can't signal an error in decode_coding_iso2022
because it may be modifying buffer contents directly.

By the way, what do you mean by "properly written code"?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-18  8:30           ` Kenichi Handa
@ 2005-02-18 12:56             ` Stefan Monnier
  2005-02-19  9:44             ` Richard Stallman
  1 sibling, 0 replies; 32+ messages in thread
From: Stefan Monnier @ 2005-02-18 12:56 UTC (permalink / raw)
  Cc: rms, emacs-devel

>> I think it should not be considered valid to decode a multibyte string,
>> whether the string happens to only contains ASCII (or ASCII+eight-bit-*)
>> or not.

> But, we allow decode-coding-region in a multibyte buffer.
> Then, it's strange not to allow something like this:
>   (decode-coding-string (buffer-substring FROM TO) CODING)

Maybe it's strange, but it would catch some bugs without restricting what
the user can do (since she can always pass the multibyte string through
some encode-coding-string or string-*-unibyte before).

>>> It's not a trivial work to change the current code (in coding.c) to signal
>>> an error safely while doing a code conversion.

>> If by "safely" you mean "which will not break currently working code",
>> I agree.  If by "safely" you mean "which will not break properly written
>> code", I disagree.

> I mean by "safely" to signal an error only at a safe place,
> i.e., the place where we can do a global exit.  For
> instance, we can't signal an error in decode_coding_iso2022
> because it may be modifying buffer contents directly.

Oh, sorry, I misunderstood.  In my code, I signal the error at the very
beginning (in code_convert_string1), which I believe is safe.

> By the way, what do you mean by "properly written code"?

I mean code which is written carefully with a good understanding of the
notion of encoding and decoding of coding-systems.  This basically boils
down to clearly distinguishing byte-sequences (aka not yet decoded strings),
typically stored in unibyte strings and buffers, and char-sequences (aka
already decoded strings), typically stored in multibyte strings and buffers.

Admittedly, in buffers the situation is less clear cut than in strings since
the (en|de)coding operations on buffers don't always operate on the whole
buffer at a time (contrary to string (en|de)coding), so we need to allow
decoding byte-sequences in multibyte buffers.

        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-17 12:08       ` Kenichi Handa
  2005-02-17 13:20         ` Stefan Monnier
@ 2005-02-18 14:12         ` Richard Stallman
  1 sibling, 0 replies; 32+ messages in thread
From: Richard Stallman @ 2005-02-18 14:12 UTC (permalink / raw)
  Cc: monnier, emacs-devel

    It's not a trivial work to change the current code (in
    coding.c) to signal an error safely while doing a code
    conversion.  So, to check if decoding is valid or not, we
    have to check all characters in a string in advance, which,
    I think, slows down the operation considerably.

Does the speed of decoding for strings really matter?
Maybe not.  We could try checking the characters in advance,
but only for strings.

We could also add an arg to decode-coding-string saying "don't check",
which people could use in cases where the speed really matters.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-17 13:20         ` Stefan Monnier
  2005-02-18  8:30           ` Kenichi Handa
@ 2005-02-18 14:12           ` Richard Stallman
  2005-02-19 20:55             ` Richard Stallman
  1 sibling, 1 reply; 32+ messages in thread
From: Richard Stallman @ 2005-02-18 14:12 UTC (permalink / raw)
  Cc: emacs-devel, handa

    I think it should not be considered valid to decode a multibyte string,
    whether the string happens to only contains ASCII (or ASCII+eight-bit-*)
    or not.

But what would it mean, in the other cases?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-18  8:30           ` Kenichi Handa
  2005-02-18 12:56             ` Stefan Monnier
@ 2005-02-19  9:44             ` Richard Stallman
  1 sibling, 0 replies; 32+ messages in thread
From: Richard Stallman @ 2005-02-19  9:44 UTC (permalink / raw)
  Cc: monnier, emacs-devel

    But, we allow decode-coding-region in a multibyte buffer.
    Then, it's strange not to allow something like this:
      (decode-coding-string (buffer-substring FROM TO) CODING)

Catching errors may be worth the strangeness.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-18 14:12           ` Richard Stallman
@ 2005-02-19 20:55             ` Richard Stallman
  2005-02-21  1:19               ` Kenichi Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Stallman @ 2005-02-19 20:55 UTC (permalink / raw)
  Cc: handa, monnier, emacs-devel

	I think it should not be considered valid to decode a multibyte string,
	whether the string happens to only contains ASCII (or ASCII+eight-bit-*)
	or not.

    But what would it mean, in the other cases?

I see I misread the message the first time--I didn't see the "not".
Now that I see it, I think maybe I agree.

If you have a multibyte string that makes sense to decode, and you
want to decode it, you could call string-as-unibyte first.  That would
be a way of overriding the error-check.  It would not be hard to do,
and it would prevent people from falling into problems that are
mysterious because they don't know that the program decodes multibyte
strings.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-19 20:55             ` Richard Stallman
@ 2005-02-21  1:19               ` Kenichi Handa
  2005-02-22  8:41                 ` Richard Stallman
  0 siblings, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2005-02-21  1:19 UTC (permalink / raw)
  Cc: monnier, emacs-devel

In article <E1D2bdh-0007tb-VD@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:

> 	I think it should not be considered valid to decode a multibyte string,
> 	whether the string happens to only contains ASCII (or ASCII+eight-bit-*)
> 	or not.

>     But what would it mean, in the other cases?

> I see I misread the message the first time--I didn't see the "not".
> Now that I see it, I think maybe I agree.

> If you have a multibyte string that makes sense to decode, and you
> want to decode it, you could call string-as-unibyte first.  That would
> be a way of overriding the error-check.  It would not be hard to do,
> and it would prevent people from falling into problems that are
> mysterious because they don't know that the program decodes multibyte
> strings.

The source of the current problem is not that the code was
going to decode a multibyte string, but the code generated
an unexpected multibyte string (because of the mysterious
unibyte->multibyte automatic conversion).

As it has been a valid operation to decode an ascii and
eight-bit-* only multibyte string, I believe signalling an
error on it causes lots of problems.  On the other hand,
signalling an error only if the string contains a non-ASCII
non-eight-bit-* character will be good.

As you wrote, the slowdown by checking it in advance will be
acceptable in the case of using decode-coding-string.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: decode-coding-string gone awry?
  2005-02-21  1:19               ` Kenichi Handa
@ 2005-02-22  8:41                 ` Richard Stallman
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Stallman @ 2005-02-22  8:41 UTC (permalink / raw)
  Cc: monnier, emacs-devel

    As it has been a valid operation to decode an ascii and
    eight-bit-* only multibyte string, I believe signalling an
    error on it causes lots of problems.  On the other hand,
    signalling an error only if the string contains a non-ASCII
    non-eight-bit-* character will be good.

It would be good to try one or the other.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-02-22  8:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-13  3:50 decode-coding-string gone awry? David Kastrup
2005-02-14  1:50 ` Kenichi Handa
2005-02-14  2:28   ` David Kastrup
2005-02-15  6:15   ` Richard Stallman
2005-02-15  9:31     ` David Kastrup
2005-02-15 16:17     ` Stefan Monnier
2005-02-17 10:35       ` Richard Stallman
2005-02-17 12:08       ` Kenichi Handa
2005-02-17 13:20         ` Stefan Monnier
2005-02-18  8:30           ` Kenichi Handa
2005-02-18 12:56             ` Stefan Monnier
2005-02-19  9:44             ` Richard Stallman
2005-02-18 14:12           ` Richard Stallman
2005-02-19 20:55             ` Richard Stallman
2005-02-21  1:19               ` Kenichi Handa
2005-02-22  8:41                 ` Richard Stallman
2005-02-18 14:12         ` Richard Stallman
2005-02-14 13:37 ` Stefan Monnier
2005-02-14 13:50   ` David Kastrup
2005-02-14 16:57     ` Stefan Monnier
2005-02-14 17:24       ` David Kastrup
2005-02-14 18:12         ` Stefan Monnier
2005-02-14 18:41           ` David Kastrup
2005-02-14 19:30             ` Stefan Monnier
2005-02-14 20:09               ` David Kastrup
2005-02-14 20:56                 ` Stefan Monnier
2005-02-14 21:07                   ` David Kastrup
2005-02-14 21:29                     ` Stefan Monnier
2005-02-14 21:57                       ` David Kastrup
2005-02-14 21:26                   ` David Kastrup
2005-02-15 17:28         ` Richard Stallman
2005-02-15 21:42           ` David Kastrup

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).