* decode-coding-string gone awry? @ 2005-02-13 3:50 David Kastrup 2005-02-14 1:50 ` Kenichi Handa 2005-02-14 13:37 ` Stefan Monnier 0 siblings, 2 replies; 32+ messages in thread From: David Kastrup @ 2005-02-13 3:50 UTC (permalink / raw) Hi, I have the problem that within preview-latex there is a function that assembles UTF-8 strings from single characters. This function, when used manually, mostly works. It is called within a process sentinel and fails rather consistently there with a current CVS Emacs. I include the code here since I don't know what might be involved here: regexp-quote, substring, char-to-string etc. The starting string is taken from a buffer containing only ASCII (inserted by a process with coding-system 'raw-text). Output looks like shown below. (defun preview-error-quote (string) "Turn STRING with potential ^^ sequences into a regexp. To preserve sanity, additional ^ prefixes are matched literally, so the character represented by ^^^ preceding extended characters will not get matched, usually." (let (output case-fold-search) (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)" string) (setq output (concat output (regexp-quote (substring string 0 (- (match-beginning 1) 2))) (if (match-beginning 2) (concat "\\(?:" (regexp-quote (substring string (- (match-beginning 1) 2) (match-end 0))) "\\|" (char-to-string (logxor (aref string (match-beginning 2)) 64)) "\\)") (char-to-string (string-to-number (match-string 1 string) 16)))) string (substring string (match-end 0)))) (setq output (concat output (regexp-quote string))) (if (featurep 'mule) (prog2 (message "%S %S " output buffer-file-coding-system) (setq output (decode-coding-string output buffer-file-coding-system)) (message "%S\n" output)) output))) The prog2 is just for the sake of debugging. What we get here is something akin to "r Weise \\$f\\$ um~\\$1\\$ erhöht und \\$e\\$" mule-utf-8-unix #("r Weise \\$f\\$ um~\\$1\\$ erh\xc2\x81Á\xc2\xb6ht und \\$e\\$" 0 26 nil 26 28 (display "\\201" help-echo utf-8-help-echo untranslated-utf-8 129) 28 29 nil 29 31 (display "\\266" help-echo utf-8-help-echo untranslated-utf-8 182) 31 43 nil) when this is called in a mule-utf-8-unix buffer with (preview-error-quote "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$") Namely, the decoding from utf-8 does not work. The original strings are multibyte before the conversion and look reasonable, with the bytes produced by char-to-string. Unfortunately, when I call this stuff by hand instead from the process-sentinel, it mostly works, so it would appear to be dependent on some uninitialized stuff or similar that is different in the process sentinel. Anybody have a clue what might go wrong here? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-13 3:50 decode-coding-string gone awry? David Kastrup @ 2005-02-14 1:50 ` Kenichi Handa 2005-02-14 2:28 ` David Kastrup 2005-02-15 6:15 ` Richard Stallman 2005-02-14 13:37 ` Stefan Monnier 1 sibling, 2 replies; 32+ messages in thread From: Kenichi Handa @ 2005-02-14 1:50 UTC (permalink / raw) Cc: emacs-devel In article <x5d5v52k4m.fsf@lola.goethe.zz>, David Kastrup <dak@gnu.org> writes: > I have the problem that within preview-latex there is a function that > assembles UTF-8 strings from single characters. This function, when > used manually, mostly works. It is called within a process sentinel > and fails rather consistently there with a current CVS Emacs. I > include the code here since I don't know what might be involved here: > regexp-quote, substring, char-to-string etc. The starting string is > taken from a buffer containing only ASCII (inserted by a process with > coding-system 'raw-text). It seems that you are caught in a trap of automatic unibyte->multibyte conversion. > (defun preview-error-quote (string) > "Turn STRING with potential ^^ sequences into a regexp. > To preserve sanity, additional ^ prefixes are matched literally, > so the character represented by ^^^ preceding extended characters > will not get matched, usually." > (let (output case-fold-search) > (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)" > string) > (setq output > (concat output > (regexp-quote (substring string > 0 > (- (match-beginning 1) 2))) If STRING is taken from a multibyte buffer, it is a multibyte string. Thus, the above substring also returns a multibyte string. > (if (match-beginning 2) > (concat > "\\(?:" (regexp-quote > (substring string > (- (match-beginning 1) 2) > (match-end 0))) > "\\|" > (char-to-string > (logxor (aref string (match-beginning 2)) 64)) > "\\)") > (char-to-string > (string-to-number (match-string 1 string) 16)))) But, this char-to-string produces a unibyte string. So, on concatinating them, this unibyte string is automatically converted to multibyte by string-make-multibyte function which usually produces a multibyte string containing latin-1 chars. > string (substring string (match-end 0)))) > (setq output (concat output (regexp-quote string))) > (if (featurep 'mule) > (prog2 > (message "%S %S " output buffer-file-coding-system) > (setq output (decode-coding-string output buffer-file-coding-system)) And this decode-coding-string treats the internal byte sequence of a multibyte string OUTPUT as utf-8, thus you get some garbage. > Unfortunately, when I call this stuff by hand instead from the > process-sentinel, it mostly works That is because the string you give to preview-error-quote is a unibyte string in that case. The Lisp reader generates a unibyte string when it sees ASCII-only string. Ex: (multibyte-string-p "abc") => nil This will also return incorrect string. (preview-error-quote (string-to-multibyte "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$")) So, the easiest fix will be to do: (setq string (string-as-unibyte string)) in the head of preview-error-quote. --- Ken'ichi HANDA handa@m17n.org ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 1:50 ` Kenichi Handa @ 2005-02-14 2:28 ` David Kastrup 2005-02-15 6:15 ` Richard Stallman 1 sibling, 0 replies; 32+ messages in thread From: David Kastrup @ 2005-02-14 2:28 UTC (permalink / raw) Cc: emacs-devel Kenichi Handa <handa@m17n.org> writes: > In article <x5d5v52k4m.fsf@lola.goethe.zz>, David Kastrup <dak@gnu.org> writes: >> I have the problem that within preview-latex there is a function >> that assembles UTF-8 strings from single characters. This >> function, when used manually, mostly works. > > It seems that you are caught in a trap of automatic > unibyte->multibyte conversion. > >> (defun preview-error-quote (string) >> "Turn STRING with potential ^^ sequences into a regexp. >> To preserve sanity, additional ^ prefixes are matched literally, >> so the character represented by ^^^ preceding extended characters >> will not get matched, usually." >> (let (output case-fold-search) >> (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)" >> string) >> (setq output >> (concat output >> (regexp-quote (substring string >> 0 >> (- (match-beginning 1) 2))) > > If STRING is taken from a multibyte buffer, it is a > multibyte string. Thus, the above substring also returns a > multibyte string. > >> (char-to-string >> (string-to-number (match-string 1 string) 16)))) > > But, this char-to-string produces a unibyte string. So, on > concatinating them, this unibyte string is automatically converted > to multibyte by string-make-multibyte function which usually > produces a multibyte string containing latin-1 chars. Oh. Latin-1 chars. Can't I tell char-to-string to produce the same sort of raw-marked chars that raw-text (as process-coding system) appears to produce? >> (setq output (decode-coding-string output buffer-file-coding-system)) > > And this decode-coding-string treats the internal byte > sequence of a multibyte string OUTPUT as utf-8, thus you get > some garbage. > >> Unfortunately, when I call this stuff by hand instead from the >> process-sentinel, it mostly works > > That is because the string you give to preview-error-quote > is a unibyte string in that case. The Lisp reader generates > a unibyte string when it sees ASCII-only string. > > Ex: (multibyte-string-p "abc") => nil > > This will also return incorrect string. > > (preview-error-quote > (string-to-multibyte "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$")) > > So, the easiest fix will be to do: > (setq string (string-as-unibyte string)) > in the head of preview-error-quote. Sigh. XEmacs-21.4-mule does not seem to have string-as-unibyte. I'll have to see whether it happens to work without it on XEmacs. If not, I'll have to come up with something else. Thanks for the analysis! -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 1:50 ` Kenichi Handa 2005-02-14 2:28 ` David Kastrup @ 2005-02-15 6:15 ` Richard Stallman 2005-02-15 9:31 ` David Kastrup 2005-02-15 16:17 ` Stefan Monnier 1 sibling, 2 replies; 32+ messages in thread From: Richard Stallman @ 2005-02-15 6:15 UTC (permalink / raw) Cc: emacs-devel > (setq output (decode-coding-string output buffer-file-coding-system)) And this decode-coding-string treats the internal byte sequence of a multibyte string OUTPUT as utf-8, thus you get some garbage. Is it reasonable to operate with decode-coding-string on a multibyte string? If that is nonsense, maybe we should make it get an error, to help people debug such problems. If there are some few cases where decode-coding-string makes sense on a multibyte string, maybe we can make it get an error except in those few cases. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-15 6:15 ` Richard Stallman @ 2005-02-15 9:31 ` David Kastrup 2005-02-15 16:17 ` Stefan Monnier 1 sibling, 0 replies; 32+ messages in thread From: David Kastrup @ 2005-02-15 9:31 UTC (permalink / raw) Cc: emacs-devel, Kenichi Handa Richard Stallman <rms@gnu.org> writes: > > (setq output (decode-coding-string output > > buffer-file-coding-system)) > > And this decode-coding-string treats the internal byte sequence > of a multibyte string OUTPUT as utf-8, thus you get some > garbage. > > Is it reasonable to operate with decode-coding-string on a multibyte > string? If that is nonsense, maybe we should make it get an error, > to help people debug such problems. In my case, this might have helped. I would not have been able to make head or tails of the error at first, but without the error I would have looked elsewhere for the problem. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-15 6:15 ` Richard Stallman 2005-02-15 9:31 ` David Kastrup @ 2005-02-15 16:17 ` Stefan Monnier 2005-02-17 10:35 ` Richard Stallman 2005-02-17 12:08 ` Kenichi Handa 1 sibling, 2 replies; 32+ messages in thread From: Stefan Monnier @ 2005-02-15 16:17 UTC (permalink / raw) Cc: emacs-devel, Kenichi Handa > Is it reasonable to operate with decode-coding-string on a multibyte > string? If that is nonsense, maybe we should make it get an error, > to help people debug such problems. I think it would indeed make sense to signal errors when decoding a multibyte string or when encoding a unibyte string. > If there are some few cases where decode-coding-string makes sense on > a multibyte string, maybe we can make it get an error except in those > few cases. The problem I suspect is that it's pretty common for ASCII-only strings to be arbitrarily marked unibyte or multibyte depending on the circumstance. So we would have to check for the case where the string is ASCII-only before signalling an error. I'm actually running right now with an Emacs that does signal such errors. I've changed the notion of "multibyte/unibyte" string by saying: - [same as now] if size_byte < 0, it's UNIBYTE. - [same as now] if size_byte > size, it's MULTIBYTE. - [changed] if size_byte == size, it's neither/both (ASCII-only). Then I've changed several parts of the C code to try and set size_byte==size whenever possible (instead of marking the string as unibyte). Stefan PS: As of now, the only place where Emacs has signalled a bad encoding/decoding with the proposed error is in Gnus, though I haven't checked any further whether this error really is a bug in Gnus. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-15 16:17 ` Stefan Monnier @ 2005-02-17 10:35 ` Richard Stallman 2005-02-17 12:08 ` Kenichi Handa 1 sibling, 0 replies; 32+ messages in thread From: Richard Stallman @ 2005-02-17 10:35 UTC (permalink / raw) Cc: emacs-devel, handa I'm actually running right now with an Emacs that does signal such errors. I've changed the notion of "multibyte/unibyte" string by saying: - [same as now] if size_byte < 0, it's UNIBYTE. - [same as now] if size_byte > size, it's MULTIBYTE. - [changed] if size_byte == size, it's neither/both (ASCII-only). That is a far-reaching change. It would have to be thought about theoretically, not just tried. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-15 16:17 ` Stefan Monnier 2005-02-17 10:35 ` Richard Stallman @ 2005-02-17 12:08 ` Kenichi Handa 2005-02-17 13:20 ` Stefan Monnier 2005-02-18 14:12 ` Richard Stallman 1 sibling, 2 replies; 32+ messages in thread From: Kenichi Handa @ 2005-02-17 12:08 UTC (permalink / raw) Cc: rms, emacs-devel In article <jwvacq56cyt.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >> Is it reasonable to operate with decode-coding-string on a multibyte >> string? If that is nonsense, maybe we should make it get an error, >> to help people debug such problems. > I think it would indeed make sense to signal errors when decoding > a multibyte string or when encoding a unibyte string. >> If there are some few cases where decode-coding-string makes sense on >> a multibyte string, maybe we can make it get an error except in those >> few cases. > The problem I suspect is that it's pretty common for ASCII-only strings to > be arbitrarily marked unibyte or multibyte depending on the circumstance. > So we would have to check for the case where the string is ASCII-only before > signalling an error. > I'm actually running right now with an Emacs that does signal such errors. > I've changed the notion of "multibyte/unibyte" string by saying: > - [same as now] if size_byte < 0, it's UNIBYTE. > - [same as now] if size_byte > size, it's MULTIBYTE. > - [changed] if size_byte == size, it's neither/both (ASCII-only). > Then I've changed several parts of the C code to try and set size_byte==size > whenever possible (instead of marking the string as unibyte). Even if size_byte == size, it may contain eight-bit-graphic characters, and decoding such a string is a valid operation. And even if size_byte > size, it may contain only ASCII, eight-bit-graphic, and eight-bit-control charactes. It's also a valid operation to decode it. It's not a trivial work to change the current code (in coding.c) to signal an error safely while doing a code conversion. So, to check if decoding is valid or not, we have to check all characters in a string in advance, which, I think, slows down the operation considerably. --- Ken'ichi HANDA handa@m17n.org ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-17 12:08 ` Kenichi Handa @ 2005-02-17 13:20 ` Stefan Monnier 2005-02-18 8:30 ` Kenichi Handa 2005-02-18 14:12 ` Richard Stallman 2005-02-18 14:12 ` Richard Stallman 1 sibling, 2 replies; 32+ messages in thread From: Stefan Monnier @ 2005-02-17 13:20 UTC (permalink / raw) Cc: rms, emacs-devel > Even if size_byte == size, it may contain eight-bit-graphic > characters, and decoding such a string is a valid operation. > And even if size_byte > size, it may contain only ASCII, > eight-bit-graphic, and eight-bit-control charactes. It's > also a valid operation to decode it. I think it should not be considered valid to decode a multibyte string, whether the string happens to only contains ASCII (or ASCII+eight-bit-*) or not. > It's not a trivial work to change the current code (in coding.c) to signal > an error safely while doing a code conversion. If by "safely" you mean "which will not break currently working code", I agree. If by "safely" you mean "which will not break properly written code", I disagree. Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-17 13:20 ` Stefan Monnier @ 2005-02-18 8:30 ` Kenichi Handa 2005-02-18 12:56 ` Stefan Monnier 2005-02-19 9:44 ` Richard Stallman 2005-02-18 14:12 ` Richard Stallman 1 sibling, 2 replies; 32+ messages in thread From: Kenichi Handa @ 2005-02-18 8:30 UTC (permalink / raw) Cc: rms, emacs-devel In article <878y5n9vh9.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >> Even if size_byte == size, it may contain eight-bit-graphic >> characters, and decoding such a string is a valid operation. >> And even if size_byte > size, it may contain only ASCII, >> eight-bit-graphic, and eight-bit-control charactes. It's >> also a valid operation to decode it. > I think it should not be considered valid to decode a multibyte string, > whether the string happens to only contains ASCII (or ASCII+eight-bit-*) > or not. But, we allow decode-coding-region in a multibyte buffer. Then, it's strange not to allow something like this: (decode-coding-string (buffer-substring FROM TO) CODING) >> It's not a trivial work to change the current code (in coding.c) to signal >> an error safely while doing a code conversion. > If by "safely" you mean "which will not break currently working code", > I agree. If by "safely" you mean "which will not break properly written > code", I disagree. I mean by "safely" to signal an error only at a safe place, i.e., the place where we can do a global exit. For instance, we can't signal an error in decode_coding_iso2022 because it may be modifying buffer contents directly. By the way, what do you mean by "properly written code"? --- Ken'ichi HANDA handa@m17n.org ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-18 8:30 ` Kenichi Handa @ 2005-02-18 12:56 ` Stefan Monnier 2005-02-19 9:44 ` Richard Stallman 1 sibling, 0 replies; 32+ messages in thread From: Stefan Monnier @ 2005-02-18 12:56 UTC (permalink / raw) Cc: rms, emacs-devel >> I think it should not be considered valid to decode a multibyte string, >> whether the string happens to only contains ASCII (or ASCII+eight-bit-*) >> or not. > But, we allow decode-coding-region in a multibyte buffer. > Then, it's strange not to allow something like this: > (decode-coding-string (buffer-substring FROM TO) CODING) Maybe it's strange, but it would catch some bugs without restricting what the user can do (since she can always pass the multibyte string through some encode-coding-string or string-*-unibyte before). >>> It's not a trivial work to change the current code (in coding.c) to signal >>> an error safely while doing a code conversion. >> If by "safely" you mean "which will not break currently working code", >> I agree. If by "safely" you mean "which will not break properly written >> code", I disagree. > I mean by "safely" to signal an error only at a safe place, > i.e., the place where we can do a global exit. For > instance, we can't signal an error in decode_coding_iso2022 > because it may be modifying buffer contents directly. Oh, sorry, I misunderstood. In my code, I signal the error at the very beginning (in code_convert_string1), which I believe is safe. > By the way, what do you mean by "properly written code"? I mean code which is written carefully with a good understanding of the notion of encoding and decoding of coding-systems. This basically boils down to clearly distinguishing byte-sequences (aka not yet decoded strings), typically stored in unibyte strings and buffers, and char-sequences (aka already decoded strings), typically stored in multibyte strings and buffers. Admittedly, in buffers the situation is less clear cut than in strings since the (en|de)coding operations on buffers don't always operate on the whole buffer at a time (contrary to string (en|de)coding), so we need to allow decoding byte-sequences in multibyte buffers. Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-18 8:30 ` Kenichi Handa 2005-02-18 12:56 ` Stefan Monnier @ 2005-02-19 9:44 ` Richard Stallman 1 sibling, 0 replies; 32+ messages in thread From: Richard Stallman @ 2005-02-19 9:44 UTC (permalink / raw) Cc: monnier, emacs-devel But, we allow decode-coding-region in a multibyte buffer. Then, it's strange not to allow something like this: (decode-coding-string (buffer-substring FROM TO) CODING) Catching errors may be worth the strangeness. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-17 13:20 ` Stefan Monnier 2005-02-18 8:30 ` Kenichi Handa @ 2005-02-18 14:12 ` Richard Stallman 2005-02-19 20:55 ` Richard Stallman 1 sibling, 1 reply; 32+ messages in thread From: Richard Stallman @ 2005-02-18 14:12 UTC (permalink / raw) Cc: emacs-devel, handa I think it should not be considered valid to decode a multibyte string, whether the string happens to only contains ASCII (or ASCII+eight-bit-*) or not. But what would it mean, in the other cases? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-18 14:12 ` Richard Stallman @ 2005-02-19 20:55 ` Richard Stallman 2005-02-21 1:19 ` Kenichi Handa 0 siblings, 1 reply; 32+ messages in thread From: Richard Stallman @ 2005-02-19 20:55 UTC (permalink / raw) Cc: handa, monnier, emacs-devel I think it should not be considered valid to decode a multibyte string, whether the string happens to only contains ASCII (or ASCII+eight-bit-*) or not. But what would it mean, in the other cases? I see I misread the message the first time--I didn't see the "not". Now that I see it, I think maybe I agree. If you have a multibyte string that makes sense to decode, and you want to decode it, you could call string-as-unibyte first. That would be a way of overriding the error-check. It would not be hard to do, and it would prevent people from falling into problems that are mysterious because they don't know that the program decodes multibyte strings. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-19 20:55 ` Richard Stallman @ 2005-02-21 1:19 ` Kenichi Handa 2005-02-22 8:41 ` Richard Stallman 0 siblings, 1 reply; 32+ messages in thread From: Kenichi Handa @ 2005-02-21 1:19 UTC (permalink / raw) Cc: monnier, emacs-devel In article <E1D2bdh-0007tb-VD@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes: > I think it should not be considered valid to decode a multibyte string, > whether the string happens to only contains ASCII (or ASCII+eight-bit-*) > or not. > But what would it mean, in the other cases? > I see I misread the message the first time--I didn't see the "not". > Now that I see it, I think maybe I agree. > If you have a multibyte string that makes sense to decode, and you > want to decode it, you could call string-as-unibyte first. That would > be a way of overriding the error-check. It would not be hard to do, > and it would prevent people from falling into problems that are > mysterious because they don't know that the program decodes multibyte > strings. The source of the current problem is not that the code was going to decode a multibyte string, but the code generated an unexpected multibyte string (because of the mysterious unibyte->multibyte automatic conversion). As it has been a valid operation to decode an ascii and eight-bit-* only multibyte string, I believe signalling an error on it causes lots of problems. On the other hand, signalling an error only if the string contains a non-ASCII non-eight-bit-* character will be good. As you wrote, the slowdown by checking it in advance will be acceptable in the case of using decode-coding-string. --- Ken'ichi HANDA handa@m17n.org ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-21 1:19 ` Kenichi Handa @ 2005-02-22 8:41 ` Richard Stallman 0 siblings, 0 replies; 32+ messages in thread From: Richard Stallman @ 2005-02-22 8:41 UTC (permalink / raw) Cc: monnier, emacs-devel As it has been a valid operation to decode an ascii and eight-bit-* only multibyte string, I believe signalling an error on it causes lots of problems. On the other hand, signalling an error only if the string contains a non-ASCII non-eight-bit-* character will be good. It would be good to try one or the other. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-17 12:08 ` Kenichi Handa 2005-02-17 13:20 ` Stefan Monnier @ 2005-02-18 14:12 ` Richard Stallman 1 sibling, 0 replies; 32+ messages in thread From: Richard Stallman @ 2005-02-18 14:12 UTC (permalink / raw) Cc: monnier, emacs-devel It's not a trivial work to change the current code (in coding.c) to signal an error safely while doing a code conversion. So, to check if decoding is valid or not, we have to check all characters in a string in advance, which, I think, slows down the operation considerably. Does the speed of decoding for strings really matter? Maybe not. We could try checking the characters in advance, but only for strings. We could also add an arg to decode-coding-string saying "don't check", which people could use in cases where the speed really matters. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-13 3:50 decode-coding-string gone awry? David Kastrup 2005-02-14 1:50 ` Kenichi Handa @ 2005-02-14 13:37 ` Stefan Monnier 2005-02-14 13:50 ` David Kastrup 1 sibling, 1 reply; 32+ messages in thread From: Stefan Monnier @ 2005-02-14 13:37 UTC (permalink / raw) Cc: emacs-devel > (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)" > string) > (setq output > (concat output > (regexp-quote (substring string > 0 > (- (match-beginning 1) 2))) > (if (match-beginning 2) > (concat > "\\(?:" (regexp-quote > (substring string > (- (match-beginning 1) 2) > (match-end 0))) > "\\|" > (char-to-string > (logxor (aref string (match-beginning 2)) 64)) > "\\)") > (char-to-string > (string-to-number (match-string 1 string) 16)))) > string (substring string (match-end 0)))) > (setq output (concat output (regexp-quote string))) > (if (featurep 'mule) > (prog2 > (message "%S %S " output buffer-file-coding-system) > (setq output (decode-coding-string output buffer-file-coding-system)) > (message "%S\n" output)) > output))) The problem is that by passing `output' to decode-coding-string you clearly consider `output' to be a sequence of bytes. But to construct `output' you use pieces of `string' so you have to make sure that `string' is also a sequence of bytes. Assuming `string' comes from the TeX process, you can do that by making sure that that process's output coding system is `binary' (or `raw-text' if you want EOL-conversion). Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 13:37 ` Stefan Monnier @ 2005-02-14 13:50 ` David Kastrup 2005-02-14 16:57 ` Stefan Monnier 0 siblings, 1 reply; 32+ messages in thread From: David Kastrup @ 2005-02-14 13:50 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)" >> string) >> (setq output >> (concat output >> (regexp-quote (substring string >> 0 >> (- (match-beginning 1) 2))) >> (if (match-beginning 2) >> (concat >> "\\(?:" (regexp-quote >> (substring string >> (- (match-beginning 1) 2) >> (match-end 0))) >> "\\|" >> (char-to-string >> (logxor (aref string (match-beginning 2)) 64)) >> "\\)") >> (char-to-string >> (string-to-number (match-string 1 string) 16)))) >> string (substring string (match-end 0)))) >> (setq output (concat output (regexp-quote string))) >> (if (featurep 'mule) >> (prog2 >> (message "%S %S " output buffer-file-coding-system) >> (setq output (decode-coding-string output buffer-file-coding-system)) >> (message "%S\n" output)) >> output))) > > The problem is that by passing `output' to decode-coding-string you > clearly consider `output' to be a sequence of bytes. But to > construct `output' you use pieces of `string' so you have to make > sure that `string' is also a sequence of bytes. Assuming `string' > comes from the TeX process, you can do that by making sure that that > process's output coding system is `binary' (or `raw-text' if you > want EOL-conversion). I already mentioned that this _is_ exactly what we do already: the problem is that some TeX systems are set up to quote _some_ bytes from utf-8 in the ^^xx hexadecimal notation, and let some bytes through unchanged. It is completely braindead. The funny thing is that with the _mixed_ representation, the hard case, this code worked. But with the _complete_ ASCII transcription, it doesn't. I have to experiment a bit with things like string-as-multibyte and stuff to find out what combination will be right all of the time. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 13:50 ` David Kastrup @ 2005-02-14 16:57 ` Stefan Monnier 2005-02-14 17:24 ` David Kastrup 0 siblings, 1 reply; 32+ messages in thread From: Stefan Monnier @ 2005-02-14 16:57 UTC (permalink / raw) Cc: emacs-devel >> The problem is that by passing `output' to decode-coding-string you >> clearly consider `output' to be a sequence of bytes. But to >> construct `output' you use pieces of `string' so you have to make >> sure that `string' is also a sequence of bytes. Assuming `string' >> comes from the TeX process, you can do that by making sure that that >> process's output coding system is `binary' (or `raw-text' if you >> want EOL-conversion). > I already mentioned that this _is_ exactly what we do already: the > problem is that some TeX systems are set up to quote _some_ bytes from > utf-8 in the ^^xx hexadecimal notation, and let some bytes through > unchanged. I'm not sure I understand. What I meant above is not "make sure the TeX process only outputs binary", but really set the `process-coding-system' of the TeX process such that its output coding-system is `raw-text' or `binary'. This *should* (aka "module bugs") encusre that the strings passed to the process filter are unibyte. If the string goes through a buffer instead of being processed directly from the process filter, then you should also ensure that this buffer is unibyte. Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 16:57 ` Stefan Monnier @ 2005-02-14 17:24 ` David Kastrup 2005-02-14 18:12 ` Stefan Monnier 2005-02-15 17:28 ` Richard Stallman 0 siblings, 2 replies; 32+ messages in thread From: David Kastrup @ 2005-02-14 17:24 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> The problem is that by passing `output' to decode-coding-string you >>> clearly consider `output' to be a sequence of bytes. But to >>> construct `output' you use pieces of `string' so you have to make >>> sure that `string' is also a sequence of bytes. Assuming `string' >>> comes from the TeX process, you can do that by making sure that that >>> process's output coding system is `binary' (or `raw-text' if you >>> want EOL-conversion). > >> I already mentioned that this _is_ exactly what we do already: the >> problem is that some TeX systems are set up to quote _some_ bytes from >> utf-8 in the ^^xx hexadecimal notation, and let some bytes through >> unchanged. > > I'm not sure I understand. What I meant above is not "make sure the > TeX process only outputs binary", but really set the > `process-coding-system' of the TeX process such that its output > coding-system is `raw-text' or `binary'. This *should* (aka "module > bugs") encusre that the strings passed to the process filter are > unibyte. > > If the string goes through a buffer Yes. > instead of being processed directly from the process filter, then > you should also ensure that this buffer is unibyte. Yuk. The problem is that this buffer is not only processed by preview-latex, but also by AUCTeX, and the versions that get combined may be different. AUCTeX uses the source code buffer's file encoding by default, which is fine for basically unibyte based coding systems. If a buffer is unibyte, how will its characters get displayed? In particular, on a system that has all its language-environment set to accommodate utf-8? At what time does the decision whether a buffer is unibyte or multibyte get made? I guess that in the long run we will have to install something directly at filter level, with some CCL program processing the TeX output. But at the moment I am trying to stumble along in the context we have now. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 17:24 ` David Kastrup @ 2005-02-14 18:12 ` Stefan Monnier 2005-02-14 18:41 ` David Kastrup 2005-02-15 17:28 ` Richard Stallman 1 sibling, 1 reply; 32+ messages in thread From: Stefan Monnier @ 2005-02-14 18:12 UTC (permalink / raw) Cc: emacs-devel >> instead of being processed directly from the process filter, then >> you should also ensure that this buffer is unibyte. > Yuk. The problem is that this buffer is not only processed by > preview-latex, but also by AUCTeX, and the versions that get combined > may be different. AUCTeX uses the source code buffer's file encoding > by default, which is fine for basically unibyte based coding systems. If you can't change this part, then your best bet might be to do something like: (defun preview-error-quote (string) "Turn STRING with potential ^^ sequences into a regexp. To preserve sanity, additional ^ prefixes are matched literally, so the character represented by ^^^ preceding extended characters will not get matched, usually." (let (output case-fold-search) (while (string-match "\\^*\\(\\^\\^\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)\\)+" string) (setq output (concat output (regexp-quote (substring string 0 (match-beginning 1))) (decode-coding-string (preview-dequote-thingies (substring (match-beginning 1) (match-end 0))) buffer-file-coding-system)) string (substring string (match-end 0)))) (setq output (concat output (regexp-quote string))) output))) BTW, you can use the 3rd arg to string-match to avoid consing strings for `string'. This way you only apply decode-coding-string to the part of the string which is still undecoded but not to the rest. Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 18:12 ` Stefan Monnier @ 2005-02-14 18:41 ` David Kastrup 2005-02-14 19:30 ` Stefan Monnier 0 siblings, 1 reply; 32+ messages in thread From: David Kastrup @ 2005-02-14 18:41 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> instead of being processed directly from the process filter, then >>> you should also ensure that this buffer is unibyte. > >> Yuk. The problem is that this buffer is not only processed by >> preview-latex, but also by AUCTeX, and the versions that get combined >> may be different. AUCTeX uses the source code buffer's file encoding >> by default, which is fine for basically unibyte based coding systems. > > If you can't change this part, then your best bet might be to do something > like: > > (defun preview-error-quote (string) > "Turn STRING with potential ^^ sequences into a regexp. > To preserve sanity, additional ^ prefixes are matched literally, > so the character represented by ^^^ preceding extended characters > will not get matched, usually." > (let (output case-fold-search) > (while (string-match "\\^*\\(\\^\\^\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)\\)+" > string) > (setq output > (concat output > (regexp-quote (substring string 0 (match-beginning 1))) > (decode-coding-string > (preview-dequote-thingies (substring (match-beginning 1) > (match-end 0))) > buffer-file-coding-system)) > string (substring string (match-end 0)))) > (setq output (concat output (regexp-quote string))) > output))) > > BTW, you can use the 3rd arg to string-match to avoid consing strings for > `string'. > > This way you only apply decode-coding-string to the part of the > string which is still undecoded but not to the rest. No use. The gag precisely is that TeX may decide to split a _single_ Unicode character into some bytes that it will let go through unchanged, and some bytes that it will transcribe into ^^ba notation. If decode-coding-string is supposed to have a chance of reassembling this junk, it must only be run at the end of reconstructing the byte stream. Yes, this is completely insane. No, I can't avoid having to deal with it somehow. Give me a clue: what happens if a process inserts stuff with 'raw-text encoding into a multibyte buffer? 'raw-text is a reconstructible encoding, isn't it, so the stuff will get converted into some prefix byte indicating "isolated single-byte entity instead of utf-8 char" and the byte itself or something, right? And decode-encoding-string does not want to work on something like that? I have to admit to total cluelessness. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 18:41 ` David Kastrup @ 2005-02-14 19:30 ` Stefan Monnier 2005-02-14 20:09 ` David Kastrup 0 siblings, 1 reply; 32+ messages in thread From: Stefan Monnier @ 2005-02-14 19:30 UTC (permalink / raw) Cc: emacs-devel > Give me a clue: what happens if a process inserts stuff with 'raw-text > encoding into a multibyte buffer? 'raw-text is a reconstructible > encoding, isn't it, so the stuff will get converted into some prefix > byte indicating "isolated single-byte entity instead of utf-8 char" > and the byte itself or something, right? And decode-encoding-string > does not want to work on something like that? If you want accented chars to appear as accented chars in the (process) buffer (i.e. you don't want to change the AUCTeX part), then raw-text is not an option anyway. If you don't mind about accented chars appearing as \NNN, then you can make the buffer unibyte and use `raw-text' as the process's output coding-system. That's the more robust approach. If that option is out (i.e. you have to use a multibyte buffer), you'll have to basically recover the original byte-sequence by replacing the (regexp-quote (substring string 0 (match-beginning 1))) with (regexp-quote (encode-coding-string (substring string 0 (match-beginning 1)) buffer-file-coding-system)) [assuming buffer-file-coding-system is the process's output coding-system] or (regexp-quote (string-make-unibyte (substring string 0 (match-beginning 1)))) which is basically equivalent except that you lose control over which coding-system is used. Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 19:30 ` Stefan Monnier @ 2005-02-14 20:09 ` David Kastrup 2005-02-14 20:56 ` Stefan Monnier 0 siblings, 1 reply; 32+ messages in thread From: David Kastrup @ 2005-02-14 20:09 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> Give me a clue: what happens if a process inserts stuff with >> 'raw-text encoding into a multibyte buffer? 'raw-text is a >> reconstructible encoding, isn't it, so the stuff will get converted >> into some prefix byte indicating "isolated single-byte entity >> instead of utf-8 char" and the byte itself or something, right? >> And decode-encoding-string does not want to work on something like >> that? > > If you want accented chars to appear as accented chars in the > (process) buffer (i.e. you don't want to change the AUCTeX part), > then raw-text is not an option anyway. Yes, I figured as much. I should better explain what I am doing in the first place. AUCTeX does the basic management of the buffer, creating it, associating processes with it, making a filter routine for it that inserts the strings after some scanning for keyphrases and so on. preview-latex uses all of this folderol, but turns the process output encoding of its own processes to raw text. This is something that AUCTeX does _not_ yet do for its own processes. AUCTeX's own process output is more likely to be viewed by the user, anyway. We can't hope to get a really readable UTF-8 display for AUCTeX's own processes at the moment, but AUCTeX's behavior right now leads to user-readable output in all current cases _except_ when TeX thinks it is in some Latin-1 locale while working on utf-8 input. Now with the AUCTeX processes, user readability is the most important thing. If AUCTeX can't locate the buffer position exactly, it will at least locate the line, and that's tolerable for all practical purposes. With preview-latex, it is not tolerable. On the other hand, the output from preview-latex processes is usually not shown to the user at all: having an unreadable output buffer due to raw-text encoding is quite ok. So that is basically the background why we can easily make the process raw-text, but quite less easily make the buffer unibyte: AUCTeX will use the same buffer for its next run, just erasing it, and if it has turned unibyte, we get into trouble. > If you don't mind about accented chars appearing as \NNN, then you > can make the buffer unibyte and use `raw-text' as the process's > output coding-system. That's the more robust approach. If the accented chars (in fact, the whole upper 8bit page) appeared as \NNN, this would actually mostly be a _win_ over the current situation where we not too rarely get a mixture of raw bytes and nonsense characters. However, I am afraid that this is not quite possible right now. We are now in the process of preparing the last major standalone release of preview-latex. After that, it will get folded into AUCTeX, and we will streamline the whole junk. But in the next weeks, I still want to get out a preview-latex that works with the current AUCTeX releases and vice versa. After that, we will probably make the process encoding raw-text for the _whole_ of AUCTeX and use a CCL-Program for preprocessing the ^^ sequences into bytecodes again, essentially creating an efficient artificial illusion of a TeX outputting sane error messages in all surroundings. > If that option is out (i.e. you have to use a multibyte buffer), > you'll have to basically recover the original byte-sequence by > replacing the > > (regexp-quote (substring string 0 (match-beginning 1))) > > with > > (regexp-quote (encode-coding-string > (substring string 0 (match-beginning 1)) > buffer-file-coding-system)) > > [assuming buffer-file-coding-system is the process's output > coding-system] The process output coding system being raw-text. Do I really need to actually encode raw-text? > (regexp-quote (string-make-unibyte > (substring string 0 (match-beginning 1)))) > > which is basically equivalent except that you lose control over > which coding-system is used. I have to admit to being befuddled. I'll probably have to experiment until I find something that works and cross fingers. I don't think I have much of a chance to actually understand all of the involved intricacies. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 20:09 ` David Kastrup @ 2005-02-14 20:56 ` Stefan Monnier 2005-02-14 21:07 ` David Kastrup 2005-02-14 21:26 ` David Kastrup 0 siblings, 2 replies; 32+ messages in thread From: Stefan Monnier @ 2005-02-14 20:56 UTC (permalink / raw) Cc: emacs-devel > So that is basically the background why we can easily make the process > raw-text, but quite less easily make the buffer unibyte: AUCTeX will > use the same buffer for its next run, just erasing it, and if it has > turned unibyte, we get into trouble. OK. raw-text is good. > The process output coding system being raw-text. Do I really need to > actually encode raw-text? If the string comes straight from raw-text (via a multibyte buffer), that means it only has ascii and eight-bit-* chars, so all you need is to turn it from multibyte to unibyte, which can be done with (encode-coding-string foo 'raw-text-unix) or (string-make-unibyte foo) or (string-as-unibyte foo). The three options are basically equivalent in this case. string-as-unibyte is +/- (encode-coding-string foo 'emacs-mule-unix) string-make-unibyte is +/- (encode-coding-string foo locale-coding-system) I personally prefer the use of encode-coding-string because it makes things more explicit: you can mention that you're encoding with `raw-text' because you're undoing the raw-text decoding done by the process's coding-system. That makes it more obviously correct. Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 20:56 ` Stefan Monnier @ 2005-02-14 21:07 ` David Kastrup 2005-02-14 21:29 ` Stefan Monnier 2005-02-14 21:26 ` David Kastrup 1 sibling, 1 reply; 32+ messages in thread From: David Kastrup @ 2005-02-14 21:07 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> So that is basically the background why we can easily make the >> process raw-text, but quite less easily make the buffer unibyte: >> AUCTeX will use the same buffer for its next run, just erasing it, >> and if it has turned unibyte, we get into trouble. > > OK. raw-text is good. > >> The process output coding system being raw-text. Do I really need >> to actually encode raw-text? > > If the string comes straight from raw-text (via a multibyte buffer), > that means it only has ascii and eight-bit-* chars, so all you need > is to turn it from multibyte to unibyte, which can be done with > (encode-coding-string foo 'raw-text-unix) or (string-make-unibyte > foo) or (string-as-unibyte foo). The three options are basically > equivalent in this case. > > string-as-unibyte is +/- (encode-coding-string foo 'emacs-mule-unix) > string-make-unibyte is +/- (encode-coding-string foo locale-coding-system) > > I personally prefer the use of encode-coding-string because it makes > things more explicit: you can mention that you're encoding with > `raw-text' because you're undoing the raw-text decoding done by the > process's coding-system. That makes it more obviously correct. Phooey. Ok, this sounds like it makes sense. It also sounds like it should work also under XEmacs without having to engage my brain in particular. Now I am venturing into the realm of pure luxury: is there a way to have the eight-bit-* chars display as octal escapes always even when real latin1 characters (inserted by a process with process-coding latin1) get displayed transparently? I seem to remember that in those "crazy" utf-8 buffers I had, those that were created by decoding raw-text, there appeared latin-1 characters like the infamous à character. But maybe I am mistaken about that. I'll just experiment with the stuff a bit and probably use C-x = a lot. Thanks, -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 21:07 ` David Kastrup @ 2005-02-14 21:29 ` Stefan Monnier 2005-02-14 21:57 ` David Kastrup 0 siblings, 1 reply; 32+ messages in thread From: Stefan Monnier @ 2005-02-14 21:29 UTC (permalink / raw) Cc: emacs-devel > Now I am venturing into the realm of pure luxury: is there a way to > have the eight-bit-* chars display as octal escapes always even when > real latin1 characters (inserted by a process with process-coding > latin1) get displayed transparently? I seem to remember that in those > "crazy" utf-8 buffers I had, those that were created by decoding > raw-text, there appeared latin-1 characters like the infamous à > character. But maybe I am mistaken about that. I'll just experiment > with the stuff a bit and probably use C-x = a lot. The eight-bit-* chars are different characters than the latin1 ones, so they can indeed be displayed differently. The eight-bit-* chars have internal codes 128-255, so you can use slots 128-255 of char tables to control how they're displayed. If the display-table says "nil" for one of them it'll be displayed as \NNN. IIRC in many normal startup situations, those slots are set so as to display latin-1 chars. Stefan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 21:29 ` Stefan Monnier @ 2005-02-14 21:57 ` David Kastrup 0 siblings, 0 replies; 32+ messages in thread From: David Kastrup @ 2005-02-14 21:57 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> Now I am venturing into the realm of pure luxury: is there a way to >> have the eight-bit-* chars display as octal escapes always even when >> real latin1 characters (inserted by a process with process-coding >> latin1) get displayed transparently? I seem to remember that in those >> "crazy" utf-8 buffers I had, those that were created by decoding >> raw-text, there appeared latin-1 characters like the infamous à >> character. But maybe I am mistaken about that. I'll just experiment >> with the stuff a bit and probably use C-x = a lot. > > The eight-bit-* chars are different characters than the latin1 ones, so they > can indeed be displayed differently. The eight-bit-* chars have internal > codes 128-255, so you can use slots 128-255 of char tables to control how > they're displayed. If the display-table says "nil" for one of them it'll be > displayed as \NNN. IIRC in many normal startup situations, those slots are > set so as to display latin-1 chars. That explains that I remembered seeing Latin-1 (which is my normal setup). That I was right now seeing the _expected_ \xxx sequences is quite likely entirely the fault of my X11 environment which for some completely unfathomable reason has LC_CTYPE=C set. I suspect a recent change to fluxbox, but have yet to find the culprit. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 20:56 ` Stefan Monnier 2005-02-14 21:07 ` David Kastrup @ 2005-02-14 21:26 ` David Kastrup 1 sibling, 0 replies; 32+ messages in thread From: David Kastrup @ 2005-02-14 21:26 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> So that is basically the background why we can easily make the process >> raw-text, but quite less easily make the buffer unibyte: AUCTeX will >> use the same buffer for its next run, just erasing it, and if it has >> turned unibyte, we get into trouble. > > OK. raw-text is good. Just to follow up on my last mail: the raw-text inserted 8bit-characters _are_ displayed with octal escape sequences, just like I wished they were. While I am pretty sure that I _have_ seen something else as well, it would appear that I confused this with what an AUCTeX run (probably with an intentionally wrong process encoding of latin-1) produced, or some earlier iteration of the software. Thanks for all the hand-holding. All this is a bit crazy: first the process "converting" to raw-text, then I taking this, "encoding" raw-text, interpreting the escapes, and decoding to utf-8 or whatever else... But that's a secondary issue. We can try making something more sensible once we have the merger behind us. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-14 17:24 ` David Kastrup 2005-02-14 18:12 ` Stefan Monnier @ 2005-02-15 17:28 ` Richard Stallman 2005-02-15 21:42 ` David Kastrup 1 sibling, 1 reply; 32+ messages in thread From: Richard Stallman @ 2005-02-15 17:28 UTC (permalink / raw) Cc: monnier, emacs-devel Yuk. The problem is that this buffer is not only processed by preview-latex, but also by AUCTeX, and the versions that get combined may be different. AUCTeX uses the source code buffer's file encoding by default, which is fine for basically unibyte based coding systems. It sounds like the safest thing is to convert the string to what you want, just before you use it. If decode-coding-string is supposed to have a chance of reassembling this junk, it must only be run at the end of reconstructing the byte stream. Yes, this is completely insane. No, I can't avoid having to deal with it somehow. If you reconstruct the correct byte stream, it should work to apply decode-coding-string to it. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: decode-coding-string gone awry? 2005-02-15 17:28 ` Richard Stallman @ 2005-02-15 21:42 ` David Kastrup 0 siblings, 0 replies; 32+ messages in thread From: David Kastrup @ 2005-02-15 21:42 UTC (permalink / raw) Cc: monnier, emacs-devel Richard Stallman <rms@gnu.org> writes: > Yuk. The problem is that this buffer is not only processed by > preview-latex, but also by AUCTeX, and the versions that get combined > may be different. AUCTeX uses the source code buffer's file encoding > by default, which is fine for basically unibyte based coding systems. > > It sounds like the safest thing is to convert the string to what you > want, just before you use it. > > If decode-coding-string is supposed to have a chance of reassembling > this junk, it must only be run at the end of reconstructing the byte > stream. Yes, this is completely insane. No, I can't avoid having to > deal with it somehow. > > If you reconstruct the correct byte stream, it should work to apply > decode-coding-string to it. Yes, it now works. The process has a process-encoding of raw-text, inserts in the (multibyte) error message buffer. The buffer contents (which are error messages with error contexts) are then tried to be matched with the source buffer from the compilation directly. If this fails, the error message buffer contents are taken, (encode-coding-string 'raw-text)ed again, the ^^xx hexadecimal bytes are converted to their equivalent bytes, and then the stuff gets (decode-coding-string 'buffer-coding-system)ed with the encoding of the source buffer with which those error messages are compared. At this time the error contexts should really match (or I'll start weeping). It currently appears to work with all sane and insane combination of TeX quoting schemes, system language environments and Emacs language settings. Thanks for all the help here. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2005-02-22 8:41 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-02-13 3:50 decode-coding-string gone awry? David Kastrup 2005-02-14 1:50 ` Kenichi Handa 2005-02-14 2:28 ` David Kastrup 2005-02-15 6:15 ` Richard Stallman 2005-02-15 9:31 ` David Kastrup 2005-02-15 16:17 ` Stefan Monnier 2005-02-17 10:35 ` Richard Stallman 2005-02-17 12:08 ` Kenichi Handa 2005-02-17 13:20 ` Stefan Monnier 2005-02-18 8:30 ` Kenichi Handa 2005-02-18 12:56 ` Stefan Monnier 2005-02-19 9:44 ` Richard Stallman 2005-02-18 14:12 ` Richard Stallman 2005-02-19 20:55 ` Richard Stallman 2005-02-21 1:19 ` Kenichi Handa 2005-02-22 8:41 ` Richard Stallman 2005-02-18 14:12 ` Richard Stallman 2005-02-14 13:37 ` Stefan Monnier 2005-02-14 13:50 ` David Kastrup 2005-02-14 16:57 ` Stefan Monnier 2005-02-14 17:24 ` David Kastrup 2005-02-14 18:12 ` Stefan Monnier 2005-02-14 18:41 ` David Kastrup 2005-02-14 19:30 ` Stefan Monnier 2005-02-14 20:09 ` David Kastrup 2005-02-14 20:56 ` Stefan Monnier 2005-02-14 21:07 ` David Kastrup 2005-02-14 21:29 ` Stefan Monnier 2005-02-14 21:57 ` David Kastrup 2005-02-14 21:26 ` David Kastrup 2005-02-15 17:28 ` Richard Stallman 2005-02-15 21:42 ` David Kastrup
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.