* How to create a derived encoding? @ 2004-10-12 0:10 David Kastrup 2004-10-12 15:09 ` Stefan Monnier 0 siblings, 1 reply; 6+ messages in thread From: David Kastrup @ 2004-10-12 0:10 UTC (permalink / raw) After considerable thinking about the problem, I have arrived at the conclusion that for efficiency's sake I'd like to have an encoding like tex-utf-8 which is derived from the normal utf-8 except that sequences like ^^8a and similar are converted into a corresponding byte before combining Unicode characters. It would be a bonus if such sequences staid unchanged in case that this sort of composition does not lead to a valid Unicode character, but that's just a bonus. The problem is that TeX has no clue about _characters_, but works on byte streams, and it has the habit of transliterating some byte codes in the above manner. Treating the output of TeX sensibly means converting those transliteration back into bytes _before_ assembling Unicode characters. The same problem occurs with unibyte non-ASCII encodings by Latin-1. I already have one (rather inefficient) hack to deal with that in preview-latex, but it does not extend easily to multibyte. So if there was a tolerably working way to derive a special encoding (which will be used as a process output encoding) that reconverts control sequences like the above before composing unicode characters from the resulting utf-8 stream, this would appear to be by far the fastest and convenient way to go about this problem. Any hints how to derive a suitably augmented encoding from an existing one? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to create a derived encoding? 2004-10-12 0:10 How to create a derived encoding? David Kastrup @ 2004-10-12 15:09 ` Stefan Monnier 2004-10-12 15:27 ` David Kastrup 0 siblings, 1 reply; 6+ messages in thread From: Stefan Monnier @ 2004-10-12 15:09 UTC (permalink / raw) Cc: emacs-devel > So if there was a tolerably working way to derive a special encoding > (which will be used as a process output encoding) that reconverts > control sequences like the above before composing unicode characters > from the resulting utf-8 stream, this would appear to be by far the > fastest and convenient way to go about this problem. I'm not sure what you've tried and what are the constraints under which you're coding, but I'd have assumed that you can do: 1 - assume the raw TeX output with its funny quoted bytes is in the current temp buffer. The buffer is in unibyte mode. 2 - do a search&replace of ^^NN to the corresponding byte. 3 - call decode-coding-region with the appropriate coding system. 4 - set the buffer to multibyte. If the step number 2 is too slow, you can most likely implement a CCL program that does it faster. Stefan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to create a derived encoding? 2004-10-12 15:09 ` Stefan Monnier @ 2004-10-12 15:27 ` David Kastrup 2004-10-12 16:23 ` Stefan Monnier 0 siblings, 1 reply; 6+ messages in thread From: David Kastrup @ 2004-10-12 15:27 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> So if there was a tolerably working way to derive a special encoding >> (which will be used as a process output encoding) that reconverts >> control sequences like the above before composing unicode characters >> from the resulting utf-8 stream, this would appear to be by far the >> fastest and convenient way to go about this problem. > > I'm not sure what you've tried and what are the constraints under which > you're coding, but I'd have assumed that you can do: > > 1 - assume the raw TeX output with its funny quoted bytes is in the > current temp buffer. The buffer is in unibyte mode. No good. We are talking about process output that is accumulating in a buffer. We can't just let everything trickle in in raw mode since the buffer may be interactive and so we need to have more or less accurate stuff at each point of time. > 2 - do a search&replace of ^^NN to the corresponding byte. Dead slow if we have to do this with search-and-replace in the filter routine of the process. > 3 - call decode-coding-region with the appropriate coding system. > 4 - set the buffer to multibyte. The buffer comes into being incrementally. > If the step number 2 is too slow, you can most likely implement a > CCL program that does it faster. Well, that was what I was asking about. And how to let this CCL program run prefixed to the normal process output decoding program. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to create a derived encoding? 2004-10-12 15:27 ` David Kastrup @ 2004-10-12 16:23 ` Stefan Monnier 2004-10-12 21:02 ` David Kastrup 0 siblings, 1 reply; 6+ messages in thread From: Stefan Monnier @ 2004-10-12 16:23 UTC (permalink / raw) Cc: emacs-devel >> 1 - assume the raw TeX output with its funny quoted bytes is in the >> current temp buffer. The buffer is in unibyte mode. > No good. We are talking about process output that is accumulating in > a buffer. We can't just let everything trickle in in raw mode since > the buffer may be interactive and so we need to have more or less > accurate stuff at each point of time. That's OK. This assumption is not important. You can do the decoding in the process filter, or anywhere else. >> 3 - call decode-coding-region with the appropriate coding system. >> 4 - set the buffer to multibyte. > The buffer comes into being incrementally. There can be several buffers. Remember in point 1 I said "temp buffer". And I'm sue it can be all done within a multibyte buffer if necessary. >> If the step number 2 is too slow, you can most likely implement a >> CCL program that does it faster. > Well, that was what I was asking about. And how to let this CCL > program run prefixed to the normal process output decoding program. You can run a CCL program independently from any coding system. Stefan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to create a derived encoding? 2004-10-12 16:23 ` Stefan Monnier @ 2004-10-12 21:02 ` David Kastrup 2004-10-14 11:12 ` Oliver Scholz 0 siblings, 1 reply; 6+ messages in thread From: David Kastrup @ 2004-10-12 21:02 UTC (permalink / raw) Cc: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> 1 - assume the raw TeX output with its funny quoted bytes is in the >>> current temp buffer. The buffer is in unibyte mode. > >> No good. We are talking about process output that is accumulating in >> a buffer. We can't just let everything trickle in in raw mode since >> the buffer may be interactive and so we need to have more or less >> accurate stuff at each point of time. > > That's OK. This assumption is not important. You can do the > decoding in the process filter, or anywhere else. > >>> 3 - call decode-coding-region with the appropriate coding system. >>> 4 - set the buffer to multibyte. > >> The buffer comes into being incrementally. > > There can be several buffers. Remember in point 1 I said "temp buffer". > And I'm sue it can be all done within a multibyte buffer if necessary. > >>> If the step number 2 is too slow, you can most likely implement a >>> CCL program that does it faster. > >> Well, that was what I was asking about. And how to let this CCL >> program run prefixed to the normal process output decoding program. > > You can run a CCL program independently from any coding system. Well, I can hardly run it manually _before_ the process decoding stuff. And if I run it in the filter function, it has to deal with partial characters at the end of the string. And the utf-8 decoding after it also has to deal with partial characters at the end of the string, which is normally done by the process filter. And of course the most challenging bit is that I have no clue whatsoever about CCL programs. Not to mention that I hope that XEmacs Mule will work just the same, but that's a different distraction. If it doesn't, I'll whine on the respective lists until it does. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: How to create a derived encoding? 2004-10-12 21:02 ` David Kastrup @ 2004-10-14 11:12 ` Oliver Scholz 0 siblings, 0 replies; 6+ messages in thread From: Oliver Scholz @ 2004-10-14 11:12 UTC (permalink / raw) Cc: emacs-devel Very interesting question. There are, of course, people on this list who know more about coding systems than I; yet I might as well give it a try, I thought. David Kastrup <dak@gnu.org> writes: > Stefan Monnier <monnier@iro.umontreal.ca> writes: > >>>> 1 - assume the raw TeX output with its funny quoted bytes is in the >>>> current temp buffer. The buffer is in unibyte mode. >> >>> No good. We are talking about process output that is accumulating in >>> a buffer. We can't just let everything trickle in in raw mode since >>> the buffer may be interactive and so we need to have more or less >>> accurate stuff at each point of time. >> >> That's OK. This assumption is not important. You can do the >> decoding in the process filter, or anywhere else. >> >>>> 3 - call decode-coding-region with the appropriate coding system. >>>> 4 - set the buffer to multibyte. >> >>> The buffer comes into being incrementally. >> >> There can be several buffers. Remember in point 1 I said "temp buffer". >> And I'm sue it can be all done within a multibyte buffer if necessary. >> >>>> If the step number 2 is too slow, you can most likely implement a >>>> CCL program that does it faster. >> >>> Well, that was what I was asking about. And how to let this CCL >>> program run prefixed to the normal process output decoding program. >> >> You can run a CCL program independently from any coding system. Is there any other way to do this than `ccl-execute-on-string'? Using the latter would imply string allocation (two times, if I read the code correctly). This is not the case for coding systems, AFAICS. It would be nice to have a `ccl-execute-on-region'. > Well, I can hardly run it manually _before_ the process decoding > stuff. And if I run it in the filter function, it has to deal with > partial characters at the end of the string. And the utf-8 decoding > after it also has to deal with partial characters at the end of the > string, which is normally done by the process filter. The best I can think of without changing the C code is to write a CCL program that returns the number of octets at the end that are suspected to be incomplete control words. Run that in a filter and frob the process mark or whatever (I am largely ignorant of process issues, so please bear with me, if that happens to be nonsense.) Like with the example below: (defun example-ccl-test-hex-to-byte (reg) "Return CCL code to convert hex char to byte. REG is the CCL register where the character is stored. This only deals with lowercase hex chars." `(if (,reg < ?0) ((write ,reg) (,reg = -1)) (if (,reg <= ?9) (,reg -= 48) (if (,reg < ?a) ((write ,reg) (,reg = -1)) (if (,reg > ?f) ((write ,reg) (,reg = -1)) (,reg -= 87)))))) (define-ccl-program example-ccl-progam `(1 (loop (r1 = -1) (r2 = -1) (r0 = 0) (read r1) (if (r1 != ?^) (write r1) ;; We update r0 accordingly whenever we read some character in. ((r0 += 1) (read r1) (if (r1 != ?^) ((write ?^) (write r1)) ;; We have found the sequence ^^ so far. Let's for now just ;; /assume/ that the following two chars are a valid hex ;; number. ((r0 += 1) (read r1) (r0 += 1) (read r2) ,(example-ccl-test-hex-to-byte 'r1) ,(example-ccl-test-hex-to-byte 'r2) (r1 *= 16) (r1 += r2) (write r1))))) (repeat)) (if (r1 != -1) ((write ?^) (write ?^) (write r1))))) (defun example-decode-region (from to) ;; This returns the number of characters at the end that are ;; suspected to be part of a yet incomplete control. (let ((str (buffer-substring-no-properties from to)) (vect (make-vector 9 nil))) (delete-region from to) (insert (ccl-execute-on-string 'example-ccl-program vect str)) (aref vect 0))) Oliver -- Oliver Scholz 23 Vendémiaire an 213 de la Révolution Ostendstr. 61 Liberté, Egalité, Fraternité! 60314 Frankfurt a. M. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2004-10-14 11:12 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-10-12 0:10 How to create a derived encoding? David Kastrup 2004-10-12 15:09 ` Stefan Monnier 2004-10-12 15:27 ` David Kastrup 2004-10-12 16:23 ` Stefan Monnier 2004-10-12 21:02 ` David Kastrup 2004-10-14 11:12 ` Oliver Scholz
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.