How to create a derived encoding?

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* How to create a derived encoding?
@ 2004-10-12  0:10 David Kastrup
  2004-10-12 15:09 ` Stefan Monnier
  0 siblings, 1 reply; 6+ messages in thread
From: David Kastrup @ 2004-10-12  0:10 UTC (permalink / raw)



After considerable thinking about the problem, I have arrived at the
conclusion that for efficiency's sake I'd like to have an encoding
like tex-utf-8 which is derived from the normal utf-8 except that
sequences like ^^8a and similar are converted into a corresponding
byte before combining Unicode characters.  It would be a bonus if such
sequences staid unchanged in case that this sort of composition does
not lead to a valid Unicode character, but that's just a bonus.

The problem is that TeX has no clue about _characters_, but works on
byte streams, and it has the habit of transliterating some byte codes
in the above manner.  Treating the output of TeX sensibly means
converting those transliteration back into bytes _before_ assembling
Unicode characters.

The same problem occurs with unibyte non-ASCII encodings by Latin-1.
I already have one (rather inefficient) hack to deal with that in
preview-latex, but it does not extend easily to multibyte.

So if there was a tolerably working way to derive a special encoding
(which will be used as a process output encoding) that reconverts
control sequences like the above before composing unicode characters
from the resulting utf-8 stream, this would appear to be by far the
fastest and convenient way to go about this problem.

Any hints how to derive a suitably augmented encoding from an existing
one?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to create a derived encoding?
  2004-10-12  0:10 How to create a derived encoding? David Kastrup
@ 2004-10-12 15:09 ` Stefan Monnier
  2004-10-12 15:27   ` David Kastrup
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Monnier @ 2004-10-12 15:09 UTC (permalink / raw)
  Cc: emacs-devel

> So if there was a tolerably working way to derive a special encoding
> (which will be used as a process output encoding) that reconverts
> control sequences like the above before composing unicode characters
> from the resulting utf-8 stream, this would appear to be by far the
> fastest and convenient way to go about this problem.

I'm not sure what you've tried and what are the constraints under which
you're coding, but I'd have assumed that you can do:

1 - assume the raw TeX output with its funny quoted bytes is in the
    current temp buffer.   The buffer is in unibyte mode.
2 - do a search&replace of ^^NN to the corresponding byte.
3 - call decode-coding-region with the appropriate coding system.
4 - set the buffer to multibyte.

If the step number 2 is too slow, you can most likely implement a CCL
program that does it faster.


        Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to create a derived encoding?
  2004-10-12 15:09 ` Stefan Monnier
@ 2004-10-12 15:27   ` David Kastrup
  2004-10-12 16:23     ` Stefan Monnier
  0 siblings, 1 reply; 6+ messages in thread
From: David Kastrup @ 2004-10-12 15:27 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> So if there was a tolerably working way to derive a special encoding
>> (which will be used as a process output encoding) that reconverts
>> control sequences like the above before composing unicode characters
>> from the resulting utf-8 stream, this would appear to be by far the
>> fastest and convenient way to go about this problem.
>
> I'm not sure what you've tried and what are the constraints under which
> you're coding, but I'd have assumed that you can do:
>
> 1 - assume the raw TeX output with its funny quoted bytes is in the
>     current temp buffer.   The buffer is in unibyte mode.

No good.  We are talking about process output that is accumulating in
a buffer.  We can't just let everything trickle in in raw mode since
the buffer may be interactive and so we need to have more or less
accurate stuff at each point of time.

> 2 - do a search&replace of ^^NN to the corresponding byte.

Dead slow if we have to do this with search-and-replace in the filter
routine of the process.

> 3 - call decode-coding-region with the appropriate coding system.
> 4 - set the buffer to multibyte.

The buffer comes into being incrementally.

> If the step number 2 is too slow, you can most likely implement a
> CCL program that does it faster.

Well, that was what I was asking about.  And how to let this CCL
program run prefixed to the normal process output decoding program.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to create a derived encoding?
  2004-10-12 15:27   ` David Kastrup
@ 2004-10-12 16:23     ` Stefan Monnier
  2004-10-12 21:02       ` David Kastrup
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Monnier @ 2004-10-12 16:23 UTC (permalink / raw)
  Cc: emacs-devel

>> 1 - assume the raw TeX output with its funny quoted bytes is in the
>> current temp buffer.   The buffer is in unibyte mode.

> No good.  We are talking about process output that is accumulating in
> a buffer.  We can't just let everything trickle in in raw mode since
> the buffer may be interactive and so we need to have more or less
> accurate stuff at each point of time.

That's OK.  This assumption is not important.  You can do the decoding in
the process filter, or anywhere else.

>> 3 - call decode-coding-region with the appropriate coding system.
>> 4 - set the buffer to multibyte.

> The buffer comes into being incrementally.

There can be several buffers.  Remember in point 1 I said "temp buffer".
And I'm sue it can be all done within a multibyte buffer if necessary.

>> If the step number 2 is too slow, you can most likely implement a
>> CCL program that does it faster.

> Well, that was what I was asking about.  And how to let this CCL
> program run prefixed to the normal process output decoding program.

You can run a CCL program independently from any coding system.


        Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to create a derived encoding?
  2004-10-12 16:23     ` Stefan Monnier
@ 2004-10-12 21:02       ` David Kastrup
  2004-10-14 11:12         ` Oliver Scholz
  0 siblings, 1 reply; 6+ messages in thread
From: David Kastrup @ 2004-10-12 21:02 UTC (permalink / raw)
  Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> 1 - assume the raw TeX output with its funny quoted bytes is in the
>>> current temp buffer.   The buffer is in unibyte mode.
>
>> No good.  We are talking about process output that is accumulating in
>> a buffer.  We can't just let everything trickle in in raw mode since
>> the buffer may be interactive and so we need to have more or less
>> accurate stuff at each point of time.
>
> That's OK.  This assumption is not important.  You can do the
> decoding in the process filter, or anywhere else.
>
>>> 3 - call decode-coding-region with the appropriate coding system.
>>> 4 - set the buffer to multibyte.
>
>> The buffer comes into being incrementally.
>
> There can be several buffers.  Remember in point 1 I said "temp buffer".
> And I'm sue it can be all done within a multibyte buffer if necessary.
>
>>> If the step number 2 is too slow, you can most likely implement a
>>> CCL program that does it faster.
>
>> Well, that was what I was asking about.  And how to let this CCL
>> program run prefixed to the normal process output decoding program.
>
> You can run a CCL program independently from any coding system.

Well, I can hardly run it manually _before_ the process decoding
stuff.  And if I run it in the filter function, it has to deal with
partial characters at the end of the string.  And the utf-8 decoding
after it also has to deal with partial characters at the end of the
string, which is normally done by the process filter.

And of course the most challenging bit is that I have no clue
whatsoever about CCL programs.  Not to mention that I hope that XEmacs
Mule will work just the same, but that's a different distraction.  If
it doesn't, I'll whine on the respective lists until it does.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to create a derived encoding?
  2004-10-12 21:02       ` David Kastrup
@ 2004-10-14 11:12         ` Oliver Scholz
  0 siblings, 0 replies; 6+ messages in thread
From: Oliver Scholz @ 2004-10-14 11:12 UTC (permalink / raw)
  Cc: emacs-devel

Very interesting question.  There are, of course, people on this list
who know more about coding systems than I; yet I might as well give it
a try, I thought.

David Kastrup <dak@gnu.org> writes:

> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>>>> 1 - assume the raw TeX output with its funny quoted bytes is in the
>>>> current temp buffer.   The buffer is in unibyte mode.
>>
>>> No good.  We are talking about process output that is accumulating in
>>> a buffer.  We can't just let everything trickle in in raw mode since
>>> the buffer may be interactive and so we need to have more or less
>>> accurate stuff at each point of time.
>>
>> That's OK.  This assumption is not important.  You can do the
>> decoding in the process filter, or anywhere else.
>>
>>>> 3 - call decode-coding-region with the appropriate coding system.
>>>> 4 - set the buffer to multibyte.
>>
>>> The buffer comes into being incrementally.
>>
>> There can be several buffers.  Remember in point 1 I said "temp buffer".
>> And I'm sue it can be all done within a multibyte buffer if necessary.
>>
>>>> If the step number 2 is too slow, you can most likely implement a
>>>> CCL program that does it faster.
>>
>>> Well, that was what I was asking about.  And how to let this CCL
>>> program run prefixed to the normal process output decoding program.
>>
>> You can run a CCL program independently from any coding system.

Is there any other way to do this than `ccl-execute-on-string'? Using
the latter would imply string allocation (two times, if I read the
code correctly). This is not the case for coding systems, AFAICS.  It
would be nice to have a `ccl-execute-on-region'.

> Well, I can hardly run it manually _before_ the process decoding
> stuff.  And if I run it in the filter function, it has to deal with
> partial characters at the end of the string.  And the utf-8 decoding
> after it also has to deal with partial characters at the end of the
> string, which is normally done by the process filter.

The best I can think of without changing the C code is to write a CCL
program that returns the number of octets at the end that are
suspected to be incomplete control words. Run that in a filter and
frob the process mark or whatever (I am largely ignorant of process
issues, so please bear with me, if that happens to be nonsense.)  Like
with the example below:

(defun example-ccl-test-hex-to-byte (reg)
  "Return CCL code to convert hex char to byte.
REG is the CCL register where the character is stored.  This only
deals with lowercase hex chars."
  `(if (,reg < ?0)
       ((write ,reg)
        (,reg = -1))
     (if (,reg <= ?9)
         (,reg -= 48)
       (if (,reg < ?a)
           ((write ,reg)
            (,reg = -1))
         (if (,reg > ?f)
             ((write ,reg)
              (,reg = -1))
           (,reg -= 87))))))


(define-ccl-program example-ccl-progam
  `(1
    (loop
     (r1 = -1)
     (r2 = -1)
     (r0 = 0)
     (read r1)
     (if (r1 != ?^)
         (write r1)
       ;; We update r0 accordingly whenever we read some character in.
       ((r0 += 1)
        (read r1)
        (if (r1 != ?^)
            ((write ?^)
             (write r1))
          ;; We have found the sequence ^^ so far. Let's for now just
          ;; /assume/ that the following two chars are a valid hex
          ;; number.
          ((r0 += 1)
           (read r1)
           (r0 += 1)
           (read r2)
           ,(example-ccl-test-hex-to-byte 'r1)
           ,(example-ccl-test-hex-to-byte 'r2)
           (r1 *= 16)
           (r1 += r2)
           (write r1)))))
     (repeat))
    (if (r1 != -1)
        ((write ?^)
         (write ?^)
         (write r1)))))

(defun example-decode-region (from to)
  ;; This returns the number of characters at the end that are
  ;; suspected to be part of a yet incomplete control.
  (let ((str (buffer-substring-no-properties from to))
        (vect (make-vector 9 nil)))
    (delete-region from to)
    (insert (ccl-execute-on-string 'example-ccl-program
                                   vect
                                   str))
    (aref vect 0)))


    Oliver
-- 
Oliver Scholz               23 Vendémiaire an 213 de la Révolution
Ostendstr. 61               Liberté, Egalité, Fraternité!
60314 Frankfurt a. M.       

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-10-14 11:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-12  0:10 How to create a derived encoding? David Kastrup
2004-10-12 15:09 ` Stefan Monnier
2004-10-12 15:27   ` David Kastrup
2004-10-12 16:23     ` Stefan Monnier
2004-10-12 21:02       ` David Kastrup
2004-10-14 11:12         ` Oliver Scholz

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).