Need some help with Rmail/mbox

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Need some help with Rmail/mbox
@ 2008-09-18 16:02 Paul Michael Reilly
  2008-09-19  3:28 ` Stephen J. Turnbull
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Paul Michael Reilly @ 2008-09-18 16:02 UTC (permalink / raw)
  To: emacs-devel

The basic problem I need to solve now is how to map the values of the
content-type and content-transfer-encoding headers (either of which
could legally be absent) to an Emacs coding system.  I am slogging
through this task and if anyone has already done it and has either a
short "how-to" or even better some code, that would be much
appreciated.

As Eli helpfully pointed out, rmail-convert-to-babyl-format provides
some help.

As near as I can tell the task is to decode the message body in two
steps: first to decode according to the character encoding
(e.g. quoted-printable or base64) and then to decode that result to
some coding system.  Something along the lines of:

     (let (body)
       (setq body (apply qp or base64 to body of message)
       (decode-coding-string body (detect-coding-string body t))

Am I even in the ballpark?

-pmr

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Need some help with Rmail/mbox
  2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly
@ 2008-09-19  3:28 ` Stephen J. Turnbull
  2008-09-19  5:35   ` Paul Michael Reilly
  2008-09-19  4:30 ` Richard M. Stallman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Stephen J. Turnbull @ 2008-09-19  3:28 UTC (permalink / raw)
  To: pmr; +Cc: emacs-devel

Paul Michael Reilly writes:

 > As near as I can tell the task is to decode the message body in two
 > steps:

But why not just use the existing code to do this?  AIUI, the Babyl
format was designed for one-buffer operation on a pseudo-RFC-822
message, so most functions used to wash and display probably assume
that the message is in the current buffer, which is narrowed so that
the presentation header plus the body form an RFC 2822 message.

All you should need to do for a first cut is to copy the message to a
new buffer, which doesn't need to be narrowed, but might need to have
some Babyl sentinels added.

If I'm missing something, feel free to ignore me, but I don't really
understand what all you think is different about presenting a
free-standing RFC 2822 message as opposed to presenting one that is
part of a Babyl-format buffer.  I don't think they should be that
different.  The main thing is that the Babyl format caches the set of
presentation headers in the Babyl-format file, but mbox won't.  So
you'll need to hide (or remove) the non-presentation headers
one-by-one rather than by just narrowing the buffer.

 > first to decode according to the character encoding (e.g. quoted-
 > printable or base64) and then to decode that result to some coding
 > system.

That's basically it.  You should do the processing on buffers, not
strings, though, and

 >        (decode-coding-string body (detect-coding-string body t))

you want to parse the coding from the *header*, not guess on the body.
If you want you can add guessing and/or user-specified MIME charsets
as a user option, but (a) almost all genuine mail today will contain
an appropriate Content-Type charset parameter, and (b) lack of such
(unless all text is US-ASCII) is an extremely strong indicator of
spam.  Few users will need to be able to read messages that have bogus
charset parameters: this feature is not immediately necessary.

The general algorithm should be something like

Identify message in mbox buffer
Copy message to presentation buffer
Identify header and body, add Babyl sentinels if desired
Parse headers (specifically content type)
Dispatch on content type and subtype:
    Case type is text and subtype is plain
        Identify charset parameter:
            (or charset-from-content-type "us-ascii")
        Map charset to Emacs coding-system
        (decode-coding-region (body-begin) (body-end) coding-system)
        Wash header for presentation, eg:
            Hide non-displayed header
            Decode RFC 2047-encoded headers
        Wash body for presentation, eg:
            Highlight and activate url-like substrings
            Highlight quoted material
Display buffer in window

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly
  2008-09-19  3:28 ` Stephen J. Turnbull
@ 2008-09-19  4:30 ` Richard M. Stallman
  2008-09-19  4:30 ` Richard M. Stallman
  2008-09-19  9:12 ` Eli Zaretskii
  3 siblings, 0 replies; 33+ messages in thread
From: Richard M. Stallman @ 2008-09-19  4:30 UTC (permalink / raw)
  To: pmr; +Cc: emacs-devel

Congratulations on the new release.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly
  2008-09-19  3:28 ` Stephen J. Turnbull
  2008-09-19  4:30 ` Richard M. Stallman
@ 2008-09-19  4:30 ` Richard M. Stallman
  2008-09-19  9:12 ` Eli Zaretskii
  3 siblings, 0 replies; 33+ messages in thread
From: Richard M. Stallman @ 2008-09-19  4:30 UTC (permalink / raw)
  To: pmr; +Cc: emacs-devel

    As near as I can tell the task is to decode the message body in two
    steps: first to decode according to the character encoding
    (e.g. quoted-printable or base64) and then to decode that result to
    some coding system.

That is correct.

			 Something along the lines of:

	 (let (body)
	   (setq body (apply qp or base64 to body of message)

You call `mail-unquote-printable-region' or  `base64-decode-region'.
They operate on the buffer.

	   (decode-coding-string body (detect-coding-string body t))

Use `decode-coding-region'.  It operates on the buffer.

When operating on large amount of text, don't do it in strings.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-19  3:28 ` Stephen J. Turnbull
@ 2008-09-19  5:35   ` Paul Michael Reilly
  2008-09-19  9:32     ` Eli Zaretskii
  2008-09-20  7:12     ` Stephen J. Turnbull
  0 siblings, 2 replies; 33+ messages in thread
From: Paul Michael Reilly @ 2008-09-19  5:35 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

Stephen J. Turnbull wrote:

Thanks for stepping up to this.  Your help is very much appreciated!

> All you should need to do for a first cut is to copy the message to a
> new buffer, which doesn't need to be narrowed, but might need to have
> some Babyl sentinels added.

I first copy the relevant headers to the view buffer by collecting
them from the PMAIL buffer into a string and insert the string into
the view buffer.  I used the rmail.el code pretty much as is but
instead of copying and hiding I do selective copy and insert (ignoring
the case of showing all headers which is trivial).

Then I basically copy the message body into a string and insert it
into the view buffer.  But when I started to work on the decoding it
seemed that decoding the string before inserting it seemed like a good
idea. (Pardon my Elisp rustiness ... is it better to use buffer to
buffer copying than insert string?) I copied the logic for this first
part of decoding from rmail-convert-to-babyl-format.

> That's basically it.  You should do the processing on buffers, not
> strings, though, and

Are you essentially answering my question above and saying that
copying buffer to buffer is faster/better than operating on strings?

> 
>  >        (decode-coding-string body (detect-coding-string body t))
> 
> you want to parse the coding from the *header*, not guess on the body.

I do parse out quoted-printable and base64 and apply these to the body
before doing the coding system based decoding.

> If you want you can add guessing and/or user-specified MIME charsets
> as a user option, but (a) almost all genuine mail today will contain
> an appropriate Content-Type charset parameter, and (b) lack of such
> (unless all text is US-ASCII) is an extremely strong indicator of
> spam.  Few users will need to be able to read messages that have bogus
> charset parameters: this feature is not immediately necessary.

OK, makes sense.

> The general algorithm should be something like
> 
> Identify message in mbox buffer

yup

> Copy message to presentation buffer

yup

> Identify header and body, add Babyl sentinels if desired

babyl sentinels?  I'm not sure what you mean by this.

> Parse headers (specifically content type)

If you had said content type and content encoding I would have said
"yup" and that is what led to my request for help.  Except for the
case of quoted-printable and base64 I'm not sure how to parse those
two headers (Content-Type and Content-Transfer-Encoding) into a coding
system so that I can then do the decoding.  I'm assuming the coding
system guesswork becomes relevant for combinations of the two headers
that Rmail does not grok.  And I now see that there is a strong
relationship between charset and coding system.

> Dispatch on content type and subtype:
>     Case type is text and subtype is plain
>         Identify charset parameter:
>             (or charset-from-content-type "us-ascii")
>         Map charset to Emacs coding-system
>         (decode-coding-region (body-begin) (body-end) coding-system)

OK, this is helpful.  I assume that for all other type/subtype cases
we punt for now and use guessing or just raw text?  But certainly
there are some that we want to process/decode in some fashion,
e.g. text/html or text/xml.  Is there another Emacs package/library
that you are aware of that provides a good model for where we want to
take Rmail so that it handles more type/subtype cases seamlessly in
the view buffer? Even perhaps audio and video (not pure MIME,
i.e. multipart ... yet).

>         Wash header for presentation, eg:
>             Hide non-displayed header
>             Decode RFC 2047-encoded headers

OK, this is helpful but I would add that non-displayed headers do not
need to be in the view buffer at all.  It contains all the headers or
just the displayed headers, depending on the User's current desire.

>         Wash body for presentation, eg:
>             Highlight and activate url-like substrings
>             Highlight quoted material

I don't believe Rmail does either of these operations now.  Is that
your understanding?  If I'm right and this washing is not done, then
it is very high on my priority list to add asap.  If I'm wrong then
please point me where it gets done or how to enable it.

> Display buffer in window

yup

Thanks again,

-pmr

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly
                   ` (2 preceding siblings ...)
  2008-09-19  4:30 ` Richard M. Stallman
@ 2008-09-19  9:12 ` Eli Zaretskii
  3 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-19  9:12 UTC (permalink / raw)
  To: pmr; +Cc: emacs-devel

> Date: Thu, 18 Sep 2008 12:02:19 -0400
> From: Paul Michael Reilly <pmr@pajato.com>
> 
> The basic problem I need to solve now is how to map the values of the
> content-type and content-transfer-encoding headers (either of which
> could legally be absent) to an Emacs coding system.  I am slogging
> through this task and if anyone has already done it and has either a
> short "how-to" or even better some code, that would be much
> appreciated.
> 
> As Eli helpfully pointed out, rmail-convert-to-babyl-format provides
> some help.

Yes, and it already maps the values of content-transfer-encoding into
Emacs coding-systems (the mapping is trivial, btw; see
rmail-decode-region and its callers).  If you still have problems with
this after reading the Rmail code, please ask more specific questions.

> As near as I can tell the task is to decode the message body in two
> steps: first to decode according to the character encoding
> (e.g. quoted-printable or base64) and then to decode that result to
> some coding system.  Something along the lines of:
> 
>      (let (body)
>        (setq body (apply qp or base64 to body of message)
>        (decode-coding-string body (detect-coding-string body t))
> 
> Am I even in the ballpark?

Yes, this is exactly what rmail-convert-to-babyl-format does.  It just
assumes that there's only one part in the message, so it does the
above only once.  You want to do that for every part of a multi-part
message.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-19  5:35   ` Paul Michael Reilly
@ 2008-09-19  9:32     ` Eli Zaretskii
  2008-09-20  7:12     ` Stephen J. Turnbull
  1 sibling, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-19  9:32 UTC (permalink / raw)
  To: Paul Michael Reilly; +Cc: stephen, emacs-devel

> Date: Fri, 19 Sep 2008 01:35:12 -0400
> From: Paul Michael Reilly <pmr@pajato.com>
> Cc: emacs-devel@gnu.org
> 
> I first copy the relevant headers to the view buffer by collecting
> them from the PMAIL buffer into a string and insert the string into
> the view buffer.

I hope you use insert-buffer-substring instead of actually making a
string and inserting it.  Consing large strings is not a good idea, if
all you need is copy text from one buffer to another.

> Then I basically copy the message body into a string and insert it
> into the view buffer.

Same comment as above.

> But when I started to work on the decoding it seemed that decoding
> the string before inserting it seemed like a good idea.

Actually, it isn't: in Emacs, whenever you can work on a buffer
instead of a string, you should generally prefer a buffer.
Specifically, decoding of strings uses scratch buffers behind your
back, and you don't gain anything in efficiency.

So just copy the text to the view buffer, then decode it in-place.

> (Pardon my Elisp rustiness ... is it better to use buffer to
> buffer copying than insert string?)

Yes.

> If you had said content type and content encoding I would have said
> "yup" and that is what led to my request for help.  Except for the
> case of quoted-printable and base64 I'm not sure how to parse those
> two headers (Content-Type and Content-Transfer-Encoding) into a coding
> system so that I can then do the decoding.

You should parse them separately, and use them separately, just like
Rmail/Babyl does: first decode qp or b64 into 8-bit encoded bytes,
then decode the rest using the charset gleaned from the Content-Type
header.

> I'm assuming the coding system guesswork becomes relevant for
> combinations of the two headers that Rmail does not grok.

This should not happen, in general; but for more robust code, you
could try `undecided' if all else fails; this is what Rmail/Babyl
does.  See rmail-decode-region.

> And I now see that there is a strong relationship between charset
> and coding system.

Yes; they are mostly the same.  Emacs defines an alias coding-system
for every MIME charset, IIRC.

> OK, this is helpful.  I assume that for all other type/subtype cases
> we punt for now and use guessing or just raw text?

It's not raw text, it should be plain ASCII (before you qp- or
b64-decode them; I suggest not to decode their original qp or b64
encoding until you support those additional types).  Rmail/Babyl uses
`undecided' for those, and so can you.

> But certainly
> there are some that we want to process/decode in some fashion,
> e.g. text/html or text/xml.

Eventually, yes.

> Is there another Emacs package/library
> that you are aware of that provides a good model for where we want to
> take Rmail so that it handles more type/subtype cases seamlessly in
> the view buffer? Even perhaps audio and video (not pure MIME,
> i.e. multipart ... yet).

Gnus, of course.  But again, I suggest not to bother about these
extensions for now: just make Rmail/mbox be no worse than Rmail/Babyl,
so that people could start using it.  Extensions can come later.

> >         Wash body for presentation, eg:
> >             Highlight and activate url-like substrings
> >             Highlight quoted material
> 
> I don't believe Rmail does either of these operations now.

Right, it doesn't.  We have ffap and similar features to do that
without highlighting, although highlighting would be nice (again, as
an extension of what Rmail does now).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-19  5:35   ` Paul Michael Reilly
  2008-09-19  9:32     ` Eli Zaretskii
@ 2008-09-20  7:12     ` Stephen J. Turnbull
  2008-09-20 10:04       ` Daiki Ueno
  1 sibling, 1 reply; 33+ messages in thread
From: Stephen J. Turnbull @ 2008-09-20  7:12 UTC (permalink / raw)
  To: Paul Michael Reilly; +Cc: emacs-devel

Paul Michael Reilly writes:

 > Thanks for stepping up to this.  Your help is very much appreciated!

You're welcome.  Eli and Richard have already responded with some
existing Rmail features, but maybe some background (somewhat
duplicating their comments) would be helpful, too.

 > I first copy the relevant headers to the view buffer by collecting
 > them from the PMAIL buffer into a string and insert the string into
 > the view buffer.

Copying to a string uses memory.  The amount of memory is not a huge
consideration these days, even with a multimegabyte buffer.  But
allocating and deallocating strings is very time-consuming because
malloc requires a system call, and deallocated strings' data gets
compacted, or possibly another system call to deallocate for large
strings (I forget if Emacs uses direct allocation for large strings
instead of expanding the string data pool).

Also, strings are read-only.  So to "edit" a string, you actually have
to copy the relevant parts to a new string; if you substitute in the
middle of a string, you have to create a bunch of new strings one for
each fragment, then a final string.  Lotsa consing.

 > I used the rmail.el code pretty much as is but instead of copying
 > and hiding I do selective copy and insert (ignoring the case of
 > showing all headers which is trivial).

That's reasonable, I think.

 > Then I basically copy the message body into a string and insert it
 > into the view buffer.

`insert-buffer-substring' is much more efficient.

 > But when I started to work on the decoding it seemed that decoding
 > the string before inserting it seemed like a good idea.

In XEmacs, string decoding is implemented by copying to a temporary
buffer and doing decode-coding-region there.  Emacs is likely the
same.  :-)

 > Are you essentially answering my question above and saying that
 > copying buffer to buffer is faster/better than operating on strings?

Yes.  It's faster and better.  Buffers are designed for editing.
Strings are designed for read-only text to save all the editor
overhead that buffers carry around.  Here's just one reason.  Emacs
strings are *not* arrays of characters, they are arrays of bytes,
which (from Lisp) can only be read at character boundaries.  An ASCII
character takes up 1 byte, a Latin-1 character 2 bytes, a Japanese
character 3 bytes, and (IIRC) certain user-defined characters may take
4 bytes!  This means that if you decide to substitute a Latin 1 SMALL
LATIN LETTER A WITH GRAVE ACCENT for ASCII SMALL LATIN LETTER A (thus
turning voila into voilà) you can't do it in a string without
allocating a new string.

 > I do parse out quoted-printable and base64 and apply these to the body
 > before doing the coding system based decoding.

OK.

 > > Identify header and body, add Babyl sentinels if desired
 > 
 > babyl sentinels?  I'm not sure what you mean by this.

Babyl messages are delimited with "^_" IIRC, and the original headers
with "**** BOOH ****" and "**** EOOH ****" or something like that.  I
don't remember whether any code that presents a message uses those
after narrowing (in your implementation, copying), though.  If it's
not used, you don't need them.

 > "yup" and that is what led to my request for help.  Except for the
 > case of quoted-printable and base64 I'm not sure how to parse those
 > two headers (Content-Type and Content-Transfer-Encoding) into a coding
 > system so that I can then do the decoding.

Content-Transfer-Encoding is about how bytes, *not characters*, are
represented.  For practical purposes there are four possibilities:
text is all ASCII (the default, aka 7bit), text is raw unibyte (8bit),
text is encoded as quoted-printable, and text is encoded as BASE64.
So you are done with that.

This is entirely independent of Content-Type or its charset parameter.

 > I'm assuming the coding system guesswork becomes relevant for
 > combinations of the two headers that Rmail does not grok.

No.  If there is no Content-Type header, you "should" assume the RFC
2822 defaults (text/plain; charset=US-ASCII).  Providing commands for
the user to change those on a per message basis would be nice, but not
needed for a first release as the vast majority of non-spam mail is
MIME-conformant these days.

 > And I now see that there is a strong relationship between charset
 > and coding system.

Technically, the *MIME charset* concept is broken, or at least a very
poor name.  A "character set" is an abstract idea that is (AFAIK)
basically unstandardized.  A *coded character set* is an invertible
mapping from a set of non-negative integers to characters.  You can
think of Unicode as a universe of characters, although that's not
quite good enough for some esoteric purposes.  What Emacs calls a
"charset" is basically a coded character set.  An "encoding" is again
an abstract idea which is not really standardized, but it's pretty
close to what Emacs calls a "coding system", which is a pair of
algorithms for decoding an external text into an Emacs buffer, and for
doing the reverse, plus some auxiliary parameters and functions for
specialized purposes (eg, for detecting the encoding of an unknown
text).  As you recognized, this is basically the same thing as a "MIME
charset".

You should not need to deal with Emacs charsets, by the way.  Just
remember that "MIME charset == Emacs coding-system" and you'll do
fine.

 > OK, this is helpful.  I assume that for all other type/subtype cases
 > we punt for now and use guessing or just raw text?

For text/* types, just use the raw text (there should be a charset
parameter if it is not ASCII).

 > But certainly there are some that we want to process/decode in some
 > fashion, e.g. text/html or text/xml.  Is there another Emacs
 > package/library that you are aware of that provides a good model
 > for where we want to take Rmail so that it handles more
 > type/subtype cases seamlessly in the view buffer?

Gnus, VM, tm (aka "Tiny MIME", obsoleted by SEMI and unsupported),
SEMI (obsolete and unsupported I believe), WEMI (IIRC a C library to
link into Emacs, based on SEMI, obsolete and unsupported I guess),
MH-E, MEW, Wanderlust (these last three I don't know about the
implementations, they may borrow from Gnus).

Both VM and Gnus use the model I suggested of dispatching on type and
subtype.  Some naming convention like `mime-handler-TYPE/SUBTYPE'
could be used.

    (let ((handlers (list (intern (format "mime-handler-%s/%s" type subtype))
                          (intern (format "mime-handler-%s/*" type))
                          'mime-handler-*/*))
          handler)
      (while handlers
        (setq handler (car handlers)
              handlers (cdr handlers))
        (if (functionp handler)
            (funcall handler body-start body-end)
          ;; `warn' may be an XEmacs-ism, sorry
          (warn "handler not defined: %s" handler))))

 > Even perhaps audio and video (not pure MIME, i.e. multipart
 > ... yet).

You *need* multipart as quickly as possible.  Too much mail is sent
as multipart.  It's not that hard, you just parse the MIME bodies
recursively, and throw away the bodies you don't know how to handle.
I'm sure Rmail already knows how to do this.

You should also provide a way of listing MIME bodies found and saving
their raw bytes to a file.  (That's just a matter of applying the
relevant Content-Transfer-Encoding to the MIME body, and then
write-region.)

 > >         Wash header for presentation, eg:
 > >             Hide non-displayed header
 > >             Decode RFC 2047-encoded headers
 > 
 > OK, this is helpful but I would add that non-displayed headers do not
 > need to be in the view buffer at all.  It contains all the headers or
 > just the displayed headers, depending on the User's current desire.

I find being able to toggle display of the full set of headers useful,
and I use it several times every day.  I would find this easier to
implement if the headers are there but hidden.  YMMV, of course.

 > >         Wash body for presentation, eg:
 > >             Highlight and activate url-like substrings
 > >             Highlight quoted material
 > 
 > I don't believe Rmail does either of these operations now.  Is that
 > your understanding?

I count the interval that I've not used Rmail by decades. :-)  My
contribution is as a standards geek and having gotten my hands dirty
on several MUAs.

URLs are easy, of course:

    (while (re-search-forward url-re nil t)
      (let ((o (make-overlay (match-beginning 0) (match-end 0))))
        (overlay-put o 'face 'url-active-face)
        ;; sorry, this may also be an XEmacs-ism
        (overlay-put o APPROPRIATE-ARGS-TO-ADD-FOLLOW-URL-TO-KEYMAP))

Quoting is harder because of the variety of quoting styles.  You might
want to make this easy for users to configure.  Kyle Jones's filladapt
package is quite good at detecting quoting styles and is configurable.
As you know, Kyle is a curmudgeon about assignment, but reading the
docs for ideas about UI is probably OK (but check with FSF legal or
Richard; IANAL nor an FSF spokesperson).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20  7:12     ` Stephen J. Turnbull
@ 2008-09-20 10:04       ` Daiki Ueno
  2008-09-20 10:19         ` Eli Zaretskii
  2008-09-20 13:48         ` Stephen J. Turnbull
  0 siblings, 2 replies; 33+ messages in thread
From: Daiki Ueno @ 2008-09-20 10:04 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Paul Michael Reilly, emacs-devel

>>>>> In <871vzfi93y.fsf@xemacs.org> 
>>>>>	"Stephen J. Turnbull" <stephen@xemacs.org> wrote:
> Paul Michael Reilly writes:
>  > But when I started to work on the decoding it seemed that decoding
>  > the string before inserting it seemed like a good idea.

> In XEmacs, string decoding is implemented by copying to a temporary
> buffer and doing decode-coding-region there.  Emacs is likely the
> same.  :-)

Nope, XEmacs does not have the concept of buffer multibyteness.

- If buffer multibyteness is on, both input and output of
`decode-coding-region' are treated as multibyte.  I think the input
should be unibyte since it is byte stream.

- If buffer multibyteness is off, both input and output of
`decode-coding-region' are treated as unibyte.  So, you have to convert
the output to multibyte manually.

I'd recommend to use `decode-coding-string' and `insert' instead of
`decode-coding-region', if unsure.  I heard that the reason why FLIM
does not use `{de|en}code-coding-region' is to avoid this confusion.

Regards,
-- 
Daiki Ueno




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 10:04       ` Daiki Ueno
@ 2008-09-20 10:19         ` Eli Zaretskii
  2008-09-20 10:46           ` Daiki Ueno
  2008-09-20 13:48         ` Stephen J. Turnbull
  1 sibling, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-20 10:19 UTC (permalink / raw)
  To: Daiki Ueno; +Cc: pmr, stephen, emacs-devel

> From: Daiki Ueno <ueno@unixuser.org>
> Date: Sat, 20 Sep 2008 19:04:33 +0900
> Cc: Paul Michael Reilly <pmr@pajato.com>, emacs-devel@gnu.org
> 
> > In XEmacs, string decoding is implemented by copying to a temporary
> > buffer and doing decode-coding-region there.  Emacs is likely the
> > same.  :-)
> 
> Nope, XEmacs does not have the concept of buffer multibyteness.
> 
> - If buffer multibyteness is on, both input and output of
> `decode-coding-region' are treated as multibyte.  I think the input
> should be unibyte since it is byte stream.
> 
> - If buffer multibyteness is off, both input and output of
> `decode-coding-region' are treated as unibyte.  So, you have to convert
> the output to multibyte manually.
> 
> I'd recommend to use `decode-coding-string' and `insert' instead of
> `decode-coding-region', if unsure.

Why? you can always set-buffer-multibyte to the right mode, at least
in Emacs.

Again, operations on buffers are much more efficient in Emacs than on
strings.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 10:19         ` Eli Zaretskii
@ 2008-09-20 10:46           ` Daiki Ueno
  2008-09-20 11:30             ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Daiki Ueno @ 2008-09-20 10:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pmr, stephen, emacs-devel

>>>>> In <utzcbp1b4.fsf@gnu.org> 
>>>>>	Eli Zaretskii <eliz@gnu.org> wrote:
> > I'd recommend to use `decode-coding-string' and `insert' instead of
> > `decode-coding-region', if unsure.

> Why? you can always set-buffer-multibyte to the right mode, at least
> in Emacs.

Yes, we can.  But I'm anxious of that those who are not familiar with
Mule cannot decide which is the right mode.

For example, Gnus' *Original Article* buffer has been multibyte for
decade, despite the content is byte stream.  Stefan has proposed to
change it to unibyte, though.

http://article.gmane.org/gmane.emacs.devel/90761

> Again, operations on buffers are much more efficient in Emacs than on
> strings.

Oh, I'll try to do profile later.

Regards,
-- 
Daiki Ueno




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 10:46           ` Daiki Ueno
@ 2008-09-20 11:30             ` Eli Zaretskii
  2008-09-20 23:33               ` Richard M. Stallman
  2008-09-21 13:34               ` Stefan Monnier
  0 siblings, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-20 11:30 UTC (permalink / raw)
  To: Daiki Ueno; +Cc: pmr, stephen, emacs-devel

> From: Daiki Ueno <ueno@unixuser.org>
> Cc: pmr@pajato.com,  stephen@xemacs.org,  emacs-devel@gnu.org
> Date: Sat, 20 Sep 2008 19:46:42 +0900
> 
> >>>>> In <utzcbp1b4.fsf@gnu.org> 
> >>>>>	Eli Zaretskii <eliz@gnu.org> wrote:
> > > I'd recommend to use `decode-coding-string' and `insert' instead of
> > > `decode-coding-region', if unsure.
> 
> > Why? you can always set-buffer-multibyte to the right mode, at least
> > in Emacs.
> 
> Yes, we can.  But I'm anxious of that those who are not familiar with
> Mule cannot decide which is the right mode.

We were discussing Rmail/mbox.  In that context, whenever you need to
decode a message body, the buffer needs to be in unibyte mode.  That's
it.  I think this is simple enough for anyone who writes related code.

> > Again, operations on buffers are much more efficient in Emacs than on
> > strings.
> 
> Oh, I'll try to do profile later.

Please do, but please profile memory, not only CPU.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 10:04       ` Daiki Ueno
  2008-09-20 10:19         ` Eli Zaretskii
@ 2008-09-20 13:48         ` Stephen J. Turnbull
  2008-09-21  0:57           ` Daiki Ueno
  1 sibling, 1 reply; 33+ messages in thread
From: Stephen J. Turnbull @ 2008-09-20 13:48 UTC (permalink / raw)
  To: Daiki Ueno; +Cc: Paul Michael Reilly, emacs-devel

Daiki Ueno writes:

 > Nope, XEmacs does not have the concept of buffer multibyteness.

True.  It never will, at the Lisp level.

 > I'd recommend to use `decode-coding-string' and `insert' instead of
 > `decode-coding-region', if unsure.

How does that help if the target buffer is unibyte?

 > I heard that the reason why FLIM does not use
 > `{de|en}code-coding-region' is to avoid this confusion.

A better strategy would be to force reading the mbox file as multibyte
binary.  It's a little bit inefficient, but not as inefficient as the
human brain, so who cares?




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 11:30             ` Eli Zaretskii
@ 2008-09-20 23:33               ` Richard M. Stallman
  2008-09-21  3:18                 ` Eli Zaretskii
  2008-09-21 13:34               ` Stefan Monnier
  1 sibling, 1 reply; 33+ messages in thread
From: Richard M. Stallman @ 2008-09-20 23:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel

    We were discussing Rmail/mbox.  In that context, whenever you need to
    decode a message body, the buffer needs to be in unibyte mode.

How can that be right?  The decoded non-ASCII characters cannot even
exist in a unibyte buffer.

I just did an experiment, and decode-coding-region worked fine in
a multibyte buffer.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 13:48         ` Stephen J. Turnbull
@ 2008-09-21  0:57           ` Daiki Ueno
  2008-09-22  9:14             ` Stephen J. Turnbull
  0 siblings, 1 reply; 33+ messages in thread
From: Daiki Ueno @ 2008-09-21  0:57 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Paul Michael Reilly, emacs-devel

>>>>> In <87skrvgc8f.fsf@xemacs.org> 
>>>>>	"Stephen J. Turnbull" <stephen@xemacs.org> wrote:
>  > I'd recommend to use `decode-coding-string' and `insert' instead of
>  > `decode-coding-region', if unsure.

> How does that help if the target buffer is unibyte?

In this context, we can assume the target buffer multibyte.

Pmail uses seperate buffers unlike Rmail, as Paul indicates in
<48D33A10.4040102@pajato.com>.  Let us call the one buffer holding raw
contents of mbox file A, and another displaying a message B.

I think the most straightforward way is to do:

1. set the buffer A unibyte
2. set the buffer B multibyte
3. extract a message body from A into a string
4. decode the string
5. insert it to B

and the only drawback is inefficiency (if it is measurable).

>  > I heard that the reason why FLIM does not use
>  > `{de|en}code-coding-region' is to avoid this confusion.

> A better strategy would be to force reading the mbox file as multibyte
> binary.  It's a little bit inefficient, but not as inefficient as the
> human brain, so who cares?

The term "multibyte binary" looks like an oxymoron for me ;-)

Regards,
-- 
Daiki Ueno

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 23:33               ` Richard M. Stallman
@ 2008-09-21  3:18                 ` Eli Zaretskii
  0 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-21  3:18 UTC (permalink / raw)
  To: rms; +Cc: pmr, stephen, ueno, emacs-devel

> From: "Richard M. Stallman" <rms@gnu.org>
> CC: ueno@unixuser.org, pmr@pajato.com, stephen@xemacs.org,
> 	emacs-devel@gnu.org
> Date: Sat, 20 Sep 2008 19:33:46 -0400
> 
> I just did an experiment, and decode-coding-region worked fine in
> a multibyte buffer.

It mostly works (because of special treatment of raw bytes in
multibyte buffers), but sometimes it backfires.  It's safer to use
unibyte buffers, in my experience.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-20 11:30             ` Eli Zaretskii
  2008-09-20 23:33               ` Richard M. Stallman
@ 2008-09-21 13:34               ` Stefan Monnier
  2008-09-21 17:59                 ` Eli Zaretskii
  1 sibling, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2008-09-21 13:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pmr, stephen, Daiki Ueno, emacs-devel

> We were discussing Rmail/mbox.  In that context, whenever you need to
> decode a message body, the buffer needs to be in unibyte mode.  That's
> it.  I think this is simple enough for anyone who writes related code.

The only reliable way to do decoding in buffers is by using
the `destination' argument to decode-coding-region so that you can
decode from a unibyte buffer into a multibyte buffer.

Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in
a unibyte buffer" (as is necessarily the case either as source or as
destination if you do the decoding in-place) is just too delicate in my
experience (and of course, it's also somewhat inefficient).

        Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-21 13:34               ` Stefan Monnier
@ 2008-09-21 17:59                 ` Eli Zaretskii
  2008-09-21 19:26                   ` Stefan Monnier
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-21 17:59 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Daiki Ueno <ueno@unixuser.org>,  pmr@pajato.com,  stephen@xemacs.org,  emacs-devel@gnu.org
> Date: Sun, 21 Sep 2008 09:34:38 -0400
> 
> The only reliable way to do decoding in buffers is by using
> the `destination' argument to decode-coding-region so that you can
> decode from a unibyte buffer into a multibyte buffer.

Why is that the only reliable method, and what do you suggest as the
value of `destination' argument for it to DTRT?




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-21 17:59                 ` Eli Zaretskii
@ 2008-09-21 19:26                   ` Stefan Monnier
  2008-09-21 20:56                     ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2008-09-21 19:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel

>> The only reliable way to do decoding in buffers is by using
>> the `destination' argument to decode-coding-region so that you can
>> decode from a unibyte buffer into a multibyte buffer.

> Why is that the only reliable method, and what do you suggest as the
> value of `destination' argument for it to DTRT?

As I said in my message: use the dest arg so as to "decode from
a unibyte buffer into a multibyte buffer", so `destination' should be
... a multibyte buffer.

As for why it's the only reliable method, it's because:

>> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in
>> a unibyte buffer" (as is necessarily the case either as source or as
>> destination if you do the decoding in-place) is just too delicate in my
>> experience (and of course, it's also somewhat inefficient).

I'm not sure which part of the above paragraph is unclear.


        Stefan




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-21 19:26                   ` Stefan Monnier
@ 2008-09-21 20:56                     ` Eli Zaretskii
  2008-09-21 22:07                       ` Stefan Monnier
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-21 20:56 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: ueno@unixuser.org,  pmr@pajato.com,  stephen@xemacs.org,  emacs-devel@gnu.org
> Date: Sun, 21 Sep 2008 15:26:38 -0400
> 
> >> The only reliable way to do decoding in buffers is by using
> >> the `destination' argument to decode-coding-region so that you can
> >> decode from a unibyte buffer into a multibyte buffer.
> 
> > Why is that the only reliable method, and what do you suggest as the
> > value of `destination' argument for it to DTRT?
> 
> As I said in my message: use the dest arg so as to "decode from
> a unibyte buffer into a multibyte buffer", so `destination' should be
> ... a multibyte buffer.

And the source a unibyte one?

> As for why it's the only reliable method, it's because:
> 
> >> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in
> >> a unibyte buffer" (as is necessarily the case either as source or as
> >> destination if you do the decoding in-place) is just too delicate in my
> >> experience (and of course, it's also somewhat inefficient).
> 
> I'm not sure which part of the above paragraph is unclear.

The fact that other methods are not 100% reliable does not yet mean
that this one is.  I thought you had a more specific explanation why
this method is reliable.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-21 20:56                     ` Eli Zaretskii
@ 2008-09-21 22:07                       ` Stefan Monnier
  2008-09-22  3:07                         ` Eli Zaretskii
  2008-09-22  4:31                         ` Kenichi Handa
  0 siblings, 2 replies; 33+ messages in thread
From: Stefan Monnier @ 2008-09-21 22:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel

>> >> The only reliable way to do decoding in buffers is by using
>> >> the `destination' argument to decode-coding-region so that you can
>> >> decode from a unibyte buffer into a multibyte buffer.
>> > Why is that the only reliable method, and what do you suggest as the
>> > value of `destination' argument for it to DTRT?
>> As I said in my message: use the dest arg so as to "decode from
>> a unibyte buffer into a multibyte buffer", so `destination' should be
>> ... a multibyte buffer.
> And the source a unibyte one?

Yes, of course.

>> As for why it's the only reliable method, it's because:
>> >> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in
>> >> a unibyte buffer" (as is necessarily the case either as source or as
>> >> destination if you do the decoding in-place) is just too delicate in my
>> >> experience (and of course, it's also somewhat inefficient).
>> I'm not sure which part of the above paragraph is unclear.
> The fact that other methods are not 100% reliable does not yet mean
> that this one is.  I thought you had a more specific explanation why
> this method is reliable.

No, I don't have such an explanation, except that the most natural input
for decoding is a unibyte (string|buffer) and the most natural output is
a multibyte (string|buffer).  I'd expect that to be pretty obvious.


        Stefan





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-21 22:07                       ` Stefan Monnier
@ 2008-09-22  3:07                         ` Eli Zaretskii
  2008-09-22  3:36                           ` Stefan Monnier
  2008-09-22  3:41                           ` Daiki Ueno
  2008-09-22  4:31                         ` Kenichi Handa
  1 sibling, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-22  3:07 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: ueno@unixuser.org,  pmr@pajato.com,  stephen@xemacs.org,  emacs-devel@gnu.org
> Date: Sun, 21 Sep 2008 18:07:10 -0400
> 
> >> >> The only reliable way to do decoding in buffers is by using
> >> >> the `destination' argument to decode-coding-region so that you can
> >> >> decode from a unibyte buffer into a multibyte buffer.
> >> > Why is that the only reliable method, and what do you suggest as the
> >> > value of `destination' argument for it to DTRT?
> >> As I said in my message: use the dest arg so as to "decode from
> >> a unibyte buffer into a multibyte buffer", so `destination' should be
> >> ... a multibyte buffer.
> > And the source a unibyte one?
> 
> Yes, of course.
> 
> >> As for why it's the only reliable method, it's because:
> >> >> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in
> >> >> a unibyte buffer" (as is necessarily the case either as source or as
> >> >> destination if you do the decoding in-place) is just too delicate in my
> >> >> experience (and of course, it's also somewhat inefficient).
> >> I'm not sure which part of the above paragraph is unclear.
> > The fact that other methods are not 100% reliable does not yet mean
> > that this one is.  I thought you had a more specific explanation why
> > this method is reliable.
> 
> No, I don't have such an explanation, except that the most natural input
> for decoding is a unibyte (string|buffer) and the most natural output is
> a multibyte (string|buffer).  I'd expect that to be pretty obvious.

That would mean Rmail/mbox will need to use another unibyte scratch
buffer for decoding MIME-encoded text: first qp- or b64-decode it into
another unibyte buffer, then decode-coding-region from there to the
(multibyte) display buffer.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-22  3:07                         ` Eli Zaretskii
@ 2008-09-22  3:36                           ` Stefan Monnier
  2008-09-22  3:41                           ` Daiki Ueno
  1 sibling, 0 replies; 33+ messages in thread
From: Stefan Monnier @ 2008-09-22  3:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel

>> No, I don't have such an explanation, except that the most natural input
>> for decoding is a unibyte (string|buffer) and the most natural output is
>> a multibyte (string|buffer).  I'd expect that to be pretty obvious.

> That would mean Rmail/mbox will need to use another unibyte scratch
> buffer for decoding MIME-encoded text: first qp- or b64-decode it into
> another unibyte buffer, then decode-coding-region from there to the
> (multibyte) display buffer.

Sure,


        Stefan




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-22  3:07                         ` Eli Zaretskii
  2008-09-22  3:36                           ` Stefan Monnier
@ 2008-09-22  3:41                           ` Daiki Ueno
  2008-09-22  3:58                             ` Stefan Monnier
  1 sibling, 1 reply; 33+ messages in thread
From: Daiki Ueno @ 2008-09-22  3:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pmr, stephen, Stefan Monnier, emacs-devel

>>>>> In <u63ooq3nt.fsf@gnu.org> 
>>>>>	Eli Zaretskii <eliz@gnu.org> wrote:
> That would mean Rmail/mbox will need to use another unibyte scratch
> buffer for decoding MIME-encoded text: first qp- or b64-decode it into
> another unibyte buffer, then decode-coding-region from there to the
> (multibyte) display buffer.

Since the output of qp- or b64-decode is unibyte (unlike
decode-coding-region), we can reuse the same source buffer.

Regards,
-- 
Daiki Ueno




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-22  3:41                           ` Daiki Ueno
@ 2008-09-22  3:58                             ` Stefan Monnier
  2008-09-22 18:48                               ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2008-09-22  3:58 UTC (permalink / raw)
  To: Daiki Ueno; +Cc: pmr, Eli Zaretskii, stephen, emacs-devel

>> That would mean Rmail/mbox will need to use another unibyte scratch
>> buffer for decoding MIME-encoded text: first qp- or b64-decode it into
>> another unibyte buffer, then decode-coding-region from there to the
>> (multibyte) display buffer.

> Since the output of qp- or b64-decode is unibyte (unlike
> decode-coding-region), we can reuse the same source buffer.

Actually, better not: the real source buffer is the actual mbox file
buffer, i.e. multi-megabyte and that shouldn't be changed unless you
really mean to.  I.e. you could do it in-place, but unless you want to
then save the mbox file back using "content-transfer-encoding: 8bit",
you'd have to be careful to undo the base64/qp decoding afterwards and
make sure that the buffer cannot be saved in the mean time.

        Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-21 22:07                       ` Stefan Monnier
  2008-09-22  3:07                         ` Eli Zaretskii
@ 2008-09-22  4:31                         ` Kenichi Handa
  2008-09-22 14:10                           ` Stefan Monnier
  2008-09-22 15:24                           ` Paul Michael Reilly
  1 sibling, 2 replies; 33+ messages in thread
From: Kenichi Handa @ 2008-09-22  4:31 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: pmr, eliz, ueno, stephen, emacs-devel

In pre-unicode-merge Emacs (more exactly, before
2008-03-12), the automatic unibyte -> multibyte conversion
sometimes caused a headache for Emacs Lisp developper
because the behaviour differs in each lang. env.  But, with
the current Emacs, that conversion works more
developper-friendly; i.e. all bytes with MSB set are
converted to the corresponding eight-bit characters of
multibyte represenation (* see the attached note).

So, now we have these four ways to get a multibute buffer
decoded from a unibyte buffer, and they all should work
equally safely.

(1) Do decode-coding-region while specifying a multibyte
buffer as TARGET.

(2) Insert the contents of unibyte buffer into a multibyte
buffer, and then perform decode-coding-region in that
multibyte buffer.

(3) Get a unibyte string form a unibyte buffer, and then
decode it while specifying a multibyte buffer as TARGET.

(4) Deocde a unibyte buffer into a mulitbyte string, and
then insert it into a multibyte buffer.

(Please note that using decode-coding-region directly in a
unibyte-buffer is not reliable because if a coding system
has post-read-converion function, that funcion (usually)
works only in a mutlibyte buffer.)

The efficiency is (1) > (2) > (3) > (4).

And, for the case of Rmail/mbox, before decoding, we may
have to perform base64 or qp decoding, and they can't
specify the different buffer/string as target.  And I don't
know if they works for a multibyte buffer/string.

So, at the moment, I think the following strategy is good.

Copy the contents of RMAIL buffer to a temporary unibyte
buffer, perform base64/qp decoding in that buffer, then do
decode-coding-region while specifying the view buffer as
TARGET.

---
Kenichi Handa
handa@ni.aist.go.jp

* Note: Those eight-bit characters have values
#x3FFF80..#x3FFFFF, and, for instance, char-after and aref
return one of those values.  To get the original byte value,
one needs (encode-char EIGHT-BIT-CHAR 'eight-bit) or
(multibyte-char-to-unibyte EIGHT-BIT-CHAR).  Perhaps, we
have to provide some APIs for directly getting a byte value
of EIGHT-BIT-CHAR, but we have not yet decided what to do.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-21  0:57           ` Daiki Ueno
@ 2008-09-22  9:14             ` Stephen J. Turnbull
  0 siblings, 0 replies; 33+ messages in thread
From: Stephen J. Turnbull @ 2008-09-22  9:14 UTC (permalink / raw)
  To: Daiki Ueno; +Cc: Paul Michael Reilly, emacs-devel

Daiki Ueno writes:

 > The term "multibyte binary" looks like an oxymoron for me ;-)

It's not, though.  "Multibyte" (more precisely, "variable width")
refers to the representation of integers in the buffer.  "Binary"
refers to the fact that the sequence of characters in the buffer
(interpreted as abstract non-negative integers) is exactly the same as
the sequence of bytes (again, considered as abstract non-negative
integers) in the source.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-22  4:31                         ` Kenichi Handa
@ 2008-09-22 14:10                           ` Stefan Monnier
  2008-09-24  0:56                             ` Kenichi Handa
  2008-09-22 15:24                           ` Paul Michael Reilly
  1 sibling, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2008-09-22 14:10 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: pmr, eliz, ueno, stephen, emacs-devel

> (1) Do decode-coding-region while specifying a multibyte
> buffer as TARGET.

> (2) Insert the contents of unibyte buffer into a multibyte
> buffer, and then perform decode-coding-region in that
> multibyte buffer.

> (3) Get a unibyte string form a unibyte buffer, and then
> decode it while specifying a multibyte buffer as TARGET.

> (4) Deocde a unibyte buffer into a mulitbyte string, and
> then insert it into a multibyte buffer.

> (Please note that using decode-coding-region directly in a
> unibyte-buffer is not reliable because if a coding system
> has post-read-converion function, that funcion (usually)
> works only in a mutlibyte buffer.)

> The efficiency is (1) > (2) > (3) > (4).

I'd have expected 3 to be more efficient than 2 since it doesn't need to
use the variable width multibyte representation of binary data.
[ I'd even expect 3 to be about as efficient as 1. ]

Is this because of the need to copy the string contents to a temp buffer
in order to run any potential pre-read-conversion function?


        Stefan




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-22  4:31                         ` Kenichi Handa
  2008-09-22 14:10                           ` Stefan Monnier
@ 2008-09-22 15:24                           ` Paul Michael Reilly
  1 sibling, 0 replies; 33+ messages in thread
From: Paul Michael Reilly @ 2008-09-22 15:24 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: stephen, eliz, ueno, Stefan Monnier, emacs-devel

Kenichi Handa wrote:
> In pre-unicode-merge Emacs (more exactly, before
...
> Copy the contents of RMAIL buffer to a temporary unibyte
> buffer, perform base64/qp decoding in that buffer, then do
> decode-coding-region while specifying the view buffer as
> TARGET.

This appears to be the definitive word and is the approach I am using.

Thanks,

-pmr




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-22  3:58                             ` Stefan Monnier
@ 2008-09-22 18:48                               ` Eli Zaretskii
  0 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2008-09-22 18:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Sun, 21 Sep 2008 23:58:49 -0400
> Cc: pmr@pajato.com, Eli Zaretskii <eliz@gnu.org>, stephen@xemacs.org,
> 	emacs-devel@gnu.org
> 
> >> That would mean Rmail/mbox will need to use another unibyte scratch
> >> buffer for decoding MIME-encoded text: first qp- or b64-decode it into
> >> another unibyte buffer, then decode-coding-region from there to the
> >> (multibyte) display buffer.
> 
> > Since the output of qp- or b64-decode is unibyte (unlike
> > decode-coding-region), we can reuse the same source buffer.
> 
> Actually, better not: the real source buffer is the actual mbox file
> buffer

Right, exactly.  Besides, one of the main design goals of Rmail/mbox
was to preserve the original mbox file intact, to avoid irreversible
changes due to some types of decoding.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-22 14:10                           ` Stefan Monnier
@ 2008-09-24  0:56                             ` Kenichi Handa
  2008-09-24  2:53                               ` Stefan Monnier
  0 siblings, 1 reply; 33+ messages in thread
From: Kenichi Handa @ 2008-09-24  0:56 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: pmr, eliz, ueno, stephen, emacs-devel

In article <jwv63oojmv8.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@IRO.UMontreal.CA> writes:

> > (1) Do decode-coding-region while specifying a multibyte
> > buffer as TARGET.

> > (2) Insert the contents of unibyte buffer into a multibyte
> > buffer, and then perform decode-coding-region in that
> > multibyte buffer.

> > (3) Get a unibyte string form a unibyte buffer, and then
> > decode it while specifying a multibyte buffer as TARGET.

> > (4) Deocde a unibyte buffer into a mulitbyte string, and
> > then insert it into a multibyte buffer.

> > (Please note that using decode-coding-region directly in a
> > unibyte-buffer is not reliable because if a coding system
> > has post-read-converion function, that funcion (usually)
> > works only in a mutlibyte buffer.)

> > The efficiency is (1) > (2) > (3) > (4).

> I'd have expected 3 to be more efficient than 2 since it doesn't need to
> use the variable width multibyte representation of binary data.
> [ I'd even expect 3 to be about as efficient as 1. ]

> Is this because of the need to copy the string contents to a temp buffer
> in order to run any potential pre-read-conversion function?

We don't have pre-read-conversion but post-read-conversion,
and if the coding system doesn't have post-read-conversion,
a temp buffer is not used.  The reason why I think (2)>(3)
is because of a cost of making a unibyte string.  And
handling multibyte representation of binary data within
decoder/encoder (written in C) is trivial.

---
Kenichi Handa
handa@ni.aist.go.jp





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-24  0:56                             ` Kenichi Handa
@ 2008-09-24  2:53                               ` Stefan Monnier
  2008-09-24  3:48                                 ` Kenichi Handa
  0 siblings, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2008-09-24  2:53 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: pmr, eliz, ueno, stephen, emacs-devel

> We don't have pre-read-conversion but post-read-conversion,
> and if the coding system doesn't have post-read-conversion,
> a temp buffer is not used.  The reason why I think (2)>(3)
> is because of a cost of making a unibyte string.  And

But if we're only talking about the cost of decoding, then that's not
relevant: we may already have the string for some reason.


        Stefan




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Need some help with Rmail/mbox
  2008-09-24  2:53                               ` Stefan Monnier
@ 2008-09-24  3:48                                 ` Kenichi Handa
  0 siblings, 0 replies; 33+ messages in thread
From: Kenichi Handa @ 2008-09-24  3:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: pmr, eliz, ueno, stephen, emacs-devel

In article <jwvljxi45ou.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > We don't have pre-read-conversion but post-read-conversion,
> > and if the coding system doesn't have post-read-conversion,
> > a temp buffer is not used.  The reason why I think (2)>(3)
> > is because of a cost of making a unibyte string.  And

> But if we're only talking about the cost of decoding, then that's not
> relevant: we may already have the string for some reason.

Yes, but, we are not only talking about the cost of coding.

I wrote:
> So, now we have these four ways to get a multibute buffer
> decoded from a unibyte buffer, and they all should work
> equally safely.
[...]

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2008-09-24  3:48 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly
2008-09-19  3:28 ` Stephen J. Turnbull
2008-09-19  5:35   ` Paul Michael Reilly
2008-09-19  9:32     ` Eli Zaretskii
2008-09-20  7:12     ` Stephen J. Turnbull
2008-09-20 10:04       ` Daiki Ueno
2008-09-20 10:19         ` Eli Zaretskii
2008-09-20 10:46           ` Daiki Ueno
2008-09-20 11:30             ` Eli Zaretskii
2008-09-20 23:33               ` Richard M. Stallman
2008-09-21  3:18                 ` Eli Zaretskii
2008-09-21 13:34               ` Stefan Monnier
2008-09-21 17:59                 ` Eli Zaretskii
2008-09-21 19:26                   ` Stefan Monnier
2008-09-21 20:56                     ` Eli Zaretskii
2008-09-21 22:07                       ` Stefan Monnier
2008-09-22  3:07                         ` Eli Zaretskii
2008-09-22  3:36                           ` Stefan Monnier
2008-09-22  3:41                           ` Daiki Ueno
2008-09-22  3:58                             ` Stefan Monnier
2008-09-22 18:48                               ` Eli Zaretskii
2008-09-22  4:31                         ` Kenichi Handa
2008-09-22 14:10                           ` Stefan Monnier
2008-09-24  0:56                             ` Kenichi Handa
2008-09-24  2:53                               ` Stefan Monnier
2008-09-24  3:48                                 ` Kenichi Handa
2008-09-22 15:24                           ` Paul Michael Reilly
2008-09-20 13:48         ` Stephen J. Turnbull
2008-09-21  0:57           ` Daiki Ueno
2008-09-22  9:14             ` Stephen J. Turnbull
2008-09-19  4:30 ` Richard M. Stallman
2008-09-19  4:30 ` Richard M. Stallman
2008-09-19  9:12 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).