* Need some help with Rmail/mbox @ 2008-09-18 16:02 Paul Michael Reilly 2008-09-19 3:28 ` Stephen J. Turnbull ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: Paul Michael Reilly @ 2008-09-18 16:02 UTC (permalink / raw) To: emacs-devel The basic problem I need to solve now is how to map the values of the content-type and content-transfer-encoding headers (either of which could legally be absent) to an Emacs coding system. I am slogging through this task and if anyone has already done it and has either a short "how-to" or even better some code, that would be much appreciated. As Eli helpfully pointed out, rmail-convert-to-babyl-format provides some help. As near as I can tell the task is to decode the message body in two steps: first to decode according to the character encoding (e.g. quoted-printable or base64) and then to decode that result to some coding system. Something along the lines of: (let (body) (setq body (apply qp or base64 to body of message) (decode-coding-string body (detect-coding-string body t)) Am I even in the ballpark? -pmr ^ permalink raw reply [flat|nested] 33+ messages in thread
* Need some help with Rmail/mbox 2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly @ 2008-09-19 3:28 ` Stephen J. Turnbull 2008-09-19 5:35 ` Paul Michael Reilly 2008-09-19 4:30 ` Richard M. Stallman ` (2 subsequent siblings) 3 siblings, 1 reply; 33+ messages in thread From: Stephen J. Turnbull @ 2008-09-19 3:28 UTC (permalink / raw) To: pmr; +Cc: emacs-devel Paul Michael Reilly writes: > As near as I can tell the task is to decode the message body in two > steps: But why not just use the existing code to do this? AIUI, the Babyl format was designed for one-buffer operation on a pseudo-RFC-822 message, so most functions used to wash and display probably assume that the message is in the current buffer, which is narrowed so that the presentation header plus the body form an RFC 2822 message. All you should need to do for a first cut is to copy the message to a new buffer, which doesn't need to be narrowed, but might need to have some Babyl sentinels added. If I'm missing something, feel free to ignore me, but I don't really understand what all you think is different about presenting a free-standing RFC 2822 message as opposed to presenting one that is part of a Babyl-format buffer. I don't think they should be that different. The main thing is that the Babyl format caches the set of presentation headers in the Babyl-format file, but mbox won't. So you'll need to hide (or remove) the non-presentation headers one-by-one rather than by just narrowing the buffer. > first to decode according to the character encoding (e.g. quoted- > printable or base64) and then to decode that result to some coding > system. That's basically it. You should do the processing on buffers, not strings, though, and > (decode-coding-string body (detect-coding-string body t)) you want to parse the coding from the *header*, not guess on the body. If you want you can add guessing and/or user-specified MIME charsets as a user option, but (a) almost all genuine mail today will contain an appropriate Content-Type charset parameter, and (b) lack of such (unless all text is US-ASCII) is an extremely strong indicator of spam. Few users will need to be able to read messages that have bogus charset parameters: this feature is not immediately necessary. The general algorithm should be something like Identify message in mbox buffer Copy message to presentation buffer Identify header and body, add Babyl sentinels if desired Parse headers (specifically content type) Dispatch on content type and subtype: Case type is text and subtype is plain Identify charset parameter: (or charset-from-content-type "us-ascii") Map charset to Emacs coding-system (decode-coding-region (body-begin) (body-end) coding-system) Wash header for presentation, eg: Hide non-displayed header Decode RFC 2047-encoded headers Wash body for presentation, eg: Highlight and activate url-like substrings Highlight quoted material Display buffer in window ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-19 3:28 ` Stephen J. Turnbull @ 2008-09-19 5:35 ` Paul Michael Reilly 2008-09-19 9:32 ` Eli Zaretskii 2008-09-20 7:12 ` Stephen J. Turnbull 0 siblings, 2 replies; 33+ messages in thread From: Paul Michael Reilly @ 2008-09-19 5:35 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: emacs-devel Stephen J. Turnbull wrote: Thanks for stepping up to this. Your help is very much appreciated! > All you should need to do for a first cut is to copy the message to a > new buffer, which doesn't need to be narrowed, but might need to have > some Babyl sentinels added. I first copy the relevant headers to the view buffer by collecting them from the PMAIL buffer into a string and insert the string into the view buffer. I used the rmail.el code pretty much as is but instead of copying and hiding I do selective copy and insert (ignoring the case of showing all headers which is trivial). Then I basically copy the message body into a string and insert it into the view buffer. But when I started to work on the decoding it seemed that decoding the string before inserting it seemed like a good idea. (Pardon my Elisp rustiness ... is it better to use buffer to buffer copying than insert string?) I copied the logic for this first part of decoding from rmail-convert-to-babyl-format. > That's basically it. You should do the processing on buffers, not > strings, though, and Are you essentially answering my question above and saying that copying buffer to buffer is faster/better than operating on strings? > > > (decode-coding-string body (detect-coding-string body t)) > > you want to parse the coding from the *header*, not guess on the body. I do parse out quoted-printable and base64 and apply these to the body before doing the coding system based decoding. > If you want you can add guessing and/or user-specified MIME charsets > as a user option, but (a) almost all genuine mail today will contain > an appropriate Content-Type charset parameter, and (b) lack of such > (unless all text is US-ASCII) is an extremely strong indicator of > spam. Few users will need to be able to read messages that have bogus > charset parameters: this feature is not immediately necessary. OK, makes sense. > The general algorithm should be something like > > Identify message in mbox buffer yup > Copy message to presentation buffer yup > Identify header and body, add Babyl sentinels if desired babyl sentinels? I'm not sure what you mean by this. > Parse headers (specifically content type) If you had said content type and content encoding I would have said "yup" and that is what led to my request for help. Except for the case of quoted-printable and base64 I'm not sure how to parse those two headers (Content-Type and Content-Transfer-Encoding) into a coding system so that I can then do the decoding. I'm assuming the coding system guesswork becomes relevant for combinations of the two headers that Rmail does not grok. And I now see that there is a strong relationship between charset and coding system. > Dispatch on content type and subtype: > Case type is text and subtype is plain > Identify charset parameter: > (or charset-from-content-type "us-ascii") > Map charset to Emacs coding-system > (decode-coding-region (body-begin) (body-end) coding-system) OK, this is helpful. I assume that for all other type/subtype cases we punt for now and use guessing or just raw text? But certainly there are some that we want to process/decode in some fashion, e.g. text/html or text/xml. Is there another Emacs package/library that you are aware of that provides a good model for where we want to take Rmail so that it handles more type/subtype cases seamlessly in the view buffer? Even perhaps audio and video (not pure MIME, i.e. multipart ... yet). > Wash header for presentation, eg: > Hide non-displayed header > Decode RFC 2047-encoded headers OK, this is helpful but I would add that non-displayed headers do not need to be in the view buffer at all. It contains all the headers or just the displayed headers, depending on the User's current desire. > Wash body for presentation, eg: > Highlight and activate url-like substrings > Highlight quoted material I don't believe Rmail does either of these operations now. Is that your understanding? If I'm right and this washing is not done, then it is very high on my priority list to add asap. If I'm wrong then please point me where it gets done or how to enable it. > Display buffer in window yup Thanks again, -pmr ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-19 5:35 ` Paul Michael Reilly @ 2008-09-19 9:32 ` Eli Zaretskii 2008-09-20 7:12 ` Stephen J. Turnbull 1 sibling, 0 replies; 33+ messages in thread From: Eli Zaretskii @ 2008-09-19 9:32 UTC (permalink / raw) To: Paul Michael Reilly; +Cc: stephen, emacs-devel > Date: Fri, 19 Sep 2008 01:35:12 -0400 > From: Paul Michael Reilly <pmr@pajato.com> > Cc: emacs-devel@gnu.org > > I first copy the relevant headers to the view buffer by collecting > them from the PMAIL buffer into a string and insert the string into > the view buffer. I hope you use insert-buffer-substring instead of actually making a string and inserting it. Consing large strings is not a good idea, if all you need is copy text from one buffer to another. > Then I basically copy the message body into a string and insert it > into the view buffer. Same comment as above. > But when I started to work on the decoding it seemed that decoding > the string before inserting it seemed like a good idea. Actually, it isn't: in Emacs, whenever you can work on a buffer instead of a string, you should generally prefer a buffer. Specifically, decoding of strings uses scratch buffers behind your back, and you don't gain anything in efficiency. So just copy the text to the view buffer, then decode it in-place. > (Pardon my Elisp rustiness ... is it better to use buffer to > buffer copying than insert string?) Yes. > If you had said content type and content encoding I would have said > "yup" and that is what led to my request for help. Except for the > case of quoted-printable and base64 I'm not sure how to parse those > two headers (Content-Type and Content-Transfer-Encoding) into a coding > system so that I can then do the decoding. You should parse them separately, and use them separately, just like Rmail/Babyl does: first decode qp or b64 into 8-bit encoded bytes, then decode the rest using the charset gleaned from the Content-Type header. > I'm assuming the coding system guesswork becomes relevant for > combinations of the two headers that Rmail does not grok. This should not happen, in general; but for more robust code, you could try `undecided' if all else fails; this is what Rmail/Babyl does. See rmail-decode-region. > And I now see that there is a strong relationship between charset > and coding system. Yes; they are mostly the same. Emacs defines an alias coding-system for every MIME charset, IIRC. > OK, this is helpful. I assume that for all other type/subtype cases > we punt for now and use guessing or just raw text? It's not raw text, it should be plain ASCII (before you qp- or b64-decode them; I suggest not to decode their original qp or b64 encoding until you support those additional types). Rmail/Babyl uses `undecided' for those, and so can you. > But certainly > there are some that we want to process/decode in some fashion, > e.g. text/html or text/xml. Eventually, yes. > Is there another Emacs package/library > that you are aware of that provides a good model for where we want to > take Rmail so that it handles more type/subtype cases seamlessly in > the view buffer? Even perhaps audio and video (not pure MIME, > i.e. multipart ... yet). Gnus, of course. But again, I suggest not to bother about these extensions for now: just make Rmail/mbox be no worse than Rmail/Babyl, so that people could start using it. Extensions can come later. > > Wash body for presentation, eg: > > Highlight and activate url-like substrings > > Highlight quoted material > > I don't believe Rmail does either of these operations now. Right, it doesn't. We have ffap and similar features to do that without highlighting, although highlighting would be nice (again, as an extension of what Rmail does now). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-19 5:35 ` Paul Michael Reilly 2008-09-19 9:32 ` Eli Zaretskii @ 2008-09-20 7:12 ` Stephen J. Turnbull 2008-09-20 10:04 ` Daiki Ueno 1 sibling, 1 reply; 33+ messages in thread From: Stephen J. Turnbull @ 2008-09-20 7:12 UTC (permalink / raw) To: Paul Michael Reilly; +Cc: emacs-devel Paul Michael Reilly writes: > Thanks for stepping up to this. Your help is very much appreciated! You're welcome. Eli and Richard have already responded with some existing Rmail features, but maybe some background (somewhat duplicating their comments) would be helpful, too. > I first copy the relevant headers to the view buffer by collecting > them from the PMAIL buffer into a string and insert the string into > the view buffer. Copying to a string uses memory. The amount of memory is not a huge consideration these days, even with a multimegabyte buffer. But allocating and deallocating strings is very time-consuming because malloc requires a system call, and deallocated strings' data gets compacted, or possibly another system call to deallocate for large strings (I forget if Emacs uses direct allocation for large strings instead of expanding the string data pool). Also, strings are read-only. So to "edit" a string, you actually have to copy the relevant parts to a new string; if you substitute in the middle of a string, you have to create a bunch of new strings one for each fragment, then a final string. Lotsa consing. > I used the rmail.el code pretty much as is but instead of copying > and hiding I do selective copy and insert (ignoring the case of > showing all headers which is trivial). That's reasonable, I think. > Then I basically copy the message body into a string and insert it > into the view buffer. `insert-buffer-substring' is much more efficient. > But when I started to work on the decoding it seemed that decoding > the string before inserting it seemed like a good idea. In XEmacs, string decoding is implemented by copying to a temporary buffer and doing decode-coding-region there. Emacs is likely the same. :-) > Are you essentially answering my question above and saying that > copying buffer to buffer is faster/better than operating on strings? Yes. It's faster and better. Buffers are designed for editing. Strings are designed for read-only text to save all the editor overhead that buffers carry around. Here's just one reason. Emacs strings are *not* arrays of characters, they are arrays of bytes, which (from Lisp) can only be read at character boundaries. An ASCII character takes up 1 byte, a Latin-1 character 2 bytes, a Japanese character 3 bytes, and (IIRC) certain user-defined characters may take 4 bytes! This means that if you decide to substitute a Latin 1 SMALL LATIN LETTER A WITH GRAVE ACCENT for ASCII SMALL LATIN LETTER A (thus turning voila into voilà) you can't do it in a string without allocating a new string. > I do parse out quoted-printable and base64 and apply these to the body > before doing the coding system based decoding. OK. > > Identify header and body, add Babyl sentinels if desired > > babyl sentinels? I'm not sure what you mean by this. Babyl messages are delimited with "^_" IIRC, and the original headers with "**** BOOH ****" and "**** EOOH ****" or something like that. I don't remember whether any code that presents a message uses those after narrowing (in your implementation, copying), though. If it's not used, you don't need them. > "yup" and that is what led to my request for help. Except for the > case of quoted-printable and base64 I'm not sure how to parse those > two headers (Content-Type and Content-Transfer-Encoding) into a coding > system so that I can then do the decoding. Content-Transfer-Encoding is about how bytes, *not characters*, are represented. For practical purposes there are four possibilities: text is all ASCII (the default, aka 7bit), text is raw unibyte (8bit), text is encoded as quoted-printable, and text is encoded as BASE64. So you are done with that. This is entirely independent of Content-Type or its charset parameter. > I'm assuming the coding system guesswork becomes relevant for > combinations of the two headers that Rmail does not grok. No. If there is no Content-Type header, you "should" assume the RFC 2822 defaults (text/plain; charset=US-ASCII). Providing commands for the user to change those on a per message basis would be nice, but not needed for a first release as the vast majority of non-spam mail is MIME-conformant these days. > And I now see that there is a strong relationship between charset > and coding system. Technically, the *MIME charset* concept is broken, or at least a very poor name. A "character set" is an abstract idea that is (AFAIK) basically unstandardized. A *coded character set* is an invertible mapping from a set of non-negative integers to characters. You can think of Unicode as a universe of characters, although that's not quite good enough for some esoteric purposes. What Emacs calls a "charset" is basically a coded character set. An "encoding" is again an abstract idea which is not really standardized, but it's pretty close to what Emacs calls a "coding system", which is a pair of algorithms for decoding an external text into an Emacs buffer, and for doing the reverse, plus some auxiliary parameters and functions for specialized purposes (eg, for detecting the encoding of an unknown text). As you recognized, this is basically the same thing as a "MIME charset". You should not need to deal with Emacs charsets, by the way. Just remember that "MIME charset == Emacs coding-system" and you'll do fine. > OK, this is helpful. I assume that for all other type/subtype cases > we punt for now and use guessing or just raw text? For text/* types, just use the raw text (there should be a charset parameter if it is not ASCII). > But certainly there are some that we want to process/decode in some > fashion, e.g. text/html or text/xml. Is there another Emacs > package/library that you are aware of that provides a good model > for where we want to take Rmail so that it handles more > type/subtype cases seamlessly in the view buffer? Gnus, VM, tm (aka "Tiny MIME", obsoleted by SEMI and unsupported), SEMI (obsolete and unsupported I believe), WEMI (IIRC a C library to link into Emacs, based on SEMI, obsolete and unsupported I guess), MH-E, MEW, Wanderlust (these last three I don't know about the implementations, they may borrow from Gnus). Both VM and Gnus use the model I suggested of dispatching on type and subtype. Some naming convention like `mime-handler-TYPE/SUBTYPE' could be used. (let ((handlers (list (intern (format "mime-handler-%s/%s" type subtype)) (intern (format "mime-handler-%s/*" type)) 'mime-handler-*/*)) handler) (while handlers (setq handler (car handlers) handlers (cdr handlers)) (if (functionp handler) (funcall handler body-start body-end) ;; `warn' may be an XEmacs-ism, sorry (warn "handler not defined: %s" handler)))) > Even perhaps audio and video (not pure MIME, i.e. multipart > ... yet). You *need* multipart as quickly as possible. Too much mail is sent as multipart. It's not that hard, you just parse the MIME bodies recursively, and throw away the bodies you don't know how to handle. I'm sure Rmail already knows how to do this. You should also provide a way of listing MIME bodies found and saving their raw bytes to a file. (That's just a matter of applying the relevant Content-Transfer-Encoding to the MIME body, and then write-region.) > > Wash header for presentation, eg: > > Hide non-displayed header > > Decode RFC 2047-encoded headers > > OK, this is helpful but I would add that non-displayed headers do not > need to be in the view buffer at all. It contains all the headers or > just the displayed headers, depending on the User's current desire. I find being able to toggle display of the full set of headers useful, and I use it several times every day. I would find this easier to implement if the headers are there but hidden. YMMV, of course. > > Wash body for presentation, eg: > > Highlight and activate url-like substrings > > Highlight quoted material > > I don't believe Rmail does either of these operations now. Is that > your understanding? I count the interval that I've not used Rmail by decades. :-) My contribution is as a standards geek and having gotten my hands dirty on several MUAs. URLs are easy, of course: (while (re-search-forward url-re nil t) (let ((o (make-overlay (match-beginning 0) (match-end 0)))) (overlay-put o 'face 'url-active-face) ;; sorry, this may also be an XEmacs-ism (overlay-put o APPROPRIATE-ARGS-TO-ADD-FOLLOW-URL-TO-KEYMAP)) Quoting is harder because of the variety of quoting styles. You might want to make this easy for users to configure. Kyle Jones's filladapt package is quite good at detecting quoting styles and is configurable. As you know, Kyle is a curmudgeon about assignment, but reading the docs for ideas about UI is probably OK (but check with FSF legal or Richard; IANAL nor an FSF spokesperson). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 7:12 ` Stephen J. Turnbull @ 2008-09-20 10:04 ` Daiki Ueno 2008-09-20 10:19 ` Eli Zaretskii 2008-09-20 13:48 ` Stephen J. Turnbull 0 siblings, 2 replies; 33+ messages in thread From: Daiki Ueno @ 2008-09-20 10:04 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Paul Michael Reilly, emacs-devel >>>>> In <871vzfi93y.fsf@xemacs.org> >>>>> "Stephen J. Turnbull" <stephen@xemacs.org> wrote: > Paul Michael Reilly writes: > > But when I started to work on the decoding it seemed that decoding > > the string before inserting it seemed like a good idea. > In XEmacs, string decoding is implemented by copying to a temporary > buffer and doing decode-coding-region there. Emacs is likely the > same. :-) Nope, XEmacs does not have the concept of buffer multibyteness. - If buffer multibyteness is on, both input and output of `decode-coding-region' are treated as multibyte. I think the input should be unibyte since it is byte stream. - If buffer multibyteness is off, both input and output of `decode-coding-region' are treated as unibyte. So, you have to convert the output to multibyte manually. I'd recommend to use `decode-coding-string' and `insert' instead of `decode-coding-region', if unsure. I heard that the reason why FLIM does not use `{de|en}code-coding-region' is to avoid this confusion. Regards, -- Daiki Ueno ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 10:04 ` Daiki Ueno @ 2008-09-20 10:19 ` Eli Zaretskii 2008-09-20 10:46 ` Daiki Ueno 2008-09-20 13:48 ` Stephen J. Turnbull 1 sibling, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2008-09-20 10:19 UTC (permalink / raw) To: Daiki Ueno; +Cc: pmr, stephen, emacs-devel > From: Daiki Ueno <ueno@unixuser.org> > Date: Sat, 20 Sep 2008 19:04:33 +0900 > Cc: Paul Michael Reilly <pmr@pajato.com>, emacs-devel@gnu.org > > > In XEmacs, string decoding is implemented by copying to a temporary > > buffer and doing decode-coding-region there. Emacs is likely the > > same. :-) > > Nope, XEmacs does not have the concept of buffer multibyteness. > > - If buffer multibyteness is on, both input and output of > `decode-coding-region' are treated as multibyte. I think the input > should be unibyte since it is byte stream. > > - If buffer multibyteness is off, both input and output of > `decode-coding-region' are treated as unibyte. So, you have to convert > the output to multibyte manually. > > I'd recommend to use `decode-coding-string' and `insert' instead of > `decode-coding-region', if unsure. Why? you can always set-buffer-multibyte to the right mode, at least in Emacs. Again, operations on buffers are much more efficient in Emacs than on strings. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 10:19 ` Eli Zaretskii @ 2008-09-20 10:46 ` Daiki Ueno 2008-09-20 11:30 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Daiki Ueno @ 2008-09-20 10:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pmr, stephen, emacs-devel >>>>> In <utzcbp1b4.fsf@gnu.org> >>>>> Eli Zaretskii <eliz@gnu.org> wrote: > > I'd recommend to use `decode-coding-string' and `insert' instead of > > `decode-coding-region', if unsure. > Why? you can always set-buffer-multibyte to the right mode, at least > in Emacs. Yes, we can. But I'm anxious of that those who are not familiar with Mule cannot decide which is the right mode. For example, Gnus' *Original Article* buffer has been multibyte for decade, despite the content is byte stream. Stefan has proposed to change it to unibyte, though. http://article.gmane.org/gmane.emacs.devel/90761 > Again, operations on buffers are much more efficient in Emacs than on > strings. Oh, I'll try to do profile later. Regards, -- Daiki Ueno ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 10:46 ` Daiki Ueno @ 2008-09-20 11:30 ` Eli Zaretskii 2008-09-20 23:33 ` Richard M. Stallman 2008-09-21 13:34 ` Stefan Monnier 0 siblings, 2 replies; 33+ messages in thread From: Eli Zaretskii @ 2008-09-20 11:30 UTC (permalink / raw) To: Daiki Ueno; +Cc: pmr, stephen, emacs-devel > From: Daiki Ueno <ueno@unixuser.org> > Cc: pmr@pajato.com, stephen@xemacs.org, emacs-devel@gnu.org > Date: Sat, 20 Sep 2008 19:46:42 +0900 > > >>>>> In <utzcbp1b4.fsf@gnu.org> > >>>>> Eli Zaretskii <eliz@gnu.org> wrote: > > > I'd recommend to use `decode-coding-string' and `insert' instead of > > > `decode-coding-region', if unsure. > > > Why? you can always set-buffer-multibyte to the right mode, at least > > in Emacs. > > Yes, we can. But I'm anxious of that those who are not familiar with > Mule cannot decide which is the right mode. We were discussing Rmail/mbox. In that context, whenever you need to decode a message body, the buffer needs to be in unibyte mode. That's it. I think this is simple enough for anyone who writes related code. > > Again, operations on buffers are much more efficient in Emacs than on > > strings. > > Oh, I'll try to do profile later. Please do, but please profile memory, not only CPU. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 11:30 ` Eli Zaretskii @ 2008-09-20 23:33 ` Richard M. Stallman 2008-09-21 3:18 ` Eli Zaretskii 2008-09-21 13:34 ` Stefan Monnier 1 sibling, 1 reply; 33+ messages in thread From: Richard M. Stallman @ 2008-09-20 23:33 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel We were discussing Rmail/mbox. In that context, whenever you need to decode a message body, the buffer needs to be in unibyte mode. How can that be right? The decoded non-ASCII characters cannot even exist in a unibyte buffer. I just did an experiment, and decode-coding-region worked fine in a multibyte buffer. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 23:33 ` Richard M. Stallman @ 2008-09-21 3:18 ` Eli Zaretskii 0 siblings, 0 replies; 33+ messages in thread From: Eli Zaretskii @ 2008-09-21 3:18 UTC (permalink / raw) To: rms; +Cc: pmr, stephen, ueno, emacs-devel > From: "Richard M. Stallman" <rms@gnu.org> > CC: ueno@unixuser.org, pmr@pajato.com, stephen@xemacs.org, > emacs-devel@gnu.org > Date: Sat, 20 Sep 2008 19:33:46 -0400 > > I just did an experiment, and decode-coding-region worked fine in > a multibyte buffer. It mostly works (because of special treatment of raw bytes in multibyte buffers), but sometimes it backfires. It's safer to use unibyte buffers, in my experience. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 11:30 ` Eli Zaretskii 2008-09-20 23:33 ` Richard M. Stallman @ 2008-09-21 13:34 ` Stefan Monnier 2008-09-21 17:59 ` Eli Zaretskii 1 sibling, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2008-09-21 13:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pmr, stephen, Daiki Ueno, emacs-devel > We were discussing Rmail/mbox. In that context, whenever you need to > decode a message body, the buffer needs to be in unibyte mode. That's > it. I think this is simple enough for anyone who writes related code. The only reliable way to do decoding in buffers is by using the `destination' argument to decode-coding-region so that you can decode from a unibyte buffer into a multibyte buffer. Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in a unibyte buffer" (as is necessarily the case either as source or as destination if you do the decoding in-place) is just too delicate in my experience (and of course, it's also somewhat inefficient). Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-21 13:34 ` Stefan Monnier @ 2008-09-21 17:59 ` Eli Zaretskii 2008-09-21 19:26 ` Stefan Monnier 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2008-09-21 17:59 UTC (permalink / raw) To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: Daiki Ueno <ueno@unixuser.org>, pmr@pajato.com, stephen@xemacs.org, emacs-devel@gnu.org > Date: Sun, 21 Sep 2008 09:34:38 -0400 > > The only reliable way to do decoding in buffers is by using > the `destination' argument to decode-coding-region so that you can > decode from a unibyte buffer into a multibyte buffer. Why is that the only reliable method, and what do you suggest as the value of `destination' argument for it to DTRT? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-21 17:59 ` Eli Zaretskii @ 2008-09-21 19:26 ` Stefan Monnier 2008-09-21 20:56 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2008-09-21 19:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel >> The only reliable way to do decoding in buffers is by using >> the `destination' argument to decode-coding-region so that you can >> decode from a unibyte buffer into a multibyte buffer. > Why is that the only reliable method, and what do you suggest as the > value of `destination' argument for it to DTRT? As I said in my message: use the dest arg so as to "decode from a unibyte buffer into a multibyte buffer", so `destination' should be ... a multibyte buffer. As for why it's the only reliable method, it's because: >> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in >> a unibyte buffer" (as is necessarily the case either as source or as >> destination if you do the decoding in-place) is just too delicate in my >> experience (and of course, it's also somewhat inefficient). I'm not sure which part of the above paragraph is unclear. Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-21 19:26 ` Stefan Monnier @ 2008-09-21 20:56 ` Eli Zaretskii 2008-09-21 22:07 ` Stefan Monnier 0 siblings, 1 reply; 33+ messages in thread From: Eli Zaretskii @ 2008-09-21 20:56 UTC (permalink / raw) To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: ueno@unixuser.org, pmr@pajato.com, stephen@xemacs.org, emacs-devel@gnu.org > Date: Sun, 21 Sep 2008 15:26:38 -0400 > > >> The only reliable way to do decoding in buffers is by using > >> the `destination' argument to decode-coding-region so that you can > >> decode from a unibyte buffer into a multibyte buffer. > > > Why is that the only reliable method, and what do you suggest as the > > value of `destination' argument for it to DTRT? > > As I said in my message: use the dest arg so as to "decode from > a unibyte buffer into a multibyte buffer", so `destination' should be > ... a multibyte buffer. And the source a unibyte one? > As for why it's the only reliable method, it's because: > > >> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in > >> a unibyte buffer" (as is necessarily the case either as source or as > >> destination if you do the decoding in-place) is just too delicate in my > >> experience (and of course, it's also somewhat inefficient). > > I'm not sure which part of the above paragraph is unclear. The fact that other methods are not 100% reliable does not yet mean that this one is. I thought you had a more specific explanation why this method is reliable. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-21 20:56 ` Eli Zaretskii @ 2008-09-21 22:07 ` Stefan Monnier 2008-09-22 3:07 ` Eli Zaretskii 2008-09-22 4:31 ` Kenichi Handa 0 siblings, 2 replies; 33+ messages in thread From: Stefan Monnier @ 2008-09-21 22:07 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel >> >> The only reliable way to do decoding in buffers is by using >> >> the `destination' argument to decode-coding-region so that you can >> >> decode from a unibyte buffer into a multibyte buffer. >> > Why is that the only reliable method, and what do you suggest as the >> > value of `destination' argument for it to DTRT? >> As I said in my message: use the dest arg so as to "decode from >> a unibyte buffer into a multibyte buffer", so `destination' should be >> ... a multibyte buffer. > And the source a unibyte one? Yes, of course. >> As for why it's the only reliable method, it's because: >> >> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in >> >> a unibyte buffer" (as is necessarily the case either as source or as >> >> destination if you do the decoding in-place) is just too delicate in my >> >> experience (and of course, it's also somewhat inefficient). >> I'm not sure which part of the above paragraph is unclear. > The fact that other methods are not 100% reliable does not yet mean > that this one is. I thought you had a more specific explanation why > this method is reliable. No, I don't have such an explanation, except that the most natural input for decoding is a unibyte (string|buffer) and the most natural output is a multibyte (string|buffer). I'd expect that to be pretty obvious. Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-21 22:07 ` Stefan Monnier @ 2008-09-22 3:07 ` Eli Zaretskii 2008-09-22 3:36 ` Stefan Monnier 2008-09-22 3:41 ` Daiki Ueno 2008-09-22 4:31 ` Kenichi Handa 1 sibling, 2 replies; 33+ messages in thread From: Eli Zaretskii @ 2008-09-22 3:07 UTC (permalink / raw) To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: ueno@unixuser.org, pmr@pajato.com, stephen@xemacs.org, emacs-devel@gnu.org > Date: Sun, 21 Sep 2008 18:07:10 -0400 > > >> >> The only reliable way to do decoding in buffers is by using > >> >> the `destination' argument to decode-coding-region so that you can > >> >> decode from a unibyte buffer into a multibyte buffer. > >> > Why is that the only reliable method, and what do you suggest as the > >> > value of `destination' argument for it to DTRT? > >> As I said in my message: use the dest arg so as to "decode from > >> a unibyte buffer into a multibyte buffer", so `destination' should be > >> ... a multibyte buffer. > > And the source a unibyte one? > > Yes, of course. > > >> As for why it's the only reliable method, it's because: > >> >> Dealing with "bytes in a multibyte buffer" or with "non-ascii chars in > >> >> a unibyte buffer" (as is necessarily the case either as source or as > >> >> destination if you do the decoding in-place) is just too delicate in my > >> >> experience (and of course, it's also somewhat inefficient). > >> I'm not sure which part of the above paragraph is unclear. > > The fact that other methods are not 100% reliable does not yet mean > > that this one is. I thought you had a more specific explanation why > > this method is reliable. > > No, I don't have such an explanation, except that the most natural input > for decoding is a unibyte (string|buffer) and the most natural output is > a multibyte (string|buffer). I'd expect that to be pretty obvious. That would mean Rmail/mbox will need to use another unibyte scratch buffer for decoding MIME-encoded text: first qp- or b64-decode it into another unibyte buffer, then decode-coding-region from there to the (multibyte) display buffer. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-22 3:07 ` Eli Zaretskii @ 2008-09-22 3:36 ` Stefan Monnier 2008-09-22 3:41 ` Daiki Ueno 1 sibling, 0 replies; 33+ messages in thread From: Stefan Monnier @ 2008-09-22 3:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pmr, stephen, ueno, emacs-devel >> No, I don't have such an explanation, except that the most natural input >> for decoding is a unibyte (string|buffer) and the most natural output is >> a multibyte (string|buffer). I'd expect that to be pretty obvious. > That would mean Rmail/mbox will need to use another unibyte scratch > buffer for decoding MIME-encoded text: first qp- or b64-decode it into > another unibyte buffer, then decode-coding-region from there to the > (multibyte) display buffer. Sure, Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-22 3:07 ` Eli Zaretskii 2008-09-22 3:36 ` Stefan Monnier @ 2008-09-22 3:41 ` Daiki Ueno 2008-09-22 3:58 ` Stefan Monnier 1 sibling, 1 reply; 33+ messages in thread From: Daiki Ueno @ 2008-09-22 3:41 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pmr, stephen, Stefan Monnier, emacs-devel >>>>> In <u63ooq3nt.fsf@gnu.org> >>>>> Eli Zaretskii <eliz@gnu.org> wrote: > That would mean Rmail/mbox will need to use another unibyte scratch > buffer for decoding MIME-encoded text: first qp- or b64-decode it into > another unibyte buffer, then decode-coding-region from there to the > (multibyte) display buffer. Since the output of qp- or b64-decode is unibyte (unlike decode-coding-region), we can reuse the same source buffer. Regards, -- Daiki Ueno ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-22 3:41 ` Daiki Ueno @ 2008-09-22 3:58 ` Stefan Monnier 2008-09-22 18:48 ` Eli Zaretskii 0 siblings, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2008-09-22 3:58 UTC (permalink / raw) To: Daiki Ueno; +Cc: pmr, Eli Zaretskii, stephen, emacs-devel >> That would mean Rmail/mbox will need to use another unibyte scratch >> buffer for decoding MIME-encoded text: first qp- or b64-decode it into >> another unibyte buffer, then decode-coding-region from there to the >> (multibyte) display buffer. > Since the output of qp- or b64-decode is unibyte (unlike > decode-coding-region), we can reuse the same source buffer. Actually, better not: the real source buffer is the actual mbox file buffer, i.e. multi-megabyte and that shouldn't be changed unless you really mean to. I.e. you could do it in-place, but unless you want to then save the mbox file back using "content-transfer-encoding: 8bit", you'd have to be careful to undo the base64/qp decoding afterwards and make sure that the buffer cannot be saved in the mean time. Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-22 3:58 ` Stefan Monnier @ 2008-09-22 18:48 ` Eli Zaretskii 0 siblings, 0 replies; 33+ messages in thread From: Eli Zaretskii @ 2008-09-22 18:48 UTC (permalink / raw) To: Stefan Monnier; +Cc: pmr, stephen, ueno, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Sun, 21 Sep 2008 23:58:49 -0400 > Cc: pmr@pajato.com, Eli Zaretskii <eliz@gnu.org>, stephen@xemacs.org, > emacs-devel@gnu.org > > >> That would mean Rmail/mbox will need to use another unibyte scratch > >> buffer for decoding MIME-encoded text: first qp- or b64-decode it into > >> another unibyte buffer, then decode-coding-region from there to the > >> (multibyte) display buffer. > > > Since the output of qp- or b64-decode is unibyte (unlike > > decode-coding-region), we can reuse the same source buffer. > > Actually, better not: the real source buffer is the actual mbox file > buffer Right, exactly. Besides, one of the main design goals of Rmail/mbox was to preserve the original mbox file intact, to avoid irreversible changes due to some types of decoding. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-21 22:07 ` Stefan Monnier 2008-09-22 3:07 ` Eli Zaretskii @ 2008-09-22 4:31 ` Kenichi Handa 2008-09-22 14:10 ` Stefan Monnier 2008-09-22 15:24 ` Paul Michael Reilly 1 sibling, 2 replies; 33+ messages in thread From: Kenichi Handa @ 2008-09-22 4:31 UTC (permalink / raw) To: Stefan Monnier; +Cc: pmr, eliz, ueno, stephen, emacs-devel In pre-unicode-merge Emacs (more exactly, before 2008-03-12), the automatic unibyte -> multibyte conversion sometimes caused a headache for Emacs Lisp developper because the behaviour differs in each lang. env. But, with the current Emacs, that conversion works more developper-friendly; i.e. all bytes with MSB set are converted to the corresponding eight-bit characters of multibyte represenation (* see the attached note). So, now we have these four ways to get a multibute buffer decoded from a unibyte buffer, and they all should work equally safely. (1) Do decode-coding-region while specifying a multibyte buffer as TARGET. (2) Insert the contents of unibyte buffer into a multibyte buffer, and then perform decode-coding-region in that multibyte buffer. (3) Get a unibyte string form a unibyte buffer, and then decode it while specifying a multibyte buffer as TARGET. (4) Deocde a unibyte buffer into a mulitbyte string, and then insert it into a multibyte buffer. (Please note that using decode-coding-region directly in a unibyte-buffer is not reliable because if a coding system has post-read-converion function, that funcion (usually) works only in a mutlibyte buffer.) The efficiency is (1) > (2) > (3) > (4). And, for the case of Rmail/mbox, before decoding, we may have to perform base64 or qp decoding, and they can't specify the different buffer/string as target. And I don't know if they works for a multibyte buffer/string. So, at the moment, I think the following strategy is good. Copy the contents of RMAIL buffer to a temporary unibyte buffer, perform base64/qp decoding in that buffer, then do decode-coding-region while specifying the view buffer as TARGET. --- Kenichi Handa handa@ni.aist.go.jp * Note: Those eight-bit characters have values #x3FFF80..#x3FFFFF, and, for instance, char-after and aref return one of those values. To get the original byte value, one needs (encode-char EIGHT-BIT-CHAR 'eight-bit) or (multibyte-char-to-unibyte EIGHT-BIT-CHAR). Perhaps, we have to provide some APIs for directly getting a byte value of EIGHT-BIT-CHAR, but we have not yet decided what to do. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-22 4:31 ` Kenichi Handa @ 2008-09-22 14:10 ` Stefan Monnier 2008-09-24 0:56 ` Kenichi Handa 2008-09-22 15:24 ` Paul Michael Reilly 1 sibling, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2008-09-22 14:10 UTC (permalink / raw) To: Kenichi Handa; +Cc: pmr, eliz, ueno, stephen, emacs-devel > (1) Do decode-coding-region while specifying a multibyte > buffer as TARGET. > (2) Insert the contents of unibyte buffer into a multibyte > buffer, and then perform decode-coding-region in that > multibyte buffer. > (3) Get a unibyte string form a unibyte buffer, and then > decode it while specifying a multibyte buffer as TARGET. > (4) Deocde a unibyte buffer into a mulitbyte string, and > then insert it into a multibyte buffer. > (Please note that using decode-coding-region directly in a > unibyte-buffer is not reliable because if a coding system > has post-read-converion function, that funcion (usually) > works only in a mutlibyte buffer.) > The efficiency is (1) > (2) > (3) > (4). I'd have expected 3 to be more efficient than 2 since it doesn't need to use the variable width multibyte representation of binary data. [ I'd even expect 3 to be about as efficient as 1. ] Is this because of the need to copy the string contents to a temp buffer in order to run any potential pre-read-conversion function? Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-22 14:10 ` Stefan Monnier @ 2008-09-24 0:56 ` Kenichi Handa 2008-09-24 2:53 ` Stefan Monnier 0 siblings, 1 reply; 33+ messages in thread From: Kenichi Handa @ 2008-09-24 0:56 UTC (permalink / raw) To: Stefan Monnier; +Cc: pmr, eliz, ueno, stephen, emacs-devel In article <jwv63oojmv8.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@IRO.UMontreal.CA> writes: > > (1) Do decode-coding-region while specifying a multibyte > > buffer as TARGET. > > (2) Insert the contents of unibyte buffer into a multibyte > > buffer, and then perform decode-coding-region in that > > multibyte buffer. > > (3) Get a unibyte string form a unibyte buffer, and then > > decode it while specifying a multibyte buffer as TARGET. > > (4) Deocde a unibyte buffer into a mulitbyte string, and > > then insert it into a multibyte buffer. > > (Please note that using decode-coding-region directly in a > > unibyte-buffer is not reliable because if a coding system > > has post-read-converion function, that funcion (usually) > > works only in a mutlibyte buffer.) > > The efficiency is (1) > (2) > (3) > (4). > I'd have expected 3 to be more efficient than 2 since it doesn't need to > use the variable width multibyte representation of binary data. > [ I'd even expect 3 to be about as efficient as 1. ] > Is this because of the need to copy the string contents to a temp buffer > in order to run any potential pre-read-conversion function? We don't have pre-read-conversion but post-read-conversion, and if the coding system doesn't have post-read-conversion, a temp buffer is not used. The reason why I think (2)>(3) is because of a cost of making a unibyte string. And handling multibyte representation of binary data within decoder/encoder (written in C) is trivial. --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-24 0:56 ` Kenichi Handa @ 2008-09-24 2:53 ` Stefan Monnier 2008-09-24 3:48 ` Kenichi Handa 0 siblings, 1 reply; 33+ messages in thread From: Stefan Monnier @ 2008-09-24 2:53 UTC (permalink / raw) To: Kenichi Handa; +Cc: pmr, eliz, ueno, stephen, emacs-devel > We don't have pre-read-conversion but post-read-conversion, > and if the coding system doesn't have post-read-conversion, > a temp buffer is not used. The reason why I think (2)>(3) > is because of a cost of making a unibyte string. And But if we're only talking about the cost of decoding, then that's not relevant: we may already have the string for some reason. Stefan ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-24 2:53 ` Stefan Monnier @ 2008-09-24 3:48 ` Kenichi Handa 0 siblings, 0 replies; 33+ messages in thread From: Kenichi Handa @ 2008-09-24 3:48 UTC (permalink / raw) To: Stefan Monnier; +Cc: pmr, eliz, ueno, stephen, emacs-devel In article <jwvljxi45ou.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: > > We don't have pre-read-conversion but post-read-conversion, > > and if the coding system doesn't have post-read-conversion, > > a temp buffer is not used. The reason why I think (2)>(3) > > is because of a cost of making a unibyte string. And > But if we're only talking about the cost of decoding, then that's not > relevant: we may already have the string for some reason. Yes, but, we are not only talking about the cost of coding. I wrote: > So, now we have these four ways to get a multibute buffer > decoded from a unibyte buffer, and they all should work > equally safely. [...] --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-22 4:31 ` Kenichi Handa 2008-09-22 14:10 ` Stefan Monnier @ 2008-09-22 15:24 ` Paul Michael Reilly 1 sibling, 0 replies; 33+ messages in thread From: Paul Michael Reilly @ 2008-09-22 15:24 UTC (permalink / raw) To: Kenichi Handa; +Cc: stephen, eliz, ueno, Stefan Monnier, emacs-devel Kenichi Handa wrote: > In pre-unicode-merge Emacs (more exactly, before ... > Copy the contents of RMAIL buffer to a temporary unibyte > buffer, perform base64/qp decoding in that buffer, then do > decode-coding-region while specifying the view buffer as > TARGET. This appears to be the definitive word and is the approach I am using. Thanks, -pmr ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 10:04 ` Daiki Ueno 2008-09-20 10:19 ` Eli Zaretskii @ 2008-09-20 13:48 ` Stephen J. Turnbull 2008-09-21 0:57 ` Daiki Ueno 1 sibling, 1 reply; 33+ messages in thread From: Stephen J. Turnbull @ 2008-09-20 13:48 UTC (permalink / raw) To: Daiki Ueno; +Cc: Paul Michael Reilly, emacs-devel Daiki Ueno writes: > Nope, XEmacs does not have the concept of buffer multibyteness. True. It never will, at the Lisp level. > I'd recommend to use `decode-coding-string' and `insert' instead of > `decode-coding-region', if unsure. How does that help if the target buffer is unibyte? > I heard that the reason why FLIM does not use > `{de|en}code-coding-region' is to avoid this confusion. A better strategy would be to force reading the mbox file as multibyte binary. It's a little bit inefficient, but not as inefficient as the human brain, so who cares? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-20 13:48 ` Stephen J. Turnbull @ 2008-09-21 0:57 ` Daiki Ueno 2008-09-22 9:14 ` Stephen J. Turnbull 0 siblings, 1 reply; 33+ messages in thread From: Daiki Ueno @ 2008-09-21 0:57 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Paul Michael Reilly, emacs-devel >>>>> In <87skrvgc8f.fsf@xemacs.org> >>>>> "Stephen J. Turnbull" <stephen@xemacs.org> wrote: > > I'd recommend to use `decode-coding-string' and `insert' instead of > > `decode-coding-region', if unsure. > How does that help if the target buffer is unibyte? In this context, we can assume the target buffer multibyte. Pmail uses seperate buffers unlike Rmail, as Paul indicates in <48D33A10.4040102@pajato.com>. Let us call the one buffer holding raw contents of mbox file A, and another displaying a message B. I think the most straightforward way is to do: 1. set the buffer A unibyte 2. set the buffer B multibyte 3. extract a message body from A into a string 4. decode the string 5. insert it to B and the only drawback is inefficiency (if it is measurable). > > I heard that the reason why FLIM does not use > > `{de|en}code-coding-region' is to avoid this confusion. > A better strategy would be to force reading the mbox file as multibyte > binary. It's a little bit inefficient, but not as inefficient as the > human brain, so who cares? The term "multibyte binary" looks like an oxymoron for me ;-) Regards, -- Daiki Ueno ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-21 0:57 ` Daiki Ueno @ 2008-09-22 9:14 ` Stephen J. Turnbull 0 siblings, 0 replies; 33+ messages in thread From: Stephen J. Turnbull @ 2008-09-22 9:14 UTC (permalink / raw) To: Daiki Ueno; +Cc: Paul Michael Reilly, emacs-devel Daiki Ueno writes: > The term "multibyte binary" looks like an oxymoron for me ;-) It's not, though. "Multibyte" (more precisely, "variable width") refers to the representation of integers in the buffer. "Binary" refers to the fact that the sequence of characters in the buffer (interpreted as abstract non-negative integers) is exactly the same as the sequence of bytes (again, considered as abstract non-negative integers) in the source. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly 2008-09-19 3:28 ` Stephen J. Turnbull @ 2008-09-19 4:30 ` Richard M. Stallman 2008-09-19 4:30 ` Richard M. Stallman 2008-09-19 9:12 ` Eli Zaretskii 3 siblings, 0 replies; 33+ messages in thread From: Richard M. Stallman @ 2008-09-19 4:30 UTC (permalink / raw) To: pmr; +Cc: emacs-devel Congratulations on the new release. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly 2008-09-19 3:28 ` Stephen J. Turnbull 2008-09-19 4:30 ` Richard M. Stallman @ 2008-09-19 4:30 ` Richard M. Stallman 2008-09-19 9:12 ` Eli Zaretskii 3 siblings, 0 replies; 33+ messages in thread From: Richard M. Stallman @ 2008-09-19 4:30 UTC (permalink / raw) To: pmr; +Cc: emacs-devel As near as I can tell the task is to decode the message body in two steps: first to decode according to the character encoding (e.g. quoted-printable or base64) and then to decode that result to some coding system. That is correct. Something along the lines of: (let (body) (setq body (apply qp or base64 to body of message) You call `mail-unquote-printable-region' or `base64-decode-region'. They operate on the buffer. (decode-coding-string body (detect-coding-string body t)) Use `decode-coding-region'. It operates on the buffer. When operating on large amount of text, don't do it in strings. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Need some help with Rmail/mbox 2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly ` (2 preceding siblings ...) 2008-09-19 4:30 ` Richard M. Stallman @ 2008-09-19 9:12 ` Eli Zaretskii 3 siblings, 0 replies; 33+ messages in thread From: Eli Zaretskii @ 2008-09-19 9:12 UTC (permalink / raw) To: pmr; +Cc: emacs-devel > Date: Thu, 18 Sep 2008 12:02:19 -0400 > From: Paul Michael Reilly <pmr@pajato.com> > > The basic problem I need to solve now is how to map the values of the > content-type and content-transfer-encoding headers (either of which > could legally be absent) to an Emacs coding system. I am slogging > through this task and if anyone has already done it and has either a > short "how-to" or even better some code, that would be much > appreciated. > > As Eli helpfully pointed out, rmail-convert-to-babyl-format provides > some help. Yes, and it already maps the values of content-transfer-encoding into Emacs coding-systems (the mapping is trivial, btw; see rmail-decode-region and its callers). If you still have problems with this after reading the Rmail code, please ask more specific questions. > As near as I can tell the task is to decode the message body in two > steps: first to decode according to the character encoding > (e.g. quoted-printable or base64) and then to decode that result to > some coding system. Something along the lines of: > > (let (body) > (setq body (apply qp or base64 to body of message) > (decode-coding-string body (detect-coding-string body t)) > > Am I even in the ballpark? Yes, this is exactly what rmail-convert-to-babyl-format does. It just assumes that there's only one part in the message, so it does the above only once. You want to do that for every part of a multi-part message. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2008-09-24 3:48 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-18 16:02 Need some help with Rmail/mbox Paul Michael Reilly 2008-09-19 3:28 ` Stephen J. Turnbull 2008-09-19 5:35 ` Paul Michael Reilly 2008-09-19 9:32 ` Eli Zaretskii 2008-09-20 7:12 ` Stephen J. Turnbull 2008-09-20 10:04 ` Daiki Ueno 2008-09-20 10:19 ` Eli Zaretskii 2008-09-20 10:46 ` Daiki Ueno 2008-09-20 11:30 ` Eli Zaretskii 2008-09-20 23:33 ` Richard M. Stallman 2008-09-21 3:18 ` Eli Zaretskii 2008-09-21 13:34 ` Stefan Monnier 2008-09-21 17:59 ` Eli Zaretskii 2008-09-21 19:26 ` Stefan Monnier 2008-09-21 20:56 ` Eli Zaretskii 2008-09-21 22:07 ` Stefan Monnier 2008-09-22 3:07 ` Eli Zaretskii 2008-09-22 3:36 ` Stefan Monnier 2008-09-22 3:41 ` Daiki Ueno 2008-09-22 3:58 ` Stefan Monnier 2008-09-22 18:48 ` Eli Zaretskii 2008-09-22 4:31 ` Kenichi Handa 2008-09-22 14:10 ` Stefan Monnier 2008-09-24 0:56 ` Kenichi Handa 2008-09-24 2:53 ` Stefan Monnier 2008-09-24 3:48 ` Kenichi Handa 2008-09-22 15:24 ` Paul Michael Reilly 2008-09-20 13:48 ` Stephen J. Turnbull 2008-09-21 0:57 ` Daiki Ueno 2008-09-22 9:14 ` Stephen J. Turnbull 2008-09-19 4:30 ` Richard M. Stallman 2008-09-19 4:30 ` Richard M. Stallman 2008-09-19 9:12 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.