From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: Need some help with Rmail/mbox Date: Sat, 20 Sep 2008 16:12:33 +0900 Message-ID: <871vzfi93y.fsf@xemacs.org> References: <87y71o4xw6.fsf@xemacs.org> <48D33A10.4040102@pajato.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1221894451 27504 80.91.229.12 (20 Sep 2008 07:07:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 20 Sep 2008 07:07:31 +0000 (UTC) Cc: emacs-devel@gnu.org To: Paul Michael Reilly Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Sep 20 09:08:28 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1KgwZu-0001D0-J6 for ged-emacs-devel@m.gmane.org; Sat, 20 Sep 2008 09:08:27 +0200 Original-Received: from localhost ([127.0.0.1]:40959 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KgwYt-0000Gf-3N for ged-emacs-devel@m.gmane.org; Sat, 20 Sep 2008 03:07:23 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KgwYo-0000GZ-HF for emacs-devel@gnu.org; Sat, 20 Sep 2008 03:07:18 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KgwYm-0000GN-4h for emacs-devel@gnu.org; Sat, 20 Sep 2008 03:07:17 -0400 Original-Received: from [199.232.76.173] (port=41200 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KgwYl-0000GK-UT for emacs-devel@gnu.org; Sat, 20 Sep 2008 03:07:15 -0400 Original-Received: from mx20.gnu.org ([199.232.41.8]:11643) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1KgwYl-0005NL-Js for emacs-devel@gnu.org; Sat, 20 Sep 2008 03:07:15 -0400 Original-Received: from mtps01.sk.tsukuba.ac.jp ([130.158.97.223]) by mx20.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1KgwYh-0006kJ-0p for emacs-devel@gnu.org; Sat, 20 Sep 2008 03:07:11 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) by mtps01.sk.tsukuba.ac.jp (Postfix) with ESMTP id 1B8541535BC; Sat, 20 Sep 2008 16:06:55 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id DF6FB1A2A39; Sat, 20 Sep 2008 16:12:33 +0900 (JST) In-Reply-To: <48D33A10.4040102@pajato.com> X-Mailer: VM 8.0.12-devo-585 under 21.5 (beta28) "fuki" 83e35df20028+ XEmacs Lucid (x86_64-unknown-linux) X-detected-kernel: by mx20.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:103983 Archived-At: Paul Michael Reilly writes: > Thanks for stepping up to this. Your help is very much appreciated! You're welcome. Eli and Richard have already responded with some existing Rmail features, but maybe some background (somewhat duplicating their comments) would be helpful, too. > I first copy the relevant headers to the view buffer by collecting > them from the PMAIL buffer into a string and insert the string into > the view buffer. Copying to a string uses memory. The amount of memory is not a huge consideration these days, even with a multimegabyte buffer. But allocating and deallocating strings is very time-consuming because malloc requires a system call, and deallocated strings' data gets compacted, or possibly another system call to deallocate for large strings (I forget if Emacs uses direct allocation for large strings instead of expanding the string data pool). Also, strings are read-only. So to "edit" a string, you actually have to copy the relevant parts to a new string; if you substitute in the middle of a string, you have to create a bunch of new strings one for each fragment, then a final string. Lotsa consing. > I used the rmail.el code pretty much as is but instead of copying > and hiding I do selective copy and insert (ignoring the case of > showing all headers which is trivial). That's reasonable, I think. > Then I basically copy the message body into a string and insert it > into the view buffer. `insert-buffer-substring' is much more efficient. > But when I started to work on the decoding it seemed that decoding > the string before inserting it seemed like a good idea. In XEmacs, string decoding is implemented by copying to a temporary buffer and doing decode-coding-region there. Emacs is likely the same. :-) > Are you essentially answering my question above and saying that > copying buffer to buffer is faster/better than operating on strings? Yes. It's faster and better. Buffers are designed for editing. Strings are designed for read-only text to save all the editor overhead that buffers carry around. Here's just one reason. Emacs strings are *not* arrays of characters, they are arrays of bytes, which (from Lisp) can only be read at character boundaries. An ASCII character takes up 1 byte, a Latin-1 character 2 bytes, a Japanese character 3 bytes, and (IIRC) certain user-defined characters may take 4 bytes! This means that if you decide to substitute a Latin 1 SMALL LATIN LETTER A WITH GRAVE ACCENT for ASCII SMALL LATIN LETTER A (thus turning voila into voil=E0) you can't do it in a string without allocating a new string. > I do parse out quoted-printable and base64 and apply these to the body > before doing the coding system based decoding. OK. > > Identify header and body, add Babyl sentinels if desired >=20 > babyl sentinels? I'm not sure what you mean by this. Babyl messages are delimited with "^_" IIRC, and the original headers with "**** BOOH ****" and "**** EOOH ****" or something like that. I don't remember whether any code that presents a message uses those after narrowing (in your implementation, copying), though. If it's not used, you don't need them. > "yup" and that is what led to my request for help. Except for the > case of quoted-printable and base64 I'm not sure how to parse those > two headers (Content-Type and Content-Transfer-Encoding) into a coding > system so that I can then do the decoding. Content-Transfer-Encoding is about how bytes, *not characters*, are represented. For practical purposes there are four possibilities: text is all ASCII (the default, aka 7bit), text is raw unibyte (8bit), text is encoded as quoted-printable, and text is encoded as BASE64. So you are done with that. This is entirely independent of Content-Type or its charset parameter. > I'm assuming the coding system guesswork becomes relevant for > combinations of the two headers that Rmail does not grok. No. If there is no Content-Type header, you "should" assume the RFC 2822 defaults (text/plain; charset=3DUS-ASCII). Providing commands for the user to change those on a per message basis would be nice, but not needed for a first release as the vast majority of non-spam mail is MIME-conformant these days. > And I now see that there is a strong relationship between charset > and coding system. Technically, the *MIME charset* concept is broken, or at least a very poor name. A "character set" is an abstract idea that is (AFAIK) basically unstandardized. A *coded character set* is an invertible mapping from a set of non-negative integers to characters. You can think of Unicode as a universe of characters, although that's not quite good enough for some esoteric purposes. What Emacs calls a "charset" is basically a coded character set. An "encoding" is again an abstract idea which is not really standardized, but it's pretty close to what Emacs calls a "coding system", which is a pair of algorithms for decoding an external text into an Emacs buffer, and for doing the reverse, plus some auxiliary parameters and functions for specialized purposes (eg, for detecting the encoding of an unknown text). As you recognized, this is basically the same thing as a "MIME charset". You should not need to deal with Emacs charsets, by the way. Just remember that "MIME charset =3D=3D Emacs coding-system" and you'll do fine. > OK, this is helpful. I assume that for all other type/subtype cases > we punt for now and use guessing or just raw text? For text/* types, just use the raw text (there should be a charset parameter if it is not ASCII). > But certainly there are some that we want to process/decode in some > fashion, e.g. text/html or text/xml. Is there another Emacs > package/library that you are aware of that provides a good model > for where we want to take Rmail so that it handles more > type/subtype cases seamlessly in the view buffer? Gnus, VM, tm (aka "Tiny MIME", obsoleted by SEMI and unsupported), SEMI (obsolete and unsupported I believe), WEMI (IIRC a C library to link into Emacs, based on SEMI, obsolete and unsupported I guess), MH-E, MEW, Wanderlust (these last three I don't know about the implementations, they may borrow from Gnus). Both VM and Gnus use the model I suggested of dispatching on type and subtype. Some naming convention like `mime-handler-TYPE/SUBTYPE' could be used. (let ((handlers (list (intern (format "mime-handler-%s/%s" type subtype= )) (intern (format "mime-handler-%s/*" type)) 'mime-handler-*/*)) handler) (while handlers (setq handler (car handlers) handlers (cdr handlers)) (if (functionp handler) (funcall handler body-start body-end) ;; `warn' may be an XEmacs-ism, sorry (warn "handler not defined: %s" handler)))) > Even perhaps audio and video (not pure MIME, i.e. multipart > ... yet). You *need* multipart as quickly as possible. Too much mail is sent as multipart. It's not that hard, you just parse the MIME bodies recursively, and throw away the bodies you don't know how to handle. I'm sure Rmail already knows how to do this. You should also provide a way of listing MIME bodies found and saving their raw bytes to a file. (That's just a matter of applying the relevant Content-Transfer-Encoding to the MIME body, and then write-region.) > > Wash header for presentation, eg: > > Hide non-displayed header > > Decode RFC 2047-encoded headers >=20 > OK, this is helpful but I would add that non-displayed headers do not > need to be in the view buffer at all. It contains all the headers or > just the displayed headers, depending on the User's current desire. I find being able to toggle display of the full set of headers useful, and I use it several times every day. I would find this easier to implement if the headers are there but hidden. YMMV, of course. > > Wash body for presentation, eg: > > Highlight and activate url-like substrings > > Highlight quoted material >=20 > I don't believe Rmail does either of these operations now. Is that > your understanding? I count the interval that I've not used Rmail by decades. :-) My contribution is as a standards geek and having gotten my hands dirty on several MUAs. URLs are easy, of course: (while (re-search-forward url-re nil t) (let ((o (make-overlay (match-beginning 0) (match-end 0)))) (overlay-put o 'face 'url-active-face) ;; sorry, this may also be an XEmacs-ism (overlay-put o APPROPRIATE-ARGS-TO-ADD-FOLLOW-URL-TO-KEYMAP)) Quoting is harder because of the variety of quoting styles. You might want to make this easy for users to configure. Kyle Jones's filladapt package is quite good at detecting quoting styles and is configurable. As you know, Kyle is a curmudgeon about assignment, but reading the docs for ideas about UI is probably OK (but check with FSF legal or Richard; IANAL nor an FSF spokesperson).