Rmail and the raw-text coding system

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Rmail and the raw-text coding system
@ 2011-01-14 18:14 Mark Lillibridge
  2011-01-14 21:00 ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Lillibridge @ 2011-01-14 18:14 UTC (permalink / raw)
  To: emacs-devel; +Cc: monnier


[resend of this message again as eggs.gnu.org refused to accept it for
five days]

----

[this is a follow-up to a previous message which appears to be delayed
so it might appear afterwards]

[all the below is emacs version 22; the code fragments here are for
additional information and probably don't need to be read the first
time]


    Rmail uses encoding and decoding somewhat weirdly because it must
mix messages of different encodings in the same file.  It reads in a
file as follows:

  (let* ((file-name (expand-file-name (or file-name-arg rmail-file-name)))
         ...
	 ;; Since the file may contain messages of different encodings
	 ;; at the tail (non-BYBYL part), we can't decode them at once
	 ;; on reading.  So, at first, we read the file without text
	 ;; code conversion, then decode the messages one by one by
	 ;; rmail-decode-babyl-format or
	 ;; rmail-convert-to-babyl-format.
	 (coding-system-for-read (and rmail-enable-multibyte 'raw-text))
	 run-mail-hook msg-shown)
         ...
      (switch-to-buffer
       (let ((enable-local-variables nil))
	 (find-file-noselect file-name))))

That is, it effectively visits the file using the encoding raw-text.
Because of the black magic of raw-text, unlike with most other
encodings, the result is a unibyte buffer.


    It then decodes the BABYL message part:

    (unless (and coding-system
		 (coding-system-p coding-system))
      (setq coding-system
	    ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
	    ;; earlier versions did that with the current buffer's encoding.
	    ;; So we want to favor detection of emacs-mule (whose normal
	    ;; priority is quite low), but still allow detection of other
	    ;; encodings if emacs-mule won't fit.  The call to
	    ;; detect-coding-with-priority below achieves that.
	    (car (detect-coding-with-priority
		  from to
		  '((coding-category-emacs-mule . emacs-mule))))))
    (unless (memq coding-system
		  '(undecided undecided-unix))
      (set-buffer-modified-p t)		; avoid locking when decoding
      (let ((buffer-undo-list t))
	(decode-coding-region from to coding-system))
      (setq coding-system last-coding-system-used))
    (set-buffer-modified-p modifiedp)
    (setq buffer-file-coding-system nil)
    (setq save-buffer-coding-system
	  (or coding-system 'undecided))))

This process leaves the buffer as a unibyte buffer.  


    It also separately decodes each non-BABYL message at the end
separately.  It does this after decoding base64 and quoted-printable
encoded message bodies that have type text or message.  (This is a
different kind of decoding than the coding system one.)

		   (let ((mime-charset
			  (if (and rmail-decode-mime-charset
				   (save-excursion
				     (goto-char start)
				     (search-forward "\n\n" nil t)
				     (let ((case-fold-search t))
				       (re-search-backward
					rmail-mime-charset-pattern
					start t))))
			      (intern (downcase (match-string 1))))))
		     (rmail-decode-region start (point) mime-charset)))


;; Decode the region specified by FROM and TO by CODING.
;; If CODING is nil or an invalid coding system, decode by `undecided'.
(defun rmail-decode-region (from to coding)
  (if (or (not coding) (not (coding-system-p coding)))
      (setq coding 'undecided))
  ;; Use -dos decoding, to remove ^M characters left from base64 or
  ;; rogue qp-encoded text.
  (decode-coding-region from to
			(coding-system-change-eol-conversion coding 1))
  ;; Don't reveal the fact we used -dos decoding, as users generally
  ;; will not expect the RMAIL buffer to use DOS EOL format.
  (setq buffer-file-coding-system
	(setq last-coding-system-used
	      (coding-system-change-eol-conversion coding 0))))

Note that if the headers don't specify a coding system, then we fall
back to undecided.  Finally, Rmail converts the buffer to multibyte.  



    So long as decoding new messages ends up using a coding system other
than raw-text*, this all works correctly.  Unfortunately, it appears
that sometimes decode-coding-region when passed undecided will decide to
use raw-text-unix.  I assume this is due to messages mixing incompatible
encodings (perhaps UTF-8 and Big5?).  I don't know if perfectly valid
messages can cause this problem, but God knows there's enough
badly formatted messages out there that mix formats.

    When visiting a strange file, using raw-text* can make sense, since
the resulting buffer will be unibyte, preserving the exact sequence of
bytes in the file.  When written out, the same bytes will be replaced.
Unfortunately, however Rmail buffers are always multibyte (excepting
weird cases where the user has requested all buffers be unibyte),
causing problems.

    Rmail takes the perfectly sensible unibyte raw-text* representation
and converts it to multibyte as part of converting its entire buffer to
multibyte.  This does *not* do what you might expect, namely convert
x80-xff characters to raw 8-bit bytes.  Rather, it effectively casts
those bytes directly to emacs internal representation unchanged.  (This
is true for byte sequences that are valid internal representations; I
believe that malformed internal representation sequences are escaped so
that writing them reproduces the same bytes; in particular,
unaccompanied continuation bytes (160-255) are turned into raw 8-bit
bytes.)

    Note that raw-text* is very weird because it is the only conversion
that does not necessarily leave a valid internal representation in a
unibyte buffer after decoding.  


    The result of all this is a message in the Rmail buffer containing
possibly arbitrary codepoints.  Needless to say, this produces a weird
display instead of the \xxx display the user might expect in this case.
When it comes time to save the Rmail file, we may no longer be able to
use emacs-mule because of the weird code points, forcing us to use
raw-text-unix as the encoding of the Rmail file itself!  This makes
unrmail's life much more difficult, especially in version 23 as the
internal buffer representation has changed.  I'll have more to say about
this in a reply to my previous message.

    I do not believe that any of this causes data loss at the byte level
in and of itself (reading raw-text into unibyte, converting to
multibyte, and then writing it out using encoding scheme raw-text leaves
the bytes unchanged I think, except possibly for the last code point at
the end of the message), but someone who understands these issues better
than me should think hard about this.  Unrmail, if not rewritten
carefully, *will* cause data loss because of this, though...

    I don't know if we are still actively maintaining the version 22 of
rmail, but if so someone should fix Rmail so it no longer uses raw-text*
for decoding messages; instead a coding system should be used which
converts 0x80-xff to raw 8-bit bytes.  (Is there already such a system?)

- Mark




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-14 18:14 Rmail and the raw-text coding system Mark Lillibridge
@ 2011-01-14 21:00 ` Stefan Monnier
  2011-01-15  0:06   ` Mark Lillibridge
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2011-01-14 21:00 UTC (permalink / raw)
  To: mark.lillibridge; +Cc: emacs-devel

>     It then decodes the BABYL message part:

>     (unless (and coding-system
> 		 (coding-system-p coding-system))
>       (setq coding-system
> 	    ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
> 	    ;; earlier versions did that with the current buffer's encoding.
> 	    ;; So we want to favor detection of emacs-mule (whose normal
> 	    ;; priority is quite low), but still allow detection of other
> 	    ;; encodings if emacs-mule won't fit.  The call to
> 	    ;; detect-coding-with-priority below achieves that.
> 	    (car (detect-coding-with-priority
> 		  from to
> 		  '((coding-category-emacs-mule . emacs-mule))))))
>     (unless (memq coding-system
> 		  '(undecided undecided-unix))
>       (set-buffer-modified-p t)		; avoid locking when decoding
>       (let ((buffer-undo-list t))
> 	(decode-coding-region from to coding-system))
>       (setq coding-system last-coding-system-used))
>     (set-buffer-modified-p modifiedp)
>     (setq buffer-file-coding-system nil)
>     (setq save-buffer-coding-system
> 	  (or coding-system 'undecided))))

> This process leaves the buffer as a unibyte buffer.  

The question for me is why did it choose raw-text here (which results
indeed in a unibyte buffer)?  It should have been emacs-mule.


        Stefan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-14 21:00 ` Stefan Monnier
@ 2011-01-15  0:06   ` Mark Lillibridge
  2011-01-15  2:55     ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Lillibridge @ 2011-01-15  0:06 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel


Stefan wrote:
>  I (Mark) wrote:
>  >     It then decodes the BABYL message part:
>  
>  > ...
>  
>  > This process leaves the buffer as a unibyte buffer.  
>  
>  The question for me is why did it choose raw-text here (which results
>  indeed in a unibyte buffer)?  It should have been emacs-mule.

    I assume it did so because the buffer contained "invalid" code
points.  Remember that loading raw-text then converting to multibyte can
(I believe) produce a buffer with essentially arbitrary bytes modulo no
unaccompanied continuation bytes.  Presumably, the result can be
considered invalid by emacs-mule.

- Mark



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-15  0:06   ` Mark Lillibridge
@ 2011-01-15  2:55     ` Stefan Monnier
  2011-01-16 22:11       ` Mark Lillibridge
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2011-01-15  2:55 UTC (permalink / raw)
  To: mark.lillibridge; +Cc: emacs-devel

>> I (Mark) wrote:
>> >     It then decodes the BABYL message part:
>> 
>> > ...
>> 
>> > This process leaves the buffer as a unibyte buffer.  
>> 
>> The question for me is why did it choose raw-text here (which results
>> indeed in a unibyte buffer)?  It should have been emacs-mule.

>     I assume it did so because the buffer contained "invalid" code
> points.

That would mean that the BABYL file is corrupted.  Is it?


        Stefan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-15  2:55     ` Stefan Monnier
@ 2011-01-16 22:11       ` Mark Lillibridge
  2011-01-17 19:19         ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Lillibridge @ 2011-01-16 22:11 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel


Stefan wrote:
>  I (Mark)  wrote:
>  >     I assume it did so because the buffer contained "invalid" code
>  > points.
>  
>  That would mean that the BABYL file is corrupted.  Is it?

    Not as far as I can tell.  Weird characters are displayed for some
messages, but that is normal with Rmail 22 as it doesn't understand
MIME.  I believe the use of raw-text does not lose data.

- Mark



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-16 22:11       ` Mark Lillibridge
@ 2011-01-17 19:19         ` Stefan Monnier
  2011-01-17 22:31           ` Mark Lillibridge
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2011-01-17 19:19 UTC (permalink / raw)
  To: mark.lillibridge; +Cc: emacs-devel

>> >     I assume it did so because the buffer contained "invalid" code
>> > points.
>> That would mean that the BABYL file is corrupted.  Is it?

>     Not as far as I can tell.  Weird characters are displayed for some
> messages, but that is normal with Rmail 22 as it doesn't understand
> MIME.  I believe the use of raw-text does not lose data.

The BABYL file is supposed to use the emacs-mule encoding.  So if it
contains invalid emacs-mule byte sequences, it presumably means
it's corrupted.  Of course, maybe they are valid sequences which
Emacs23/24 rejects by mistake, or maybe there's yet something else
going on.

But AFAIK BABYL files use a single encoding for the whole file, and
since around Emacs-21.x that single encoding is supposed to be
emacs-mule (and I seem to remember that the BABYL file is supposed to
contain an annotation at the very beginning saying it's using
emacs-mule, if so).

        Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-17 19:19         ` Stefan Monnier
@ 2011-01-17 22:31           ` Mark Lillibridge
  2011-01-18  2:05             ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Lillibridge @ 2011-01-17 22:31 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel


Stefan wrote:
>  >> >     I assume it did so because the buffer contained "invalid" code
>  >> > points.
>  >> That would mean that the BABYL file is corrupted.  Is it?
>  
>  >     Not as far as I can tell.  Weird characters are displayed for some
>  > messages, but that is normal with Rmail 22 as it doesn't understand
>  > MIME.  I believe the use of raw-text does not lose data.
>  
>  The BABYL file is supposed to use the emacs-mule encoding.  So if it
>  contains invalid emacs-mule byte sequences, it presumably means
>  it's corrupted.  Of course, maybe they are valid sequences which
>  Emacs23/24 rejects by mistake, or maybe there's yet something else
>  going on.
>  
>  But AFAIK BABYL files use a single encoding for the whole file, and
>  since around Emacs-21.x that single encoding is supposed to be
>  emacs-mule (and I seem to remember that the BABYL file is supposed to
>  contain an annotation at the very beginning saying it's using
>  emacs-mule, if so).

    Arguably, it is a Rmail 22 bug that some BABYL files are encoded
using raw-text.  This does not necessarily make them "corrupted".  I
have over 85 such files so this problem seems to be fairly common.

    I haven't tried to figure out the logic Emacs 22 uses when trying to
decide if the current buffer can be written out as emacs-mule.  The
weird code points is just a guess on my part at this point.

- Mark



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-17 22:31           ` Mark Lillibridge
@ 2011-01-18  2:05             ` Stefan Monnier
  2011-01-19  5:16               ` Mark Lillibridge
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2011-01-18  2:05 UTC (permalink / raw)
  To: mark.lillibridge; +Cc: emacs-devel

>> But AFAIK BABYL files use a single encoding for the whole file, and
>> since around Emacs-21.x that single encoding is supposed to be
>> emacs-mule (and I seem to remember that the BABYL file is supposed to
>> contain an annotation at the very beginning saying it's using
>> emacs-mule, if so).

>     Arguably, it is a Rmail 22 bug that some BABYL files are encoded
> using raw-text.

So you're saying that when Emacs-22 wrote those files, it used raw-text?
And that Emacs-22's Rmail reads those files properly?
And that Emacs-22's Rmail reads those files with raw-text?


        Stefan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Rmail and the raw-text coding system
  2011-01-18  2:05             ` Stefan Monnier
@ 2011-01-19  5:16               ` Mark Lillibridge
  0 siblings, 0 replies; 9+ messages in thread
From: Mark Lillibridge @ 2011-01-19  5:16 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel


>  >> But AFAIK BABYL files use a single encoding for the whole file, and
>  >> since around Emacs-21.x that single encoding is supposed to be
>  >> emacs-mule (and I seem to remember that the BABYL file is supposed to
>  >> contain an annotation at the very beginning saying it's using
>  >> emacs-mule, if so).
>  
>  >     Arguably, it is a Rmail 22 bug that some BABYL files are encoded
>  > using raw-text.
>  
>  So you're saying that when Emacs-22 wrote those files, it used raw-text?
>  And that Emacs-22's Rmail reads those files properly?
>  And that Emacs-22's Rmail reads those files with raw-text?

yes, yes, and yes.

The Rmail 22 reading code uses:

    (unless (and coding-system
		 (coding-system-p coding-system))
      (setq coding-system
	    ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
	    ;; earlier versions did that with the current buffer's encoding.
	    ;; So we want to favor detection of emacs-mule (whose normal
	    ;; priority is quite low), but still allow detection of other
	    ;; encodings if emacs-mule won't fit.  The call to
	    ;; detect-coding-with-priority below achieves that.
	    (car (detect-coding-with-priority
		  from to
		  '((coding-category-emacs-mule . emacs-mule))))))
    (unless (memq coding-system
		  '(undecided undecided-unix))
      (set-buffer-modified-p t)		; avoid locking when decoding
      (let ((buffer-undo-list t))
	(decode-coding-region from to coding-system))
      (setq coding-system last-coding-system-used))
    (set-buffer-modified-p modifiedp)
    (setq buffer-file-coding-system nil)
    (setq save-buffer-coding-system
	  (or coding-system 'undecided))))

If the file doesn't appear be valid emacs-mule, it tries raw-text and
succeeds (I think all files are valid raw-text).  On write, it ends up
using undecided-unix (see above), which uses raw-text because the buffer
cannot be written in emacs-mule validly for some reason.  Weird code
points?

    Thus, it both reads and writes using raw-text, preserving the
contents of the buffer so long as we don't upgrade to version 23.

- Mark



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-01-19  5:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-14 18:14 Rmail and the raw-text coding system Mark Lillibridge
2011-01-14 21:00 ` Stefan Monnier
2011-01-15  0:06   ` Mark Lillibridge
2011-01-15  2:55     ` Stefan Monnier
2011-01-16 22:11       ` Mark Lillibridge
2011-01-17 19:19         ` Stefan Monnier
2011-01-17 22:31           ` Mark Lillibridge
2011-01-18  2:05             ` Stefan Monnier
2011-01-19  5:16               ` Mark Lillibridge

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).