help needed with coding systems (unrmail problems)

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* help needed with coding systems (unrmail problems)
@ 2011-01-13 23:22 Mark Lillibridge
  2011-01-14  2:51 ` Stefan Monnier
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Lillibridge @ 2011-01-13 23:22 UTC (permalink / raw)
  To: emacs-devel


[eggs.gnu.org took 5 days to bounce this; I have already replied to this
with more information...]


    I'm at my wit's end trying to debug a subtle and nasty unrmail bug
where unrmail mangles the character encodings.  I'm many, many hours
down this particular rathole, but let me try and explain the problem
"briefly".  Please ask for clarification or more experiments as needed.


    Ok, I have a Rmail Babyl file whose contents are correctly encoded
via raw-text-unix (V22) -- for those curious, I believe this can be
caused by receiving a MIME e-mail with two parts with different and
incompatible encodings.  One of the messages contains a Latin-1 u with
two dots over it (below from a V22 emacs):

  character: ü (2300, #o4374, #x8fc, U+00FC)
    charset: latin-iso8859-1
             (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-IR-100.)
 code point: #x7C
     syntax: w  which means: word
   category: l:Latin
buffer code: #x81 #xFC
  file code: #xC3 #xBC (encoded by coding system mule-utf-8-unix)

I have verified that this character is represented on disk as 81 FC
(hex).  If I visit that file literally (also), I see \201\374, which is
octal for 81 FC as expected.


    When I fire up unrmail on this file, it first reads it in as
"raw-text-unix":

    ;; Read in the old Rmail file with no decoding.
    (let ((coding-system-for-read 'raw-text))
      (insert-file-contents file))
    ;; But make it multibyte.
    (set-buffer-multibyte t)
    (setq buffer-file-coding-system 'raw-text-unix)


It then decodes the main part of the file containing the messages:

      (unless (and coding-system
                   (coding-system-p coding-system))
        (setq coding-system
              ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
              ;; earlier versions did that with the current buffer's encoding.
              ;; So we want to favor detection of emacs-mule (whose normal
              ;; priority is quite low), but still allow detection of other
              ;; encodings if emacs-mule won't fit.  The call to
              ;; detect-coding-with-priority below achieves that.
              (car (detect-coding-with-priority
                    from to
                    '((coding-category-emacs-mule . emacs-mule))))))
      (message "decoding file with %s" coding-system)
      (unless (memq coding-system
                    '(undecided undecided-unix))
        (set-buffer-modified-p t)       ; avoid locking when decoding
        (let ((buffer-undo-list t))
          (decode-coding-region from to coding-system))
        (setq coding-system last-coding-system-used))
      (message "actual coding system used: %s" coding-system)

I have verified via the inserted message calls above that it is decoding
using raw-text-unix here.


    It then writes out the modified message (after rewritting some
headers and the like; no changes to 8 bit characters) by encoding using
the coding system that message was originally decoded with
(mule-utf-8-unix):

              ;; If the message specifies a coding system, use it.
              (let ((maybe-coding (mail-fetch-field "X-Coding-System")))
                (if maybe-coding
                    (setq coding
                          ;; Force Unix EOLs.
                          (coding-system-change-eol-conversion
                           (intern maybe-coding) 0))
                  ;; If there's no X-Coding-System header, assume the
                  ;; message was never decoded.
                  (setq coding 'raw-text-unix)))
            ...
            ;; Write it to the output file, suitably encoded.
            ;(debug)
            (let ((coding-system-for-write coding))
              (write-region (point-min) (point-max) to-file t
                            'nomsg))
            (message "was %s now %s" coding last-coding-system-used)

Again, I verified via the inserted message call that this is correctly
mule-utf-8-unix.


    In a sane universe, this would result in the message in the output
file containing the UTF-8 for this character, C3 BC.  However, what I
actually get is 81 FC -- the same as we started with!  

    I conjecture that this is caused by the change in Emacs's internal
representation.  Whereas raw-text-unix -> mule-utf-8-unix on V22 is an
encoding change, in V23 it probably is not, at least for sane byte
sequences.  (Remember that we are running unrmail on a V23 emacs.)  Can
anyone verify this conjecture?  Google pretty much returns nothing
useful for information on how emacs' coding systems work.


    Ok, I said, if true, there should be an easy workaround for now: run
unrmail on a V22 emacs instead.  I did so, and the debugging messages
show the same coding system names being used.  However, now the file
contains C2 81 FC, which is still wrong!  More mysteriously, if I read
in that file in V22 using raw-text-unix (being careful to disable the
auto start rmail on buffer part) and then write the file out using
mule-utf-8-unix I *do* get the expected C3 BC.


    So something about the exact way that unrmail is doing things is
messing things up.  As a test, I stopped unrmail after it read in the
file but before decoded it:

    ;; Read in the old Rmail file with no decoding.
    (let ((coding-system-for-read 'raw-text))
      (insert-file-contents file))
    ;; But make it multibyte.
    (set-buffer-multibyte t)
    (setq buffer-file-coding-system 'raw-text-unix)

If I write out that buffer via write-region using coding system
mule-utf-8-unix, I get the error (C2 81 FC) in the output file.  The
same thing happens if I do this just before the setting of the buffer to
multibyte.  Mind you, I see the same characters (\201\374) in the buffer
in all three cases before I write it out so some invisible property of
the buffer must be different.


    So insert-file-contents is doing something differently from just
visiting the file that matters.  Unfortunately, the help documentation
for insert-file-contents gives no help on this.


    Does anyone have any ideas on what might be going on?

- Thanks,
  Mark




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: help needed with coding systems (unrmail problems)
  2011-01-13 23:22 help needed with coding systems (unrmail problems) Mark Lillibridge
@ 2011-01-14  2:51 ` Stefan Monnier
  2011-01-14 18:21   ` Mark Lillibridge
  0 siblings, 1 reply; 3+ messages in thread
From: Stefan Monnier @ 2011-01-14  2:51 UTC (permalink / raw)
  To: mark.lillibridge; +Cc: emacs-devel

>     Ok, I have a Rmail Babyl file whose contents are correctly encoded
> via raw-text-unix (V22) -- for those curious, I believe this can be

raw-text-unix is an alias for `binary'.  I.e. it takes bytes in and
returns the same bytes unchanged.  Decoding using it should never result
in any non-ascii chars: only ascii chars and "eight-bit chars"
(i.e. bytes between 128-255).

> I have verified that this character is represented on disk as 81 FC
> (hex).  If I visit that file literally (also), I see \201\374, which is
> octal for 81 FC as expected.

>     When I fire up unrmail on this file, it first reads it in as
> "raw-text-unix":

I.e. it read it literally.

> It then decodes the main part of the file containing the messages:

>       (unless (and coding-system
>                    (coding-system-p coding-system))
>         (setq coding-system
>               ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
>               ;; earlier versions did that with the current buffer's encoding.
>               ;; So we want to favor detection of emacs-mule (whose normal
>               ;; priority is quite low), but still allow detection of other
>               ;; encodings if emacs-mule won't fit.  The call to
>               ;; detect-coding-with-priority below achieves that.
>               (car (detect-coding-with-priority
>                     from to
>                     '((coding-category-emacs-mule . emacs-mule))))))
>       (message "decoding file with %s" coding-system)
>       (unless (memq coding-system
>                     '(undecided undecided-unix))
>         (set-buffer-modified-p t)       ; avoid locking when decoding
>         (let ((buffer-undo-list t))
>           (decode-coding-region from to coding-system))
>         (setq coding-system last-coding-system-used))
>       (message "actual coding system used: %s" coding-system)

> I have verified via the inserted message calls above that it is decoding
> using raw-text-unix here.

Sounds like you have a problem here: it should be using emacs-mule
(since \201\374 is the emacs-mule encoding of ü).


        Stefan



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: help needed with coding systems (unrmail problems)
  2011-01-14  2:51 ` Stefan Monnier
@ 2011-01-14 18:21   ` Mark Lillibridge
  0 siblings, 0 replies; 3+ messages in thread
From: Mark Lillibridge @ 2011-01-14 18:21 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel


Stefan wrote:
>  I (Mark) wrote:
>  >     Ok, I have a Rmail Babyl file whose contents are correctly encoded
>  > via raw-text-unix (V22) -- for those curious, I believe this can be
>  
>  raw-text-unix is an alias for `binary'.  I.e. it takes bytes in and
>  returns the same bytes unchanged.

    I think you are thinking of raw-text; I believe raw-text-unix does
end of line conversion (only).


>  Decoding using it should never result
>  in any non-ascii chars: only ascii chars and "eight-bit chars"
>  (i.e. bytes between 128-255).

    This is only true if you either read to a multibyte buffer or read
to a unibyte buffer and then never convert it to multibyte (Rmail does
the latter).


>  > I have verified that this character is represented on disk as 81 FC
>  > (hex).  If I visit that file literally (also), I see \201\374, which is
>  > octal for 81 FC as expected.
>  
>  >     When I fire up unrmail on this file, it first reads it in as
>  > "raw-text-unix":
>  
>  I.e. it read it literally.

I think these concepts are also not equivalent for subtle reasons.


>  > It then decodes the main part of the file containing the messages:
>  
>  >       (unless (and coding-system
>  >                    (coding-system-p coding-system))
>  >         (setq coding-system
>  >               ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but
>  >               ;; earlier versions did that with the current buffer's encoding.
>  >               ;; So we want to favor detection of emacs-mule (whose normal
>  >               ;; priority is quite low), but still allow detection of other
>  >               ;; encodings if emacs-mule won't fit.  The call to
>  >               ;; detect-coding-with-priority below achieves that.
>  >               (car (detect-coding-with-priority
>  >                     from to
>  >                     '((coding-category-emacs-mule . emacs-mule))))))
>  >       (message "decoding file with %s" coding-system)
>  >       (unless (memq coding-system
>  >                     '(undecided undecided-unix))
>  >         (set-buffer-modified-p t)       ; avoid locking when decoding
>  >         (let ((buffer-undo-list t))
>  >           (decode-coding-region from to coding-system))
>  >         (setq coding-system last-coding-system-used))
>  >       (message "actual coding system used: %s" coding-system)
>  
>  > I have verified via the inserted message calls above that it is decoding
>  > using raw-text-unix here.
>  
>  Sounds like you have a problem here: it should be using emacs-mule
>  (since \201\374 is the emacs-mule encoding of ü).

    See my "earlier" message entitled "Rmail and the raw-text coding
system" for why Rmail is using raw-text-unix instead of emacs-mule; I
just resent it so maybe it will get through to the mailing list this
time...

- Mark



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-01-14 18:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-13 23:22 help needed with coding systems (unrmail problems) Mark Lillibridge
2011-01-14  2:51 ` Stefan Monnier
2011-01-14 18:21   ` Mark Lillibridge

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.