From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mark Lillibridge <mark.lillibridge@hp.com>
Newsgroups: gmane.emacs.devel
Subject: help needed with coding systems (unrmail problems)
Date: Thu, 13 Jan 2011 15:22:40 -0800
Message-ID: <qmhipxs2xtr.fsf@hp.com>
Reply-To: mark.lillibridge@hp.com
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: dough.gmane.org 1294960985 8928 80.91.229.12 (13 Jan 2011 23:23:05 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Thu, 13 Jan 2011 23:23:05 +0000 (UTC)
To: <emacs-devel@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Jan 14 00:23:01 2011
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1PdWVQ-0000HS-Vm
	for ged-emacs-devel@m.gmane.org; Fri, 14 Jan 2011 00:23:01 +0100
Original-Received: from localhost ([127.0.0.1]:35745 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1PdWVQ-0006ha-CJ
	for ged-emacs-devel@m.gmane.org; Thu, 13 Jan 2011 18:23:00 -0500
Original-Received: from [140.186.70.92] (port=37865 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PdWVG-0006fx-D1
	for emacs-devel@gnu.org; Thu, 13 Jan 2011 18:22:51 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mark.lillibridge@hp.com>) id 1PdWVE-0006in-5J
	for emacs-devel@gnu.org; Thu, 13 Jan 2011 18:22:50 -0500
Original-Received: from madara.hpl.hp.com ([192.6.19.124]:50435)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mark.lillibridge@hp.com>) id 1PdWVD-0006id-TU
	for emacs-devel@gnu.org; Thu, 13 Jan 2011 18:22:48 -0500
Original-Received: from mailhub-pa1.hpl.hp.com (mailhub-pa1.hpl.hp.com [15.25.115.25])
	by madara.hpl.hp.com (8.14.3/8.14.3/HPL-PA Relay) with ESMTP id
	p0DNMgkK015203
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT)
	for <emacs-devel@gnu.org>; Thu, 13 Jan 2011 15:22:43 -0800
Original-Received: from ts-rhel5 (ts-rhel5.hpl.hp.com [15.25.118.27])
	by mailhub-pa1.hpl.hp.com (8.14.3/8.14.3/HPL-PA Hub) with ESMTP id
	p0DNMePJ014321; Thu, 13 Jan 2011 15:22:41 -0800
X-Scanned-By: MIMEDefang 2.69 on 15.0.152.124
X-MIME-Autoconverted: from 8bit to quoted-printable by madara.hpl.hp.com id
	p0DNMgkK015203
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:134506
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/134506>


[eggs.gnu.org took 5 days to bounce this; I have already replied to this
with more information...]


    I'm at my wit's end trying to debug a subtle and nasty unrmail bug
where unrmail mangles the character encodings.  I'm many, many hours
down this particular rathole, but let me try and explain the problem
"briefly".  Please ask for clarification or more experiments as needed.


    Ok, I have a Rmail Babyl file whose contents are correctly encoded
via raw-text-unix (V22) -- for those curious, I believe this can be
caused by receiving a MIME e-mail with two parts with different and
incompatible encodings.  One of the messages contains a Latin-1 u with
two dots over it (below from a V22 emacs):

  character: =C3=BC (2300, #o4374, #x8fc, U+00FC)
    charset: latin-iso8859-1
             (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-I=
R-100.)
 code point: #x7C
     syntax: w  which means: word
   category: l:Latin
buffer code: #x81 #xFC
  file code: #xC3 #xBC (encoded by coding system mule-utf-8-unix)

I have verified that this character is represented on disk as 81 FC
(hex).  If I visit that file literally (also), I see \201\374, which is
octal for 81 FC as expected.


    When I fire up unrmail on this file, it first reads it in as
"raw-text-unix":

    ;; Read in the old Rmail file with no decoding.
    (let ((coding-system-for-read 'raw-text))
      (insert-file-contents file))
    ;; But make it multibyte.
    (set-buffer-multibyte t)
    (setq buffer-file-coding-system 'raw-text-unix)


It then decodes the main part of the file containing the messages:

      (unless (and coding-system
                   (coding-system-p coding-system))
        (setq coding-system
              ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, b=
ut
              ;; earlier versions did that with the current buffer's enco=
ding.
              ;; So we want to favor detection of emacs-mule (whose norma=
l
              ;; priority is quite low), but still allow detection of oth=
er
              ;; encodings if emacs-mule won't fit.  The call to
              ;; detect-coding-with-priority below achieves that.
              (car (detect-coding-with-priority
                    from to
                    '((coding-category-emacs-mule . emacs-mule))))))
      (message "decoding file with %s" coding-system)
      (unless (memq coding-system
                    '(undecided undecided-unix))
        (set-buffer-modified-p t)       ; avoid locking when decoding
        (let ((buffer-undo-list t))
          (decode-coding-region from to coding-system))
        (setq coding-system last-coding-system-used))
      (message "actual coding system used: %s" coding-system)

I have verified via the inserted message calls above that it is decoding
using raw-text-unix here.


    It then writes out the modified message (after rewritting some
headers and the like; no changes to 8 bit characters) by encoding using
the coding system that message was originally decoded with
(mule-utf-8-unix):

              ;; If the message specifies a coding system, use it.
              (let ((maybe-coding (mail-fetch-field "X-Coding-System")))
                (if maybe-coding
                    (setq coding
                          ;; Force Unix EOLs.
                          (coding-system-change-eol-conversion
                           (intern maybe-coding) 0))
                  ;; If there's no X-Coding-System header, assume the
                  ;; message was never decoded.
                  (setq coding 'raw-text-unix)))
            ...
            ;; Write it to the output file, suitably encoded.
            ;(debug)
            (let ((coding-system-for-write coding))
              (write-region (point-min) (point-max) to-file t
                            'nomsg))
            (message "was %s now %s" coding last-coding-system-used)

Again, I verified via the inserted message call that this is correctly
mule-utf-8-unix.


    In a sane universe, this would result in the message in the output
file containing the UTF-8 for this character, C3 BC.  However, what I
actually get is 81 FC -- the same as we started with! =20

    I conjecture that this is caused by the change in Emacs's internal
representation.  Whereas raw-text-unix -> mule-utf-8-unix on V22 is an
encoding change, in V23 it probably is not, at least for sane byte
sequences.  (Remember that we are running unrmail on a V23 emacs.)  Can
anyone verify this conjecture?  Google pretty much returns nothing
useful for information on how emacs' coding systems work.


    Ok, I said, if true, there should be an easy workaround for now: run
unrmail on a V22 emacs instead.  I did so, and the debugging messages
show the same coding system names being used.  However, now the file
contains C2 81 FC, which is still wrong!  More mysteriously, if I read
in that file in V22 using raw-text-unix (being careful to disable the
auto start rmail on buffer part) and then write the file out using
mule-utf-8-unix I *do* get the expected C3 BC.


    So something about the exact way that unrmail is doing things is
messing things up.  As a test, I stopped unrmail after it read in the
file but before decoded it:

    ;; Read in the old Rmail file with no decoding.
    (let ((coding-system-for-read 'raw-text))
      (insert-file-contents file))
    ;; But make it multibyte.
    (set-buffer-multibyte t)
    (setq buffer-file-coding-system 'raw-text-unix)

If I write out that buffer via write-region using coding system
mule-utf-8-unix, I get the error (C2 81 FC) in the output file.  The
same thing happens if I do this just before the setting of the buffer to
multibyte.  Mind you, I see the same characters (\201\374) in the buffer
in all three cases before I write it out so some invisible property of
the buffer must be different.


    So insert-file-contents is doing something differently from just
visiting the file that matters.  Unfortunately, the help documentation
for insert-file-contents gives no help on this.


    Does anyone have any ideas on what might be going on?

- Thanks,
  Mark