From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark Lillibridge Newsgroups: gmane.emacs.devel Subject: help needed with coding systems (unrmail problems) Date: Thu, 13 Jan 2011 15:22:40 -0800 Message-ID: Reply-To: mark.lillibridge@hp.com NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1294960985 8928 80.91.229.12 (13 Jan 2011 23:23:05 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 13 Jan 2011 23:23:05 +0000 (UTC) To: Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Jan 14 00:23:01 2011 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PdWVQ-0000HS-Vm for ged-emacs-devel@m.gmane.org; Fri, 14 Jan 2011 00:23:01 +0100 Original-Received: from localhost ([127.0.0.1]:35745 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PdWVQ-0006ha-CJ for ged-emacs-devel@m.gmane.org; Thu, 13 Jan 2011 18:23:00 -0500 Original-Received: from [140.186.70.92] (port=37865 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PdWVG-0006fx-D1 for emacs-devel@gnu.org; Thu, 13 Jan 2011 18:22:51 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PdWVE-0006in-5J for emacs-devel@gnu.org; Thu, 13 Jan 2011 18:22:50 -0500 Original-Received: from madara.hpl.hp.com ([192.6.19.124]:50435) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PdWVD-0006id-TU for emacs-devel@gnu.org; Thu, 13 Jan 2011 18:22:48 -0500 Original-Received: from mailhub-pa1.hpl.hp.com (mailhub-pa1.hpl.hp.com [15.25.115.25]) by madara.hpl.hp.com (8.14.3/8.14.3/HPL-PA Relay) with ESMTP id p0DNMgkK015203 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Thu, 13 Jan 2011 15:22:43 -0800 Original-Received: from ts-rhel5 (ts-rhel5.hpl.hp.com [15.25.118.27]) by mailhub-pa1.hpl.hp.com (8.14.3/8.14.3/HPL-PA Hub) with ESMTP id p0DNMePJ014321; Thu, 13 Jan 2011 15:22:41 -0800 X-Scanned-By: MIMEDefang 2.69 on 15.0.152.124 X-MIME-Autoconverted: from 8bit to quoted-printable by madara.hpl.hp.com id p0DNMgkK015203 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:134506 Archived-At: [eggs.gnu.org took 5 days to bounce this; I have already replied to this with more information...] I'm at my wit's end trying to debug a subtle and nasty unrmail bug where unrmail mangles the character encodings. I'm many, many hours down this particular rathole, but let me try and explain the problem "briefly". Please ask for clarification or more experiments as needed. Ok, I have a Rmail Babyl file whose contents are correctly encoded via raw-text-unix (V22) -- for those curious, I believe this can be caused by receiving a MIME e-mail with two parts with different and incompatible encodings. One of the messages contains a Latin-1 u with two dots over it (below from a V22 emacs): character: =C3=BC (2300, #o4374, #x8fc, U+00FC) charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-I= R-100.) code point: #x7C syntax: w which means: word category: l:Latin buffer code: #x81 #xFC file code: #xC3 #xBC (encoded by coding system mule-utf-8-unix) I have verified that this character is represented on disk as 81 FC (hex). If I visit that file literally (also), I see \201\374, which is octal for 81 FC as expected. When I fire up unrmail on this file, it first reads it in as "raw-text-unix": ;; Read in the old Rmail file with no decoding. (let ((coding-system-for-read 'raw-text)) (insert-file-contents file)) ;; But make it multibyte. (set-buffer-multibyte t) (setq buffer-file-coding-system 'raw-text-unix) It then decodes the main part of the file containing the messages: (unless (and coding-system (coding-system-p coding-system)) (setq coding-system ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, b= ut ;; earlier versions did that with the current buffer's enco= ding. ;; So we want to favor detection of emacs-mule (whose norma= l ;; priority is quite low), but still allow detection of oth= er ;; encodings if emacs-mule won't fit. The call to ;; detect-coding-with-priority below achieves that. (car (detect-coding-with-priority from to '((coding-category-emacs-mule . emacs-mule)))))) (message "decoding file with %s" coding-system) (unless (memq coding-system '(undecided undecided-unix)) (set-buffer-modified-p t) ; avoid locking when decoding (let ((buffer-undo-list t)) (decode-coding-region from to coding-system)) (setq coding-system last-coding-system-used)) (message "actual coding system used: %s" coding-system) I have verified via the inserted message calls above that it is decoding using raw-text-unix here. It then writes out the modified message (after rewritting some headers and the like; no changes to 8 bit characters) by encoding using the coding system that message was originally decoded with (mule-utf-8-unix): ;; If the message specifies a coding system, use it. (let ((maybe-coding (mail-fetch-field "X-Coding-System"))) (if maybe-coding (setq coding ;; Force Unix EOLs. (coding-system-change-eol-conversion (intern maybe-coding) 0)) ;; If there's no X-Coding-System header, assume the ;; message was never decoded. (setq coding 'raw-text-unix))) ... ;; Write it to the output file, suitably encoded. ;(debug) (let ((coding-system-for-write coding)) (write-region (point-min) (point-max) to-file t 'nomsg)) (message "was %s now %s" coding last-coding-system-used) Again, I verified via the inserted message call that this is correctly mule-utf-8-unix. In a sane universe, this would result in the message in the output file containing the UTF-8 for this character, C3 BC. However, what I actually get is 81 FC -- the same as we started with! =20 I conjecture that this is caused by the change in Emacs's internal representation. Whereas raw-text-unix -> mule-utf-8-unix on V22 is an encoding change, in V23 it probably is not, at least for sane byte sequences. (Remember that we are running unrmail on a V23 emacs.) Can anyone verify this conjecture? Google pretty much returns nothing useful for information on how emacs' coding systems work. Ok, I said, if true, there should be an easy workaround for now: run unrmail on a V22 emacs instead. I did so, and the debugging messages show the same coding system names being used. However, now the file contains C2 81 FC, which is still wrong! More mysteriously, if I read in that file in V22 using raw-text-unix (being careful to disable the auto start rmail on buffer part) and then write the file out using mule-utf-8-unix I *do* get the expected C3 BC. So something about the exact way that unrmail is doing things is messing things up. As a test, I stopped unrmail after it read in the file but before decoded it: ;; Read in the old Rmail file with no decoding. (let ((coding-system-for-read 'raw-text)) (insert-file-contents file)) ;; But make it multibyte. (set-buffer-multibyte t) (setq buffer-file-coding-system 'raw-text-unix) If I write out that buffer via write-region using coding system mule-utf-8-unix, I get the error (C2 81 FC) in the output file. The same thing happens if I do this just before the setting of the buffer to multibyte. Mind you, I see the same characters (\201\374) in the buffer in all three cases before I write it out so some invisible property of the buffer must be different. So insert-file-contents is doing something differently from just visiting the file that matters. Unfortunately, the help documentation for insert-file-contents gives no help on this. Does anyone have any ideas on what might be going on? - Thanks, Mark