From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark Lillibridge Newsgroups: gmane.emacs.devel Subject: why unrmail fails with raw-text on version 22 [WAS: Re: help needed with coding systems (unrmail problems)] Date: Sat, 08 Jan 2011 21:52:27 -0800 Message-ID: References: Reply-To: mark.lillibridge@hp.com NNTP-Posting-Host: lo.gmane.org X-Trace: dough.gmane.org 1294552375 13783 80.91.229.12 (9 Jan 2011 05:52:55 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sun, 9 Jan 2011 05:52:55 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Jan 09 06:52:51 2011 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PboCx-0001Ry-H8 for ged-emacs-devel@m.gmane.org; Sun, 09 Jan 2011 06:52:51 +0100 Original-Received: from localhost ([127.0.0.1]:39173 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PboCw-0004wS-VP for ged-emacs-devel@m.gmane.org; Sun, 09 Jan 2011 00:52:51 -0500 Original-Received: from [140.186.70.92] (port=47200 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PboCl-0004wF-Iz for emacs-devel@gnu.org; Sun, 09 Jan 2011 00:52:40 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PboCk-0008W6-HD for emacs-devel@gnu.org; Sun, 09 Jan 2011 00:52:39 -0500 Original-Received: from gundega.hpl.hp.com ([192.6.19.190]:42823) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PboCk-0008Vk-5W for emacs-devel@gnu.org; Sun, 09 Jan 2011 00:52:38 -0500 Original-Received: from mailhub-pa1.hpl.hp.com (mailhub-pa1.hpl.hp.com [15.25.115.25]) by gundega.hpl.hp.com (8.14.3/8.14.3/HPL-PA Relay) with ESMTP id p095qTLg005281 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Sat, 8 Jan 2011 21:52:29 -0800 Original-Received: from ts-rhel5 (ts-rhel5.hpl.hp.com [15.25.118.27]) by mailhub-pa1.hpl.hp.com (8.14.3/8.14.3/HPL-PA Hub) with ESMTP id p095qRxP029638; Sat, 8 Jan 2011 21:52:28 -0800 In-reply-to: (message from Mark Lillibridge on Sat, 8 Jan 2011 14:58:57 -0800) X-Scanned-By: MIMEDefang 2.69 on 15.0.48.190 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:134380 Archived-At: Ok, I think I now understand why unrmail fails with raw-text*. Unrmail reads and decodes BABYL files in a subtly different way than Rmail does. In particular, it does: (with-temp-buffer ;; Read in the old Rmail file with no decoding. (let ((coding-system-for-read 'raw-text)) (insert-file-contents file)) ;; But make it multibyte. (set-buffer-multibyte t) (setq buffer-file-coding-system 'raw-text-unix) Not obvious, but important: with-temp-buffer creates a multibyte buffer so that insert-file-contents is decoding from raw-text to a multibyte buffer, producing raw 8-bit bytes for x80-xff. The (set-buffer-multibyte t) here is a no-op as far as I can tell as the buffer is already multibyte at that point. It then decodes the middle part as Rmail does: (unless (and coding-system (coding-system-p coding-system)) (setq coding-system ;; Emacs 21.1 and later writes RMAIL files in emacs-mule, but ;; earlier versions did that with the current buffer's encoding. ;; So we want to favor detection of emacs-mule (whose normal ;; priority is quite low), but still allow detection of other ;; encodings if emacs-mule won't fit. The call to ;; detect-coding-with-priority below achieves that. (car (detect-coding-with-priority from to '((coding-category-emacs-mule . emacs-mule)))))) (unless (memq coding-system '(undecided undecided-unix)) (set-buffer-modified-p t) ; avoid locking when decoding (let ((buffer-undo-list t)) (decode-coding-region from to coding-system)) (setq coding-system last-coding-system-used)) So, Rmail is doing read unibyte, decode, then convert to multibyte while unrmail does read multibyte then decode. This produces the same results for all coding systems except raw-text*. The reason is that read raw-text unibyte then convert to multibyte produces a different result than reading raw-text directly to multibyte! The later produces raw bytes while the former produces code points. Needless to say, trying to encode from raw bytes instead of code points gives different results. My testing so far shows that this problem can be fixed for version 22 by switching to the Rmail way of doing things (e.g., read unibyte and only convert to multibyte at the end). A more complicated solution will be needed for version 23. Should I produce a patch for version 22 given that it will not work for version 23? - Mark