From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Lennart Borgman Newsgroups: gmane.emacs.devel Subject: Re: url-retrieve-synchronously and coding Date: Mon, 24 Jan 2011 13:21:16 +0100 Message-ID: References: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1295880146 18843 80.91.229.12 (24 Jan 2011 14:42:26 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 24 Jan 2011 14:42:26 +0000 (UTC) Cc: Emacs-Devel devel To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Jan 24 15:42:22 2011 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PhNcb-000225-Ow for ged-emacs-devel@m.gmane.org; Mon, 24 Jan 2011 15:42:22 +0100 Original-Received: from localhost ([127.0.0.1]:33374 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PhNcb-0005PT-73 for ged-emacs-devel@m.gmane.org; Mon, 24 Jan 2011 09:42:21 -0500 Original-Received: from [140.186.70.92] (port=47951 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PhLV0-0001mV-UY for emacs-devel@gnu.org; Mon, 24 Jan 2011 07:26:27 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PhLQQ-0004w3-AP for emacs-devel@gnu.org; Mon, 24 Jan 2011 07:21:39 -0500 Original-Received: from mail-ew0-f41.google.com ([209.85.215.41]:36075) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PhLQQ-0004vz-5x for emacs-devel@gnu.org; Mon, 24 Jan 2011 07:21:38 -0500 Original-Received: by ewy27 with SMTP id 27so1705120ewy.0 for ; Mon, 24 Jan 2011 04:21:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type:content-transfer-encoding; bh=Atc7mrw5knRg9FIA5f3cZOWfTV3kiI8FK+A3PH/qlhI=; b=CZVSstl9rgc8Ovx5zwqqkbxBLz6nUjOJInjIArjyUD230LGs8IK3E3bXz3FZcRZAOE VdA76NhShXUR/A1OyOLPL8+QgDVMnViKjR89qmqAMYMWknkOxMyb9euZDOvAuVW1wqqC 7o6pnH0GgxRyCI5OJrZf2SwJlJhmfV8rLt3yo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; b=Ho6aHLQzm3ChkVOUoxMNl2OjPt+tVtjs0OxzUB1OMW0cUeQlYWz71tn4hN5/wyq17E xhOR6QyWWI5/7lcrsHr2v2O1f/VpWGV3PpI3rNZO4s+V0nOokKE+LjxVk28tNemy21eE PTiIoBNLp1y0pRmdBUJ/AEUinPHycD4rawWgw= Original-Received: by 10.213.28.12 with SMTP id k12mr4760042ebc.84.1295871697193; Mon, 24 Jan 2011 04:21:37 -0800 (PST) Original-Received: by 10.213.20.148 with HTTP; Mon, 24 Jan 2011 04:21:16 -0800 (PST) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:134891 Archived-At: On Mon, Jan 24, 2011 at 4:37 AM, Stefan Monnier wrote: >> If I do something like this >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(setq buffer (url-retrieve-synchronous= ly url)) > >> and the contents of the buffer begins with > >> =C2=A0 HTTP/1.1 200 OK >> =C2=A0 Content-Type: text/xml >> =C2=A0 X-Content-Type-Options: nosniff >> =C2=A0 Connection: close > >> =C2=A0 >> =C2=A0 Content-type: fix-mhtml > --> > >> should not then the buffer file coding system be utf-8? > > I don't think so, because url-retrieve-synchronously handles the HTTP > part of the protocol only. =C2=A0Maybe you're thinking of url-insert-file= -contents? Ok, thanks. It is not easy to navigate among those functions. But I guess we have said before that better documentation is needed. Unfortunately url-insert-file-contents does not decode the file as utf-8. mm-disect-buffer looks for the charset, but only in the mime headers. In this case the charset is specified instead in the xml content. I do not know how the retrieved content above should be handled. It looks however like the web browsers handles this case and shows the xml content correctly. It seems natural in a case like this where Content-Type is text/xml to look for the specified charset in the xml content. I think `url-insert' should do this. Here is a suggestion for how to do it where I just have added a search for : (defun url-insert (buffer &optional beg end) "Insert the body of a URL object. BUFFER should be a complete URL buffer as returned by `url-retrieve'. If the headers specify a coding-system, it is applied to the body before it is inserted. Returns a list of the form (SIZE CHARSET), where SIZE is the size in bytes of the inserted text and CHARSET is the charset that was specified in the header, or nil if none was found. BEG and END can be used to only insert a subpart of the body. They count bytes from the beginning of the body." (let* ((handle (with-current-buffer buffer (mm-dissect-buffer t))) (data (with-current-buffer (mm-handle-buffer handle) (if beg (buffer-substring (+ (point-min) beg) (if end (+ (point-min) end) (point-m= ax))) (buffer-string)))) (charset (mail-content-type-get (mm-handle-type handle) 'charset))) (mm-destroy-parts handle) (if charset (insert (mm-decode-string data (mm-charset-to-coding-system charset= ))) (if (not (string=3D "xml" (mm-handle-media-subtype handle))) (insert data) ;; Content is XML, use the specified encoding if any: (let ((coding-system (with-temp-buffer (insert (substring data 0 100)) (let* ((enc-pos (progn (goto-char (point-min)) (xmltok-get-declared-encoding-position))= ) (enc-name (and (consp enc-pos) (buffer-substring-no-properties (car enc-pos) (cdr enc-pos)= )))) (cond (enc-name (if (string=3D (downcase enc-name) "utf-16") (nxml-choose-utf-16-coding-system) (nxml-mime-charset-coding-system enc-name))) (enc-pos (nxml-choose-utf-coding-system))))))) (if coding-system (insert (mm-decode-string data coding-system)) (insert data))))) (list (length data) charset))) Is this the right thing to do, or? Something more is needed to get things working in my case, but I want to know if this part is ok first. Or is perhaps the coding handled to late here?