From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Lennart Borgman Newsgroups: gmane.emacs.devel Subject: Re: url-retrieve-synchronously and coding Date: Wed, 26 Jan 2011 23:08:17 +0100 Message-ID: References: <87aaipjm4r.fsf@keller.adm.naquadah.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: dough.gmane.org 1296079734 13984 80.91.229.12 (26 Jan 2011 22:08:54 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 26 Jan 2011 22:08:54 +0000 (UTC) To: Lennart Borgman , Stefan Monnier , Emacs-Devel devel Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 26 23:08:49 2011 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PiDXk-0004Lw-RQ for ged-emacs-devel@m.gmane.org; Wed, 26 Jan 2011 23:08:49 +0100 Original-Received: from localhost ([127.0.0.1]:39739 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PiDXk-0003ZN-09 for ged-emacs-devel@m.gmane.org; Wed, 26 Jan 2011 17:08:48 -0500 Original-Received: from [140.186.70.92] (port=51822 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PiDXf-0003Z9-Ir for emacs-devel@gnu.org; Wed, 26 Jan 2011 17:08:44 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PiDXa-0005b8-H9 for emacs-devel@gnu.org; Wed, 26 Jan 2011 17:08:43 -0500 Original-Received: from mail-ey0-f169.google.com ([209.85.215.169]:38949) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PiDXa-0005b4-7j for emacs-devel@gnu.org; Wed, 26 Jan 2011 17:08:38 -0500 Original-Received: by eyh6 with SMTP id 6so817395eyh.0 for ; Wed, 26 Jan 2011 14:08:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=wrU2hL78kpLb1zeeAaJpf1YEkSymrUFeZml0DS6OOcE=; b=QwyAz6l/em4vAyt5CphgyngzW8nUeDSUqsPgHJH5HfAnESlONpvIcOSI0W6vmkBlXT Hxc/e2Q7oHNlGEQqW500P4ie8f8ZxvZPU04cAaxDnmAPJeMQAQEAZRUs8AiIJGyGhLOb YFDD7ERXwkIjbxeEK9GG5li8Dx7aYZPyG7KXM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=pqCJ+GVJDkoeBLkNBBCUVv4AcnH5fBw2u+KSseKgFpptAnafp9Yly+2kpVKxxDkzvT G5L0UdBYuXyQv9SDmPr8FTYotpcwcb2Dq7r8FuNi6HS8Uv9yWgjIPcH1W9YTpqAbDqiD 4ADi7EhL165t7tq+0ewquXRnae474LKgyr8PA= Original-Received: by 10.213.14.136 with SMTP id g8mr1987722eba.97.1296079717342; Wed, 26 Jan 2011 14:08:37 -0800 (PST) Original-Received: by 10.213.20.148 with HTTP; Wed, 26 Jan 2011 14:08:17 -0800 (PST) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:135054 Archived-At: On Tue, Jan 25, 2011 at 12:01 PM, Lennart Borgman wrote: > On Tue, Jan 25, 2011 at 11:47 AM, Julien Danjou wrote: >> On Mon, Jan 24 2011, Lennart Borgman wrote: >> >>> It looks to me like url-insert-file-contents is a code place for >>> decoding. So I suggest the following: >>> >>> 1) Move the decoding from url-insert to url-insert-file-contents. >> >> I'd like to be able to use the coding detection code and decoding on >> already retrieved buffer, so this can be used in >> url-insert-file-contents, but it must be a autonomous function that I >> can call myself. > > Yes, of course. > > >>> 2) Replace the call to decode-coding-inserted-region in >>> url-insert-file-contents with something that also takes care of xml >>> encoding and similar things. I changed my mind a bit. It looks like it is best to do all the url related decoding in url-insert since that is where you have the information about http headers. Below is the new suggestion. (Doc strings needs some rework.) (defvar coding-finders '(("text/xml" coding-finder-for-xml)) ) (defun coding-finder-for-xml (src) (let* ((buffer (if (bufferp src) src (with-current-buffer (generate-new-buffer "coding-getter-for-xml") (insert (substring src 0 100)) (current-buffer)))) (here (with-current-buffer buffer (point))) (coding-system (with-current-buffer buffer (let* ((enc-pos (progn (goto-char (point-min)) (xmltok-get-declared-encoding-position))) (enc-name (and (consp enc-pos) (buffer-substring-no-properties (car enc-pos) (cdr enc-pos))))) (cond (enc-name (if (string= (downcase enc-name) "utf-16") (nxml-choose-utf-16-coding-system) (nxml-mime-charset-coding-system enc-name))) (enc-pos (nxml-choose-utf-coding-system))))))) (if (bufferp src) (with-current-buffer buffer (goto-char here)) (kill-buffer buffer)) coding-system)) (defun url-decode (buffer charset media-type) "Decode whole BUFFER using char set CHARSET. Use MEDIA-TYPE only if CHARSET is nil. In that case it should be a http header content type. Use this to lookup a coding finder function in `coding-finders' and decode the buffer with the coding system that function returns. Return non-nil if the buffer was decoded." (with-current-buffer buffer (save-restriction (widen) (if charset (let ((data (buffer-substring-no-properties (point-min) (point-max)))) (delete-region (point-min) (point-max)) (insert (mm-decode-string data charset)) t) (when media-type (let* ((rec (assoc media-type coding-finders)) (coding-finder (nth 1 rec)) (coding (when coding-finder (funcall coding-finder (current-buffer))))) (when coding (decode-coding-region (point-min) (point-max) coding) t))))))) (defun url-insert (buffer &optional beg end) "Insert the body of a URL object. BUFFER should be a complete URL buffer as returned by `url-retrieve'. If the headers specify a coding-system, it is applied to the body before it is inserted. Returns a list of the form (SIZE CHARSET), where SIZE is the size in bytes of the inserted text and CHARSET is the charset that was specified in the header, or nil if none was found. BEG and END can be used to only insert a subpart of the body. They count bytes from the beginning of the body." (let* ((handle (with-current-buffer buffer (mm-dissect-buffer t))) (data (with-current-buffer (mm-handle-buffer handle) (if beg (buffer-substring (+ (point-min) beg) (if end (+ (point-min) end) (point-max))) (buffer-string)))) (charset (mail-content-type-get (mm-handle-type handle) 'charset)) ;;(coding (mm-charset-to-coding-system charset)) (media-type (mm-handle-media-type handle)) (codbuf (generate-new-buffer "url-insert")) decoded) (mm-destroy-parts handle) (insert (with-current-buffer codbuf (insert data) (url-decode (current-buffer) charset media-type) (buffer-substring-no-properties (point-min) (point-max)))) (kill-buffer codbuf) (list (length data) charset))) ;;;###autoload (defun url-insert-file-contents (url &optional visit beg end replace) (let ((buffer (url-retrieve-synchronously url))) (if (not buffer) (error "Opening input file: No such file or directory, %s" url)) (if visit (setq buffer-file-name url)) (save-excursion (let* ((start (point)) (size-decoded (url-insert buffer beg end)) (size (nth 0 size-decoded)) (decoded (nth 1 size-decoded))) (kill-buffer buffer) (when replace (delete-region (point-min) start) (delete-region (point) (point-max))) (unless decoded ;; If the headers don't specify any particular charset, use the ;; usual heuristic/rules that we apply to files. (decode-coding-inserted-region start (point) url visit beg end replace)) (list url size)))))