* url-retrieve-synchronously and coding @ 2011-01-23 19:03 Lennart Borgman 2011-01-24 3:37 ` Stefan Monnier 0 siblings, 1 reply; 11+ messages in thread From: Lennart Borgman @ 2011-01-23 19:03 UTC (permalink / raw) To: Emacs-Devel devel If I do something like this (setq buffer (url-retrieve-synchronously url)) and the contents of the buffer begins with HTTP/1.1 200 OK Content-Type: text/xml X-Content-Type-Options: nosniff Connection: close <?xml version = "1.0" encoding="UTF-8" standalone="yes"?><!-- Content-type: fix-mhtml --> should not then the buffer file coding system be utf-8? This does not happen for me, but I am unsure if this is a bug or not. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-23 19:03 url-retrieve-synchronously and coding Lennart Borgman @ 2011-01-24 3:37 ` Stefan Monnier 2011-01-24 12:21 ` Lennart Borgman 0 siblings, 1 reply; 11+ messages in thread From: Stefan Monnier @ 2011-01-24 3:37 UTC (permalink / raw) To: Lennart Borgman; +Cc: Emacs-Devel devel > If I do something like this > (setq buffer (url-retrieve-synchronously url)) > and the contents of the buffer begins with > HTTP/1.1 200 OK > Content-Type: text/xml > X-Content-Type-Options: nosniff > Connection: close > <?xml version = "1.0" encoding="UTF-8" standalone="yes"?><!-- > Content-type: fix-mhtml --> > should not then the buffer file coding system be utf-8? I don't think so, because url-retrieve-synchronously handles the HTTP part of the protocol only. Maybe you're thinking of url-insert-file-contents? Stefan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-24 3:37 ` Stefan Monnier @ 2011-01-24 12:21 ` Lennart Borgman 2011-01-24 15:11 ` Julien Danjou 2011-01-24 16:44 ` Stefan Monnier 0 siblings, 2 replies; 11+ messages in thread From: Lennart Borgman @ 2011-01-24 12:21 UTC (permalink / raw) To: Stefan Monnier; +Cc: Emacs-Devel devel On Mon, Jan 24, 2011 at 4:37 AM, Stefan Monnier <monnier@iro.umontreal.ca> wrote: >> If I do something like this >> (setq buffer (url-retrieve-synchronously url)) > >> and the contents of the buffer begins with > >> HTTP/1.1 200 OK >> Content-Type: text/xml >> X-Content-Type-Options: nosniff >> Connection: close > >> <?xml version = "1.0" encoding="UTF-8" standalone="yes"?><!-- > >> Content-type: fix-mhtml > --> > >> should not then the buffer file coding system be utf-8? > > I don't think so, because url-retrieve-synchronously handles the HTTP > part of the protocol only. Maybe you're thinking of url-insert-file-contents? Ok, thanks. It is not easy to navigate among those functions. But I guess we have said before that better documentation is needed. Unfortunately url-insert-file-contents does not decode the file as utf-8. mm-disect-buffer looks for the charset, but only in the mime headers. In this case the charset is specified instead in the xml content. I do not know how the retrieved content above should be handled. It looks however like the web browsers handles this case and shows the xml content correctly. It seems natural in a case like this where Content-Type is text/xml to look for the specified charset in the xml content. I think `url-insert' should do this. Here is a suggestion for how to do it where I just have added a search for <?xml encoding=...>: (defun url-insert (buffer &optional beg end) "Insert the body of a URL object. BUFFER should be a complete URL buffer as returned by `url-retrieve'. If the headers specify a coding-system, it is applied to the body before it is inserted. Returns a list of the form (SIZE CHARSET), where SIZE is the size in bytes of the inserted text and CHARSET is the charset that was specified in the header, or nil if none was found. BEG and END can be used to only insert a subpart of the body. They count bytes from the beginning of the body." (let* ((handle (with-current-buffer buffer (mm-dissect-buffer t))) (data (with-current-buffer (mm-handle-buffer handle) (if beg (buffer-substring (+ (point-min) beg) (if end (+ (point-min) end) (point-max))) (buffer-string)))) (charset (mail-content-type-get (mm-handle-type handle) 'charset))) (mm-destroy-parts handle) (if charset (insert (mm-decode-string data (mm-charset-to-coding-system charset))) (if (not (string= "xml" (mm-handle-media-subtype handle))) (insert data) ;; Content is XML, use the specified encoding if any: (let ((coding-system (with-temp-buffer (insert (substring data 0 100)) (let* ((enc-pos (progn (goto-char (point-min)) (xmltok-get-declared-encoding-position))) (enc-name (and (consp enc-pos) (buffer-substring-no-properties (car enc-pos) (cdr enc-pos))))) (cond (enc-name (if (string= (downcase enc-name) "utf-16") (nxml-choose-utf-16-coding-system) (nxml-mime-charset-coding-system enc-name))) (enc-pos (nxml-choose-utf-coding-system))))))) (if coding-system (insert (mm-decode-string data coding-system)) (insert data))))) (list (length data) charset))) Is this the right thing to do, or? Something more is needed to get things working in my case, but I want to know if this part is ok first. Or is perhaps the coding handled to late here? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-24 12:21 ` Lennart Borgman @ 2011-01-24 15:11 ` Julien Danjou 2011-01-24 17:29 ` Lennart Borgman 2011-01-24 16:44 ` Stefan Monnier 1 sibling, 1 reply; 11+ messages in thread From: Julien Danjou @ 2011-01-24 15:11 UTC (permalink / raw) To: Lennart Borgman; +Cc: Stefan Monnier, Emacs-Devel devel [-- Attachment #1: Type: text/plain, Size: 1928 bytes --] On Mon, Jan 24 2011, Lennart Borgman wrote: > Ok, thanks. It is not easy to navigate among those functions. But I > guess we have said before that better documentation is needed. > > Unfortunately url-insert-file-contents does not decode the file as > utf-8. mm-disect-buffer looks for the charset, but only in the mime > headers. In this case the charset is specified instead in the xml > content. > > I do not know how the retrieved content above should be handled. It > looks however like the web browsers handles this case and shows the > xml content correctly. Probably because your browser understand XML. Firefox seems to. > It seems natural in a case like this where Content-Type is text/xml to > look for the specified charset in the xml content. I think > `url-insert' should do this. Here is a suggestion for how to do it > where I just have added a search for <?xml encoding=...>: Damn no, I don't think *url*-insert should parse XML, or you'll end up parsing a lot of file type. This is not what url is about. What you need is another layer on top of mm (or enhance mm) with something like this: #+begin_src emacs-lisp (defvar mm-decoder-helper-functions '(("text/xml" . 'mm-decoder-xml-helper))) (defun mm-decoder-xml-helper (string-or-buffer) "Return the encoding type of a XML." This function read a XML string or a buffer containing XML (this depends on the API type you chose to implement) and return it's encoding. ...) (defun mm-decoder-please-decode-this (content content-type &optional content-encoding) "Decode CONTENT based on CONTENT-TYPE and possibly CONTENT-ENCODING." Here you use content-encoding if provided, or a helper from `mm-decoder-helper-functions' to find the good content based on `content-type'. ...) #+end_src That is just a raw idea. Feel free to enhance. :) -- Julien Danjou ❱ http://julien.danjou.info [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-24 15:11 ` Julien Danjou @ 2011-01-24 17:29 ` Lennart Borgman 2011-01-24 19:59 ` Lennart Borgman 2011-01-25 10:44 ` Julien Danjou 0 siblings, 2 replies; 11+ messages in thread From: Lennart Borgman @ 2011-01-24 17:29 UTC (permalink / raw) To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel On Mon, Jan 24, 2011 at 4:11 PM, Julien Danjou <julien@danjou.info> wrote: > On Mon, Jan 24 2011, Lennart Borgman wrote: > >> It seems natural in a case like this where Content-Type is text/xml to >> look for the specified charset in the xml content. I think >> `url-insert' should do this. Here is a suggestion for how to do it >> where I just have added a search for <?xml encoding=...>: > > Damn no, I don't think *url*-insert should parse XML, or you'll end up > parsing a lot of file type. This is not what url is about. url-insert already does character decoding, but only if the information is in the mime headers. Isn't it easier to understand and maintain if all decoding is done at the same place? Maybe url-insert is not the right place to do any decoding? > What you need is another layer on top of mm (or enhance mm) with > something like this: > > #+begin_src emacs-lisp > (defvar mm-decoder-helper-functions > '(("text/xml" . 'mm-decoder-xml-helper))) Yes, that looks like a good structure. But where exactly should this be done? Where is multi-part messages char decoding handled? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-24 17:29 ` Lennart Borgman @ 2011-01-24 19:59 ` Lennart Borgman 2011-01-25 10:47 ` Julien Danjou 2011-01-25 10:44 ` Julien Danjou 1 sibling, 1 reply; 11+ messages in thread From: Lennart Borgman @ 2011-01-24 19:59 UTC (permalink / raw) To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel On Mon, Jan 24, 2011 at 6:29 PM, Lennart Borgman <lennart.borgman@gmail.com> wrote: > On Mon, Jan 24, 2011 at 4:11 PM, Julien Danjou <julien@danjou.info> wrote: >> On Mon, Jan 24 2011, Lennart Borgman wrote: >> >>> It seems natural in a case like this where Content-Type is text/xml to >>> look for the specified charset in the xml content. I think >>> `url-insert' should do this. Here is a suggestion for how to do it >>> where I just have added a search for <?xml encoding=...>: >> >> Damn no, I don't think *url*-insert should parse XML, or you'll end up >> parsing a lot of file type. This is not what url is about. > > url-insert already does character decoding, but only if the > information is in the mime headers. > > Isn't it easier to understand and maintain if all decoding is done at > the same place? Maybe url-insert is not the right place to do any > decoding? > > >> What you need is another layer on top of mm (or enhance mm) with >> something like this: >> >> #+begin_src emacs-lisp >> (defvar mm-decoder-helper-functions >> '(("text/xml" . 'mm-decoder-xml-helper))) > > Yes, that looks like a good structure. But where exactly should this > be done? Where is multi-part messages char decoding handled? It looks to me like url-insert-file-contents is a code place for decoding. So I suggest the following: 1) Move the decoding from url-insert to url-insert-file-contents. 2) Replace the call to decode-coding-inserted-region in url-insert-file-contents with something that also takes care of xml encoding and similar things. But I wonder what to use for 2. Something like Julien suggested seems good to me. If no entry is find (or used) in mm-decoder-helper-functions then probably decode-coding-inserted-region should be called. This of course means that the functions for decoding should be in url-handlers.el (which Julien objected against). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-24 19:59 ` Lennart Borgman @ 2011-01-25 10:47 ` Julien Danjou 2011-01-25 11:01 ` Lennart Borgman 0 siblings, 1 reply; 11+ messages in thread From: Julien Danjou @ 2011-01-25 10:47 UTC (permalink / raw) To: Lennart Borgman; +Cc: Stefan Monnier, Emacs-Devel devel [-- Attachment #1: Type: text/plain, Size: 1076 bytes --] On Mon, Jan 24 2011, Lennart Borgman wrote: > It looks to me like url-insert-file-contents is a code place for > decoding. So I suggest the following: > > 1) Move the decoding from url-insert to url-insert-file-contents. I'd like to be able to use the coding detection code and decoding on already retrieved buffer, so this can be used in url-insert-file-contents, but it must be a autonomous function that I can call myself. > 2) Replace the call to decode-coding-inserted-region in > url-insert-file-contents with something that also takes care of xml > encoding and similar things. > > But I wonder what to use for 2. Something like Julien suggested seems > good to me. If no entry is find (or used) in > mm-decoder-helper-functions then probably > decode-coding-inserted-region should be called. Sounds good. > This of course means that the functions for decoding should be in > url-handlers.el (which Julien objected against). Why cannot be it in mm (or another possibly new package)? -- Julien Danjou ❱ http://julien.danjou.info [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-25 10:47 ` Julien Danjou @ 2011-01-25 11:01 ` Lennart Borgman 2011-01-26 22:08 ` Lennart Borgman 0 siblings, 1 reply; 11+ messages in thread From: Lennart Borgman @ 2011-01-25 11:01 UTC (permalink / raw) To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel On Tue, Jan 25, 2011 at 11:47 AM, Julien Danjou <julien@danjou.info> wrote: > On Mon, Jan 24 2011, Lennart Borgman wrote: > >> It looks to me like url-insert-file-contents is a code place for >> decoding. So I suggest the following: >> >> 1) Move the decoding from url-insert to url-insert-file-contents. > > I'd like to be able to use the coding detection code and decoding on > already retrieved buffer, so this can be used in > url-insert-file-contents, but it must be a autonomous function that I > can call myself. Yes, of course. >> 2) Replace the call to decode-coding-inserted-region in >> url-insert-file-contents with something that also takes care of xml >> encoding and similar things. >> >> But I wonder what to use for 2. Something like Julien suggested seems >> good to me. If no entry is find (or used) in >> mm-decoder-helper-functions then probably >> decode-coding-inserted-region should be called. > > Sounds good. > >> This of course means that the functions for decoding should be in >> url-handlers.el (which Julien objected against). > > Why cannot be it in mm (or another possibly new package)? Yes, it could be a new package or another one than url-handlers.el. But url-handlers.el seems to be a reasonable choice to me. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-25 11:01 ` Lennart Borgman @ 2011-01-26 22:08 ` Lennart Borgman 0 siblings, 0 replies; 11+ messages in thread From: Lennart Borgman @ 2011-01-26 22:08 UTC (permalink / raw) To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel On Tue, Jan 25, 2011 at 12:01 PM, Lennart Borgman <lennart.borgman@gmail.com> wrote: > On Tue, Jan 25, 2011 at 11:47 AM, Julien Danjou <julien@danjou.info> wrote: >> On Mon, Jan 24 2011, Lennart Borgman wrote: >> >>> It looks to me like url-insert-file-contents is a code place for >>> decoding. So I suggest the following: >>> >>> 1) Move the decoding from url-insert to url-insert-file-contents. >> >> I'd like to be able to use the coding detection code and decoding on >> already retrieved buffer, so this can be used in >> url-insert-file-contents, but it must be a autonomous function that I >> can call myself. > > Yes, of course. > > >>> 2) Replace the call to decode-coding-inserted-region in >>> url-insert-file-contents with something that also takes care of xml >>> encoding and similar things. I changed my mind a bit. It looks like it is best to do all the url related decoding in url-insert since that is where you have the information about http headers. Below is the new suggestion. (Doc strings needs some rework.) (defvar coding-finders '(("text/xml" coding-finder-for-xml)) ) (defun coding-finder-for-xml (src) (let* ((buffer (if (bufferp src) src (with-current-buffer (generate-new-buffer "coding-getter-for-xml") (insert (substring src 0 100)) (current-buffer)))) (here (with-current-buffer buffer (point))) (coding-system (with-current-buffer buffer (let* ((enc-pos (progn (goto-char (point-min)) (xmltok-get-declared-encoding-position))) (enc-name (and (consp enc-pos) (buffer-substring-no-properties (car enc-pos) (cdr enc-pos))))) (cond (enc-name (if (string= (downcase enc-name) "utf-16") (nxml-choose-utf-16-coding-system) (nxml-mime-charset-coding-system enc-name))) (enc-pos (nxml-choose-utf-coding-system))))))) (if (bufferp src) (with-current-buffer buffer (goto-char here)) (kill-buffer buffer)) coding-system)) (defun url-decode (buffer charset media-type) "Decode whole BUFFER using char set CHARSET. Use MEDIA-TYPE only if CHARSET is nil. In that case it should be a http header content type. Use this to lookup a coding finder function in `coding-finders' and decode the buffer with the coding system that function returns. Return non-nil if the buffer was decoded." (with-current-buffer buffer (save-restriction (widen) (if charset (let ((data (buffer-substring-no-properties (point-min) (point-max)))) (delete-region (point-min) (point-max)) (insert (mm-decode-string data charset)) t) (when media-type (let* ((rec (assoc media-type coding-finders)) (coding-finder (nth 1 rec)) (coding (when coding-finder (funcall coding-finder (current-buffer))))) (when coding (decode-coding-region (point-min) (point-max) coding) t))))))) (defun url-insert (buffer &optional beg end) "Insert the body of a URL object. BUFFER should be a complete URL buffer as returned by `url-retrieve'. If the headers specify a coding-system, it is applied to the body before it is inserted. Returns a list of the form (SIZE CHARSET), where SIZE is the size in bytes of the inserted text and CHARSET is the charset that was specified in the header, or nil if none was found. BEG and END can be used to only insert a subpart of the body. They count bytes from the beginning of the body." (let* ((handle (with-current-buffer buffer (mm-dissect-buffer t))) (data (with-current-buffer (mm-handle-buffer handle) (if beg (buffer-substring (+ (point-min) beg) (if end (+ (point-min) end) (point-max))) (buffer-string)))) (charset (mail-content-type-get (mm-handle-type handle) 'charset)) ;;(coding (mm-charset-to-coding-system charset)) (media-type (mm-handle-media-type handle)) (codbuf (generate-new-buffer "url-insert")) decoded) (mm-destroy-parts handle) (insert (with-current-buffer codbuf (insert data) (url-decode (current-buffer) charset media-type) (buffer-substring-no-properties (point-min) (point-max)))) (kill-buffer codbuf) (list (length data) charset))) ;;;###autoload (defun url-insert-file-contents (url &optional visit beg end replace) (let ((buffer (url-retrieve-synchronously url))) (if (not buffer) (error "Opening input file: No such file or directory, %s" url)) (if visit (setq buffer-file-name url)) (save-excursion (let* ((start (point)) (size-decoded (url-insert buffer beg end)) (size (nth 0 size-decoded)) (decoded (nth 1 size-decoded))) (kill-buffer buffer) (when replace (delete-region (point-min) start) (delete-region (point) (point-max))) (unless decoded ;; If the headers don't specify any particular charset, use the ;; usual heuristic/rules that we apply to files. (decode-coding-inserted-region start (point) url visit beg end replace)) (list url size))))) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-24 17:29 ` Lennart Borgman 2011-01-24 19:59 ` Lennart Borgman @ 2011-01-25 10:44 ` Julien Danjou 1 sibling, 0 replies; 11+ messages in thread From: Julien Danjou @ 2011-01-25 10:44 UTC (permalink / raw) To: Lennart Borgman; +Cc: Stefan Monnier, Emacs-Devel devel [-- Attachment #1: Type: text/plain, Size: 594 bytes --] On Mon, Jan 24 2011, Lennart Borgman wrote: > url-insert already does character decoding, but only if the > information is in the mime headers. Ok, did not spot that. > Isn't it easier to understand and maintain if all decoding is done at > the same place? Maybe url-insert is not the right place to do any > decoding? Yes. That should be split out of url-insert. I needed such a function and never found it, and never though url-insert might do some decoding… Since I never needed to "insert" a url retrieved buffer. :) -- Julien Danjou ❱ http://julien.danjou.info [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: url-retrieve-synchronously and coding 2011-01-24 12:21 ` Lennart Borgman 2011-01-24 15:11 ` Julien Danjou @ 2011-01-24 16:44 ` Stefan Monnier 1 sibling, 0 replies; 11+ messages in thread From: Stefan Monnier @ 2011-01-24 16:44 UTC (permalink / raw) To: Lennart Borgman; +Cc: Emacs-Devel devel > Unfortunately url-insert-file-contents does not decode the file as utf-8. Hmm... the encoding-info is part of the content, so it should be linked to the associated major mode, or something like that. More specifically, I'd expect this to be handled just like the coding-cookie. So at least if you use url-handler-mode and use `find-file-noselect', the encoding should be detected correctly. At which level exactly should this be handled, I'm not completely sure. But I do think that if it works for XML it should also work for other contents which use other ways to specify the encoding (e.g. coding-cookie, or \usepackage[utf8]{inputenc}, or ...). Stefan ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-01-26 22:08 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-01-23 19:03 url-retrieve-synchronously and coding Lennart Borgman 2011-01-24 3:37 ` Stefan Monnier 2011-01-24 12:21 ` Lennart Borgman 2011-01-24 15:11 ` Julien Danjou 2011-01-24 17:29 ` Lennart Borgman 2011-01-24 19:59 ` Lennart Borgman 2011-01-25 10:47 ` Julien Danjou 2011-01-25 11:01 ` Lennart Borgman 2011-01-26 22:08 ` Lennart Borgman 2011-01-25 10:44 ` Julien Danjou 2011-01-24 16:44 ` Stefan Monnier
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.