url-retrieve-synchronously and coding

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* url-retrieve-synchronously and coding
@ 2011-01-23 19:03 Lennart Borgman
  2011-01-24  3:37 ` Stefan Monnier
  0 siblings, 1 reply; 11+ messages in thread
From: Lennart Borgman @ 2011-01-23 19:03 UTC (permalink / raw)
  To: Emacs-Devel devel

If I do something like this

         (setq buffer (url-retrieve-synchronously url))

and the contents of the buffer begins with

  HTTP/1.1 200 OK
  Content-Type: text/xml
  X-Content-Type-Options: nosniff
  Connection: close

  <?xml version = "1.0" encoding="UTF-8" standalone="yes"?><!--

  Content-type: fix-mhtml
  -->

should not then the buffer file coding system be utf-8?

This does not happen for me, but I am unsure if this is a bug or not.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-23 19:03 url-retrieve-synchronously and coding Lennart Borgman
@ 2011-01-24  3:37 ` Stefan Monnier
  2011-01-24 12:21   ` Lennart Borgman
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Monnier @ 2011-01-24  3:37 UTC (permalink / raw)
  To: Lennart Borgman; +Cc: Emacs-Devel devel

> If I do something like this
>          (setq buffer (url-retrieve-synchronously url))

> and the contents of the buffer begins with

>   HTTP/1.1 200 OK
>   Content-Type: text/xml
>   X-Content-Type-Options: nosniff
>   Connection: close

>   <?xml version = "1.0" encoding="UTF-8" standalone="yes"?><!--

>   Content-type: fix-mhtml
--> 

> should not then the buffer file coding system be utf-8?

I don't think so, because url-retrieve-synchronously handles the HTTP
part of the protocol only.  Maybe you're thinking of url-insert-file-contents?


        Stefan



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-24  3:37 ` Stefan Monnier
@ 2011-01-24 12:21   ` Lennart Borgman
  2011-01-24 15:11     ` Julien Danjou
  2011-01-24 16:44     ` Stefan Monnier
  0 siblings, 2 replies; 11+ messages in thread
From: Lennart Borgman @ 2011-01-24 12:21 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Emacs-Devel devel

On Mon, Jan 24, 2011 at 4:37 AM, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:
>> If I do something like this
>>          (setq buffer (url-retrieve-synchronously url))
>
>> and the contents of the buffer begins with
>
>>   HTTP/1.1 200 OK
>>   Content-Type: text/xml
>>   X-Content-Type-Options: nosniff
>>   Connection: close
>
>>   <?xml version = "1.0" encoding="UTF-8" standalone="yes"?><!--
>
>>   Content-type: fix-mhtml
> -->
>
>> should not then the buffer file coding system be utf-8?
>
> I don't think so, because url-retrieve-synchronously handles the HTTP
> part of the protocol only.  Maybe you're thinking of url-insert-file-contents?


Ok, thanks. It is not easy to navigate among those functions. But I
guess we have said before that better documentation is needed.

Unfortunately url-insert-file-contents does not decode the file as
utf-8. mm-disect-buffer looks for the charset, but only in the mime
headers. In this case the charset is specified instead in the xml
content.

I do not know how the retrieved content above should be handled. It
looks however like the web browsers handles this case and shows the
xml content correctly.

It seems natural in a case like this where Content-Type is text/xml to
look for the specified charset in the xml content. I think
`url-insert' should do this. Here is a suggestion for how to do it
where I just have added a search for <?xml encoding=...>:


(defun url-insert (buffer &optional beg end)
  "Insert the body of a URL object.
BUFFER should be a complete URL buffer as returned by `url-retrieve'.
If the headers specify a coding-system, it is applied to the body
before it is inserted.
Returns a list of the form (SIZE CHARSET), where SIZE is the size in bytes
of the inserted text and CHARSET is the charset that was specified in
the header,
or nil if none was found.
BEG and END can be used to only insert a subpart of the body.
They count bytes from the beginning of the body."
  (let* ((handle (with-current-buffer buffer (mm-dissect-buffer t)))
         (data (with-current-buffer (mm-handle-buffer handle)
                 (if beg
                     (buffer-substring (+ (point-min) beg)
                                       (if end (+ (point-min) end) (point-max)))
		   (buffer-string))))
         (charset (mail-content-type-get (mm-handle-type handle)
                                          'charset)))
    (mm-destroy-parts handle)
    (if charset
        (insert (mm-decode-string data (mm-charset-to-coding-system charset)))
      (if (not (string= "xml" (mm-handle-media-subtype handle)))
          (insert data)
        ;; Content is XML, use the specified encoding if any:
        (let ((coding-system
               (with-temp-buffer
                 (insert (substring data 0 100))
                 (let* ((enc-pos (progn
                                   (goto-char (point-min))
                                   (xmltok-get-declared-encoding-position)))
                        (enc-name
                         (and (consp enc-pos)
                              (buffer-substring-no-properties (car enc-pos)
                                                              (cdr enc-pos)))))
                   (cond (enc-name
                          (if (string= (downcase enc-name) "utf-16")
                              (nxml-choose-utf-16-coding-system)
                            (nxml-mime-charset-coding-system enc-name)))
                         (enc-pos (nxml-choose-utf-coding-system)))))))
          (if coding-system
              (insert (mm-decode-string data coding-system))
            (insert data)))))
    (list (length data) charset)))

Is this the right thing to do, or?


Something more is needed to get things working in my case, but I want
to know if this part is ok first. Or is perhaps the coding handled to
late here?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-24 12:21   ` Lennart Borgman
@ 2011-01-24 15:11     ` Julien Danjou
  2011-01-24 17:29       ` Lennart Borgman
  2011-01-24 16:44     ` Stefan Monnier
  1 sibling, 1 reply; 11+ messages in thread
From: Julien Danjou @ 2011-01-24 15:11 UTC (permalink / raw)
  To: Lennart Borgman; +Cc: Stefan Monnier, Emacs-Devel devel

[-- Attachment #1: Type: text/plain, Size: 1928 bytes --]

On Mon, Jan 24 2011, Lennart Borgman wrote:

> Ok, thanks. It is not easy to navigate among those functions. But I
> guess we have said before that better documentation is needed.
>
> Unfortunately url-insert-file-contents does not decode the file as
> utf-8. mm-disect-buffer looks for the charset, but only in the mime
> headers. In this case the charset is specified instead in the xml
> content.
>
> I do not know how the retrieved content above should be handled. It
> looks however like the web browsers handles this case and shows the
> xml content correctly.

Probably because your browser understand XML. Firefox seems to.

> It seems natural in a case like this where Content-Type is text/xml to
> look for the specified charset in the xml content. I think
> `url-insert' should do this. Here is a suggestion for how to do it
> where I just have added a search for <?xml encoding=...>:

Damn no, I don't think *url*-insert should parse XML, or you'll end up
parsing a lot of file type. This is not what url is about.

What you need is another layer on top of mm (or enhance mm) with
something like this:

#+begin_src emacs-lisp
(defvar mm-decoder-helper-functions
 '(("text/xml" . 'mm-decoder-xml-helper)))

(defun mm-decoder-xml-helper (string-or-buffer)
  "Return the encoding type of a XML."

This function read a XML string or a buffer containing XML (this
depends on the API type you chose to implement) and return it's encoding.

  ...)

(defun mm-decoder-please-decode-this (content content-type &optional content-encoding)
  "Decode CONTENT based on CONTENT-TYPE and possibly CONTENT-ENCODING."

Here you use content-encoding if provided, or a helper from 
`mm-decoder-helper-functions' to find the good content based on
`content-type'.
   ...)
#+end_src

That is just a raw idea. Feel free to enhance. :)

-- 
Julien Danjou
❱ http://julien.danjou.info

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-24 12:21   ` Lennart Borgman
  2011-01-24 15:11     ` Julien Danjou
@ 2011-01-24 16:44     ` Stefan Monnier
  1 sibling, 0 replies; 11+ messages in thread
From: Stefan Monnier @ 2011-01-24 16:44 UTC (permalink / raw)
  To: Lennart Borgman; +Cc: Emacs-Devel devel

> Unfortunately url-insert-file-contents does not decode the file as utf-8.

Hmm... the encoding-info is part of the content, so it should be linked
to the associated major mode, or something like that.
More specifically, I'd expect this to be handled just like the
coding-cookie.  So at least if you use url-handler-mode and use
`find-file-noselect', the encoding should be detected correctly.

At which level exactly should this be handled, I'm not completely sure.
But I do think that if it works for XML it should also work for other
contents which use other ways to specify the encoding
(e.g. coding-cookie, or \usepackage[utf8]{inputenc}, or ...).

        Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-24 15:11     ` Julien Danjou
@ 2011-01-24 17:29       ` Lennart Borgman
  2011-01-24 19:59         ` Lennart Borgman
  2011-01-25 10:44         ` Julien Danjou
  0 siblings, 2 replies; 11+ messages in thread
From: Lennart Borgman @ 2011-01-24 17:29 UTC (permalink / raw)
  To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel

On Mon, Jan 24, 2011 at 4:11 PM, Julien Danjou <julien@danjou.info> wrote:
> On Mon, Jan 24 2011, Lennart Borgman wrote:
>
>> It seems natural in a case like this where Content-Type is text/xml to
>> look for the specified charset in the xml content. I think
>> `url-insert' should do this. Here is a suggestion for how to do it
>> where I just have added a search for <?xml encoding=...>:
>
> Damn no, I don't think *url*-insert should parse XML, or you'll end up
> parsing a lot of file type. This is not what url is about.

url-insert already does character decoding, but only if the
information is in the mime headers.

Isn't it easier to understand and maintain if all decoding is done at
the same place? Maybe url-insert is not the right place to do any
decoding?


> What you need is another layer on top of mm (or enhance mm) with
> something like this:
>
> #+begin_src emacs-lisp
> (defvar mm-decoder-helper-functions
>  '(("text/xml" . 'mm-decoder-xml-helper)))

Yes, that looks like a good structure. But where exactly should this
be done? Where is multi-part messages char decoding handled?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-24 17:29       ` Lennart Borgman
@ 2011-01-24 19:59         ` Lennart Borgman
  2011-01-25 10:47           ` Julien Danjou
  2011-01-25 10:44         ` Julien Danjou
  1 sibling, 1 reply; 11+ messages in thread
From: Lennart Borgman @ 2011-01-24 19:59 UTC (permalink / raw)
  To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel

On Mon, Jan 24, 2011 at 6:29 PM, Lennart Borgman
<lennart.borgman@gmail.com> wrote:
> On Mon, Jan 24, 2011 at 4:11 PM, Julien Danjou <julien@danjou.info> wrote:
>> On Mon, Jan 24 2011, Lennart Borgman wrote:
>>
>>> It seems natural in a case like this where Content-Type is text/xml to
>>> look for the specified charset in the xml content. I think
>>> `url-insert' should do this. Here is a suggestion for how to do it
>>> where I just have added a search for <?xml encoding=...>:
>>
>> Damn no, I don't think *url*-insert should parse XML, or you'll end up
>> parsing a lot of file type. This is not what url is about.
>
> url-insert already does character decoding, but only if the
> information is in the mime headers.
>
> Isn't it easier to understand and maintain if all decoding is done at
> the same place? Maybe url-insert is not the right place to do any
> decoding?
>
>
>> What you need is another layer on top of mm (or enhance mm) with
>> something like this:
>>
>> #+begin_src emacs-lisp
>> (defvar mm-decoder-helper-functions
>>  '(("text/xml" . 'mm-decoder-xml-helper)))
>
> Yes, that looks like a good structure. But where exactly should this
> be done? Where is multi-part messages char decoding handled?

It looks to me like url-insert-file-contents is a code place for
decoding. So I suggest the following:

1) Move the decoding from url-insert to url-insert-file-contents.
2) Replace the call to decode-coding-inserted-region in
url-insert-file-contents with something that also takes care of xml
encoding and similar things.

But I wonder what to use for 2. Something like Julien suggested seems
good to me. If no entry is find (or used) in
mm-decoder-helper-functions then probably
decode-coding-inserted-region should be called.

This of course means that the functions for decoding should be in
url-handlers.el (which Julien objected against).



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-24 17:29       ` Lennart Borgman
  2011-01-24 19:59         ` Lennart Borgman
@ 2011-01-25 10:44         ` Julien Danjou
  1 sibling, 0 replies; 11+ messages in thread
From: Julien Danjou @ 2011-01-25 10:44 UTC (permalink / raw)
  To: Lennart Borgman; +Cc: Stefan Monnier, Emacs-Devel devel

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

On Mon, Jan 24 2011, Lennart Borgman wrote:

> url-insert already does character decoding, but only if the
> information is in the mime headers.

Ok, did not spot that.

> Isn't it easier to understand and maintain if all decoding is done at
> the same place? Maybe url-insert is not the right place to do any
> decoding?

Yes. That should be split out of url-insert. I needed such a function
and never found it, and never though url-insert might do some decoding…
Since I never needed to "insert" a url retrieved buffer. :)

-- 
Julien Danjou
❱ http://julien.danjou.info

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-24 19:59         ` Lennart Borgman
@ 2011-01-25 10:47           ` Julien Danjou
  2011-01-25 11:01             ` Lennart Borgman
  0 siblings, 1 reply; 11+ messages in thread
From: Julien Danjou @ 2011-01-25 10:47 UTC (permalink / raw)
  To: Lennart Borgman; +Cc: Stefan Monnier, Emacs-Devel devel

[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]

On Mon, Jan 24 2011, Lennart Borgman wrote:

> It looks to me like url-insert-file-contents is a code place for
> decoding. So I suggest the following:
>
> 1) Move the decoding from url-insert to url-insert-file-contents.

I'd like to be able to use the coding detection code and decoding on
already retrieved buffer, so this can be used in
url-insert-file-contents, but it must be a autonomous function that I
can call myself.

> 2) Replace the call to decode-coding-inserted-region in
> url-insert-file-contents with something that also takes care of xml
> encoding and similar things.
>
> But I wonder what to use for 2. Something like Julien suggested seems
> good to me. If no entry is find (or used) in
> mm-decoder-helper-functions then probably
> decode-coding-inserted-region should be called.

Sounds good.

> This of course means that the functions for decoding should be in
> url-handlers.el (which Julien objected against).

Why cannot be it in mm (or another possibly new package)?

-- 
Julien Danjou
❱ http://julien.danjou.info

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-25 10:47           ` Julien Danjou
@ 2011-01-25 11:01             ` Lennart Borgman
  2011-01-26 22:08               ` Lennart Borgman
  0 siblings, 1 reply; 11+ messages in thread
From: Lennart Borgman @ 2011-01-25 11:01 UTC (permalink / raw)
  To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel

On Tue, Jan 25, 2011 at 11:47 AM, Julien Danjou <julien@danjou.info> wrote:
> On Mon, Jan 24 2011, Lennart Borgman wrote:
>
>> It looks to me like url-insert-file-contents is a code place for
>> decoding. So I suggest the following:
>>
>> 1) Move the decoding from url-insert to url-insert-file-contents.
>
> I'd like to be able to use the coding detection code and decoding on
> already retrieved buffer, so this can be used in
> url-insert-file-contents, but it must be a autonomous function that I
> can call myself.

Yes, of course.


>> 2) Replace the call to decode-coding-inserted-region in
>> url-insert-file-contents with something that also takes care of xml
>> encoding and similar things.
>>
>> But I wonder what to use for 2. Something like Julien suggested seems
>> good to me. If no entry is find (or used) in
>> mm-decoder-helper-functions then probably
>> decode-coding-inserted-region should be called.
>
> Sounds good.
>
>> This of course means that the functions for decoding should be in
>> url-handlers.el (which Julien objected against).
>
> Why cannot be it in mm (or another possibly new package)?

Yes, it could be a new package or another one than url-handlers.el.
But url-handlers.el seems to be a reasonable choice to me.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: url-retrieve-synchronously and coding
  2011-01-25 11:01             ` Lennart Borgman
@ 2011-01-26 22:08               ` Lennart Borgman
  0 siblings, 0 replies; 11+ messages in thread
From: Lennart Borgman @ 2011-01-26 22:08 UTC (permalink / raw)
  To: Lennart Borgman, Stefan Monnier, Emacs-Devel devel

On Tue, Jan 25, 2011 at 12:01 PM, Lennart Borgman
<lennart.borgman@gmail.com> wrote:
> On Tue, Jan 25, 2011 at 11:47 AM, Julien Danjou <julien@danjou.info> wrote:
>> On Mon, Jan 24 2011, Lennart Borgman wrote:
>>
>>> It looks to me like url-insert-file-contents is a code place for
>>> decoding. So I suggest the following:
>>>
>>> 1) Move the decoding from url-insert to url-insert-file-contents.
>>
>> I'd like to be able to use the coding detection code and decoding on
>> already retrieved buffer, so this can be used in
>> url-insert-file-contents, but it must be a autonomous function that I
>> can call myself.
>
> Yes, of course.
>
>
>>> 2) Replace the call to decode-coding-inserted-region in
>>> url-insert-file-contents with something that also takes care of xml
>>> encoding and similar things.


I changed my mind a bit. It looks like it is best to do all the url
related decoding in url-insert since that is where you have the
information about http headers. Below is the new suggestion. (Doc
strings needs some rework.)



(defvar coding-finders
  '(("text/xml" coding-finder-for-xml))
  )

(defun coding-finder-for-xml (src)
  (let* ((buffer (if (bufferp src)
                     src
                   (with-current-buffer (generate-new-buffer
"coding-getter-for-xml")
                     (insert (substring src 0 100))
                     (current-buffer))))
         (here (with-current-buffer buffer (point)))
         (coding-system (with-current-buffer buffer
                          (let* ((enc-pos (progn
                                            (goto-char (point-min))

(xmltok-get-declared-encoding-position)))
                                 (enc-name
                                  (and (consp enc-pos)
                                       (buffer-substring-no-properties
(car enc-pos)

(cdr enc-pos)))))
                            (cond (enc-name
                                   (if (string= (downcase enc-name) "utf-16")
                                       (nxml-choose-utf-16-coding-system)
                                     (nxml-mime-charset-coding-system
enc-name)))
                                  (enc-pos (nxml-choose-utf-coding-system)))))))
    (if (bufferp src)
        (with-current-buffer buffer (goto-char here))
      (kill-buffer buffer))
    coding-system))

(defun url-decode (buffer charset media-type)
  "Decode whole BUFFER using char set CHARSET.
Use MEDIA-TYPE only if CHARSET is nil.  In that case it should be
a http header content type.  Use this to lookup a coding finder
function in `coding-finders' and decode the buffer with the
coding system that function returns.

Return non-nil if the buffer was decoded."
  (with-current-buffer buffer
    (save-restriction
      (widen)
      (if charset
          (let ((data (buffer-substring-no-properties (point-min) (point-max))))
            (delete-region (point-min) (point-max))
            (insert (mm-decode-string data charset))
            t)
        (when media-type
          (let* ((rec (assoc media-type coding-finders))
                 (coding-finder (nth 1 rec))
                 (coding (when coding-finder
                           (funcall coding-finder (current-buffer)))))
            (when coding
              (decode-coding-region (point-min) (point-max) coding)
              t)))))))

(defun url-insert (buffer &optional beg end)
  "Insert the body of a URL object.
BUFFER should be a complete URL buffer as returned by `url-retrieve'.
If the headers specify a coding-system, it is applied to the body
before it is inserted.
Returns a list of the form (SIZE CHARSET), where SIZE is the size in bytes
of the inserted text and CHARSET is the charset that was specified in
the header,
or nil if none was found.
BEG and END can be used to only insert a subpart of the body.
They count bytes from the beginning of the body."
  (let* ((handle (with-current-buffer buffer (mm-dissect-buffer t)))
         (data (with-current-buffer (mm-handle-buffer handle)
                 (if beg
                     (buffer-substring (+ (point-min) beg)
                                       (if end (+ (point-min) end) (point-max)))
		   (buffer-string))))
         (charset (mail-content-type-get (mm-handle-type handle)
                                          'charset))
         ;;(coding (mm-charset-to-coding-system charset))
         (media-type (mm-handle-media-type handle))
         (codbuf (generate-new-buffer "url-insert"))
         decoded)
    (mm-destroy-parts handle)
    (insert
     (with-current-buffer codbuf
       (insert data)
       (url-decode (current-buffer) charset media-type)
       (buffer-substring-no-properties (point-min) (point-max))))
    (kill-buffer codbuf)
    (list (length data) charset)))

;;;###autoload
(defun url-insert-file-contents (url &optional visit beg end replace)
  (let ((buffer (url-retrieve-synchronously url)))
    (if (not buffer)
	(error "Opening input file: No such file or directory, %s" url))
    (if visit (setq buffer-file-name url))
    (save-excursion
      (let* ((start (point))
             (size-decoded (url-insert buffer beg end))
             (size    (nth 0 size-decoded))
             (decoded (nth 1 size-decoded)))
        (kill-buffer buffer)
        (when replace
          (delete-region (point-min) start)
          (delete-region (point) (point-max)))
        (unless decoded
          ;; If the headers don't specify any particular charset, use the
          ;; usual heuristic/rules that we apply to files.
          (decode-coding-inserted-region start (point) url visit beg
end replace))
        (list url size)))))



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-01-26 22:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-23 19:03 url-retrieve-synchronously and coding Lennart Borgman
2011-01-24  3:37 ` Stefan Monnier
2011-01-24 12:21   ` Lennart Borgman
2011-01-24 15:11     ` Julien Danjou
2011-01-24 17:29       ` Lennart Borgman
2011-01-24 19:59         ` Lennart Borgman
2011-01-25 10:47           ` Julien Danjou
2011-01-25 11:01             ` Lennart Borgman
2011-01-26 22:08               ` Lennart Borgman
2011-01-25 10:44         ` Julien Danjou
2011-01-24 16:44     ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).