url-retrieve and utf-8

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* url-retrieve and utf-8
@ 2008-02-04  8:28 William Xu
  2008-02-04 12:43 ` William Xu
  0 siblings, 1 reply; 13+ messages in thread
From: William Xu @ 2008-02-04  8:28 UTC (permalink / raw)
  To: help-gnu-emacs

How to correctly handle utf-8 encoded html pages fetched by
url-retrieve? Or is there a way to specify a coding system for
read/write in the buffer returned by url-retrieve?

At present, I tried to call: 

  (decode-coding-string (buffer-string) 'utf-8)

But the result is only partially correct. For example, when there are a
mix of ascii and japanese characters, it only returns the ascii part.

-- 
William

http://williamxu.net9.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-04  8:28 url-retrieve and utf-8 William Xu
@ 2008-02-04 12:43 ` William Xu
  2008-02-04 16:02   ` Andreas Röhler
  0 siblings, 1 reply; 13+ messages in thread
From: William Xu @ 2008-02-04 12:43 UTC (permalink / raw)
  To: help-gnu-emacs

William Xu <william.xwl@gmail.com> writes:

> At present, I tried to call: 
>
>   (decode-coding-string (buffer-string) 'utf-8)
>
> But the result is only partially correct. For example, when there are a
> mix of ascii and japanese characters, it only returns the ascii part.

For this, it is because I have called (skip-chars-backward
"[[:space:]]") before decode-coding-string, and apprarently
skip-chars-backward seems mistook some non-ascii characters as
whitespaces.

-- 
William

http://williamxu.net9.org





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-04 12:43 ` William Xu
@ 2008-02-04 16:02   ` Andreas Röhler
  2008-02-05  3:55     ` William Xu
       [not found]     ` <mailman.6982.1202183739.18990.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 13+ messages in thread
From: Andreas Röhler @ 2008-02-04 16:02 UTC (permalink / raw)
  To: help-gnu-emacs; +Cc: William Xu

Am Montag, 4. Februar 2008 13:43 schrieb William Xu:
> William Xu <william.xwl@gmail.com> writes:
> > At present, I tried to call:
> >
> >   (decode-coding-string (buffer-string) 'utf-8)
> >
> > But the result is only partially correct. For example, when there are a
> > mix of ascii and japanese characters, it only returns the ascii part.
>
> For this, it is because I have called (skip-chars-backward
> "[[:space:]]") before decode-coding-string, and apprarently
> skip-chars-backward seems mistook some non-ascii characters as
> whitespaces.


AFAIS that's not a mistake, that's implemented

See elisp info node 34.3.1.2 Character Classes


`[:space:]'
     This matches any character that has whitespace syntax (*note

....


Here is a table of syntax classes, the characters that stand for them,
their meanings, and examples of their use.

 -- Syntax class: whitespace character
     "Whitespace characters" (designated by ` ' or `-') separate
     symbols and words from each other.  Typically, whitespace
     characters have no other syntactic significance, and multiple
     whitespace characters are syntactically equivalent to a single
     one.  

======> Space, tab, newline and formfeed <============

are classified as
     whitespace in almost all major modes.

;;;;;;;

[:blank:] should DTRT.

Andreas Röhler




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-04 16:02   ` Andreas Röhler
@ 2008-02-05  3:55     ` William Xu
  2008-02-05 10:50       ` Andreas Röhler
  2008-02-05 20:02       ` Andreas Röhler
       [not found]     ` <mailman.6982.1202183739.18990.help-gnu-emacs@gnu.org>
  1 sibling, 2 replies; 13+ messages in thread
From: William Xu @ 2008-02-05  3:55 UTC (permalink / raw)
  To: help-gnu-emacs

Andreas Röhler <andreas.roehler@online.de> writes:

> ======> Space, tab, newline and formfeed <============
>
> are classified as
>      whitespace in almost all major modes.

If [:space:] is for the above purpose, how come it eats some non-ascii
chacracters(here, it is japanese) ?

Maybe due to the buffer not correctly encoded? Back to my original
question, does url-retrieve respect "Context-type, charset", headers in
a html page? 

-- 
William

http://williamxu.net9.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-05  3:55     ` William Xu
@ 2008-02-05 10:50       ` Andreas Röhler
  2008-02-06  6:28         ` William Xu
  2008-02-05 20:02       ` Andreas Röhler
  1 sibling, 1 reply; 13+ messages in thread
From: Andreas Röhler @ 2008-02-05 10:50 UTC (permalink / raw)
  To: help-gnu-emacs

Am Dienstag, 5. Februar 2008 04:55 schrieb William Xu:
> Andreas Röhler <andreas.roehler@online.de> writes:
> > ======> Space, tab, newline and formfeed <============
> >
> > are classified as
> >      whitespace in almost all major modes.
>
> If [:space:] is for the above purpose, how come it eats some non-ascii
> chacracters(here, it is japanese) ?

As it depends from definition of whitespace-syntax,
that may happen, however indicates some error in that
definition.

To exclude errors here, you could use
(skip-chars-backward " ") i.e. not relying at the
class-matter at all AFAIU.

>
> Maybe due to the buffer not correctly encoded? Back to my original
> question, does url-retrieve respect "Context-type, charset", headers in
> a html page?

Good question. Unfortunately I ignore these things. From this point of
ignorance I would expect, `url-retrieve' delivering the
stuff as-is, which must be parsed afterwards. Right or
wrong?

Should I discover more, I'll let you know.

Andreas Röhler

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-05  3:55     ` William Xu
  2008-02-05 10:50       ` Andreas Röhler
@ 2008-02-05 20:02       ` Andreas Röhler
  2008-02-06  6:34         ` William Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Andreas Röhler @ 2008-02-05 20:02 UTC (permalink / raw)
  To: help-gnu-emacs

...

 Back to my original
> question, does url-retrieve respect "Context-type, charset", headers in
> a html page?



While searching, I came across `w3m-retrieve'.

Call below fills the buffer with contents from
"http://www.emacswiki.org". Does this work for you too? 

(w3m-retrieve "http://www.emacswiki.org")
     

Maybe you can use this form?

Andreas Röhler




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-05 10:50       ` Andreas Röhler
@ 2008-02-06  6:28         ` William Xu
  0 siblings, 0 replies; 13+ messages in thread
From: William Xu @ 2008-02-06  6:28 UTC (permalink / raw)
  To: help-gnu-emacs

Andreas Röhler <andreas.roehler@online.de> writes:

> From this point of
> ignorance I would expect, `url-retrieve' delivering the
> stuff as-is, which must be parsed afterwards. Right or
> wrong?

Right. More correctly, may be that it treats all data as ascii
characters, i.e., single byte.  

-- 
William

http://williamxu.net9.org





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-05 20:02       ` Andreas Röhler
@ 2008-02-06  6:34         ` William Xu
  2008-02-06 12:36           ` Andreas Röhler
  0 siblings, 1 reply; 13+ messages in thread
From: William Xu @ 2008-02-06  6:34 UTC (permalink / raw)
  To: help-gnu-emacs

Andreas Röhler <andreas.roehler@online.de> writes:

> While searching, I came across `w3m-retrieve'.
>
> Call below fills the buffer with contents from
> "http://www.emacswiki.org". Does this work for you too? 
>
> (w3m-retrieve "http://www.emacswiki.org")

Thanks. This works much better than url lib.

> Maybe you can use this form?

But, as w3m-el is not included in Emacs, and it requires external tool
-- "w3m".  If I would use an external tool, curl or wget are also good
options.  I select url lib so as to reduce dependencies. 

-- 
William

http://williamxu.net9.org





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-06  6:34         ` William Xu
@ 2008-02-06 12:36           ` Andreas Röhler
  0 siblings, 0 replies; 13+ messages in thread
From: Andreas Röhler @ 2008-02-06 12:36 UTC (permalink / raw)
  To: help-gnu-emacs

Am Mittwoch, 6. Februar 2008 07:34 schrieb William Xu:
> Andreas Röhler <andreas.roehler@online.de> writes:
> > While searching, I came across `w3m-retrieve'.
> >
> > Call below fills the buffer with contents from
> > "http://www.emacswiki.org". Does this work for you too?
> >
> > (w3m-retrieve "http://www.emacswiki.org")
>
> Thanks. This works much better than url lib.
>
> > Maybe you can use this form?
>
> But, as w3m-el is not included in Emacs, 

That's bad, really.  

> and it requires external tool 
> -- "w3m".  If I would use an external tool, curl or wget are also good
> options.  I select url lib so as to reduce dependencies.


IMO, always use the fastest, most widely used, well
maintained free tool, no matter if shipped with Emacs
or not. Don't lose your time... free time is joy :).

Andreas Röhler





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
       [not found]     ` <mailman.6982.1202183739.18990.help-gnu-emacs@gnu.org>
@ 2008-02-06 15:09       ` Stefan Monnier
  2008-02-07  8:05         ` William Xu
       [not found]         ` <mailman.7082.1202371563.18990.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Monnier @ 2008-02-06 15:09 UTC (permalink / raw)
  To: help-gnu-emacs

> Maybe due to the buffer not correctly encoded? Back to my original
> question, does url-retrieve respect "Context-type, charset", headers in
> a html page?

I can't remember exactly, but I think it doesn't (it just returns the
raw undecoded bytes).  url-insert-file-contents should try and obey
"Content-Type"'s charset info, tho.


        Stefan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
  2008-02-06 15:09       ` Stefan Monnier
@ 2008-02-07  8:05         ` William Xu
       [not found]         ` <mailman.7082.1202371563.18990.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 13+ messages in thread
From: William Xu @ 2008-02-07  8:05 UTC (permalink / raw)
  To: help-gnu-emacs

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> I can't remember exactly, but I think it doesn't (it just returns the
> raw undecoded bytes).  url-insert-file-contents should try and obey
> "Content-Type"'s charset info, tho.

Hmm, url-insert-file-contents' implementation appears to obey
"Content-Type": 

,----
| ;;;###autoload
| (defun url-insert-file-contents (url &optional visit beg end replace)
|   (let ((buffer (url-retrieve-synchronously url)))
|     (if (not buffer)
| 	(error "Opening input file: No such file or directory, %s" url))
|     (if visit (setq buffer-file-name url))
|     (save-excursion
|       (let* ((start (point))
|              (size-and-charset (url-insert buffer beg end)))
|         (kill-buffer buffer)
|         (when replace
|           (delete-region (point-min) start)
|           (delete-region (point) (point-max)))
|         (unless (cadr size-and-charset)
|           ;; If the headers don't specify any particular charset, use the
|           ;; usual heuristic/rules that we apply to files.
|           (decode-coding-inserted-region start (point) url visit beg end replace))
|         (list url (car size-and-charset))))))
`----

only it never succeeds.  For example, with a header like

,----
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
`----

it could only find out "text/html", completely missing "charset" value.
It looks like the final header detecting job is fallen on
mm-decode.el. Maybe mm-decode.el's fault?

-- 
William

http://williamxu.net9.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
       [not found]         ` <mailman.7082.1202371563.18990.help-gnu-emacs@gnu.org>
@ 2008-02-08  3:06           ` Stefan Monnier
       [not found]           ` <foefpr$ifg$1@news.sap-ag.de>
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Monnier @ 2008-02-08  3:06 UTC (permalink / raw)
  To: help-gnu-emacs

> only it never succeeds.  For example, with a header like

> ,----
> | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
> `----

This is not an HTTP header.  It is an HTML (XML? SGML?) header, so it
would need to be interpreted not by the URL library but by the user of
that HTML data.

> it could only find out "text/html", completely missing "charset" value.
> It looks like the final header detecting job is fallen on
> mm-decode.el. Maybe mm-decode.el's fault?

Could be.  Not sure where mm-decode comes into the picture, I guess that
was in a part of the thread that I missed.


        Stefan



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: url-retrieve and utf-8
       [not found]           ` <foefpr$ifg$1@news.sap-ag.de>
@ 2008-02-08  4:30             ` William Xu
  0 siblings, 0 replies; 13+ messages in thread
From: William Xu @ 2008-02-08  4:30 UTC (permalink / raw)
  To: help-gnu-emacs

Klaus Straubinger <KSNetz@UseNet.ArcorNews.DE> writes:

> In the whole URL library you won't find any instance of "http-equiv".
> Therefore it is not surprising that it can't handle inline headers like
> the meta tag you describe. What the URL library could do is parsing
> HTTP headers.

Thank you for your explanations.  Looks like I was mixing things up. 

> The W3 library has the function w3-nasty-disgusting-http-equiv-handling.
> Its name is probably indicating what fun it is to parse http-equiv headers.

Yes, parsing html is always too much fun...  What if every html page
were well structured...

-- 
William

http://williamxu.net9.org





^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-02-08  4:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-04  8:28 url-retrieve and utf-8 William Xu
2008-02-04 12:43 ` William Xu
2008-02-04 16:02   ` Andreas Röhler
2008-02-05  3:55     ` William Xu
2008-02-05 10:50       ` Andreas Röhler
2008-02-06  6:28         ` William Xu
2008-02-05 20:02       ` Andreas Röhler
2008-02-06  6:34         ` William Xu
2008-02-06 12:36           ` Andreas Röhler
     [not found]     ` <mailman.6982.1202183739.18990.help-gnu-emacs@gnu.org>
2008-02-06 15:09       ` Stefan Monnier
2008-02-07  8:05         ` William Xu
     [not found]         ` <mailman.7082.1202371563.18990.help-gnu-emacs@gnu.org>
2008-02-08  3:06           ` Stefan Monnier
     [not found]           ` <foefpr$ifg$1@news.sap-ag.de>
2008-02-08  4:30             ` William Xu

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.