How to get title of web page by url?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* How to get title of web page by url?
@ 2010-07-27 12:14 filebat Mark
  2010-07-28  5:08 ` Thamer Mahmoud
  0 siblings, 1 reply; 12+ messages in thread
From: filebat Mark @ 2010-07-27 12:14 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 353 bytes --]

Hi emacsers

I'm wondering is there any existing lisp code snippet to get title of web
page by url.
Such as, given "http://www.emacswiki.org/emacs/Git", we will get the title
of this web page, which is "EmacsWiki: Git:".

Function of w3m-current-title is quite close, but a standalone lisp function
is much preferred.

-- 
Thanks & Regards

Denny Zhang

[-- Attachment #2: Type: text/html, Size: 471 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-27 12:14 How to get title of web page by url? filebat Mark
@ 2010-07-28  5:08 ` Thamer Mahmoud
  2010-07-28 13:44   ` filebat Mark
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Thamer Mahmoud @ 2010-07-28  5:08 UTC (permalink / raw)
  To: help-gnu-emacs

filebat Mark <filebat.mark@gmail.com> writes:

> Such as, given "http://www.emacswiki.org/emacs/Git", we will get the title
> of this web page, which is "EmacsWiki: Git:".
>
> Function of w3m-current-title is quite close, but a standalone lisp function
> is much preferred.

Using the url.el package,

(defun www-get-page-title (url)
  (with-current-buffer (url-retrieve-synchronously url)
    (goto-char 0)
    (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)
    (match-string 1)))

(www-get-page-title "http://www.emacswiki.org/emacs/Git")
=> "EmacsWiki: Git"

hth,
Thamer




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28  5:08 ` Thamer Mahmoud
@ 2010-07-28 13:44   ` filebat Mark
  2010-07-28 15:34     ` Thamer Mahmoud
  2010-07-28 14:12   ` Deniz Dogan
       [not found]   ` <mailman.2.1280326418.17798.help-gnu-emacs@gnu.org>
  2 siblings, 1 reply; 12+ messages in thread
From: filebat Mark @ 2010-07-28 13:44 UTC (permalink / raw)
  To: Thamer Mahmoud; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]

Thanks, Thamer. It works.

Below is the code snippet.

Well, I still have an encoding problem.
To get the title of "http://www.baidu.com", the title we get is displayed as
unrecognizable codes.

I have tried to encode it, in the way of "(setq web_title_str
(encode-coding-string  web_title_str 'utf-8-dos))", but it fails.
Since I am a newbie for emacs encoding, can you please help me to point what
the problem is?

;; -------------------------- separator --------------------------
(defun get-page-title()
  "Get title of web page, whose url can be found in current line"
  (interactive)
  ;; Get url from current line
  (copy-region-as-kill (re-search-backward "^") (re-search-forward "$"))
  (setq url (substring-no-properties (current-kill 0)))
  ;; Get title of web page, with the help of functions in url.el
  (with-current-buffer (url-retrieve-synchronously url)
    (goto-char 0)
    (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)
    (setq web_title_str (match-string 1)))
    (setq web_title_str (encode-coding-string web_title_str 'utf-8-dos))
  ;; Insert the title in the next line
  (reindent-then-newline-and-indent)
  (insert web_title_str)
  )


On 7/28/10, Thamer Mahmoud <thamer.mahmoud@gmail.com> wrote:
>
> filebat Mark <filebat.mark@gmail.com> writes:
>
> > Such as, given "http://www.emacswiki.org/emacs/Git", we will get the
> title
> > of this web page, which is "EmacsWiki: Git:".
> >
> > Function of w3m-current-title is quite close, but a standalone lisp
> function
> > is much preferred.
>
>
> Using the url.el package,
>
> (defun www-get-page-title (url)
>   (with-current-buffer (url-retrieve-synchronously url)
>     (goto-char 0)
>     (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)
>     (match-string 1)))
>
> (www-get-page-title "http://www.emacswiki.org/emacs/Git")
> => "EmacsWiki: Git"
>
> hth,
>
> Thamer
>
>
>


-- 
Thanks & Regards

Denny Zhang

[-- Attachment #2: Type: text/html, Size: 2709 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28  5:08 ` Thamer Mahmoud
  2010-07-28 13:44   ` filebat Mark
@ 2010-07-28 14:12   ` Deniz Dogan
  2010-07-28 14:53     ` Teemu Likonen
       [not found]   ` <mailman.2.1280326418.17798.help-gnu-emacs@gnu.org>
  2 siblings, 1 reply; 12+ messages in thread
From: Deniz Dogan @ 2010-07-28 14:12 UTC (permalink / raw)
  To: Thamer Mahmoud; +Cc: help-gnu-emacs

2010/7/28 Thamer Mahmoud <thamer.mahmoud@gmail.com>:
>    (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)

Why [/] and not /?

By the way, this will not work in scenarios where the title is spread
out across multiple lines:

<title>
  Hello
</title>

How would you solve this in Emacs Lisp?

-- 
Deniz Dogan



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
       [not found]   ` <mailman.2.1280326418.17798.help-gnu-emacs@gnu.org>
@ 2010-07-28 14:49     ` Ted Zlatanov
  0 siblings, 0 replies; 12+ messages in thread
From: Ted Zlatanov @ 2010-07-28 14:49 UTC (permalink / raw)
  To: help-gnu-emacs

On Wed, 28 Jul 2010 16:12:52 +0200 Deniz Dogan <deniz.a.m.dogan@gmail.com> wrote: 

DD> 2010/7/28 Thamer Mahmoud <thamer.mahmoud@gmail.com>:
>>    (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)

DD> Why [/] and not /?

DD> By the way, this will not work in scenarios where the title is spread
DD> out across multiple lines:

DD> <title>
DD>   Hello
DD> </title>

DD> How would you solve this in Emacs Lisp?

It will break in many other situations.  For example, it doesn't check
that the <title> tag is inside the <head> tag or remove comments.  It's
hard to parse HTML properly.

Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28 14:12   ` Deniz Dogan
@ 2010-07-28 14:53     ` Teemu Likonen
  2010-07-28 16:03       ` Andreas Röhler
  0 siblings, 1 reply; 12+ messages in thread
From: Teemu Likonen @ 2010-07-28 14:53 UTC (permalink / raw)
  To: Deniz Dogan; +Cc: help-gnu-emacs, Thamer Mahmoud

* 2010-07-28 16:12 (+0200), Deniz Dogan wrote:

> 2010/7/28 Thamer Mahmoud <thamer.mahmoud@gmail.com>:
>>    (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)

> By the way, this will not work in scenarios where the title is spread
> out across multiple lines:
>
> <title>
>   Hello
> </title>
>
> How would you solve this in Emacs Lisp?

Regexps can match whitespace too. Just leave out spaces, tabs and
newlines in the beginning and end of title text. Also note that the
title text itself may contain newlines. We should probably replace
newlines with spaces in the matching string.

The real solution for extracting title from a HTML text are not regular
expressions but a specific HTML parser. The Lisp way to write such
parser would be to turn the document (or only the head part) to nested
lists and other s-expressions and then dive into the list to find the
title. Such parsers already exist for Common Lisp but I'm not sure about
Emacs Lisp.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28 13:44   ` filebat Mark
@ 2010-07-28 15:34     ` Thamer Mahmoud
  2010-07-28 15:44       ` Lennart Borgman
  2010-07-28 18:14       ` Thamer Mahmoud
  0 siblings, 2 replies; 12+ messages in thread
From: Thamer Mahmoud @ 2010-07-28 15:34 UTC (permalink / raw)
  To: help-gnu-emacs

filebat Mark <filebat.mark@gmail.com> writes:

> Thanks, Thamer. It works.
>
> Below is the code snippet.
>
> Well, I still have an encoding problem.
> To get the title of "http://www.baidu.com", the title we get is displayed as
> unrecognizable codes.
>
> I have tried to encode it, in the way of "(setq web_title_str
> (encode-coding-string  web_title_str 'utf-8-dos))", but it fails.

I'm also new to Elisp (well sort of). 

But here is a modified version that should handle both charsets and
newlines (and other issues noticed by Deniz Dogan. Thanks).

(defun www-get-page-title (url)
  (let ((title))
    (with-current-buffer (url-retrieve-synchronously url)
      (goto-char (point-min))
      (re-search-forward "<title>\\([^<]*\\)</title>" nil t 1)
      (setq title (match-string 1))
      (goto-char (point-min))
      (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1)
      (decode-coding-string title (intern (match-string 1))))))

The robustness of this code would still depend on whether the HTML is
well-formed, but it should be good enough I think.

--
Thamer









> Since I am a newbie for emacs encoding, can you please help me to point what
> the problem is?



>
> ;; -------------------------- separator --------------------------
> (defun get-page-title()
>   "Get title of web page, whose url can be found in current line"
>   (interactive)
>   ;; Get url from current line
>   (copy-region-as-kill (re-search-backward "^") (re-search-forward "$"))
>   (setq url (substring-no-properties (current-kill 0)))
>   ;; Get title of web page, with the help of functions in url.el
>   (with-current-buffer (url-retrieve-synchronously url)
>     (goto-char 0)
>     (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)
>     (setq web_title_str (match-string 1)))
>     (setq web_title_str (encode-coding-string web_title_str 'utf-8-dos))
>   ;; Insert the title in the next line
>   (reindent-then-newline-and-indent)
>   (insert web_title_str)
>   )
>
>
> On 7/28/10, Thamer Mahmoud <thamer.mahmoud@gmail.com> wrote:
>>
>> filebat Mark <filebat.mark@gmail.com> writes:
>>
>> > Such as, given "http://www.emacswiki.org/emacs/Git", we will get the
>> title
>> > of this web page, which is "EmacsWiki: Git:".
>> >
>> > Function of w3m-current-title is quite close, but a standalone lisp
>> function
>> > is much preferred.
>>
>>
>> Using the url.el package,
>>
>> (defun www-get-page-title (url)
>>   (with-current-buffer (url-retrieve-synchronously url)
>>     (goto-char 0)
>>     (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1)
>>     (match-string 1)))
>>
>> (www-get-page-title "http://www.emacswiki.org/emacs/Git")
>> => "EmacsWiki: Git"
>>
>> hth,
>>
>> Thamer
>>
>>
>>




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28 15:34     ` Thamer Mahmoud
@ 2010-07-28 15:44       ` Lennart Borgman
  2010-07-28 18:14       ` Thamer Mahmoud
  1 sibling, 0 replies; 12+ messages in thread
From: Lennart Borgman @ 2010-07-28 15:44 UTC (permalink / raw)
  To: Thamer Mahmoud; +Cc: help-gnu-emacs

On Wed, Jul 28, 2010 at 5:34 PM, Thamer Mahmoud
<thamer.mahmoud@gmail.com> wrote:
> filebat Mark <filebat.mark@gmail.com> writes:
>
>> Thanks, Thamer. It works.
>>
>> Below is the code snippet.
>>
>> Well, I still have an encoding problem.
>> To get the title of "http://www.baidu.com", the title we get is displayed as
>> unrecognizable codes.
>>
>> I have tried to encode it, in the way of "(setq web_title_str
>> (encode-coding-string  web_title_str 'utf-8-dos))", but it fails.
>
> I'm also new to Elisp (well sort of).
>
> But here is a modified version that should handle both charsets and
> newlines (and other issues noticed by Deniz Dogan. Thanks).
>
> (defun www-get-page-title (url)
>  (let ((title))
>    (with-current-buffer (url-retrieve-synchronously url)
>      (goto-char (point-min))
>      (re-search-forward "<title>\\([^<]*\\)</title>" nil t 1)
>      (setq title (match-string 1))
>      (goto-char (point-min))
>      (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1)
>      (decode-coding-string title (intern (match-string 1))))))
>
> The robustness of this code would still depend on whether the HTML is
> well-formed, but it should be good enough I think.


Have a look at url-copy-file for how to get this correct. (Or
web-vcs-url-copy-file in nXhtml which is a little bit more careful.)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28 14:53     ` Teemu Likonen
@ 2010-07-28 16:03       ` Andreas Röhler
  2010-07-28 19:52         ` Andreas Röhler
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Röhler @ 2010-07-28 16:03 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: help-gnu-emacs

[ ... ]

> The real solution for extracting title from a HTML text are not regular
> expressions but a specific HTML parser. The Lisp way to write such
> parser would be to turn the document (or only the head part) to nested
> lists and other s-expressions and then dive into the list to find the
> title. Such parsers already exist for Common Lisp but I'm not sure about
> Emacs Lisp.
>
>

beg-end.el

at

http://bazaar.launchpad.net/~a-roehler/s-x-emacs-werkstatt

is an essay for such a parser

see thing-at-point-markup.el too, which serves markup-languages as xml, html

thing-at-point-utils.el offers functions to grasp everything between 
angles - and does count nesting.

try ar-angled-lesser-atpt for example

all this needs

thingatpt-utils-base.el,

where the core routines reside.

Have a look, how the parser mentioned is employed via 
beginning-of-form-base, end-of-form-base from there.


Andreas


Andreas

--
https://code.launchpad.net/~a-roehler/python-mode
https://code.launchpad.net/s-x-emacs-werkstatt/










^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28 15:34     ` Thamer Mahmoud
  2010-07-28 15:44       ` Lennart Borgman
@ 2010-07-28 18:14       ` Thamer Mahmoud
  2010-07-29 15:07         ` filebat Mark
  1 sibling, 1 reply; 12+ messages in thread
From: Thamer Mahmoud @ 2010-07-28 18:14 UTC (permalink / raw)
  To: help-gnu-emacs


> (defun www-get-page-title (url)
>   (let ((title))
>     (with-current-buffer (url-retrieve-synchronously url)
>       (goto-char (point-min))
>       (re-search-forward "<title>\\([^<]*\\)</title>" nil t 1)
>       (setq title (match-string 1))
>       (goto-char (point-min))
>       (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1)
>       (decode-coding-string title (intern (match-string 1))))))

Just did a test on a wikipedia page, and looks like
`decode-coding-string' doesn't handle upper-case charsets, like UTF-8,
only utf-8.

So the last line should be:

(decode-coding-string title (intern (downcase (match-string 1)))))))

--
Thamer




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28 16:03       ` Andreas Röhler
@ 2010-07-28 19:52         ` Andreas Röhler
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Röhler @ 2010-07-28 19:52 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: help-gnu-emacs


>

> http://bazaar.launchpad.net/~a-roehler/s-x-emacs-werkstatt

link don't work, sorry

this does it:

https://code.launchpad.net/s-x-emacs-werkstatt/



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to get title of web page by url?
  2010-07-28 18:14       ` Thamer Mahmoud
@ 2010-07-29 15:07         ` filebat Mark
  0 siblings, 0 replies; 12+ messages in thread
From: filebat Mark @ 2010-07-29 15:07 UTC (permalink / raw)
  To: Thamer Mahmoud; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 2223 bytes --]

Thank you very much, Thamer! It serves my need very well.

Though html parser shall be more powerful, grepping the string shall be good
enough for my requirement.
Thank you all for the attention and valuable discussion.

Post the complete lisp function here, if someone else need it.
;; -------------------------- separator --------------------------
(defun get-page-title()
  "Get title of web page, whose url can be found in the current line"
  (interactive)
  ;; Get url from current line
  (copy-region-as-kill (re-search-backward "^") (re-search-forward "$"))
  (setq url (substring-no-properties (current-kill 0)))
  ;; Get title of web page, with the help of functions in url.el
  (with-current-buffer (url-retrieve-synchronously url)
    ;; find title by grep the html code
    (goto-char 0)
    (re-search-forward "<title>\\([^<]*\\)</title>" nil t 1)
    (setq web_title_str (match-string 1))
    ;; find charset by grep the html code
    (goto-char 0)
    (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1)
    ;; downcase the charaset. e.g, UTF-8 is not acceptible for emacs, while
utf-8 is ok.
    (setq coding_charset (downcase (match-string 1)))
    ;; decode the string of title.
    (setq web_title_str (decode-coding-string web_title_str (intern
coding_charset)))
    )
  ;; Insert the title in the next line
  (reindent-then-newline-and-indent)
  (insert web_title_str)
  )



On Thu, Jul 29, 2010 at 2:14 AM, Thamer Mahmoud <thamer.mahmoud@gmail.com>wrote:

>
> > (defun www-get-page-title (url)
> >   (let ((title))
> >     (with-current-buffer (url-retrieve-synchronously url)
> >       (goto-char (point-min))
> >       (re-search-forward "<title>\\([^<]*\\)</title>" nil t 1)
> >       (setq title (match-string 1))
> >       (goto-char (point-min))
> >       (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1)
> >       (decode-coding-string title (intern (match-string 1))))))
>
> Just did a test on a wikipedia page, and looks like
> `decode-coding-string' doesn't handle upper-case charsets, like UTF-8,
> only utf-8.
>
> So the last line should be:
>
> (decode-coding-string title (intern (downcase (match-string 1)))))))
>
> --
> Thamer
>
>
>


-- 
Thanks & Regards

Denny Zhang

[-- Attachment #2: Type: text/html, Size: 2895 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-07-29 15:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-27 12:14 How to get title of web page by url? filebat Mark
2010-07-28  5:08 ` Thamer Mahmoud
2010-07-28 13:44   ` filebat Mark
2010-07-28 15:34     ` Thamer Mahmoud
2010-07-28 15:44       ` Lennart Borgman
2010-07-28 18:14       ` Thamer Mahmoud
2010-07-29 15:07         ` filebat Mark
2010-07-28 14:12   ` Deniz Dogan
2010-07-28 14:53     ` Teemu Likonen
2010-07-28 16:03       ` Andreas Röhler
2010-07-28 19:52         ` Andreas Röhler
     [not found]   ` <mailman.2.1280326418.17798.help-gnu-emacs@gnu.org>
2010-07-28 14:49     ` Ted Zlatanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).