From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: filebat Mark Newsgroups: gmane.emacs.help Subject: Re: How to get title of web page by url? Date: Thu, 29 Jul 2010 23:07:43 +0800 Message-ID: References: <87vd802nx4.fsf@zemblan.newkuwait.org> <87ocdr39i7.fsf@zemblan.newkuwait.org> <87k4of324m.fsf@zemblan.newkuwait.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001636e0b1caa623fd048c881809 X-Trace: dough.gmane.org 1280416109 20815 80.91.229.12 (29 Jul 2010 15:08:29 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 29 Jul 2010 15:08:29 +0000 (UTC) Cc: help-gnu-emacs@gnu.org To: Thamer Mahmoud Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Thu Jul 29 17:08:27 2010 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1OeUid-0000hd-Tk for geh-help-gnu-emacs@m.gmane.org; Thu, 29 Jul 2010 17:08:24 +0200 Original-Received: from localhost ([127.0.0.1]:45593 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OeUic-00084L-Jb for geh-help-gnu-emacs@m.gmane.org; Thu, 29 Jul 2010 11:08:22 -0400 Original-Received: from [140.186.70.92] (port=54384 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OeUi8-00081g-0s for help-gnu-emacs@gnu.org; Thu, 29 Jul 2010 11:07:58 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OeUi1-0003LB-Le for help-gnu-emacs@gnu.org; Thu, 29 Jul 2010 11:07:51 -0400 Original-Received: from mail-pv0-f169.google.com ([74.125.83.169]:59825) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OeUi1-0003Kt-BQ for help-gnu-emacs@gnu.org; Thu, 29 Jul 2010 11:07:45 -0400 Original-Received: by pvc30 with SMTP id 30so256172pvc.0 for ; Thu, 29 Jul 2010 08:07:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type; bh=J1yHGT3QquPE3YgwAZ8iSghJzDM+WvKO2CQ2ayzjR3c=; b=s2pSeE+pdNvgBF75Jhk8OeSybWZoknxR++K/osvUYtoNCKdQv7AV9FCI0CM6jArViL 9ij+C6YRXmt0cY++pRUuV61K40kdUp17NHJEUb38Emri0sCXnYifyXLPK8xOZVNyV/SY NJoFdZdrcZteyMMdXLVq7if77MTK+OMs5X7jE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=tHnb8a+6HGB/BhBThg1VDl1Ua+b84liHBdqaNbKvr3fm5l9FHhJLfsnFip/Qec9jZh zj5+Nk+tV8ytfv0ymrl4pQq0FEGcAibJ+Ra4n3dKyK2Js/pEoiGF3UTAjmKZctG7Lvp3 1anJl3CT0WYN0oazW19EqmK1GVsSnB1Ydx7rA= Original-Received: by 10.143.37.18 with SMTP id p18mr247833wfj.46.1280416063824; Thu, 29 Jul 2010 08:07:43 -0700 (PDT) Original-Received: by 10.229.43.4 with HTTP; Thu, 29 Jul 2010 08:07:43 -0700 (PDT) In-Reply-To: <87k4of324m.fsf@zemblan.newkuwait.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:74337 Archived-At: --001636e0b1caa623fd048c881809 Content-Type: text/plain; charset=ISO-8859-1 Thank you very much, Thamer! It serves my need very well. Though html parser shall be more powerful, grepping the string shall be good enough for my requirement. Thank you all for the attention and valuable discussion. Post the complete lisp function here, if someone else need it. ;; -------------------------- separator -------------------------- (defun get-page-title() "Get title of web page, whose url can be found in the current line" (interactive) ;; Get url from current line (copy-region-as-kill (re-search-backward "^") (re-search-forward "$")) (setq url (substring-no-properties (current-kill 0))) ;; Get title of web page, with the help of functions in url.el (with-current-buffer (url-retrieve-synchronously url) ;; find title by grep the html code (goto-char 0) (re-search-forward "\\([^<]*\\)" nil t 1) (setq web_title_str (match-string 1)) ;; find charset by grep the html code (goto-char 0) (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1) ;; downcase the charaset. e.g, UTF-8 is not acceptible for emacs, while utf-8 is ok. (setq coding_charset (downcase (match-string 1))) ;; decode the string of title. (setq web_title_str (decode-coding-string web_title_str (intern coding_charset))) ) ;; Insert the title in the next line (reindent-then-newline-and-indent) (insert web_title_str) ) On Thu, Jul 29, 2010 at 2:14 AM, Thamer Mahmoud wrote: > > > (defun www-get-page-title (url) > > (let ((title)) > > (with-current-buffer (url-retrieve-synchronously url) > > (goto-char (point-min)) > > (re-search-forward "\\([^<]*\\)" nil t 1) > > (setq title (match-string 1)) > > (goto-char (point-min)) > > (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1) > > (decode-coding-string title (intern (match-string 1)))))) > > Just did a test on a wikipedia page, and looks like > `decode-coding-string' doesn't handle upper-case charsets, like UTF-8, > only utf-8. > > So the last line should be: > > (decode-coding-string title (intern (downcase (match-string 1))))))) > > -- > Thamer > > > -- Thanks & Regards Denny Zhang --001636e0b1caa623fd048c881809 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thank you very much, Thamer! It serves my need very well.

Though ht= ml parser shall be more powerful, grepping the string shall be good enough = for my requirement.
Thank you all for the attention and valuable discuss= ion.

Post the complete lisp function here, if someone else need it.
;; --= ------------------------ separator --------------------------
(defun get= -page-title()
=A0 "Get title of web page, whose url can be found in= the current line"
=A0 (interactive)
=A0 ;; Get url from current line
=A0 (copy-region-a= s-kill (re-search-backward "^") (re-search-forward "$")= )
=A0 (setq url (substring-no-properties (current-kill 0)))
=A0 ;; Ge= t title of web page, with the help of functions in url.el
=A0 (with-current-buffer (url-retrieve-synchronously url)
=A0=A0=A0 ;; f= ind title by grep the html code
=A0=A0=A0 (goto-char 0)
=A0=A0=A0 (re= -search-forward "<title>\\([^<]*\\)</title>" nil t= 1)
=A0=A0=A0 (setq web_title_str (match-string 1))
=A0=A0=A0 ;; find charset by grep the html code
=A0=A0=A0 (goto-char 0)<= br>=A0=A0=A0 (re-search-forward "charset=3D\\([-0-9a-zA-Z]*\\)" n= il t 1)
=A0=A0=A0 ;; downcase the charaset. e.g, UTF-8 is not acceptible= for emacs, while utf-8 is ok.
=A0=A0=A0 (setq coding_charset (downcase (match-string 1)))
=A0=A0=A0 ;;= decode the string of title.
=A0=A0=A0 (setq web_title_str (decode-codin= g-string web_title_str (intern coding_charset)))
=A0=A0=A0 )
=A0 ;; I= nsert the title in the next line
=A0 (reindent-then-newline-and-indent)
=A0 (insert web_title_str)
=A0= )



On Thu, Jul 29, 2010 at 2:14 A= M, Thamer Mahmoud <thamer.mahmoud@gmail.com> wrote:

> (defun www-get-page-title (url)
> =A0 (let ((title))
> =A0 =A0 (with-current-buffer (url-retrieve-synchronously url)
> =A0 =A0 =A0 (goto-char (point-min))
> =A0 =A0 =A0 (re-search-forward "<title>\\([^<]*\\)</t= itle>" nil t 1)
> =A0 =A0 =A0 (setq title (match-string 1))
> =A0 =A0 =A0 (goto-char (point-min))
> =A0 =A0 =A0 (re-search-forward "charset=3D\\([-0-9a-zA-Z]*\\)&quo= t; nil t 1)
> =A0 =A0 =A0 (decode-coding-string title (intern (match-string 1))))))<= br>
Just did a test on a wikipedia page, and looks like
`decode-coding-string' doesn't handle upper-case charsets, like UTF= -8,
only utf-8.

So the last line should be:

(decode-coding-string title (intern (downcase (match-string 1)))))))

--
Thamer





--
Thanks & Reg= ards

Denny Zhang

--001636e0b1caa623fd048c881809--