From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Thamer Mahmoud Newsgroups: gmane.emacs.help Subject: Re: How to get title of web page by url? Date: Wed, 28 Jul 2010 18:34:56 +0300 Message-ID: <87ocdr39i7.fsf@zemblan.newkuwait.org> References: <87vd802nx4.fsf@zemblan.newkuwait.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1280331371 17071 80.91.229.12 (28 Jul 2010 15:36:11 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 28 Jul 2010 15:36:11 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Jul 28 17:36:09 2010 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Oe8ft-0000wh-Fb for geh-help-gnu-emacs@m.gmane.org; Wed, 28 Jul 2010 17:36:05 +0200 Original-Received: from localhost ([127.0.0.1]:48713 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oe8fs-00048b-V1 for geh-help-gnu-emacs@m.gmane.org; Wed, 28 Jul 2010 11:36:05 -0400 Original-Received: from [140.186.70.92] (port=39188 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oe8f8-00047N-UE for help-gnu-emacs@gnu.org; Wed, 28 Jul 2010 11:35:20 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Oe8f6-00010V-6Q for help-gnu-emacs@gnu.org; Wed, 28 Jul 2010 11:35:18 -0400 Original-Received: from lo.gmane.org ([80.91.229.12]:39061) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Oe8f5-000101-Sa for help-gnu-emacs@gnu.org; Wed, 28 Jul 2010 11:35:16 -0400 Original-Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1Oe8f4-0000Mp-5N for help-gnu-emacs@gnu.org; Wed, 28 Jul 2010 17:35:14 +0200 Original-Received: from 89.203.6.209 ([89.203.6.209]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 28 Jul 2010 17:35:14 +0200 Original-Received: from thamer.mahmoud by 89.203.6.209 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 28 Jul 2010 17:35:14 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 97 Original-X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: 89.203.6.209 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) Cancel-Lock: sha1:ldDpfZfhsIshN428Sv3/KxynoD8= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:74320 Archived-At: filebat Mark writes: > Thanks, Thamer. It works. > > Below is the code snippet. > > Well, I still have an encoding problem. > To get the title of "http://www.baidu.com", the title we get is displayed as > unrecognizable codes. > > I have tried to encode it, in the way of "(setq web_title_str > (encode-coding-string web_title_str 'utf-8-dos))", but it fails. I'm also new to Elisp (well sort of). But here is a modified version that should handle both charsets and newlines (and other issues noticed by Deniz Dogan. Thanks). (defun www-get-page-title (url) (let ((title)) (with-current-buffer (url-retrieve-synchronously url) (goto-char (point-min)) (re-search-forward "\\([^<]*\\)" nil t 1) (setq title (match-string 1)) (goto-char (point-min)) (re-search-forward "charset=\\([-0-9a-zA-Z]*\\)" nil t 1) (decode-coding-string title (intern (match-string 1)))))) The robustness of this code would still depend on whether the HTML is well-formed, but it should be good enough I think. -- Thamer > Since I am a newbie for emacs encoding, can you please help me to point what > the problem is? > > ;; -------------------------- separator -------------------------- > (defun get-page-title() > "Get title of web page, whose url can be found in current line" > (interactive) > ;; Get url from current line > (copy-region-as-kill (re-search-backward "^") (re-search-forward "$")) > (setq url (substring-no-properties (current-kill 0))) > ;; Get title of web page, with the help of functions in url.el > (with-current-buffer (url-retrieve-synchronously url) > (goto-char 0) > (re-search-forward "\\(.*\\)<[/]title>" nil t 1) > (setq web_title_str (match-string 1))) > (setq web_title_str (encode-coding-string web_title_str 'utf-8-dos)) > ;; Insert the title in the next line > (reindent-then-newline-and-indent) > (insert web_title_str) > ) > > > On 7/28/10, Thamer Mahmoud <thamer.mahmoud@gmail.com> wrote: >> >> filebat Mark <filebat.mark@gmail.com> writes: >> >> > Such as, given "http://www.emacswiki.org/emacs/Git", we will get the >> title >> > of this web page, which is "EmacsWiki: Git:". >> > >> > Function of w3m-current-title is quite close, but a standalone lisp >> function >> > is much preferred. >> >> >> Using the url.el package, >> >> (defun www-get-page-title (url) >> (with-current-buffer (url-retrieve-synchronously url) >> (goto-char 0) >> (re-search-forward "<title>\\(.*\\)<[/]title>" nil t 1) >> (match-string 1))) >> >> (www-get-page-title "http://www.emacswiki.org/emacs/Git") >> => "EmacsWiki: Git" >> >> hth, >> >> Thamer >> >> >>