From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?ISO-8859-1?Q?Andreas_R=F6hler?= Newsgroups: gmane.emacs.help Subject: Re: How to get title of web page by url? Date: Wed, 28 Jul 2010 18:03:58 +0200 Message-ID: <4C5054EE.4060106@easy-emacs.de> References: <87vd802nx4.fsf@zemblan.newkuwait.org> <87mxtbmzdu.fsf@mithlond.arda> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Trace: dough.gmane.org 1280332846 23172 80.91.229.12 (28 Jul 2010 16:00:46 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 28 Jul 2010 16:00:46 +0000 (UTC) Cc: help-gnu-emacs@gnu.org To: Teemu Likonen Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Wed Jul 28 18:00:41 2010 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Oe93g-0005Nl-0h for geh-help-gnu-emacs@m.gmane.org; Wed, 28 Jul 2010 18:00:40 +0200 Original-Received: from localhost ([127.0.0.1]:60744 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oe93f-0005Fc-60 for geh-help-gnu-emacs@m.gmane.org; Wed, 28 Jul 2010 12:00:39 -0400 Original-Received: from [140.186.70.92] (port=45013 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oe92f-0005EJ-S5 for help-gnu-emacs@gnu.org; Wed, 28 Jul 2010 11:59:38 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Oe92b-0004pj-Jz for help-gnu-emacs@gnu.org; Wed, 28 Jul 2010 11:59:37 -0400 Original-Received: from moutng.kundenserver.de ([212.227.126.171]:55034) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Oe92b-0004pD-6q for help-gnu-emacs@gnu.org; Wed, 28 Jul 2010 11:59:33 -0400 Original-Received: from [192.168.178.27] (p5DDB0A87.dip0.t-ipconnect.de [93.219.10.135]) by mrelayeu.kundenserver.de (node=mreu0) with ESMTP (Nemesis) id 0Lm8NJ-1PDkly2OXp-00ZVn9; Wed, 28 Jul 2010 17:59:27 +0200 User-Agent: Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.11) Gecko/20100711 Thunderbird/3.0.6 In-Reply-To: <87mxtbmzdu.fsf@mithlond.arda> X-Provags-ID: V02:K0:GeBisIUIcZR/FOtdDq1lMVZYxKIR/rzDN02EaZe7wGV 45cUzkZqHmEcyxEwHsGDmdkafYL/JStZzgDp7CYve510P5Bsjt MywUGPxIuATUrSwG4ay7lhh2o3DWRTFMbk5FzHVHmHHPq/3Ofh PZsYtAVkksVltkrPZUKsrSVsOsqqrB+7PO1oewZ1x6Yds0RAtz DMzubIJwQk6dRu3hEngzE1YrUGcTKEEOV6V876v1nM= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:74324 Archived-At: [ ... ] > The real solution for extracting title from a HTML text are not regular > expressions but a specific HTML parser. The Lisp way to write such > parser would be to turn the document (or only the head part) to nested > lists and other s-expressions and then dive into the list to find the > title. Such parsers already exist for Common Lisp but I'm not sure about > Emacs Lisp. > > beg-end.el at http://bazaar.launchpad.net/~a-roehler/s-x-emacs-werkstatt is an essay for such a parser see thing-at-point-markup.el too, which serves markup-languages as xml, html thing-at-point-utils.el offers functions to grasp everything between angles - and does count nesting. try ar-angled-lesser-atpt for example all this needs thingatpt-utils-base.el, where the core routines reside. Have a look, how the parser mentioned is employed via beginning-of-form-base, end-of-form-base from there. Andreas Andreas -- https://code.launchpad.net/~a-roehler/python-mode https://code.launchpad.net/s-x-emacs-werkstatt/