From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Engster Newsgroups: gmane.emacs.devel Subject: Re: "Readability" feature in eww Date: Mon, 03 Nov 2014 22:37:36 +0100 Message-ID: <87mw88artr.fsf@engster.org> References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1415050701 13226 80.91.229.3 (3 Nov 2014 21:38:21 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 3 Nov 2014 21:38:21 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 03 22:38:16 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XlPKI-0003Df-7L for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2014 22:38:14 +0100 Original-Received: from localhost ([::1]:37460 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XlPKH-0006CZ-Ie for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2014 16:38:13 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38322) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XlPJy-0006BT-Ih for emacs-devel@gnu.org; Mon, 03 Nov 2014 16:37:59 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XlPJt-0008J5-2Q for emacs-devel@gnu.org; Mon, 03 Nov 2014 16:37:54 -0500 Original-Received: from randomsample.de ([5.45.97.173]:50349) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XlPJs-0008HD-Px for emacs-devel@gnu.org; Mon, 03 Nov 2014 16:37:49 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=randomsample.de; s=a; h=Content-Type:MIME-Version:Message-ID:Date:References:In-Reply-To:Subject:To:From; bh=Jn+8Cba5QMoRvezEs1/iYJ4lEP4zWCseKUDt6xfQpr8=; b=Qw86Fj7TSYfuvTs9mWaVM/3jR3rzexpskJuibQZov24znKLyoTEkHQH4UfedVbVeHHJVws2UYD0J9sUjZfAmGwbS+UaEhn1ST52UszY8s9giB58i8lJDDT6/5Pz+D1RR; Original-Received: from dslc-082-083-061-008.pools.arcor-ip.net ([82.83.61.8] helo=spaten) by randomsample.de with esmtpsa (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from ) id 1XlPJl-0006WY-Qs for emacs-devel@gnu.org; Mon, 03 Nov 2014 22:37:42 +0100 In-Reply-To: (Lars Magne Ingebrigtsen's message of "Mon, 03 Nov 2014 01:41:14 +0100") User-Agent: Gnus/5.13001 (Ma Gnus v0.10) Emacs/24.3.91 (gnu/linux) Mail-Followup-To: emacs-devel@gnu.org X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 5.45.97.173 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176320 Archived-At: Lars Magne Ingebrigtsen writes: > The `R' command in eww will try to find the parts of the current page > where most of the text is, and only display that part. This makes all > the menus and stuff disappear, and you don't have to page forever to > find the actual article on newspaper sites. > > This is a heuristic, of course, so it can be tweaked endlessly. The > current algorithm just gives most words a positive score, HTML markup a > negative score, and words inside tags a negative score. For such a > simple algorithm, it seems to give pretty good results. > > But tweaking is necessary for it to be ... better. If anybody has ideas > for tweaks or better algorithms, please be my guest and have at it. I've looked into this a bit years ago when I was coding on emacs-w3m's 'shimbun' feature for Gnus. I took a peek at the algorithm which was used for the 'boilerplate' library[1], but never got around implementing it. Since I mostly needed it for reading blogs, I coded a quick solution which looks at the 'generator' meta-tag and extracts the main content for CMS like Wordpress, Typepad or Blogspot/Blogger, which was already enough for me. It'd be great if you could make this extraction method flexible, similar to the 'washing' feature from Gnus, so that users could hook their own methods for extracting the main content into eww. The user would provide an extraction function and the corresponding regexp that matches against the URL, or optionally also against the source to match things like the 'generator' meta-tag. -David [1] http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf