From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: =?ISO-8859-1?Q?R=FCdiger?= Sonderfeld Newsgroups: gmane.emacs.devel Subject: Re: "Readability" feature in eww Date: Mon, 03 Nov 2014 10:37:47 +0100 Message-ID: <7820496.BS1QHyORAs@descartes> References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1415007515 16506 80.91.229.3 (3 Nov 2014 09:38:35 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 3 Nov 2014 09:38:35 +0000 (UTC) Cc: Lars Magne Ingebrigtsen To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Nov 03 10:38:29 2014 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XlE5l-0001hb-IY for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2014 10:38:29 +0100 Original-Received: from localhost ([::1]:33129 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XlE5l-0004X5-7Q for ged-emacs-devel@m.gmane.org; Mon, 03 Nov 2014 04:38:29 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:33467) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XlE5c-0004Vo-7y for emacs-devel@gnu.org; Mon, 03 Nov 2014 04:38:26 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XlE5V-0005zL-Uq for emacs-devel@gnu.org; Mon, 03 Nov 2014 04:38:20 -0500 Original-Received: from ptmx.org ([178.63.28.110]:50905) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XlE5V-0005ye-OK for emacs-devel@gnu.org; Mon, 03 Nov 2014 04:38:13 -0500 Original-Received: from localhost (localhost [127.0.0.1]) by ptmx.org (Postfix) with ESMTP id 6FB742BE60; Mon, 3 Nov 2014 10:38:11 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at ptmx.org Original-Received: from ptmx.org ([127.0.0.1]) by localhost (ptmx.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id isl6nQI6aP3s; Mon, 3 Nov 2014 10:38:09 +0100 (CET) Original-Received: from descartes.localnet (chello080109100138.16.15.univie.teleweb.at [80.109.100.138]) by ptmx.org (Postfix) with ESMTPSA id CCBDA22EB3; Mon, 3 Nov 2014 10:38:08 +0100 (CET) User-Agent: KMail/4.13.3 (Linux/3.13.0-37-generic; KDE/4.13.3; x86_64; ; ) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 178.63.28.110 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:176254 Archived-At: On Monday 03 November 2014 01:41:14 Lars Magne Ingebrigtsen wrote: > This is a heuristic, of course, so it can be tweaked endlessly. The > current algorithm just gives most words a positive score, HTML markup= a > negative score, and words inside tags a negative score. For such= a > simple algorithm, it seems to give pretty good results. >=20 > But tweaking is necessary for it to be ... better. If anybody has id= eas > for tweaks or better algorithms, please be my guest and have at it. HTML5 has introduced tags such as
and
, which can be us= ed to=20 identify the important parts. I'm not sure how widespread their use th= us far=20 is (I think org-mode supports it already if one sets the HTML5 export o= ption). =20 But at least adding them to the heuristic might help. E.g., https://developer.mozilla.org/en-US/docs/Web/HTML/Element/main Regards, R=C3=BCdiger