From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Stefan Monnier Newsgroups: gmane.emacs.devel,gmane.emacs.pretest.bugs Subject: Re: paragraphs.el: do forward-sentence and friends not work? Date: Thu, 14 Feb 2008 09:43:58 -0500 Message-ID: References: <7BC345E9-711F-48AE-AF7F-CD3C51E96A47@gmail.com> <692473D9-ADB4-4485-ADA7-B46DF6CCCD9E@gmail.com> <87wsp87hkj.fsf@uwakimon.sk.tsukuba.ac.jp> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1203000259 18793 80.91.229.12 (14 Feb 2008 14:44:19 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 14 Feb 2008 14:44:19 +0000 (UTC) Cc: schwab@suse.de, "Stephen J. Turnbull" , rms@gnu.org, emacs-pretest-bug@gnu.org To: David Reitter Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 14 15:44:41 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JPfKL-00049p-7k for ged-emacs-devel@m.gmane.org; Thu, 14 Feb 2008 15:44:41 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JPfJr-0006bQ-7y for ged-emacs-devel@m.gmane.org; Thu, 14 Feb 2008 09:44:11 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JPfJm-0006Z2-Fr for emacs-devel@gnu.org; Thu, 14 Feb 2008 09:44:06 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JPfJl-0006XX-ET for emacs-devel@gnu.org; Thu, 14 Feb 2008 09:44:06 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JPfJl-0006XR-Au for emacs-devel@gnu.org; Thu, 14 Feb 2008 09:44:05 -0500 Original-Received: from fencepost.gnu.org ([140.186.70.10]) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JPfJl-0006PF-2W for emacs-devel@gnu.org; Thu, 14 Feb 2008 09:44:05 -0500 Original-Received: from mail.gnu.org ([199.232.76.166] helo=mx10.gnu.org) by fencepost.gnu.org with esmtp (Exim 4.67) (envelope-from ) id 1JPfJk-0007N3-Pk for emacs-pretest-bug@gnu.org; Thu, 14 Feb 2008 09:44:04 -0500 Original-Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60) (envelope-from ) id 1JPfJh-0006OM-MG for emacs-pretest-bug@gnu.org; Thu, 14 Feb 2008 09:44:04 -0500 Original-Received: from ironport2-out.pppoe.ca ([206.248.154.182]) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JPfJh-0006O2-CP; Thu, 14 Feb 2008 09:44:01 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAETis0fO+IZjdGdsb2JhbACQQAEwnix/ X-IronPort-AV: E=Sophos;i="4.25,352,1199682000"; d="scan'208";a="14482718" Original-Received: from smtp.pppoe.ca ([65.39.196.238]) by ironport2-out.pppoe.ca with ESMTP; 14 Feb 2008 09:43:59 -0500 Original-Received: from pastel.home ([206.248.134.99]) by smtp.pppoe.ca (Internet Mail Server v1.0) with ESMTP id UQF70759; Thu, 14 Feb 2008 09:43:59 -0500 Original-Received: by pastel.home (Postfix, from userid 20848) id DCE7880A4; Thu, 14 Feb 2008 09:43:58 -0500 (EST) In-Reply-To: (David Reitter's message of "Thu, 14 Feb 2008 09:45:01 +0000") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.50 (gnu/linux) X-detected-kernel: by monty-python.gnu.org: Genre and OS details not recognized. X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:89078 gmane.emacs.pretest.bugs:21087 Archived-At: >> Using two spaces after end of sentence enables Emacs to distinguish >> between periods that end sentences and periods for abbreviations. >> That is why it should be the default. > We can improve this to make it work without depending on the double- > space. > Sentence tokenization is a known problem. You can throw machine learning > algorithms at it, but that's not a viable option in our case. However, > Grefenstette&Tapanainen (1994) examined this in detail for English, using > the Brown corpus. They basically say that using a small lexicon of common > abbreviations, they can classify 99.1% of all periods correctly. Even > without the lexicon, you can achieve 97.7% accuracy (on English) using the > right regular expressions, and I think this will be similar for other > languages as well. I think that's good enough for M-e and M-a. But the period-single-space vs period-double-space distinction allows us to get it right 100% in many more languages than just English. Stefan "Who switched to non-French spacing even when writing French"