From: David Reitter <david.reitter@gmail.com>
To: rms@gnu.org
Cc: schwab@suse.de, "Stephen J. Turnbull" <stephen@xemacs.org>,
emacs-pretest-bug@gnu.org
Subject: Re: paragraphs.el: do forward-sentence and friends not work?
Date: Thu, 14 Feb 2008 09:45:01 +0000 [thread overview]
Message-ID: <C0B5F7FE-E61F-4D3A-A102-87B9DD5D83EC@gmail.com> (raw)
In-Reply-To: <E1JPVvs-0005lS-VS@fencepost.gnu.org>
On 14 Feb 2008, at 04:42, Richard Stallman wrote:
> Using two spaces after end of sentence enables Emacs to distinguish
> between periods that end sentences and periods for abbreviations.
> That is why it should be the default.
We can improve this to make it work without depending on the double-
space.
Sentence tokenization is a known problem. You can throw machine
learning algorithms at it, but that's not a viable option in our case.
However, Grefenstette&Tapanainen (1994) examined this in detail for
English, using the Brown corpus. They basically say that using a small
lexicon of common abbreviations, they can classify 99.1% of all
periods correctly. Even without the lexicon, you can achieve 97.7%
accuracy (on English) using the right regular expressions, and I think
this will be similar for other languages as well. I think that's good
enough for M-e and M-a.
http://citeseer.ist.psu.edu/grefenstette94what.html
next prev parent reply other threads:[~2008-02-14 9:45 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-13 17:08 paragraphs.el: do forward-sentence and friends not work? David Reitter
2008-02-13 17:27 ` Andreas Schwab
2008-02-13 17:32 ` David Reitter
2008-02-13 17:36 ` Bastien
2008-02-13 20:00 ` Stephen J. Turnbull
2008-02-14 4:42 ` Richard Stallman
2008-02-14 9:45 ` David Reitter [this message]
2008-02-14 14:22 ` Robert J. Chassell
2008-02-14 14:43 ` Stefan Monnier
2008-02-14 15:52 ` David Reitter
2008-02-14 16:04 ` Miles Bader
2008-02-15 5:48 ` Jonathan Rockway
2008-06-13 14:14 ` David Reitter
2008-02-15 0:02 ` Richard Stallman
2008-02-14 9:10 ` David Reitter
2008-02-14 9:22 ` Miles Bader
2008-02-14 9:46 ` David Reitter
2008-02-14 10:07 ` Miles Bader
2008-02-14 10:44 ` Stephen J. Turnbull
2008-02-14 12:27 ` David Reitter
2008-02-14 22:25 ` Stephen J. Turnbull
2008-02-13 20:36 ` Stefan Monnier
2008-02-13 20:52 ` Thorsten Bonow
2008-02-13 23:06 ` Miles Bader
2008-02-14 2:18 ` Robert J. Chassell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C0B5F7FE-E61F-4D3A-A102-87B9DD5D83EC@gmail.com \
--to=david.reitter@gmail.com \
--cc=emacs-pretest-bug@gnu.org \
--cc=rms@gnu.org \
--cc=schwab@suse.de \
--cc=stephen@xemacs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.