unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Daphne Preston-Kendal <dpk@nonceword.org>
To: 48192@debbugs.gnu.org
Subject: bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation
Date: Mon, 3 May 2021 17:26:44 +0200	[thread overview]
Message-ID: <FC5169CD-CFA6-4BF1-B0BC-394F278B5BDA@nonceword.org> (raw)
In-Reply-To: <6D537AD9-6B73-42C6-BA7D-D10071135E66@nonceword.org>

I should note that I just tried to reproduce this bug in a different
buffer in emacs -q, and the behaviour this time was consistently the one
I describe for the curly quotes below; then when I restarted again
without -q, it was behaving like that consistently in all buffers again.
Pfui. (Sorry, I should have documented my environment more thoroughly
before submitting this bug report. I don’t know any more what was
causing the inconsistency.)

However, the behaviour of considering "don't", "can't" etc. and almost
any English possessive as two words for the purposes of count-words etc.
is undoubtedly wrong for most users in my book. However, I appreciate
there are cross-linguistic issues here, and French speakers would be
equally annoyed if "l'allemand" started to count as one word, not two.
(Thanks to John Cowan for this example.)

On 3 May 2021, at 16:37, Daphne Preston-Kendal <dpk@nonceword.org> wrote:

> forward-word, backward-word etc. have inconsistent behaviour when
> applied to text containing ASCII straight quotation marks vs. Unicode
> quotation marks. The word
>    don't
> with a straight quote (U+0027) counts as a single word, and forward-word
> and backward-word will move over the whole thing. Meanwhile,
>    don’t
> with a curly quote (U+2019) counts as two words, and the cursor will
> stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)
> 
> This also means count-words/count-words-region give surprising results
> when applied to text containing Unicode curly apostrophes, since they
> work by counting the number of times the cursor can move
> forward-word-strictly between given start and end points. (Since it uses
> forward-word-strictly and not forward-word, the problem can’t be solved
> by customizing find-word-boundary-function-table.)
> 
> The Right Thing in my view would be for Emacs to use the Unicode TR29
> word boundary rules to work out where to put the cursor when
> forward-word and backward-word are invoked. They handle punctuation
> characters correctly, and rules are not too complicated.
> <http://www.unicode.org/reports/tr29/#Word_Boundaries>
> However, how this would interact with the existing
> find-word-boundary-function-table customization method, I don’t know.
> CLDR makes customizations of the rules for specific (human) languages;
> perhaps they could be ported into Emacs somehow.
> 
> As a temporary workaround to get correct-ish word counts for my
> documents, I’ve hacked up a function that uses how-many instead of
> forward-word to count the number of words in a region.
> <https://gitlab.com/dpk/dotfiles/-/blob/master/.emacs.d/lisp/wc-mode.el>






  reply	other threads:[~2021-05-03 15:26 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-03 14:37 bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation Daphne Preston-Kendal
2021-05-03 15:26 ` Daphne Preston-Kendal [this message]
2022-07-01 11:34   ` Lars Ingebrigtsen
2022-07-01 15:14     ` Robert Pluim
2022-07-30 14:07       ` Lars Ingebrigtsen
2021-05-03 15:49 ` Andreas Schwab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=FC5169CD-CFA6-4BF1-B0BC-394F278B5BDA@nonceword.org \
    --to=dpk@nonceword.org \
    --cc=48192@debbugs.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).