bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation
@ 2021-05-03 14:37 Daphne Preston-Kendal
  2021-05-03 15:26 ` Daphne Preston-Kendal
  2021-05-03 15:49 ` Andreas Schwab
  0 siblings, 2 replies; 6+ messages in thread
From: Daphne Preston-Kendal @ 2021-05-03 14:37 UTC (permalink / raw)
  To: 48192

forward-word, backward-word etc. have inconsistent behaviour when
applied to text containing ASCII straight quotation marks vs. Unicode
quotation marks. The word
    don't
with a straight quote (U+0027) counts as a single word, and forward-word
and backward-word will move over the whole thing. Meanwhile,
    don’t
with a curly quote (U+2019) counts as two words, and the cursor will
stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)

This also means count-words/count-words-region give surprising results
when applied to text containing Unicode curly apostrophes, since they
work by counting the number of times the cursor can move
forward-word-strictly between given start and end points. (Since it uses
forward-word-strictly and not forward-word, the problem can’t be solved
by customizing find-word-boundary-function-table.)

The Right Thing in my view would be for Emacs to use the Unicode TR29
word boundary rules to work out where to put the cursor when
forward-word and backward-word are invoked. They handle punctuation
characters correctly, and rules are not too complicated.
<http://www.unicode.org/reports/tr29/#Word_Boundaries>
However, how this would interact with the existing
find-word-boundary-function-table customization method, I don’t know.
CLDR makes customizations of the rules for specific (human) languages;
perhaps they could be ported into Emacs somehow.

As a temporary workaround to get correct-ish word counts for my
documents, I’ve hacked up a function that uses how-many instead of
forward-word to count the number of words in a region.
<https://gitlab.com/dpk/dotfiles/-/blob/master/.emacs.d/lisp/wc-mode.el>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation
  2021-05-03 14:37 bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation Daphne Preston-Kendal
@ 2021-05-03 15:26 ` Daphne Preston-Kendal
  2022-07-01 11:34   ` Lars Ingebrigtsen
  2021-05-03 15:49 ` Andreas Schwab
  1 sibling, 1 reply; 6+ messages in thread
From: Daphne Preston-Kendal @ 2021-05-03 15:26 UTC (permalink / raw)
  To: 48192

I should note that I just tried to reproduce this bug in a different
buffer in emacs -q, and the behaviour this time was consistently the one
I describe for the curly quotes below; then when I restarted again
without -q, it was behaving like that consistently in all buffers again.
Pfui. (Sorry, I should have documented my environment more thoroughly
before submitting this bug report. I don’t know any more what was
causing the inconsistency.)

However, the behaviour of considering "don't", "can't" etc. and almost
any English possessive as two words for the purposes of count-words etc.
is undoubtedly wrong for most users in my book. However, I appreciate
there are cross-linguistic issues here, and French speakers would be
equally annoyed if "l'allemand" started to count as one word, not two.
(Thanks to John Cowan for this example.)

On 3 May 2021, at 16:37, Daphne Preston-Kendal <dpk@nonceword.org> wrote:

> forward-word, backward-word etc. have inconsistent behaviour when
> applied to text containing ASCII straight quotation marks vs. Unicode
> quotation marks. The word
>    don't
> with a straight quote (U+0027) counts as a single word, and forward-word
> and backward-word will move over the whole thing. Meanwhile,
>    don’t
> with a curly quote (U+2019) counts as two words, and the cursor will
> stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)
> 
> This also means count-words/count-words-region give surprising results
> when applied to text containing Unicode curly apostrophes, since they
> work by counting the number of times the cursor can move
> forward-word-strictly between given start and end points. (Since it uses
> forward-word-strictly and not forward-word, the problem can’t be solved
> by customizing find-word-boundary-function-table.)
> 
> The Right Thing in my view would be for Emacs to use the Unicode TR29
> word boundary rules to work out where to put the cursor when
> forward-word and backward-word are invoked. They handle punctuation
> characters correctly, and rules are not too complicated.
> <http://www.unicode.org/reports/tr29/#Word_Boundaries>
> However, how this would interact with the existing
> find-word-boundary-function-table customization method, I don’t know.
> CLDR makes customizations of the rules for specific (human) languages;
> perhaps they could be ported into Emacs somehow.
> 
> As a temporary workaround to get correct-ish word counts for my
> documents, I’ve hacked up a function that uses how-many instead of
> forward-word to count the number of words in a region.
> <https://gitlab.com/dpk/dotfiles/-/blob/master/.emacs.d/lisp/wc-mode.el>






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation
  2021-05-03 14:37 bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation Daphne Preston-Kendal
  2021-05-03 15:26 ` Daphne Preston-Kendal
@ 2021-05-03 15:49 ` Andreas Schwab
  1 sibling, 0 replies; 6+ messages in thread
From: Andreas Schwab @ 2021-05-03 15:49 UTC (permalink / raw)
  To: Daphne Preston-Kendal; +Cc: 48192

On Mai 03 2021, Daphne Preston-Kendal wrote:

> forward-word, backward-word etc. have inconsistent behaviour when
> applied to text containing ASCII straight quotation marks vs. Unicode
> quotation marks. The word
>     don't
> with a straight quote (U+0027) counts as a single word, and forward-word
> and backward-word will move over the whole thing. Meanwhile,
>     don’t
> with a curly quote (U+2019) counts as two words, and the cursor will
> stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)

Looks like you have customized the syntax table, because by default,
both ' and ’ have punctuation syntax, thus are not part of a word.  But
text-mode uses a different syntax table, where ' has word syntax.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation
  2021-05-03 15:26 ` Daphne Preston-Kendal
@ 2022-07-01 11:34   ` Lars Ingebrigtsen
  2022-07-01 15:14     ` Robert Pluim
  0 siblings, 1 reply; 6+ messages in thread
From: Lars Ingebrigtsen @ 2022-07-01 11:34 UTC (permalink / raw)
  To: Daphne Preston-Kendal; +Cc: 48192

Daphne Preston-Kendal <dpk@nonceword.org> writes:

> However, the behaviour of considering "don't", "can't" etc. and almost
> any English possessive as two words for the purposes of count-words etc.
> is undoubtedly wrong for most users in my book.

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

I think it would make sense to make text-mode give ’ (RIGHT SINGLE
QUOTATION MARK) a word constituent syntax, because many people use that
character interchangeably with ' (APOSTROPHE).

But that's not really the intention behind that character.  RIGHT SINGLE
QUOTATION MARK is to allow quoting like ‘this’ -- i.e., the ’ is not
meant to be used inside words.

So changing the syntax here would be controversial since it's "wrong" to
use the ’ character instead of APOSTROPHE, even though it's common.

Does anybody have an opinion here?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation
  2022-07-01 11:34   ` Lars Ingebrigtsen
@ 2022-07-01 15:14     ` Robert Pluim
  2022-07-30 14:07       ` Lars Ingebrigtsen
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Pluim @ 2022-07-01 15:14 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 48192, Daphne Preston-Kendal

>>>>> On Fri, 01 Jul 2022 13:34:36 +0200, Lars Ingebrigtsen <larsi@gnus.org> said:

    Lars> Daphne Preston-Kendal <dpk@nonceword.org> writes:
    >> However, the behaviour of considering "don't", "can't" etc. and almost
    >> any English possessive as two words for the purposes of count-words etc.
    >> is undoubtedly wrong for most users in my book.

    Lars> (I'm going through old bug reports that unfortunately weren't resolved
    Lars> at the time.)

    Lars> I think it would make sense to make text-mode give ’ (RIGHT SINGLE
    Lars> QUOTATION MARK) a word constituent syntax, because many people use that
    Lars> character interchangeably with ' (APOSTROPHE).

    Lars> But that's not really the intention behind that character.  RIGHT SINGLE
    Lars> QUOTATION MARK is to allow quoting like ‘this’ -- i.e., the ’ is not
    Lars> meant to be used inside words.

    Lars> So changing the syntax here would be controversial since it's "wrong" to
    Lars> use the ’ character instead of APOSTROPHE, even though it's common.

    Lars> Does anybody have an opinion here?

What people should do is use U+02BC, MODIFIER LETTER APOSTROPHE, since
that has word-constituent syntax already.

(what, the worldʼs not going to change to suit me, you say? 😼)

Robert
-- 





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation
  2022-07-01 15:14     ` Robert Pluim
@ 2022-07-30 14:07       ` Lars Ingebrigtsen
  0 siblings, 0 replies; 6+ messages in thread
From: Lars Ingebrigtsen @ 2022-07-30 14:07 UTC (permalink / raw)
  To: Robert Pluim; +Cc: Daphne Preston-Kendal, 48192

Robert Pluim <rpluim@gmail.com> writes:

> What people should do is use U+02BC, MODIFIER LETTER APOSTROPHE, since
> that has word-constituent syntax already.
>
> (what, the worldʼs not going to change to suit me, you say? 😼)

😀

In any case, I think the conclusion here is that we don't want to change
anything here, and I'm therefore closing this bug report.






^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-07-30 14:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-03 14:37 bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation Daphne Preston-Kendal
2021-05-03 15:26 ` Daphne Preston-Kendal
2022-07-01 11:34   ` Lars Ingebrigtsen
2022-07-01 15:14     ` Robert Pluim
2022-07-30 14:07       ` Lars Ingebrigtsen
2021-05-03 15:49 ` Andreas Schwab

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).