From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Daphne Preston-Kendal Newsgroups: gmane.emacs.bugs Subject: bug#48192: forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation Date: Mon, 3 May 2021 17:26:44 +0200 Message-ID: References: <6D537AD9-6B73-42C6-BA7D-D10071135E66@nonceword.org> Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.80.0.2.43\)) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="33780"; mail-complaints-to="usenet@ciao.gmane.io" To: 48192@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon May 03 17:52:38 2021 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1ldasA-0008fX-K7 for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 03 May 2021 17:52:38 +0200 Original-Received: from localhost ([::1]:42122 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ldas9-0003ev-MR for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 03 May 2021 11:52:37 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:54788) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ldaqd-0002Ql-5p for bug-gnu-emacs@gnu.org; Mon, 03 May 2021 11:51:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:35570) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ldaqb-000803-Ut for bug-gnu-emacs@gnu.org; Mon, 03 May 2021 11:51:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1ldaqb-0008J0-TP for bug-gnu-emacs@gnu.org; Mon, 03 May 2021 11:51:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Daphne Preston-Kendal Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 03 May 2021 15:51:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 48192 X-GNU-PR-Package: emacs X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.162005702131902 (code B ref -1); Mon, 03 May 2021 15:51:01 +0000 Original-Received: (at submit) by debbugs.gnu.org; 3 May 2021 15:50:21 +0000 Original-Received: from localhost ([127.0.0.1]:47114 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ldapw-0008IU-2g for submit@debbugs.gnu.org; Mon, 03 May 2021 11:50:21 -0400 Original-Received: from lists.gnu.org ([209.51.188.17]:35482) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ldaTH-0005rg-49 for submit@debbugs.gnu.org; Mon, 03 May 2021 11:26:55 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:48016) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ldaTG-0003jz-SE for bug-gnu-emacs@gnu.org; Mon, 03 May 2021 11:26:54 -0400 Original-Received: from wout3-smtp.messagingengine.com ([64.147.123.19]:47009) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ldaTB-0004bu-F4 for bug-gnu-emacs@gnu.org; Mon, 03 May 2021 11:26:54 -0400 Original-Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.west.internal (Postfix) with ESMTP id EC4091D1F for ; Mon, 3 May 2021 11:26:47 -0400 (EDT) Original-Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Mon, 03 May 2021 11:26:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; bh=si86Gfs0NAmwxJ2xDwoGdDmPCTh7BLqEPJmLw1KYK Ng=; b=Uw/ZW1Rh789NmtjxCoNOzNaKJIkbagzsZBacLJf1UtA1RFKVo7ljo2JFj 12ByxgwNfOpn5QRc5+2XTM7MTxZJGFbtnIl4WlzqNgoCf0/Z05fZYYhcIm7EzuLe Exa+iIemeN7EXayNpPRdpG1wuU+g0zlR8D9ciFEz+XJLT97w4zw0NFXnJAeE9pvO ZXePL5QbVY2ipOkHdqaXMx6s1OMigsbRi8yFATgxAjtqev1FwKqLVxODrowci5XJ rIetraybjL8y3RtfqgYKc2xyFdxjQbYbHpQQd+stT6JjcEjzvct6gao4DYWdXUEr hFunpYm6hVUIDwzn6dj1tDU6L6gKw== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduledrvdefgedgledtucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefhtgfgggfuffhfvfgjkffosehtqh hmtdhhtdejnecuhfhrohhmpeffrghphhhnvgcurfhrvghsthhonhdqmfgvnhgurghluceo ughpkhesnhhonhgtvgifohhrugdrohhrgheqnecuggftrfgrthhtvghrnhepveetgfejgf eftdeijefgudehjeegueekfedtteduffeutefhteejfeeggefglefgnecuffhomhgrihhn pehunhhitghouggvrdhorhhgpdhgihhtlhgrsgdrtghomhenucfkphepudeguddrvddtrd dvudejrddvtddunecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhf rhhomhepughpkhesnhhonhgtvgifohhrugdrohhrgh X-ME-Proxy: Original-Received: from smtpclient.apple (unknown [141.20.217.201]) by mail.messagingengine.com (Postfix) with ESMTPA for ; Mon, 3 May 2021 11:26:46 -0400 (EDT) In-Reply-To: <6D537AD9-6B73-42C6-BA7D-D10071135E66@nonceword.org> X-Mailer: Apple Mail (2.3654.80.0.2.43) Received-SPF: none client-ip=64.147.123.19; envelope-from=dpk@nonceword.org; helo=wout3-smtp.messagingengine.com X-Spam_score_int: -25 X-Spam_score: -2.6 X-Spam_bar: -- X-Spam_report: (-2.6 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Mon, 03 May 2021 11:50:19 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:205528 Archived-At: I should note that I just tried to reproduce this bug in a different buffer in emacs -q, and the behaviour this time was consistently the one I describe for the curly quotes below; then when I restarted again without -q, it was behaving like that consistently in all buffers again. Pfui. (Sorry, I should have documented my environment more thoroughly before submitting this bug report. I don=E2=80=99t know any more what = was causing the inconsistency.) However, the behaviour of considering "don't", "can't" etc. and almost any English possessive as two words for the purposes of count-words etc. is undoubtedly wrong for most users in my book. However, I appreciate there are cross-linguistic issues here, and French speakers would be equally annoyed if "l'allemand" started to count as one word, not two. (Thanks to John Cowan for this example.) On 3 May 2021, at 16:37, Daphne Preston-Kendal = wrote: > forward-word, backward-word etc. have inconsistent behaviour when > applied to text containing ASCII straight quotation marks vs. Unicode > quotation marks. The word > don't > with a straight quote (U+0027) counts as a single word, and = forward-word > and backward-word will move over the whole thing. Meanwhile, > don=E2=80=99t > with a curly quote (U+2019) counts as two words, and the cursor will > stop at =E2=80=98don=E2=80=99 and =E2=80=98t=E2=80=99 separately. = (Fundamental mode, Emacs 27.2.) >=20 > This also means count-words/count-words-region give surprising results > when applied to text containing Unicode curly apostrophes, since they > work by counting the number of times the cursor can move > forward-word-strictly between given start and end points. (Since it = uses > forward-word-strictly and not forward-word, the problem can=E2=80=99t = be solved > by customizing find-word-boundary-function-table.) >=20 > The Right Thing in my view would be for Emacs to use the Unicode TR29 > word boundary rules to work out where to put the cursor when > forward-word and backward-word are invoked. They handle punctuation > characters correctly, and rules are not too complicated. > > However, how this would interact with the existing > find-word-boundary-function-table customization method, I don=E2=80=99t = know. > CLDR makes customizations of the rules for specific (human) languages; > perhaps they could be ported into Emacs somehow. >=20 > As a temporary workaround to get correct-ish word counts for my > documents, I=E2=80=99ve hacked up a function that uses how-many = instead of > forward-word to count the number of words in a region. > =