From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#29871: 25.3; ZWJ word-boundaries in regexps Date: Wed, 27 Dec 2017 22:33:22 +0200 Message-ID: <83bmijhpwt.fsf@gnu.org> References: <87k1x8f0qr.fsf@nagas.meson.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1514406744 2507 195.159.176.226 (27 Dec 2017 20:32:24 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 27 Dec 2017 20:32:24 +0000 (UTC) Cc: 29871@debbugs.gnu.org To: "Mark Shoulson" Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Wed Dec 27 21:32:20 2017 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eUIN9-0008L2-I8 for geb-bug-gnu-emacs@m.gmane.org; Wed, 27 Dec 2017 21:32:19 +0100 Original-Received: from localhost ([::1]:46850 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eUIP1-0007LV-Sc for geb-bug-gnu-emacs@m.gmane.org; Wed, 27 Dec 2017 15:34:15 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57036) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eUIOt-0007Kq-5L for bug-gnu-emacs@gnu.org; Wed, 27 Dec 2017 15:34:11 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eUIOo-0000XR-KV for bug-gnu-emacs@gnu.org; Wed, 27 Dec 2017 15:34:07 -0500 Original-Received: from debbugs.gnu.org ([208.118.235.43]:45527) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eUIOo-0000XE-EF for bug-gnu-emacs@gnu.org; Wed, 27 Dec 2017 15:34:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1eUIOo-00018e-7c for bug-gnu-emacs@gnu.org; Wed, 27 Dec 2017 15:34:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Wed, 27 Dec 2017 20:34:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 29871 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 29871-submit@debbugs.gnu.org id=B29871.15144068004330 (code B ref 29871); Wed, 27 Dec 2017 20:34:02 +0000 Original-Received: (at 29871) by debbugs.gnu.org; 27 Dec 2017 20:33:20 +0000 Original-Received: from localhost ([127.0.0.1]:54208 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eUIO8-00017m-Ff for submit@debbugs.gnu.org; Wed, 27 Dec 2017 15:33:20 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:36914) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eUIO7-00017Z-3D for 29871@debbugs.gnu.org; Wed, 27 Dec 2017 15:33:19 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eUINx-0007dR-AN for 29871@debbugs.gnu.org; Wed, 27 Dec 2017 15:33:13 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44318) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eUINx-0007cx-13; Wed, 27 Dec 2017 15:33:09 -0500 Original-Received: from [176.228.60.248] (port=1037 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1eUINw-0005jG-Bz; Wed, 27 Dec 2017 15:33:08 -0500 In-reply-to: <87k1x8f0qr.fsf@nagas.meson.org> (mark@nagas.meson.org) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:141545 Archived-At: > From: "Mark Shoulson" > Date: Wed, 27 Dec 2017 14:07:40 -0500 > > According to http://unicode.org/reports/tr29/#Word_Boundaries rule WB4, > it would seem that a ZWJ character (U+200D ZERO WIDTH JOINER) between > two "word" characters should not constitute a word boundary. And yet: > > (string-match "\\<" "foo\u200Dfbar" 1) > > evaluates to 4 (the 1 is to skip the word-beginning at the start of the > string). Or you can search for "\\b" or "\\>" and get 3. Either way, > indicative of a word-break at the ZWJ character. Is this correct? Emacs considers a change of script as a word break, and U+200D's script is 'symbol', which is different from 'latin', the script of the ASCII characters.