From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id CF4AA6DE0289 for ; Sat, 24 Aug 2019 07:39:17 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -1.098 X-Spam-Level: X-Spam-Status: No, score=-1.098 tagged_above=-999 required=5 tests=[AWL=-0.196, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xXt9MESmKZEX for ; Sat, 24 Aug 2019 07:39:16 -0700 (PDT) Received: from w1.tutanota.de (w1.tutanota.de [81.3.6.162]) by arlo.cworth.org (Postfix) with ESMTPS id 52A1A6DE024F for ; Sat, 24 Aug 2019 07:39:15 -0700 (PDT) Received: from w2.tutanota.de (unknown [192.168.1.163]) by w1.tutanota.de (Postfix) with ESMTP id 8ECE1FBF2A8; Sat, 24 Aug 2019 14:39:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1566657550; s=s1; d=tuta.io; h=Date:From:To:Cc:Message-ID:In-Reply-To:References:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding; bh=RCWIJaMhFfD0FP5ION0YmF+4JlFfqE/15Tb+9/LbNhc=; b=XfO0KCpaOBAKkZAv8iXPCnH8po9BrcLmDjAdBfHJ4p6ua5zVR4IvBg2lxCUHyQRH jJeATkg47OuNf+8x1biTlTx/Kh9rPqMUbUbpbpDkr/s2OwBq6TRF9Zp41hVjNMd8g/U bWdtpr8SWK0KokcCRFYe3haNjVwaKsgBsrWIBtwM7zVoEAyLDAehNnnkP3BNfe/lf9L QOO4fwWclSIvhjX6Ig+td3kLoFL52fZIrioheOsouajuaCe0jbRROTq1nEm2Tq7PP56 uQUflkZoP6AbI6Tcn9uPMvB9blfciEAX2+nA0bDGY9fppoNUC8s/j20NbLEQd/AVgLC 9VWJjmAPfA== Date: Sat, 24 Aug 2019 16:39:10 +0200 (CEST) From: "yury.t" To: David Bremner Cc: Notmuch Message-ID: In-Reply-To: <87pnkxiclr.fsf@tethera.net> References: <878srmk2i8.fsf@tethera.net> <87pnkxiclr.fsf@tethera.net> Subject: Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 24 Aug 2019 14:39:17 -0000 Although this thread now might be offtopic, let me send a follow-up. By searching with C related terms, I found some articles about this issue.= =C2=A0 It seems to be a common problem on regex + multibyte in C.=C2=A0 (e.= g. https://stackoverflow.com/a/15895746 ) On Wed, Aug 21, 2019 at 12:58:04PM +0000, tptlab@tuta.io wrote: > - [=EF=BC=91] (U+FF11) is treated as [\x{F000}-\x{FFFF}] Actually, it becomes [\xef\xbc\x91].=C2=A0 That's why it matches with U+Fxx= x (starts with \xef in UTF-8).=C2=A0 And without ^, it matches partial byte= of a character, U+4444 (\xe4\x91\x84), U+5C11 (\xeb\xb0\x91) for example. I'm not familiar with C and don't know whether pcre or \k solve this issue,= but it might hard to fix if the root cause is how C handles multibyte stri= ngs.