regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
@ 2019-08-21 12:58 yury.t
  2019-08-21 14:38 ` David Bremner
  0 siblings, 1 reply; 6+ messages in thread
From: yury.t @ 2019-08-21 12:58 UTC (permalink / raw)
  To: notmuch

Some regular expression returns incorrect results if the pattern contains multibyte characters in square brackets.  The following bracket expression matches subjects not starting with `[１-９]` and returns more results than the parenthesis expression.

(Please note that digits are full width, unicode characters.)

    notmuch count -- 'subject:"/^[１-９]/"' # 961

    notmuch count -- 'subject:"/^(１|２|３|４|５|６|７|８|９)/"' # 32

Somehow non-ascii characters in brackets match with any characters start with same hex code point.  For example:

- [１] (U+FF11) is treated as [\x{F000}-\x{FFFF}]

- ^[倀] (U+5000), ^[啕] (U+5555) and ^[忿] (U+5fff) return same results since they are all "U+5xxx".

Without ^, their results are vary but still contain unrelated subjects.

And curly brackets for repetition also have weird behavior.

If there are two emails whose subject is (A) "１人" and (B) "１２人":

- ^(１|２...|９)人 - match A, unmatch B (expected)

- ^(１|２...|９){2}人 - unmatch A, match B (expected)

- ^[１-９]人 and ^[１-９]{2}人 - unmatch both

- ^[１-９]{3}人, {4} and {5} - match A, unmatch B

- ^[１-９]{6}人, {7} and {8} - unmatch A, match B

As noted in manpage of notmuch-search-terms, I surely wrap regular expression with double quotes and entire query with single quotes.  I also increase/decrease $XAPIAN_CJK_NGRAM and rebuild index, but the situation won't change.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
  2019-08-21 12:58 regex [X-Z] with non-ascii char returns different results from (X|Y|Z) yury.t
@ 2019-08-21 14:38 ` David Bremner
  2019-08-22 12:28   ` yury.t
  0 siblings, 1 reply; 6+ messages in thread
From: David Bremner @ 2019-08-21 14:38 UTC (permalink / raw)
  To: yury.t, notmuch

"yury.t" <tptlab@tuta.io> writes:

> Some regular expression returns incorrect results if the pattern
> contains multibyte characters in square brackets.  The following
> bracket expression matches subjects not starting with `[１-９]` and
> returns more results than the parenthesis expression.

We rely on POSIX.2 regex functions (regcomp, regexec). I would be
interested to know if the searches you are interested in work in a
standalone C program using regcomp and regexec.

d

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
  2019-08-21 14:38 ` David Bremner
@ 2019-08-22 12:28   ` yury.t
  2019-08-22 12:55     ` David Bremner
  0 siblings, 1 reply; 6+ messages in thread
From: yury.t @ 2019-08-22 12:28 UTC (permalink / raw)
  To: David Bremner; +Cc: Notmuch

Thank you for your reply.
I confirmed that the issue is reproduced in C program. https://pastebin.com/5NaCM45G <https://pastebin.com/5NaCM45G>

Sorry for bothering you...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
  2019-08-22 12:28   ` yury.t
@ 2019-08-22 12:55     ` David Bremner
  2019-08-22 19:53       ` Tomi Ollila
  2019-08-24 14:39       ` yury.t
  0 siblings, 2 replies; 6+ messages in thread
From: David Bremner @ 2019-08-22 12:55 UTC (permalink / raw)
  To: yury.t; +Cc: Notmuch

"yury.t" <tptlab@tuta.io> writes:

> Thank you for your reply.
> I confirmed that the issue is reproduced in C program. https://pastebin.com/5NaCM45G <https://pastebin.com/5NaCM45G>
>
> Sorry for bothering you...

I'm not sure, but it might be a glibc bug. Since we are already using
glib, maybe we should use

      https://developer.gnome.org/glib/stable/glib-Perl-compatible-regular-expressions.html

I don't know if it also has this problem with [] and non-ascii
characters.

d

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
  2019-08-22 12:55     ` David Bremner
@ 2019-08-22 19:53       ` Tomi Ollila
  2019-08-24 14:39       ` yury.t
  1 sibling, 0 replies; 6+ messages in thread
From: Tomi Ollila @ 2019-08-22 19:53 UTC (permalink / raw)
  Cc: Notmuch

On Thu, Aug 22 2019, David Bremner wrote:

> "yury.t" <tptlab@tuta.io> writes:
>
>> Thank you for your reply.
>> I confirmed that the issue is reproduced in C program. https://pastebin.com/5NaCM45G <https://pastebin.com/5NaCM45G>
>>
>> Sorry for bothering you...
>
> I'm not sure, but it might be a glibc bug. Since we are already using
> glib, maybe we should use
>
>       https://developer.gnome.org/glib/stable/glib-Perl-compatible-regular-expressions.html
>
> I don't know if it also has this problem with [] and non-ascii
> characters.

Since pcre2 supports \K that gives positive vibe about the above

( 'Resetting the match start' 
  in http://www.pcre.org/current/doc/html/pcre2pattern.html )

Tomi 

>
> d

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
  2019-08-22 12:55     ` David Bremner
  2019-08-22 19:53       ` Tomi Ollila
@ 2019-08-24 14:39       ` yury.t
  1 sibling, 0 replies; 6+ messages in thread
From: yury.t @ 2019-08-24 14:39 UTC (permalink / raw)
  To: David Bremner; +Cc: Notmuch

Although this thread now might be offtopic, let me send a follow-up.
By searching with C related terms, I found some articles about this issue.  It seems to be a common problem on regex + multibyte in C.  (e.g. https://stackoverflow.com/a/15895746 <https://stackoverflow.com/a/15895746>)

On Wed, Aug 21, 2019 at 12:58:04PM +0000, tptlab@tuta.io <mailto:tptlab@tuta.io> wrote:
> - [１] (U+FF11) is treated as [\x{F000}-\x{FFFF}]

Actually, it becomes [\xef\xbc\x91].  That's why it matches with U+Fxxx (starts with \xef in UTF-8).  And without ^, it matches partial byte of a character, U+4444 (\xe4\x91\x84), U+5C11 (\xeb\xb0\x91) for example.

I'm not familiar with C and don't know whether pcre or \k solve this issue, but it might hard to fix if the root cause is how C handles multibyte strings.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-08-24 14:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21 12:58 regex [X-Z] with non-ascii char returns different results from (X|Y|Z) yury.t
2019-08-21 14:38 ` David Bremner
2019-08-22 12:28   ` yury.t
2019-08-22 12:55     ` David Bremner
2019-08-22 19:53       ` Tomi Ollila
2019-08-24 14:39       ` yury.t

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).