unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: "yury.t" <tptlab@tuta.io>
To: <notmuch@notmuchmail.org>
Subject: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)
Date: Wed, 21 Aug 2019 14:58:04 +0200 (CEST)	[thread overview]
Message-ID: <LmoFLlW--3-1@tuta.io> (raw)


Some regular expression returns incorrect results if the pattern contains multibyte characters in square brackets.  The following bracket expression matches subjects not starting with `[1-9]` and returns more results than the parenthesis expression.

(Please note that digits are full width, unicode characters.)




    notmuch count -- 'subject:"/^[1-9]/"' # 961


    notmuch count -- 'subject:"/^(1|2|3|4|5|6|7|8|9)/"' # 32





Somehow non-ascii characters in brackets match with any characters start with same hex code point.  For example:





- [1] (U+FF11) is treated as [\x{F000}-\x{FFFF}]


- ^[倀] (U+5000), ^[啕] (U+5555) and ^[忿] (U+5fff) return same results since they are all "U+5xxx".


Without ^, their results are vary but still contain unrelated subjects.





And curly brackets for repetition also have weird behavior.


If there are two emails whose subject is (A) "1人" and (B) "12人":



- ^(1|2...|9)人 - match A, unmatch B (expected)


- ^(1|2...|9){2}人 - unmatch A, match B (expected)


- ^[1-9]人 and ^[1-9]{2}人 - unmatch both


- ^[1-9]{3}人, {4} and {5} - match A, unmatch B


- ^[1-9]{6}人, {7} and {8} - unmatch A, match B





As noted in manpage of notmuch-search-terms, I surely wrap regular expression with double quotes and entire query with single quotes.  I also increase/decrease $XAPIAN_CJK_NGRAM and rebuild index, but the situation won't change.

             reply	other threads:[~2019-08-21 13:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21 12:58 yury.t [this message]
2019-08-21 14:38 ` regex [X-Z] with non-ascii char returns different results from (X|Y|Z) David Bremner
2019-08-22 12:28   ` yury.t
2019-08-22 12:55     ` David Bremner
2019-08-22 19:53       ` Tomi Ollila
2019-08-24 14:39       ` yury.t

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=LmoFLlW--3-1@tuta.io \
    --to=tptlab@tuta.io \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).