From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 1D21D6DE0F48 for ; Tue, 12 Mar 2019 00:34:34 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.582 X-Spam-Level: X-Spam-Status: No, score=-0.582 tagged_above=-999 required=5 tests=[AWL=0.108, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rLANC2OUMdJY for ; Tue, 12 Mar 2019 00:34:32 -0700 (PDT) Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) by arlo.cworth.org (Postfix) with ESMTPS id 3F1286DE0C6D for ; Tue, 12 Mar 2019 00:34:31 -0700 (PDT) Received: from len.workgroup ([84.185.98.59]) by mail.gmx.com (mrgmx002 [212.227.17.190]) with ESMTPSA (Nemesis) id 0MWSwU-1hZz4L3BMI-00XXzw for ; Tue, 12 Mar 2019 08:34:27 +0100 From: Gregor Zattler To: notmuch@notmuchmail.org Subject: Re: how to search for hyphenated words? (was: how to search for Morse code?) In-Reply-To: <87wol4dhe7.fsf@tethera.net> References: <87muui87om.fsf@len.workgroup> <87ef7hyxqs.fsf@len.workgroup> <87a7i4c3t5.fsf@wondoo.home.cworth.org> <87wol4dhe7.fsf@tethera.net> Mail-Followup-To: notmuch@notmuchmail.org Date: Tue, 12 Mar 2019 08:34:19 +0100 Message-ID: <87a7i0v950.fsf@len.workgroup> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Provags-ID: V03:K1:RTPpRKczjZYbzvPLCgqIN9LhDxokfk2nXCPFf37uUWGE+cdxZ3O x/ZYg5r8jHtll7Mj+UUrGBSc93CDFTXKA3OnhUiZTccTj4K6cSGmbZ+0t4q543l7AP2x8GI 9pt1MlgNie53NvV3Yu88CHmx8Z/0MjwdZngpnH96UMuZW9mU1NZgGwKe1OgK2LEGJtC0Z/y RjwdQQb7ZpboxBaPiFPCg== X-UI-Out-Filterresults: notjunk:1;V03:K0:OfcHFhgB150=:BP5u+qGjE74keeR65uqwKq 0sJmsPjANB23iailXdakUaaaPKTKKQJmZbg5K/nxR5keX1uK4pePYvnWlG8/kMyEYDYXjaYFH sFUhKxiEKaemT4LISjk06EDECwjmMVHCscX+/5iUFLwqke3YHTMGJ0SWqkBDUrNKo3FF4UeBC TgO9r8QVE7J7DgLwaZooIofIYTZolSdva1Hbew5IWambUL3h6dC5+k056Izi3HiYiQcy5nz0k qdiZkrPrrC6GZOPi+SqaS2LGENWT7CWvK9Jr5YocoYs7sg1ty7jnTgUvDGqow52cnA9uqQsuM IWjUUy9gVc5Bb8h88RgxM2vwD9POkWGInC2hpQLg3FlUhU+PQav6iZc0g18xZh3N/Ei7egvnv qfetNqrymZVekJ5HiIprvlCNAcNFWmeGR37sdetyx8nOf9ny2qwW/DeSK1FCiTo+XOVetE/YK ZcnJeuSg5zPr6Rp2B2Cn6/yP68vKXHTDdmgVgz16/5gAXK8gyraNjVpwzQjxYmuivWEOf3Bik EpJGK5wPuS+3kGRGCgwU73QHPEphShW2kAltJGEJ1haxYwP+zLT95xLelQfcojF0ttDSPfkE0 BWXk3I2K+Ib07B0a3il5rs9R5Sw22M8zigX1H8sA2LK7G+kU2Thy8nefQLW1EyMYNVKj1CLyi Tul3oC01iK0/6OgPnwjA2MElD9RB4//YDr9dkixFJxq5/rPyACcwgOX1qPxwAFfFRK6AtINo/ 3Uym8STtSXO9X+fZLY1aqgC+wtcSThkKXTUnEsCpjTlVaIsd0Ub3CdaOSBq1Zo1Lk4rM2/Oxe LPardqsqb86bhLIyF7gH5haxvKIrWa/lsAc7Hvjsdqkz/KiYcF9TVBkBXDHIO8npXBmS2J9mD 15x/BsRRgaZhS+vVgABseuHcRQ9zAYX2kZCLHwma72WWPnVFuDpDa7piiK2Klq X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Mar 2019 07:34:34 -0000 --=-=-= Content-Type: text/plain Hi David, Matt, Carl, notmuch developers, * David Bremner [2019-03-11; 22:13]: > Matt Armstrong writes: >> Carl Worth writes: >>> The trick here is that when notmuch is indexing body text it feeds it >>> into a Xapian function that parses the text by finding "terms" in the >>> text. And this parser considers both punctuation and whitespace as >>> separators between terms. >> >> I notice that Xapian supports something called "phrase searches", >> documented as: >> >> "A phrase surrounded with double quotes ("") matches documents >> containing that exact phrase. Hyphenated words are also treated as >> phrases, as are cases such as filenames and email addresses >> (e.g. /etc/passwd or president@whitehouse.gov)." >> >> I assume that this particular Xapian feature is unavailable in notmuch? >> If so, I wonder if enabling has ever been considered? > > It is enabled, and documented in notmuch-search-terms(7). Unfortunately > I don't think it's related to the original request. The mention of > hyphenated words is about the input to the query parser, not the > (necessarily) the retrieved text. what I do not understand is that it dosn't matter if I search for org-notmuch or "org-notmuch" '"org-notmuch"' or even org ADJ/1 notmuch $ notmuch count --output=messages '"org-notmuch"' 581 $ notmuch count --output=messages 'org-notmuch' 581 $ notmuch count --output=messages org-notmuch 581 $ notmuch count --output=messages org ADJ/1 notmuch 581 a typical example of a matched message is the attached one. Somehow the search matches the address of this very mailing list in the body of the email (I assume). But obviously there are much more emails with this address in them: $ notmuch count --output=messages 'notmuch@notmuchmail.org' 27396 $ notmuch count --output=messages '"notmuch@notmuchmail.org"' 27396 Or with a naive search (no decoding of possible base64 encoded parts) there are $ find /home/grfz/Mail/~ml/emacs-orgmode@gnu.org /home/grfz/Mail/~ml/notmuch@notmuchmail.org* -type f -print0 | xargs -0r grep -l -- 'notmuch@notmuchmail.org' | xargs -IXXXX sh -c "cat XXXX | sed -e '1,/^$/ d' | grep -c notmuch@notmuchmail.org " | egrep -c "1|2|3|4|5|6|7|8|9" 16795 emails with the address at least once in the body. Therefore I wonder why notmuch matches 581 messages. A naive search for org-notmuch on the files (no decoding of possible base64 encoded parts) only shows 79 files (77 unique emails): mkdir -vp /tmp/test/{cur,new,tmp} $ find /home/grfz/Mail/~ml/emacs-orgmode@gnu.org /home/grfz/Mail/~ml/notmuch@notmuchmail.org* -type f -print0 | xargs -0r grep -l -- 'org-notmuch' | xargs ln -vs --target-directory=/tmp/kolp/cur/ | wc -l 79 Therefore I wonder why notmuch matches 581 messages, not 16795 messages or 77 messages. Somehow these numbers do not fit!? Ciao; Gregor -- -... --- .-. . -.. ..--.. ...-.- --=-=-= Content-Type: message/rfc822 Content-Disposition: attachment; filename="1514563210.28210_1.len:2,S" Return-path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on len.workgroup X-Spam-grfz-Status: No, hits=-1.5 required=2.0 bayes=0.0000 tests=BAYES_00=-3.599,NO_RELAYS=-0.001,TO_MALFORMED=2.099 autolearn=no autolearn_force=no version=3.4.1 date="Fri, 29 Dec 2017 17:00:10 +0100" languages= Envelope-to: root@localhost Delivery-date: Fri, 29 Dec 2017 17:00:09 +0100 Received: from grfz by len.workgroup with local (Exim 4.89) (envelope-from ) id 1eUx4r-0007Kx-5w for root@localhost; Fri, 29 Dec 2017 17:00:09 +0100 From: root@len.workgroup (Cron Daemon) To: root@localhost Subject: Cron ~/bin/mailwiederdurchschleusen MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Cron-Env: X-Cron-Env: X-Cron-Env: X-Cron-Env: X-Cron-Env: Message-Id: Date: Fri, 29 Dec 2017 17:00:09 +0100 Date: Thu, 28 Dec 2017 21:04:52 -0500 From: Maxim Cournoyer To: help-gnu-emacs@gnu.org Subject: Re: Gnus and emails sent by me ---------------------------------------------------------- Date: Thu, 28 Dec 2017 22:00:56 -0400 From: David Bremner To: David Edmondson , notmuch@notmuchmail.org Subject: Re: Xapian exception leading to database corruption ---------------------------------------------------------- --=-=-=--