From: Panayotis Manganaris <panos.manganaris@gmail.com>
To: frederik@ofb.net
Cc: notmuch@notmuchmail.org
Subject: Re: searching for a message by path
Date: Sat, 21 Sep 2024 12:24:14 -0400 [thread overview]
Message-ID: <87wmj52cwh.fsf@ASCALON.mail-host-address-is-not-set> (raw)
In-Reply-To: <20240921032340.opozeclfbyqzw2yt@localhost>
Frederick Eaton <frederik@ofb.net> writes:
>
> Suppose the filter script reads a message from a particular file and decides that it is
> spam. How does the filter tell Notmuch that the message corresponding to that file is spam?
> You seem to be saying below that the filter script should extract the Message-ID and use it
> to identify the message to Notmuch, since file paths of the messages are not
> indexed. Probably what my script should be doing for each message is appending a line to a
> batch file like this:
>
> +spam -new -- id:some_message_id@foo
> +inbox -new -- id:some_other@baz
>
> and then passing the batch file to "notmuch tag"?
>
Hello Fredrick, you are exactly correct. This is what I've written to handle spam filtering in
my notmuch post-new hook. Like you, I have notmuch configured to assign newly fetched mail with
tag "new"
notmuch search --output=messages 'tag:new' > /tmp/msgs
notmuch search --output=files 'tag:new' |\
bogofilter -o0.7,0.7 -bt |\
paste - /tmp/msgs |\
awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
notmuch tag --batch
This should run under any shell. My chosen filter is bogofilter. The -bt flags tell it to
operate on a stdin "batch" of file paths and return a "terse" summary of results e.g.
H 0.248913
S 0.999999
This script operates on the assumption that the order of results from notmuch queries are
always the same, which is fortunately true.
>>>I've tentatively concluded that the best way to locate each message in the Notmuch database
>>>is to extract the Message-ID and search for it with "id:"? But the FAQ says that multiple
>>>messages can have the same Message-ID (and some spam messages don't have one at all).
Your instinct to use batch tagging and id: queries is correct. I collect my new message ids in
/tmp/msgs. These ids are unique, they are definitely unique enough to be used to tag individual
messages on a daily basis. If you prefer to tag entire threads as spam the moment a single
message is spam, you can simply use
notmuch search --output=threads 'tag:new' > /tmp/msgs
I prefer to manually mute threads with a mute tag, but Thread ids are definitely unique.
If you want auto-tag spam in an existing archive, then you will need to first manually tag a
good quantity of messages (100-1000) you consider to be spam and a good quantity of messages
(100-1000) you consider to be ham and use them to train the filter e.g.
notmuch search --output=files 'tag:spam' | bogofilter -bs
notmuch search --output=files 'tag:inbox' | bogofilter -bn
>>>If I could access the message using the filename that the script is processing, it would
>>>seem slightly more reliable. It seems like there should be some way to allow a Notmuch
>>>database entry to be accessed directly by filename, without even creating a Notmuch-style
>>>search query containing that filename, but rather by passing the filename as a command-line
>>>argument to "notmuch". It would be nice not to have to worry about quoting and unquoting.
>>
>>I am not sure if this is useful, given that (presumably) Notmuch uses message IDs as
>>keys. Besides, those filenames are usually generated automatically and quite cryptic.
>
> It might be useful for the reasons I stated, namely in case the Message-ID does not exist or
> is not unique.
I think mail that is successfully transmitted through a mail host necessarily obtains a message
id, but I might be wrong. I believe notmuch indexes on both it's own unique thread ids and the
message ids. Thereby further decreasing the already minuscule chance of message id collisions.
--
Best,
Panos
next prev parent reply other threads:[~2024-09-21 16:24 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-20 17:52 searching for a message by path Frederick Eaton
2024-09-21 0:25 ` Pengji Zhang
2024-09-21 3:23 ` Frederick Eaton
2024-09-21 9:01 ` Pengji Zhang
2024-09-21 9:38 ` Michael J Gruber
2024-09-21 10:44 ` Gregor Zattler
2024-09-21 16:24 ` Panayotis Manganaris [this message]
2024-09-21 17:30 ` Teemu Likonen
2024-09-23 22:14 ` Panayotis Manganaris
2024-09-24 13:00 ` David Bremner
2024-09-24 9:09 ` Michael J Gruber
2024-09-28 2:56 ` Frederick Eaton
2024-09-29 12:08 ` David Bremner
2024-10-12 22:59 ` David Bremner
2024-10-14 6:50 ` Michael J Gruber
2024-10-14 10:58 ` David Bremner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://notmuchmail.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wmj52cwh.fsf@ASCALON.mail-host-address-is-not-set \
--to=panos.manganaris@gmail.com \
--cc=frederik@ofb.net \
--cc=notmuch@notmuchmail.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).