Re: searching for a message by path

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

From: Frederick Eaton <frederik@ofb.net>
To: Michael J Gruber <michaeljgruber+grubix+git@gmail.com>
Cc: Pengji Zhang <me@pengjiz.com>,
	notmuch@notmuchmail.org,
	Panayotis Manganaris <panos.manganaris@gmail.com>
Subject: Re: searching for a message by path
Date: Fri, 27 Sep 2024 19:56:30 -0700	[thread overview]
Message-ID: <20240928025630.km2tcgjgt3xub4jo@localhost> (raw)
In-Reply-To: <CAA19uiTji8k3UARPM=3xDMaqeSY_sFN1DHrtpVEDvnR-mnMV7A@mail.gmail.com> <CAA19uiTKyTPA5mPG4X2E8XwWYVKTo03Yk1ZHrF32=MN8uRc5-A@mail.gmail.com>

Thank you all for your helpful replies. It seems pretty clear that the recommended Notmuch usage for someone who wants to incorporate a script that classifies a batch of messages, is either to write the whole script to use Notmuch from the beginning, and to have the messages specified as a list of "id:" IDs or even a general Notmuch query; or, if you are using an existing script that accepts a list of files, to try to extract the message ID from each file at the end so that the new tags can be communicated from the script back to Notmuch. Both options seem a little hacky - especially since it is rather common to receive multiple distinct messages with the same ID, for example when someone replies to a mailing list post and Cc's me, and I would want these to be separately viewable (they are linked together ith an "=" in the Mutt thread view) for security reasons. If Notmuch is meant to function as an abstraction layer over message files stored on the file system, then why doesn't it p
 rovide a standard way to go from file paths to Notmuch messages?

As for why it would be a security issue to ignore new messages with duplicate message IDs, consider that one can apparently play the following game on a Notmuch user. (1) Send a private email that the user will never see because it contains spam keywords. (2) Send a public email to a mailing list with the same ID. The Notmuch user will not see the second email, and everyone will think he is unable to reply to the allegations it presumably contains, and that he is therefore guilty and should be arrested.

My recommendation would be to split the Notmuch project into three teams: one to work on the source code, another on the documentation, and a third on test cases. There should be separate Git repositories for each team, so that I can for example run current test cases against a fork of the source repo, or use recent manual pages with an older version of the source. This way, the documentation team will be able to document deficiencies in various source releases as well as standardizing proposed new features or syntax. Or, someone would submit a pull request to the source team, that would then be discussed on the mailing list or in the issue tracker, and someone on another team would then use that discussion to write documentation or test cases before the PR is accepted. The teams would have a "checks and balances" relationship, like with the three branches of government. (I think that all software projects should be run this way, so please don't be offended.)

I wrote some Perl scripts a long time ago, which work together to tag mail and put links to each message in a tag-specific directory for each of its tags. The script would add headers to the message, however, and it rewrote the Message-ID if it wasn't unique. It did not create a full-text index like Notmuch does. It did seem fairly reliable. I am trying to adapt it to send the tags to Notmuch. I am having to use Notmuch because of a third piece of software that depends on it. It is somewhat perplexing to me that no one else has had my use case before.

Best wishes,

Frederick

On Sat, Sep 21, 2024 at 11:38:18AM +0200, Michael J Gruber wrote:
>Am Sa., 21. Sept. 2024 um 05:23 Uhr schrieb Frederick Eaton <frederik@ofb.net>:
>>
>> Thank you for your response, Pengji.
>>
>> On Sat, Sep 21, 2024 at 08:25:10AM +0800, Pengji Zhang wrote:
>> >Hi Frederick,
>> >
>> >Frederick Eaton <frederik@ofb.net> writes:
>> >
>> >>I am trying to figure out how to adapt a script I wrote for
>> >>filtering messages, to apply notmuch tags to each message. A
>> >>difficulty is that the messages are already in the Notmuch database,
>> >>because another tool has delivered them to a maildir and run
>> >>"notmuch new".
>> >>
>> >>Now, Notmuch can provide me with the paths of all the new
>> >>(unfiltered) messages, which I can give to my script. The question I
>> >>have is, once the filter is done, how can the script tell Notmuch
>> >>which message to apply the tags to?
>> >
>> >
>> >I am not sure if I understand you correctly. If the problem here is to
>> >distinguish existing messages and new messages, would the config
>> >option 'new.tags' work? For example, use
>> >
>> >   notmuch config set new.tags new
>> >
>> >to give all new messages a 'new' tag.
>>
>> No, I already have that configuration. The first sentence described what I already know how to do, the second sentence is what I'm trying to do.
>
>It seems that we're still guess-working-out what your script is
>doing/trying to do. Do you mind sharing a trimmed down version?
>
>> It might be useful for the reasons I stated, namely in case the Message-ID does not exist or is not unique.
>
>This is probably at the heart of the problem. Within notmuch, a
>"message" is something identified by a message-id (mid), and all
>information in the notmuch database is tied to a mid.
>
>When you speak about a message, you probably mean the content of an
>individual "message file" - which is a natural, but different notion.
>A "path:" refers to a message file, a "mid:" to message id.
>
>When "notmuch new" encounters a new message files, it
>- checks if it contains a valid "Message-ID" header
>- used that as mid or generates a mid using a sha1 checksum of the message file
>- checks whether that mid (!) is in the database already
>- adds the path to the existing db entry, or creates a new db entry
>
>So, you may have several files (path entries) for the same mid, and
>which one is used for indexing purposes depends on the order of
>arrival (or, in the case of reindexing, probably on file system
>ordering). notmuch assumes that this makes no difference - same mid
>same "message". This assumption can break, for example for list
>copies, different headers on sent versus received etc.
>
>I"m elaborating on this because we have to guess about your script -
>what is a "new message" for your script, and which kind of information
>does it want to process?
>
>Typical processing would be done in a notmuch post-hook, and it would:
>- check for new messages (tag:new)
>- get their file paths form `notmuch search --output=files mid:XYZ` or such
>- do whatever it needs using the file if you really need to parse that yourself
>
>I guess most of us have some sort of script running on new messages as
>part of a hook, be it `afew` or something homegrown, and this
>typically clears the new tag afterwards.
>
>Michael
>

On Tue, Sep 24, 2024 at 11:09:26AM +0200, Michael J Gruber wrote:
>Am Sa., 21. Sept. 2024 um 18:24 Uhr schrieb Panayotis Manganaris
><panos.manganaris@gmail.com>:
>...
>> notmuch search --output=messages 'tag:new' > /tmp/msgs
>> notmuch search --output=files 'tag:new' |\
>>     bogofilter -o0.7,0.7 -bt |\
>>     paste - /tmp/msgs |\
>>     awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
>>     notmuch tag --batch
>>
>...
>> This script operates on the assumption that the order of results from notmuch queries are
>> always the same, which is fortunately true.
>
>It also operates under the assumption that you receive no duplicate
>messages with the same message-id (such as list copies,
>sent/reveived), or else `paste` will have a hard time matching lines.
>
>Note that you can loop over the msgs, treat them individually, and
>still collect input for `notmuch tag --batch`, which solves both the
>problem with duplicate messages and potential ordering instability
>while keeping batch efficiency.
>
>> Your instinct to use batch tagging and id: queries is correct. I collect my new message ids in
>> /tmp/msgs. These ids are unique, they are definitely unique enough to be used to tag individual
>> messages on a daily basis.
>
>I'm sorry, but either they're unique or not. What's unique enough? I'm
>pestering on this because part of the OP's problem is being clear
>about the notion of message, which is uniquely identified by a message
>id in the notmuch db. I tried to clear that up in my previous answer
>in this thread.
>
>
>> > It might be useful for the reasons I stated, namely in case the Message-ID does not exist or
>> > is not unique.
>>
>> I think mail that is successfully transmitted through a mail host necessarily obtains a message
>> id, but I might be wrong. I believe notmuch indexes on both it's own unique thread ids and the
>> message ids. Thereby further decreasing the already minuscule chance of message id collisions.
>
>No. Messages can arrive without mid. In that case, notmuch creates one
>(without altering the message file) and uses it for indexing.
>"Thread-id" is something completely different from message-ids. They
>do not identify a message uniquely (but a thread of messages "joint"
>by references), albeit indirectly (such as "root message of the
>thread", assuming one root).
>
>Cheers
>Michael
>

next prev parent reply	other threads:[~2024-09-28  2:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-20 17:52 searching for a message by path Frederick Eaton
2024-09-21  0:25 ` Pengji Zhang
2024-09-21  3:23   ` Frederick Eaton
2024-09-21  9:01     ` Pengji Zhang
2024-09-21  9:38     ` Michael J Gruber
2024-09-21 10:44     ` Gregor Zattler
2024-09-21 16:24     ` Panayotis Manganaris
2024-09-21 17:30       ` Teemu Likonen
2024-09-23 22:14         ` Panayotis Manganaris
2024-09-24 13:00           ` David Bremner
2024-09-24  9:09       ` Michael J Gruber
2024-09-28  2:56         ` Frederick Eaton [this message]
2024-09-29 12:08           ` David Bremner
2024-10-12 22:59             ` David Bremner
2024-10-14  6:50               ` Michael J Gruber
2024-10-14 10:58                 ` David Bremner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240928025630.km2tcgjgt3xub4jo@localhost \
    --to=frederik@ofb.net \
    --cc=me@pengjiz.com \
    --cc=michaeljgruber+grubix+git@gmail.com \
    --cc=notmuch@notmuchmail.org \
    --cc=panos.manganaris@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).