searching for a message by path

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* searching for a message by path
@ 2024-09-20 17:52 Frederick Eaton
  2024-09-21  0:25 ` Pengji Zhang
  0 siblings, 1 reply; 16+ messages in thread
From: Frederick Eaton @ 2024-09-20 17:52 UTC (permalink / raw)
  To: notmuch

Dear Notmuch,

I am trying to figure out how to adapt a script I wrote for filtering messages, to apply notmuch tags to each message. A difficulty is that the messages are already in the Notmuch database, because another tool has delivered them to a maildir and run "notmuch new".

Now, Notmuch can provide me with the paths of all the new (unfiltered) messages, which I can give to my script. The question I have is, once the filter is done, how can the script tell Notmuch which message to apply the tags to?

I've tentatively concluded that the best way to locate each message in the Notmuch database is to extract the Message-ID and search for it with "id:"? But the FAQ says that multiple messages can have the same Message-ID (and some spam messages don't have one at all).

If I could access the message using the filename that the script is processing, it would seem slightly more reliable. It seems like there should be some way to allow a Notmuch database entry to be accessed directly by filename, without even creating a Notmuch-style search query containing that filename, but rather by passing the filename as a command-line argument to "notmuch". It would be nice not to have to worry about quoting and unquoting.

When I try to search for a message using "path:", nothing seems to work.

     $ notmuch show id:"enwiki.66eda276579298.42329355@en.wikipedia.org" | grep filename
     message{ id:enwiki.66eda276579298.42329355@en.wikipedia.org depth:0 match:1 excluded:0 filename:/home/me/mail/notmuch//gmail-me/mail/cur/1921042a466e0b3b:2,
     $ notmuch search id:"enwiki.66eda276579298.42329355@en.wikipedia.org"
     thread:000000000003a7a9 33 mins. ago [1/1] Wikipedia; Wikipedia page SpaceX Raptor has been changed by Canterbury Tail (important inbox unread updates)
     $ notmuch search path:"/home/me/mail/notmuch//gmail-me/mail/cur/1921042a466e0b3b:2,"
     notmuch search: A Xapian exception occurred
     A Xapian exception occurred parsing query: unmatched regex delimiter in '/home/me/mail/notmuch//gmail-me/mail/cur/1921042a466e0b3b:2,'
     Query string was: path:/home/me/mail/notmuch//gmail-me/mail/cur/1921042a466e0b3b:2,
     [1]$ notmuch search path:"\/home\/me\/mail\/notmuch\/gmail-me\/mail\/cur\/1921042a466e0b3b:2,"
     $ notmuch search path:"\/home\/me\/mail\/notmuch\/\/gmail-me\/mail\/cur\/1921042a466e0b3b:2,"
     $ notmuch search path:"*1921042a466e0b3b*"
     $ notmuch search path:".*1921042a466e0b3b.*"
     $ notmuch search path:"/.*1921042a466e0b3b.*/"
     $ notmuch search path:"notmuch/gmail-me/mail/cur/1921042a466e0b3b:2,"
     $ notmuch search path:"notmuch//gmail-me/mail/cur/1921042a466e0b3b:2,"
     $ notmuch search path:"gmail-me/mail/cur/1921042a466e0b3b:2,"
     $ notmuch search path:"mail/cur/1921042a466e0b3b:2,"
     $ notmuch search path:"cur/1921042a466e0b3b:2,"
     $ notmuch search path:"/\/home\/me\/mail\/notmuch\/gmail-me\/mail\/cur\/1921042a466e0b3b:2,/"

There were no results for any of the "path:" searches, although the "id:" search worked. I am using version 0.32.2 and can update if this may be related to a bug that was fixed in the past few years.

I am not subscribed so please kindly Cc me if you have any ideas.

Thank you,

Frederick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-20 17:52 searching for a message by path Frederick Eaton
@ 2024-09-21  0:25 ` Pengji Zhang
  2024-09-21  3:23   ` Frederick Eaton
  0 siblings, 1 reply; 16+ messages in thread
From: Pengji Zhang @ 2024-09-21  0:25 UTC (permalink / raw)
  To: notmuch; +Cc: Frederick Eaton

Hi Frederick,

Frederick Eaton <frederik@ofb.net> writes:

> I am trying to figure out how to adapt a script I wrote for 
> filtering messages, to apply notmuch tags to each message. A 
> difficulty is that the messages are already in the Notmuch 
> database, because another tool has delivered them to a maildir 
> and run "notmuch new".
>
> Now, Notmuch can provide me with the paths of all the new 
> (unfiltered) messages, which I can give to my script. The 
> question I have is, once the filter is done, how can the script 
> tell Notmuch which message to apply the tags to? 

I am not sure if I understand you correctly. If the problem here 
is to distinguish existing messages and new messages, would the 
config option 'new.tags' work? For example, use

    notmuch config set new.tags new

to give all new messages a 'new' tag.

> I've tentatively concluded that the best way to locate each 
> message in the Notmuch database is to extract the Message-ID and 
> search for it with "id:"? But the FAQ says that multiple 
> messages can have the same Message-ID (and some spam messages 
> don't have one at all). 

IIRC, in the Notmuch database tags are associated with message 
IDs, so you probably do not need to worry about this. 

> If I could access the message using the filename that the script 
> is processing, it would seem slightly more reliable. It seems 
> like there should be some way to allow a Notmuch database entry 
> to be accessed directly by filename, without even creating a 
> Notmuch-style search query containing that filename, but rather 
> by passing the filename as a command-line argument to "notmuch". 
> It would be nice not to have to worry about quoting and 
> unquoting. 

I am not sure if this is useful, given that (presumably) Notmuch 
uses message IDs as keys. Besides, those filenames are usually 
generated automatically and quite cryptic.

> When I try to search for a message using "path:", nothing seems 
> to work.
>
> [...]
>
> There were no results for any of the "path:" searches, although 
> the "id:" search worked. I am using version 0.32.2 and can 
> update if this may be related to a bug that was fixed in the 
> past few years.

I have never used 0.32.2 so I am not sure if there are any 
differences, but for version 0.38.3, the prefix "path:" is used to 
search for messages in some *directory*, and the query should be 
*relative* to the maildir.

I highly recommend the manual page 'notmuch-search-terms(7)' and 
also other pages if you have time. They are informative and well 
written, and very helpful for writing message processing scripts.

Regards,
Pengji

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21  0:25 ` Pengji Zhang
@ 2024-09-21  3:23   ` Frederick Eaton
  2024-09-21  9:01     ` Pengji Zhang
                       ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Frederick Eaton @ 2024-09-21  3:23 UTC (permalink / raw)
  To: Pengji Zhang; +Cc: notmuch

Thank you for your response, Pengji.

On Sat, Sep 21, 2024 at 08:25:10AM +0800, Pengji Zhang wrote:
>Hi Frederick,
>
>Frederick Eaton <frederik@ofb.net> writes:
>
>>I am trying to figure out how to adapt a script I wrote for 
>>filtering messages, to apply notmuch tags to each message. A 
>>difficulty is that the messages are already in the Notmuch database, 
>>because another tool has delivered them to a maildir and run 
>>"notmuch new".
>>
>>Now, Notmuch can provide me with the paths of all the new 
>>(unfiltered) messages, which I can give to my script. The question I 
>>have is, once the filter is done, how can the script tell Notmuch 
>>which message to apply the tags to?
>
>
>I am not sure if I understand you correctly. If the problem here is to 
>distinguish existing messages and new messages, would the config 
>option 'new.tags' work? For example, use
>
>   notmuch config set new.tags new
>
>to give all new messages a 'new' tag.

No, I already have that configuration. The first sentence described what I already know how to do, the second sentence is what I'm trying to do.

Suppose the filter script reads a message from a particular file and decides that it is spam. How does the filter tell Notmuch that the message corresponding to that file is spam? You seem to be saying below that the filter script should extract the Message-ID and use it to identify the message to Notmuch, since file paths of the messages are not indexed. Probably what my script should be doing for each message is appending a line to a batch file like this:

     +spam -new -- id:some_message_id@foo
     +inbox -new -- id:some_other@baz

and then passing the batch file to "notmuch tag"?

>>I've tentatively concluded that the best way to locate each message 
>>in the Notmuch database is to extract the Message-ID and search for 
>>it with "id:"? But the FAQ says that multiple messages can have the 
>>same Message-ID (and some spam messages don't have one at all).
>
>IIRC, in the Notmuch database tags are associated with message IDs, so 
>you probably do not need to worry about this.

This time, I'm not sure I understand.

>>If I could access the message using the filename that the script is 
>>processing, it would seem slightly more reliable. It seems like 
>>there should be some way to allow a Notmuch database entry to be 
>>accessed directly by filename, without even creating a Notmuch-style 
>>search query containing that filename, but rather by passing the 
>>filename as a command-line argument to "notmuch". It would be nice 
>>not to have to worry about quoting and unquoting.
>
>I am not sure if this is useful, given that (presumably) Notmuch uses 
>message IDs as keys. Besides, those filenames are usually generated 
>automatically and quite cryptic.

It might be useful for the reasons I stated, namely in case the Message-ID does not exist or is not unique.

>>When I try to search for a message using "path:", nothing seems to 
>>work.
>>
>>[...]
>>
>>There were no results for any of the "path:" searches, although the 
>>"id:" search worked. I am using version 0.32.2 and can update if 
>>this may be related to a bug that was fixed in the past few years.
>
>I have never used 0.32.2 so I am not sure if there are any 
>differences, but for version 0.38.3, the prefix "path:" is used to 
>search for messages in some *directory*, and the query should be 
>*relative* to the maildir.
>
>I highly recommend the manual page 'notmuch-search-terms(7)' and also 
>other pages if you have time. They are informative and well written, 
>and very helpful for writing message processing scripts.

Thank you for interpreting that section for me. The manual pages may be informative and well written, but if my opinion matters, then I think that they could be made slightly clearer than they are. For example, explaining directly to the user that there is no index of path names would help clarify what can be done with the software. Also, a short example of using Notmuch in a filter script would be useful in one of the manual pages, particularly illustrating the case where the programmer wants to re-tag a message that is provided as a file or on stdin.

My copy of the notmuch-search-terms manual page says:

        path:<directory-path> or path:<directory-path>/** or path:/<regex>/
               The path: prefix searches for email messages that are in partic-
               ular directories within the mail store. The  directory  must  be
               specified  relative  to  the  top-level maildir (and without the
               leading slash). ...

I see now that this text is only suggesting that Notmuch supports searches for directory names, but on first read it wasn't really clear to me whether "directory-path" means a "path to a directory" or a "file path consisting of directories followed by a filename", particularly as there is no obvious reason for Notmuch not to index filenames. I think "path:<directory>" would be clearer, and saying "The path: prefix matches email messages that are stored in a specified directory on the filesystem, which must be specified relative to the top-level maildir, and here is how to find out what the 'top-level maildir' is when you have for example $HOME/mail/notmuch/ configured as your database path in ~/.notmuch-config ...". Even clearer would be to explain why the "path:" search prefix only accepts directories, point out that it should be called "dir:" instead of "path:", and warn the user that the search will be inefficient because there is no index of filenames.

Thank you,

Frederick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21  3:23   ` Frederick Eaton
@ 2024-09-21  9:01     ` Pengji Zhang
  2024-09-21  9:38     ` Michael J Gruber
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Pengji Zhang @ 2024-09-21  9:01 UTC (permalink / raw)
  To: frederik; +Cc: notmuch

Frederick Eaton <frederik@ofb.net> writes:

> No, I already have that configuration. The first sentence 
> described what I already know how to do, the second sentence is 
> what I'm trying to do. 
> 
> Suppose the filter script reads a message from a particular file 
> and decides that it is spam. How does the filter tell Notmuch 
> that the message corresponding to that file is spam? You seem to 
> be saying below that the filter script should extract the 
> Message-ID and use it to identify the message to Notmuch, since 
> file paths of the messages are not indexed. Probably what my 
> script should be doing for each message is appending a line to a 
> batch file like this: 
> 
>      +spam -new -- id:some_message_id@foo +inbox -new -- 
>      id:some_other@baz 
> 
> and then passing the batch file to "notmuch tag"?

Yes, message ID is considered unique for each email, according to 
'src/database.cc':

    A mail document is associated with a particular email message. 
    It is stored in one or more files on disk and is uniquely 
    identified by its "id" field (which is generally the message 
    ID).

I remember that if there are multiple emails sharing a message ID, 
then they will share the same set of tags as well. At least that 
is the case if I add a duplicate message using 'notmuch insert'. 
 
> This time, I'm not sure I understand.

As mentioned above, they share the set of tags so I guess there is 
no need to bother.

> It might be useful for the reasons I stated, namely in case the 
> Message-ID does not exist or is not unique.

I suppose it does not help either in that case, because they are 
considered the same in the database. 
 
> Thank you for interpreting that section for me. The manual pages 
> may be informative and well written, but if my opinion matters, 
> then I think that they could be made slightly clearer than they 
> are.
> For example, explaining directly to the user that there is no 
> index of path names would help clarify what can be done with the 
> software.  Also, a short example of using Notmuch in a filter 
> script would be useful in one of the manual pages, particularly 
> illustrating the case where the programmer wants to re-tag a 
> message that is provided as a file or on stdin.
> 
> [...]
> 
> I see now that this text is only suggesting that Notmuch 
> supports searches for directory names, but on first read it 
> wasn't really clear to me whether "directory-path" means a "path 
> to a directory" or a "file path consisting of directories 
> followed by a filename", particularly as there is no obvious 
> reason for Notmuch not to index filenames. I think 
> "path:<directory>" would be clearer, and saying "The path: 
> prefix matches email messages that are stored in a specified 
> directory on the filesystem, which must be specified relative to 
> the top-level maildir, and here is how to find out what the 
> 'top-level maildir' is when you have for example 
> $HOME/mail/notmuch/ configured as your database path in 
> ~/.notmuch-config ...". Even clearer would be to explain why the 
> "path:" search prefix only accepts directories, point out that 
> it should be called "dir:" instead of "path:", and warn the user 
> that the search will be inefficient because there is no index of 
> filenames. 

Sorry that I assumed you had ignored that paragraph in the manual! 
I am not a native speaker so I have no opinions on the wording.

I am happy with the existing search terms, but using a custom 
filter script in query does sound useful, and I would love to see 
such an example if Notmuch already has support for that. Just my 
two cents.

Pengji

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21  3:23   ` Frederick Eaton
  2024-09-21  9:01     ` Pengji Zhang
@ 2024-09-21  9:38     ` Michael J Gruber
  2024-09-21 10:44     ` Gregor Zattler
  2024-09-21 16:24     ` Panayotis Manganaris
  3 siblings, 0 replies; 16+ messages in thread
From: Michael J Gruber @ 2024-09-21  9:38 UTC (permalink / raw)
  To: frederik; +Cc: Pengji Zhang, notmuch

Am Sa., 21. Sept. 2024 um 05:23 Uhr schrieb Frederick Eaton <frederik@ofb.net>:
>
> Thank you for your response, Pengji.
>
> On Sat, Sep 21, 2024 at 08:25:10AM +0800, Pengji Zhang wrote:
> >Hi Frederick,
> >
> >Frederick Eaton <frederik@ofb.net> writes:
> >
> >>I am trying to figure out how to adapt a script I wrote for
> >>filtering messages, to apply notmuch tags to each message. A
> >>difficulty is that the messages are already in the Notmuch database,
> >>because another tool has delivered them to a maildir and run
> >>"notmuch new".
> >>
> >>Now, Notmuch can provide me with the paths of all the new
> >>(unfiltered) messages, which I can give to my script. The question I
> >>have is, once the filter is done, how can the script tell Notmuch
> >>which message to apply the tags to?
> >
> >
> >I am not sure if I understand you correctly. If the problem here is to
> >distinguish existing messages and new messages, would the config
> >option 'new.tags' work? For example, use
> >
> >   notmuch config set new.tags new
> >
> >to give all new messages a 'new' tag.
>
> No, I already have that configuration. The first sentence described what I already know how to do, the second sentence is what I'm trying to do.

It seems that we're still guess-working-out what your script is
doing/trying to do. Do you mind sharing a trimmed down version?

> It might be useful for the reasons I stated, namely in case the Message-ID does not exist or is not unique.

This is probably at the heart of the problem. Within notmuch, a
"message" is something identified by a message-id (mid), and all
information in the notmuch database is tied to a mid.

When you speak about a message, you probably mean the content of an
individual "message file" - which is a natural, but different notion.
A "path:" refers to a message file, a "mid:" to message id.

When "notmuch new" encounters a new message files, it
- checks if it contains a valid "Message-ID" header
- used that as mid or generates a mid using a sha1 checksum of the message file
- checks whether that mid (!) is in the database already
- adds the path to the existing db entry, or creates a new db entry

So, you may have several files (path entries) for the same mid, and
which one is used for indexing purposes depends on the order of
arrival (or, in the case of reindexing, probably on file system
ordering). notmuch assumes that this makes no difference - same mid
same "message". This assumption can break, for example for list
copies, different headers on sent versus received etc.

I"m elaborating on this because we have to guess about your script -
what is a "new message" for your script, and which kind of information
does it want to process?

Typical processing would be done in a notmuch post-hook, and it would:
- check for new messages (tag:new)
- get their file paths form `notmuch search --output=files mid:XYZ` or such
- do whatever it needs using the file if you really need to parse that yourself

I guess most of us have some sort of script running on new messages as
part of a hook, be it `afew` or something homegrown, and this
typically clears the new tag afterwards.

Michael

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21  3:23   ` Frederick Eaton
  2024-09-21  9:01     ` Pengji Zhang
  2024-09-21  9:38     ` Michael J Gruber
@ 2024-09-21 10:44     ` Gregor Zattler
  2024-09-21 16:24     ` Panayotis Manganaris
  3 siblings, 0 replies; 16+ messages in thread
From: Gregor Zattler @ 2024-09-21 10:44 UTC (permalink / raw)
  To: frederik, Pengji Zhang; +Cc: notmuch

Hi Frederick,
* Frederick Eaton <frederik@ofb.net> [2024-09-20; 20:23 -07]:
> On Sat, Sep 21, 2024 at 08:25:10AM +0800, Pengji Zhang wrote:
>>Frederick Eaton <frederik@ofb.net> writes:
>>>I am trying to figure out how to adapt a script I wrote for
>>>filtering messages, to apply notmuch tags to each message. A
>>>difficulty is that the messages are already in the Notmuch database,
>>>because another tool has delivered them to a maildir and run
>>>"notmuch new".
>>>
>>>Now, Notmuch can provide me with the paths of all the new
>>>(unfiltered) messages, which I can give to my script. The question I
>>>have is, once the filter is done, how can the script tell Notmuch
>>>which message to apply the tags to?
[...]
> Suppose the filter script reads a message from a particular file and decides that it is spam. How does the filter tell Notmuch that the message corresponding to that file is spam? You seem to be saying below that the filter script should extract the Message-ID and use it to identify the message to Notmuch, since file paths of the messages are not indexed. Probably what my script should be doing for each message is appending a line to a batch file like this:
>
>      +spam -new -- id:some_message_id@foo
>      +inbox -new -- id:some_other@baz
>
> and then passing the batch file to "notmuch tag"?

Yes.  I assume your script somehow loops
over directories and picks up one file
at a time investigating it and then
decides if it is spam.  Then what you
wrote above is the way to go.

The possibility that the Message-ID of
two different emails might be identical,
is real, but that's luckily mostly the
case for spam messages in my experience.

But there is a safety hatch:  While
notmuch cannot search by file names, it
is able to output file names, like so:

notmuch search --output=files -- id:"100822114556.GC10314@example.com/"

If you want to be extra cautious you
could start with a file, get the
Message-ID from it, decide upon the
email in the file, then search for file
names which notmuch associates to the
Message-ID and in case of several file
names, check them in turn...  In case of
the spam/ham decision, you then would
have to decide what to do, if some files
containing emails with the same
Message-ID are spam and others are ham,
since these are usually considered to be
mutually exclusive.
But I would be astonished if you would
find such a case in a real file corpus
of emails.

HTH, Gregor

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21  3:23   ` Frederick Eaton
                       ` (2 preceding siblings ...)
  2024-09-21 10:44     ` Gregor Zattler
@ 2024-09-21 16:24     ` Panayotis Manganaris
  2024-09-21 17:30       ` Teemu Likonen
  2024-09-24  9:09       ` Michael J Gruber
  3 siblings, 2 replies; 16+ messages in thread
From: Panayotis Manganaris @ 2024-09-21 16:24 UTC (permalink / raw)
  To: frederik; +Cc: notmuch

Frederick Eaton <frederik@ofb.net> writes:

>
> Suppose the filter script reads a message from a particular file and decides that it is
> spam. How does the filter tell Notmuch that the message corresponding to that file is spam?
> You seem to be saying below that the filter script should extract the Message-ID and use it
> to identify the message to Notmuch, since file paths of the messages are not
> indexed. Probably what my script should be doing for each message is appending a line to a
> batch file like this:
>
>      +spam -new -- id:some_message_id@foo
>      +inbox -new -- id:some_other@baz
>
> and then passing the batch file to "notmuch tag"?
>

Hello Fredrick, you are exactly correct. This is what I've written to handle spam filtering in
my notmuch post-new hook. Like you, I have notmuch configured to assign newly fetched mail with
tag "new"

notmuch search --output=messages 'tag:new' > /tmp/msgs
notmuch search --output=files 'tag:new' |\
    bogofilter -o0.7,0.7 -bt |\
    paste - /tmp/msgs |\
    awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
    notmuch tag --batch

This should run under any shell. My chosen filter is bogofilter. The -bt flags tell it to
operate on a stdin "batch" of file paths and return a "terse" summary of results e.g.

H 0.248913
S 0.999999

This script operates on the assumption that the order of results from notmuch queries are
always the same, which is fortunately true.

>>>I've tentatively concluded that the best way to locate each message in the Notmuch database
>>>is to extract the Message-ID and search for it with "id:"? But the FAQ says that multiple
>>>messages can have the same Message-ID (and some spam messages don't have one at all).

Your instinct to use batch tagging and id: queries is correct. I collect my new message ids in
/tmp/msgs. These ids are unique, they are definitely unique enough to be used to tag individual
messages on a daily basis. If you prefer to tag entire threads as spam the moment a single
message is spam, you can simply use

notmuch search --output=threads 'tag:new' > /tmp/msgs

I prefer to manually mute threads with a mute tag, but Thread ids are definitely unique.

If you want auto-tag spam in an existing archive, then you will need to first manually tag a
good quantity of messages (100-1000) you consider to be spam and a good quantity of messages
(100-1000) you consider to be ham and use them to train the filter e.g.

notmuch search --output=files 'tag:spam' | bogofilter -bs
notmuch search --output=files 'tag:inbox' | bogofilter -bn

>>>If I could access the message using the filename that the script is processing, it would
>>>seem slightly more reliable. It seems like there should be some way to allow a Notmuch
>>>database entry to be accessed directly by filename, without even creating a Notmuch-style
>>>search query containing that filename, but rather by passing the filename as a command-line
>>>argument to "notmuch". It would be nice not to have to worry about quoting and unquoting.
>>
>>I am not sure if this is useful, given that (presumably) Notmuch uses message IDs as
>>keys. Besides, those filenames are usually generated automatically and quite cryptic.
>
> It might be useful for the reasons I stated, namely in case the Message-ID does not exist or
> is not unique.

I think mail that is successfully transmitted through a mail host necessarily obtains a message
id, but I might be wrong. I believe notmuch indexes on both it's own unique thread ids and the
message ids. Thereby further decreasing the already minuscule chance of message id collisions.

--
Best,
Panos

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21 16:24     ` Panayotis Manganaris
@ 2024-09-21 17:30       ` Teemu Likonen
  2024-09-23 22:14         ` Panayotis Manganaris
  2024-09-24  9:09       ` Michael J Gruber
  1 sibling, 1 reply; 16+ messages in thread
From: Teemu Likonen @ 2024-09-21 17:30 UTC (permalink / raw)
  To: Panayotis Manganaris, frederik; +Cc: notmuch


[-- Attachment #1.1: Type: text/plain, Size: 1369 bytes --]

* 2024-09-21 12:24:14-0400, Panayotis Manganaris wrote:

> Like you, I have notmuch configured to assign newly fetched mail with
> tag "new"
>
> notmuch search --output=messages 'tag:new' > /tmp/msgs
> notmuch search --output=files 'tag:new' |\
>     bogofilter -o0.7,0.7 -bt |\
>     paste - /tmp/msgs |\
>     awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
>     notmuch tag --batch

I think that is unnecessarily complex. How about simply using
Bogofilter's output value:


    for msgid in $(notmuch search --output=messages tag:new); do
            notmuch show --exclude=false --format=raw "$msgid" | bogofilter
            case $? in
            0) # Spam
                    notmuch tag +spam "$msgid"
                    # Add some logic and conditionally run:
                    #notmuch tag +learn-spam "$msgid"
                    ;;

            1) # Ham
                    # Add some logic and conditionally run:
                    #notmuch tag +learn-ham "$msgid"
                    ;;

            2) notmuch tag +unsure "$msgid" ;;
            *) exit $? ;;
            esac
    done


Then later process tags "learn-spam" and "learn-ham" by piping such
messages to "bogofilter -s" or "-n".

-- 
/// Teemu Likonen - .-.. https://www.iki.fi/tlikonen/
// OpenPGP: 6965F03973F0D4CA22B9410F0F2CAE0E07608462

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21 17:30       ` Teemu Likonen
@ 2024-09-23 22:14         ` Panayotis Manganaris
  2024-09-24 13:00           ` David Bremner
  0 siblings, 1 reply; 16+ messages in thread
From: Panayotis Manganaris @ 2024-09-23 22:14 UTC (permalink / raw)
  To: Teemu Likonen, frederik; +Cc: notmuch

Teemu Likonen <tlikonen@iki.fi> writes:

>
> How about simply using Bogofilter's output value
>

To each their own, of course. I just prefer "wholemeal" programming, as Geraint Jones called it.

I did think of some relevant considerations, though:

"notmuch tag --batch" is much faster than repeatedly invoking "notmuch tag"

Of course, this performance difference is probably not noticeable in a well curated inbox.

Also, the awk-print-tag-batch pattern is re-usable if you'd like to auto-tag messages in other ways e.g.
- tag "vip" threads from known addresses
- pick out coupon codes using a full text search (again, to each their own)

I find this way to be more flexible than the current provisions for "named queries" in the notmuch config/database:

query:<name>

     The 'query:' prefix allows queries to refer to previously saved queries added with ‘notmuch-config(1)’.

Though, I'm not poo-pooing that feature. I would certainly appreciate further development on this front.\r

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-21 16:24     ` Panayotis Manganaris
  2024-09-21 17:30       ` Teemu Likonen
@ 2024-09-24  9:09       ` Michael J Gruber
  2024-09-28  2:56         ` Frederick Eaton
  1 sibling, 1 reply; 16+ messages in thread
From: Michael J Gruber @ 2024-09-24  9:09 UTC (permalink / raw)
  To: Panayotis Manganaris; +Cc: frederik, notmuch

Am Sa., 21. Sept. 2024 um 18:24 Uhr schrieb Panayotis Manganaris
<panos.manganaris@gmail.com>:
...
> notmuch search --output=messages 'tag:new' > /tmp/msgs
> notmuch search --output=files 'tag:new' |\
>     bogofilter -o0.7,0.7 -bt |\
>     paste - /tmp/msgs |\
>     awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
>     notmuch tag --batch
>
...
> This script operates on the assumption that the order of results from notmuch queries are
> always the same, which is fortunately true.

It also operates under the assumption that you receive no duplicate
messages with the same message-id (such as list copies,
sent/reveived), or else `paste` will have a hard time matching lines.

Note that you can loop over the msgs, treat them individually, and
still collect input for `notmuch tag --batch`, which solves both the
problem with duplicate messages and potential ordering instability
while keeping batch efficiency.

> Your instinct to use batch tagging and id: queries is correct. I collect my new message ids in
> /tmp/msgs. These ids are unique, they are definitely unique enough to be used to tag individual
> messages on a daily basis.

I'm sorry, but either they're unique or not. What's unique enough? I'm
pestering on this because part of the OP's problem is being clear
about the notion of message, which is uniquely identified by a message
id in the notmuch db. I tried to clear that up in my previous answer
in this thread.

> > It might be useful for the reasons I stated, namely in case the Message-ID does not exist or
> > is not unique.
>
> I think mail that is successfully transmitted through a mail host necessarily obtains a message
> id, but I might be wrong. I believe notmuch indexes on both it's own unique thread ids and the
> message ids. Thereby further decreasing the already minuscule chance of message id collisions.

No. Messages can arrive without mid. In that case, notmuch creates one
(without altering the message file) and uses it for indexing.
"Thread-id" is something completely different from message-ids. They
do not identify a message uniquely (but a thread of messages "joint"
by references), albeit indirectly (such as "root message of the
thread", assuming one root).

Cheers
Michael

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-23 22:14         ` Panayotis Manganaris
@ 2024-09-24 13:00           ` David Bremner
  0 siblings, 0 replies; 16+ messages in thread
From: David Bremner @ 2024-09-24 13:00 UTC (permalink / raw)
  To: Panayotis Manganaris; +Cc: notmuch

Panayotis Manganaris <panos.manganaris@gmail.com> writes:

> I find this way to be more flexible than the current provisions for "named queries" in the notmuch config/database:
>
> query:<name>
>
>      The 'query:' prefix allows queries to refer to previously saved queries added with ‘notmuch-config(1)’.
>
> Though, I'm not poo-pooing that feature. I would certainly appreciate further development on this front.

You might be interested in the 'macro' feature of saved sexp queries
notmuch-sexp-queries(7). I'm not sure if it meets your specific
requirements, but it is a further development of query: syntax.\r

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-24  9:09       ` Michael J Gruber
@ 2024-09-28  2:56         ` Frederick Eaton
  2024-09-29 12:08           ` David Bremner
  0 siblings, 1 reply; 16+ messages in thread
From: Frederick Eaton @ 2024-09-28  2:56 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: Pengji Zhang, notmuch, Panayotis Manganaris

Thank you all for your helpful replies. It seems pretty clear that the recommended Notmuch usage for someone who wants to incorporate a script that classifies a batch of messages, is either to write the whole script to use Notmuch from the beginning, and to have the messages specified as a list of "id:" IDs or even a general Notmuch query; or, if you are using an existing script that accepts a list of files, to try to extract the message ID from each file at the end so that the new tags can be communicated from the script back to Notmuch. Both options seem a little hacky - especially since it is rather common to receive multiple distinct messages with the same ID, for example when someone replies to a mailing list post and Cc's me, and I would want these to be separately viewable (they are linked together ith an "=" in the Mutt thread view) for security reasons. If Notmuch is meant to function as an abstraction layer over message files stored on the file system, then why doesn't it p
 rovide a standard way to go from file paths to Notmuch messages?

As for why it would be a security issue to ignore new messages with duplicate message IDs, consider that one can apparently play the following game on a Notmuch user. (1) Send a private email that the user will never see because it contains spam keywords. (2) Send a public email to a mailing list with the same ID. The Notmuch user will not see the second email, and everyone will think he is unable to reply to the allegations it presumably contains, and that he is therefore guilty and should be arrested.

My recommendation would be to split the Notmuch project into three teams: one to work on the source code, another on the documentation, and a third on test cases. There should be separate Git repositories for each team, so that I can for example run current test cases against a fork of the source repo, or use recent manual pages with an older version of the source. This way, the documentation team will be able to document deficiencies in various source releases as well as standardizing proposed new features or syntax. Or, someone would submit a pull request to the source team, that would then be discussed on the mailing list or in the issue tracker, and someone on another team would then use that discussion to write documentation or test cases before the PR is accepted. The teams would have a "checks and balances" relationship, like with the three branches of government. (I think that all software projects should be run this way, so please don't be offended.)

I wrote some Perl scripts a long time ago, which work together to tag mail and put links to each message in a tag-specific directory for each of its tags. The script would add headers to the message, however, and it rewrote the Message-ID if it wasn't unique. It did not create a full-text index like Notmuch does. It did seem fairly reliable. I am trying to adapt it to send the tags to Notmuch. I am having to use Notmuch because of a third piece of software that depends on it. It is somewhat perplexing to me that no one else has had my use case before.

Best wishes,

Frederick

On Sat, Sep 21, 2024 at 11:38:18AM +0200, Michael J Gruber wrote:
>Am Sa., 21. Sept. 2024 um 05:23 Uhr schrieb Frederick Eaton <frederik@ofb.net>:
>>
>> Thank you for your response, Pengji.
>>
>> On Sat, Sep 21, 2024 at 08:25:10AM +0800, Pengji Zhang wrote:
>> >Hi Frederick,
>> >
>> >Frederick Eaton <frederik@ofb.net> writes:
>> >
>> >>I am trying to figure out how to adapt a script I wrote for
>> >>filtering messages, to apply notmuch tags to each message. A
>> >>difficulty is that the messages are already in the Notmuch database,
>> >>because another tool has delivered them to a maildir and run
>> >>"notmuch new".
>> >>
>> >>Now, Notmuch can provide me with the paths of all the new
>> >>(unfiltered) messages, which I can give to my script. The question I
>> >>have is, once the filter is done, how can the script tell Notmuch
>> >>which message to apply the tags to?
>> >
>> >
>> >I am not sure if I understand you correctly. If the problem here is to
>> >distinguish existing messages and new messages, would the config
>> >option 'new.tags' work? For example, use
>> >
>> >   notmuch config set new.tags new
>> >
>> >to give all new messages a 'new' tag.
>>
>> No, I already have that configuration. The first sentence described what I already know how to do, the second sentence is what I'm trying to do.
>
>It seems that we're still guess-working-out what your script is
>doing/trying to do. Do you mind sharing a trimmed down version?
>
>> It might be useful for the reasons I stated, namely in case the Message-ID does not exist or is not unique.
>
>This is probably at the heart of the problem. Within notmuch, a
>"message" is something identified by a message-id (mid), and all
>information in the notmuch database is tied to a mid.
>
>When you speak about a message, you probably mean the content of an
>individual "message file" - which is a natural, but different notion.
>A "path:" refers to a message file, a "mid:" to message id.
>
>When "notmuch new" encounters a new message files, it
>- checks if it contains a valid "Message-ID" header
>- used that as mid or generates a mid using a sha1 checksum of the message file
>- checks whether that mid (!) is in the database already
>- adds the path to the existing db entry, or creates a new db entry
>
>So, you may have several files (path entries) for the same mid, and
>which one is used for indexing purposes depends on the order of
>arrival (or, in the case of reindexing, probably on file system
>ordering). notmuch assumes that this makes no difference - same mid
>same "message". This assumption can break, for example for list
>copies, different headers on sent versus received etc.
>
>I"m elaborating on this because we have to guess about your script -
>what is a "new message" for your script, and which kind of information
>does it want to process?
>
>Typical processing would be done in a notmuch post-hook, and it would:
>- check for new messages (tag:new)
>- get their file paths form `notmuch search --output=files mid:XYZ` or such
>- do whatever it needs using the file if you really need to parse that yourself
>
>I guess most of us have some sort of script running on new messages as
>part of a hook, be it `afew` or something homegrown, and this
>typically clears the new tag afterwards.
>
>Michael
>

On Tue, Sep 24, 2024 at 11:09:26AM +0200, Michael J Gruber wrote:
>Am Sa., 21. Sept. 2024 um 18:24 Uhr schrieb Panayotis Manganaris
><panos.manganaris@gmail.com>:
>...
>> notmuch search --output=messages 'tag:new' > /tmp/msgs
>> notmuch search --output=files 'tag:new' |\
>>     bogofilter -o0.7,0.7 -bt |\
>>     paste - /tmp/msgs |\
>>     awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
>>     notmuch tag --batch
>>
>...
>> This script operates on the assumption that the order of results from notmuch queries are
>> always the same, which is fortunately true.
>
>It also operates under the assumption that you receive no duplicate
>messages with the same message-id (such as list copies,
>sent/reveived), or else `paste` will have a hard time matching lines.
>
>Note that you can loop over the msgs, treat them individually, and
>still collect input for `notmuch tag --batch`, which solves both the
>problem with duplicate messages and potential ordering instability
>while keeping batch efficiency.
>
>> Your instinct to use batch tagging and id: queries is correct. I collect my new message ids in
>> /tmp/msgs. These ids are unique, they are definitely unique enough to be used to tag individual
>> messages on a daily basis.
>
>I'm sorry, but either they're unique or not. What's unique enough? I'm
>pestering on this because part of the OP's problem is being clear
>about the notion of message, which is uniquely identified by a message
>id in the notmuch db. I tried to clear that up in my previous answer
>in this thread.
>
>
>> > It might be useful for the reasons I stated, namely in case the Message-ID does not exist or
>> > is not unique.
>>
>> I think mail that is successfully transmitted through a mail host necessarily obtains a message
>> id, but I might be wrong. I believe notmuch indexes on both it's own unique thread ids and the
>> message ids. Thereby further decreasing the already minuscule chance of message id collisions.
>
>No. Messages can arrive without mid. In that case, notmuch creates one
>(without altering the message file) and uses it for indexing.
>"Thread-id" is something completely different from message-ids. They
>do not identify a message uniquely (but a thread of messages "joint"
>by references), albeit indirectly (such as "root message of the
>thread", assuming one root).
>
>Cheers
>Michael
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-28  2:56         ` Frederick Eaton
@ 2024-09-29 12:08           ` David Bremner
  2024-10-12 22:59             ` David Bremner
  0 siblings, 1 reply; 16+ messages in thread
From: David Bremner @ 2024-09-29 12:08 UTC (permalink / raw)
  To: frederik; +Cc: notmuch

Frederick Eaton <frederik@ofb.net> writes:

> it is rather common to receive multiple distinct messages with the
> same ID, for example when someone replies to a mailing list post and
> Cc's me, and I would want these to be separately viewable (they are
> linked together ith an "=" in the Mutt thread view) for security
> reasons'

Just for the record, both copies are also separately viewable in the
Emacs interface in current notmuch. I don't know how other front ends
handle it.

> If Notmuch is meant to function as an abstraction layer over message
> files stored on the file system, then why doesn't it provide a
> standard way to go from file paths to Notmuch messages?

Although I think notmuch as it exists is far from "an abstraction
layer", the specific feature request seems reasonable. It would need
someone who wants it to get familiar with the low level implementation
details of notmuch. In particular it would require writing a database
upgrade and having a new version of the database schema.

It might happen eventually as a side effect of reworking the way
threading works in notmuch. I'd have to dig up the details of that, but
the core idea is to use Xapian "collapse keys" for more flexible
handling of duplicates.

> As for why it would be a security issue to ignore new messages with
> duplicate message IDs,
[...]

Although perhaps not so colourfully, this has been extensively discussed
on the mailing list, which is what lead to the functionality of indexing
all files with the same message IDs. In short, new messages are not
being ignored.

>
> My recommendation would be to split the Notmuch project into three
> teams:

Since there is a very small number of active developers (in some sense
just me at the moment), I guess that won't happen.

> (I think that all software projects should be run this way, so please
> don't be offended.)

I understand you probably didn't set out to become a notmuch
contributor, but if you want your organizational suggestions to be taken
seriously, I'm afraid that is a prerequisite. Of course contributions to
all three of the areas you mentioned are welcome.

> I wrote some Perl scripts a long time ago
[...]
> I am having to use Notmuch because of a third piece of software that
> depends on it

I can't speak to third party software dependencies, but you might
investigate mu [1], or mairix [2] which might make design choices more
to your liking.

> It is somewhat perplexing to me that no one else has had my use case
> before

I guess one of the side effects of working on software like notmuch,
which has several different interfaces (CLI, library, bindings), is that
even after almost 15 years, it is impossible to predict all of the
different ways people will try to use it.  We try to accomodate new use
cases as they arrive, but our priority is not breaking things for the
existing users. This means being cautious about adding new features,
given our limited resources for maintainance and support.

[1]: https://www.djcbsoftware.nl/code/mu/
[2]: https://github.com/rc0/mairix

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-09-29 12:08           ` David Bremner
@ 2024-10-12 22:59             ` David Bremner
  2024-10-14  6:50               ` Michael J Gruber
  0 siblings, 1 reply; 16+ messages in thread
From: David Bremner @ 2024-10-12 22:59 UTC (permalink / raw)
  To: frederik; +Cc: notmuch

David Bremner <david@tethera.net> writes:

> Frederick Eaton <frederik@ofb.net> writes:
>
>> If Notmuch is meant to function as an abstraction layer over message
>> files stored on the file system, then why doesn't it provide a
>> standard way to go from file paths to Notmuch messages?
>
> Although I think notmuch as it exists is far from "an abstraction
> layer", the specific feature request seems reasonable. It would need
> someone who wants it to get familiar with the low level implementation
> details of notmuch. In particular it would require writing a database
> upgrade and having a new version of the database schema.

I was looking at the code, and I realized it is not actually as hard as
that.  Essentially the code of notmuch_database_find_message_by_filename
needs to be wrapped in a PostingSource following the model of
RegexpPostingSource (regexp-fields.cc). The fact that no database format
changes (or even reindexing) are needed, makes this a much lower risk
project.

d

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-10-12 22:59             ` David Bremner
@ 2024-10-14  6:50               ` Michael J Gruber
  2024-10-14 10:58                 ` David Bremner
  0 siblings, 1 reply; 16+ messages in thread
From: Michael J Gruber @ 2024-10-14  6:50 UTC (permalink / raw)
  To: David Bremner; +Cc: frederik, notmuch

Am So., 13. Okt. 2024 um 00:59 Uhr schrieb David Bremner <david@tethera.net>:
>
> David Bremner <david@tethera.net> writes:
>
> > Frederick Eaton <frederik@ofb.net> writes:
> >
> >> If Notmuch is meant to function as an abstraction layer over message
> >> files stored on the file system, then why doesn't it provide a
> >> standard way to go from file paths to Notmuch messages?
> >
> > Although I think notmuch as it exists is far from "an abstraction
> > layer", the specific feature request seems reasonable. It would need
> > someone who wants it to get familiar with the low level implementation
> > details of notmuch. In particular it would require writing a database
> > upgrade and having a new version of the database schema.
>
> I was looking at the code, and I realized it is not actually as hard as
> that.  Essentially the code of notmuch_database_find_message_by_filename
> needs to be wrapped in a PostingSource following the model of
> RegexpPostingSource (regexp-fields.cc). The fact that no database format
> changes (or even reindexing) are needed, makes this a much lower risk
> project.

If you want to map filenames to mids, you can use xapian.
Say, $relone is the filename path relative to the notmuch basedir
~/.mail/.notmuch/.

```
dirterm=XDIRECTORY$(dirname $relone)
dirdocid=$(xapian-delve -1 -t $dirterm ~/.mail/.notmuch/xapian/ | tail -n1)
docterm=XFDIRENTRY${dirdocid}:$(basename $relone)
docid=$(xapian-delve -1 -t $docterm ~/.mail/.notmuch/xapian/ | tail -n1)
xapian-delve -1 -t $docterm -r $docid -V1 ~/.mail/.notmuch/xapian/ |
grep Value | cut -d' ' -f6-
```
... or grep Message-ID :)
Cheers
Michael

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: searching for a message by path
  2024-10-14  6:50               ` Michael J Gruber
@ 2024-10-14 10:58                 ` David Bremner
  0 siblings, 0 replies; 16+ messages in thread
From: David Bremner @ 2024-10-14 10:58 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: frederik, notmuch

Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:

> ```
> dirterm=XDIRECTORY$(dirname $relone)
> dirdocid=$(xapian-delve -1 -t $dirterm ~/.mail/.notmuch/xapian/ | tail -n1)
> docterm=XFDIRENTRY${dirdocid}:$(basename $relone)
> docid=$(xapian-delve -1 -t $docterm ~/.mail/.notmuch/xapian/ | tail -n1)
> xapian-delve -1 -t $docterm -r $docid -V1 ~/.mail/.notmuch/xapian/ |
> grep Value | cut -d' ' -f6-
> ```

Impressive hack! ;)

Roughly speaking, that's what the code in question does, but a
PostingSource is a Xapian way of wrapping up that double query (two
calls to xapian-delve in this case) so that it can be part of a regular
query with e.g. a file: prefix.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-10-14 10:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-20 17:52 searching for a message by path Frederick Eaton
2024-09-21  0:25 ` Pengji Zhang
2024-09-21  3:23   ` Frederick Eaton
2024-09-21  9:01     ` Pengji Zhang
2024-09-21  9:38     ` Michael J Gruber
2024-09-21 10:44     ` Gregor Zattler
2024-09-21 16:24     ` Panayotis Manganaris
2024-09-21 17:30       ` Teemu Likonen
2024-09-23 22:14         ` Panayotis Manganaris
2024-09-24 13:00           ` David Bremner
2024-09-24  9:09       ` Michael J Gruber
2024-09-28  2:56         ` Frederick Eaton
2024-09-29 12:08           ` David Bremner
2024-10-12 22:59             ` David Bremner
2024-10-14  6:50               ` Michael J Gruber
2024-10-14 10:58                 ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).