feature request: caching message arrival time

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* feature request: caching message arrival time
@ 2019-06-01  3:29 Daniel Kahn Gillmor
  2019-06-01 14:13 ` David Bremner
  2019-06-01 14:19 ` Ralph Seichter
  0 siblings, 2 replies; 10+ messages in thread
From: Daniel Kahn Gillmor @ 2019-06-01  3:29 UTC (permalink / raw)
  To: Notmuch Mail

[-- Attachment #1: Type: text/plain, Size: 1890 bytes --]

Hi Notmuch folks--

I'm working on Autocrypt integration for notmuch right now, and it
occurs to me that it might be useful to know the time that any given
message was first seen by notmuch.

I'm trying to not get distracted by implementing such a feature, but I
wanted to log this as a feature request, along with a few thoughts about
it.

My idea is that the first time notmuch indexes a message, it would add a
property to the message like firstseen=2019-05-31T23:15:24Z.

Some nuances spring to mind:

 * This should *not* be cleared and reset on reindexing, so it doesn't
   belong in the index.* property namespace.

 * What happens when you delete a message?  Maybe we should keep that
   value around for "ghosts" too -- can ghost documents have properties?
   Or is it bad to remember that we've seen the message if someone
   deletes it?

 * When even the ghost goes away (e.g. full thread deletion), presumably
   this property would go away.  So If you deleted the message from your
   message store, notmuch would forget about it, and then the next time
   you ingest it it would get a later "firstseen=" property.  I'm ok
   with this.

 * i don't think we have a way to search properties by range (e.g. the
   way that we can search date ranges).  i don't need that feature for
   my use case, but maybe someone will come up with a use case that
   wants it?  is there a way to store the datestamp in a way that it can
   be scanned the way that "date" can?  or do we already have this and
   i'm just unaware?

 * What is the cost in terms of database size?  It doesn't look like it
   would be expensive to me, but i haven't profiled it.

 * if we make such a change, how should we deal with already-indexed
   messages?

Anyone have any thoughts, suggestions, or objections to this?  I'm happy
to explain more about my use case if people are interested too.

       --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-01  3:29 feature request: caching message arrival time Daniel Kahn Gillmor
@ 2019-06-01 14:13 ` David Bremner
  2019-06-01 14:19 ` Ralph Seichter
  1 sibling, 0 replies; 10+ messages in thread
From: David Bremner @ 2019-06-01 14:13 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, Notmuch Mail

Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

>  * i don't think we have a way to search properties by range (e.g. the
>    way that we can search date ranges).  i don't need that feature for
>    my use case, but maybe someone will come up with a use case that
>    wants it?  is there a way to store the datestamp in a way that it can
>    be scanned the way that "date" can?  or do we already have this and
>    i'm just unaware?

you'd need to use a value slot to get (native Xapian) range
searches. To quote the xapian docs

      For performance it is important to keep the amount of data stored
      in the values to a minimum, since the values for a large number of
      documents may be read during the search - the more data that has
      to be read, the slower the search will be.

So it's definitely something that would need to be profiled.

Probably the patches that added lastmod: are a good example for someone
wanting to investigate this.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-01  3:29 feature request: caching message arrival time Daniel Kahn Gillmor
  2019-06-01 14:13 ` David Bremner
@ 2019-06-01 14:19 ` Ralph Seichter
  2019-06-01 15:30   ` Daniel Kahn Gillmor
  1 sibling, 1 reply; 10+ messages in thread
From: Ralph Seichter @ 2019-06-01 14:19 UTC (permalink / raw)
  To: notmuch

* Daniel Kahn Gillmor:

> I'm working on Autocrypt integration for notmuch right now [...]

Woot! :-)

> I'm happy to explain more about my use case if people are interested
> too.

I'm interested. Right now I frankly don't know what knowing when a
message was first seen by Notmuch might be useful for. That makes it
a bit difficult for me to contemplate your questions.

-Ralph

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-01 14:19 ` Ralph Seichter
@ 2019-06-01 15:30   ` Daniel Kahn Gillmor
  2019-06-03  8:57     ` Örjan Ekeberg
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Kahn Gillmor @ 2019-06-01 15:30 UTC (permalink / raw)
  To: Ralph Seichter, notmuch

[-- Attachment #1: Type: text/plain, Size: 6093 bytes --]

On Sat 2019-06-01 16:19:19 +0200, Ralph Seichter wrote:
> I'm interested. Right now I frankly don't know what knowing when a
> message was first seen by Notmuch might be useful for. That makes it
> a bit difficult for me to contemplate your questions.

Sure, thanks for asking!

As i went to write this down, it became a lot longer than i'd expected.
sorry about that!  On the positive side, i may have convinced myself in
the process that the threat this mechanism would defend against is small
enough that it may not be worth the additional implementation (though if
the implementation were there, we'd certainly want to use it).

So, this is a story about Autocrypt state, out-of-order delivery, and
e-mails with suspicious date stamps ("from the future"). (if you're
reading this message haven't been following Autocrypt closely, you can
read up at https://www.autocrypt.org/)

------

When receiving an e-mail sent From: the peer foo@example.org, an
Autocrypt-capable client needs to update the Autocrypt state for that
peer's e-mail address ("foo@example.org").  This is the case for
messages that have an Autocrypt: header *and* for messages that *don't*
have one.

Both kinds of messages update the Autocrypt peer state, because if you
start receiving Autocrypt-free messages from someone who used Autocrypt
in the past, your client needs to make a note of that and consider it
when it makes its recommendation for new outbound messages to that peer.

Additionally, sometimes we receive e-mail messages out of order.
sometimes this is because we're suddenly running across a cache of old
messages, sometimes it's because we've just popped online after a day
off, and sometimes it's because SMTP had a hiccup (there are probably
many other reasons).

We also probably don't want to store state about everyone who has ever
sent us mail *without* using Autocrypt.  At the moment, at least, that's
probably most senders, and it's both a waste of space and a potential
privacy concern to record a lot of empty state that just indicates that
you got mail from someone at some point in the past.  So if we've never
seen an Autocrypt header from a given peer, there's no state to update.

So now consider the following set of e-mail messages all from the same
sender; mails with a * have an Autocrypt header, and the times
following the message indicates its Date: header in an abstract way
(higher numbers are later than lower numbers).

 A: (time 1)
 B*: (time 2)
 C: (time 3)

Let's assume that i update Autocrypt state about the peer upon receipt
of each message, regardless of what order the messages were sent.  We
want the Autocrypt state to be immutable, independent of the order of
delivery.

If i receive them at times 4, 5, and 6 in order (A, B, C) then i'll
think that the Autocrypt state for the peer is "we had an Autocrypt
header earlier (from B), but a more recent delivery (C) suggests that
they might not be using Autocrypt reliably" (depending on the actual
difference in time between the Date:s of B and C, the peer might end up
with an Autocrypt recommendation called "discourage").  This is the
correct state for us to end up in.

But now imagine that at times 4, 5, and 6 i receive the messages in the
order A, C, B.  If i don't store Autocrypt state for the peer at times 4
and 5, because i've never seen an Autocrypt header for the peer before,
and there is none in messages A and C.  Then my end result is that i'll
think that the Autocrypt state for the peer is just the Autocrypt header
from B.

But that's it's different from what we ended up with when we received
the messages in order.

Now, we can improve on this with the following extra technique: when a
peer goes from no Autocrypt state to having an Autocrypt state, we can
search the existing index for messages from that peer with a later Date:
header.  If we find such a message, then we should include it in our
calculations.  If we do that, then we end up with the correct state,
regardless of the order of delivery.  good!

So far, we haven't needed the firstseen= property yet.  There's one
final wrinkle that introduces the need for it: message Date: headers can
be wrong.  They can even be grossly wrong -- they can be from the
future.  This can happen when the sender's clock is bad, mainly, but it
can also happen through malice (someone wanting to forge a message to
mess with the receipient's state about a given peer, for example).

So Autocrypt defines the "effective date" of a message as the *earliest*
of two dates: the date that the message is first seen, and the Date:
header itself.  So we want our augmented Autocrypt header ingestion
routine to search for all other messages we know about from the sender
that have both a later firstseen= property *and* a later Date: header.

Otherwise, one poorly formed e-mail without an Autocrypt header with the
Date: set to the year 3000 (the "bogus future message") would make it so
that the peer's recommendation would be set to "discourage" when a
message that contains an Autocrypt: header first comes in.

Conclusion
----------

Upon writing all this down, perhaps that's not such a troubling threat.
Having such a bogus future message stored in the database would indeed
leave the peer with a "discouraged" Autocrypt state upon receipt of the
first Autocrypt: header.  But if that database search only happens upon
the first Autocrypt: header seen, then a second message from the same
peer would clear the "discouraged" recommendation without consulting the
bogus future message at all.

So if the threat of a bogus future message is overcome by just a single
additional Autocrypt-enabled message from the same peer, that's not
particularly bad.  And "bogus future message" probably isn't all that
likely either.

So this isn't very high on my list of priorities after all, though if
such a lastseen property were available, i'd definitely use it to
improve the Autocrypt experience in this minor way.

        --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-01 15:30   ` Daniel Kahn Gillmor
@ 2019-06-03  8:57     ` Örjan Ekeberg
  2019-06-03 13:17       ` Daniel Kahn Gillmor
  0 siblings, 1 reply; 10+ messages in thread
From: Örjan Ekeberg @ 2019-06-03  8:57 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, Ralph Seichter, notmuch

Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

> So Autocrypt defines the "effective date" of a message as the *earliest*
> of two dates: the date that the message is first seen, and the Date:
> header itself.  So we want our augmented Autocrypt header ingestion
> routine to search for all other messages we know about from the sender
> that have both a later firstseen= property *and* a later Date: header.

Would it be possible to use the earliest date seen in any of the
Received: headers as a safeguard against future-dated messages?

/Örjan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-03  8:57     ` Örjan Ekeberg
@ 2019-06-03 13:17       ` Daniel Kahn Gillmor
  2019-06-03 14:02         ` Ralph Seichter
  2019-06-03 16:02         ` Örjan Ekeberg
  0 siblings, 2 replies; 10+ messages in thread
From: Daniel Kahn Gillmor @ 2019-06-03 13:17 UTC (permalink / raw)
  To: Örjan Ekeberg, Ralph Seichter, notmuch

[-- Attachment #1: Type: text/plain, Size: 1962 bytes --]

On Mon 2019-06-03 10:57:15 +0200, Örjan Ekeberg wrote:
> Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:
>
>> So Autocrypt defines the "effective date" of a message as the *earliest*
>> of two dates: the date that the message is first seen, and the Date:
>> header itself.  So we want our augmented Autocrypt header ingestion
>> routine to search for all other messages we know about from the sender
>> that have both a later firstseen= property *and* a later Date: header.
>
> Would it be possible to use the earliest date seen in any of the
> Received: headers as a safeguard against future-dated messages?

Sure, assuming that you trust the closest MTA in the chain of MTAs that
handed the message off to you, since an adversarial proximal MTA could
manipulate all the existing Received: headers as well.

But I'm a bit uncomfortable with it: this sort of protection actually
opens up a new attack vector that didn't exist before -- any MTA in the
chain can now make the message seem like it was actually from the
*past*, just by setting its own Received: header.

Technically, of course, any MTA could munge the actual Date: header as
well to perform this kind of attack, but that munging would at least
have the potential to be detected by anyone who cares to verify DKIM
headers; but Received: headers are impossible to cover with DKIM.

If there was no expense to the indexing and storage, i'd say it would be
good to just go ahead and index the earliest Received: header as well,
to have that data trivially available as a data point in evaluating
incoming messages.  But since it sounds like there's a cost (in
performance and storage) that would need to be profiled, i don't know
that i can say it's worth the tradeoff.

Since notmuch actually knows when it recieved the message, it seems like
it would be simplest (and less vulnerable to manipulation) to just
record that timestamp directly.

         --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-03 13:17       ` Daniel Kahn Gillmor
@ 2019-06-03 14:02         ` Ralph Seichter
  2019-06-03 22:16           ` Daniel Kahn Gillmor
  2019-06-03 16:02         ` Örjan Ekeberg
  1 sibling, 1 reply; 10+ messages in thread
From: Ralph Seichter @ 2019-06-03 14:02 UTC (permalink / raw)
  To: notmuch

* Daniel Kahn Gillmor:

> Since notmuch actually knows when it recieved the message [...]

Not meaning to complicate things, but Notmuch does not receive messages
at all. ;-) One needs to rely on some software to populate the Maildir
tree (Dovecot LMTP in my case, Postfix or some other MTA for local
delivery in other cases). Any software transporting the raw messages
can, and sometimes must, manipulate the header data, and the order in
which files within the Maildir tree are created is also not determined
by Notmuch.

As an example: My nightly backup script disables local delivery for the
duration of the backup process. Once reactivated, delivery of queued
messages resumes, but it is not guaranteed to happen in the order of
arrival. So even the local MTA, although trusted, might induce issues in
terms of delivery time.

-Ralph

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-03 14:02         ` Ralph Seichter
@ 2019-06-03 22:16           ` Daniel Kahn Gillmor
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Kahn Gillmor @ 2019-06-03 22:16 UTC (permalink / raw)
  To: Ralph Seichter, notmuch

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

On Mon 2019-06-03 16:02:48 +0200, Ralph Seichter wrote:
> Not meaning to complicate things, but Notmuch does not receive messages
> at all. ;-) One needs to rely on some software to populate the Maildir
> tree (Dovecot LMTP in my case, Postfix or some other MTA for local
> delivery in other cases). Any software transporting the raw messages
> can, and sometimes must, manipulate the header data, and the order in
> which files within the Maildir tree are created is also not determined
> by Notmuch.
>
> As an example: My nightly backup script disables local delivery for the
> duration of the backup process. Once reactivated, delivery of queued
> messages resumes, but it is not guaranteed to happen in the order of
> arrival. So even the local MTA, although trusted, might induce issues in
> terms of delivery time.

I agree with you!  the e-mail system, like any other store-and-forward
ecosystem, offers no guarantees of message delivery.

fwiw, i'm not claiming that the time notmuch receives the message is
guaranteed to be close to the time that the message was sent.

but i can guarantee two things:

 * notmuch cannot receive the message *before* it was sent :)

 * if the local system clock is correct, notmuch can place a plausible
   upper bound on the Date: header that is included in the message.

This alone is useful data.

     --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-03 13:17       ` Daniel Kahn Gillmor
  2019-06-03 14:02         ` Ralph Seichter
@ 2019-06-03 16:02         ` Örjan Ekeberg
  2019-06-03 22:21           ` Daniel Kahn Gillmor
  1 sibling, 1 reply; 10+ messages in thread
From: Örjan Ekeberg @ 2019-06-03 16:02 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, Ralph Seichter, notmuch

Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

> Sure, assuming that you trust the closest MTA in the chain of MTAs that
> handed the message off to you, since an adversarial proximal MTA could
> manipulate all the existing Received: headers as well.
>
> But I'm a bit uncomfortable with it: this sort of protection actually
> opens up a new attack vector that didn't exist before -- any MTA in the
> chain can now make the message seem like it was actually from the
> *past*, just by setting its own Received: header.

As far as I understand the autocrypt protocol (i.e. not much;-) ), the
vulnerability is that an incoming message with a later time-stamp than
the locally saved autocrypt status can update the stored state
(e.g. turn off encryption).  Manipulating the time-stamp to make the
message appear to be *older* than it really is should only mean that it is
less likely to update the saved state?

If this is correct, using the oldest of all the time-stamps seen in the
Date-header and any of the Received-headers should be the most
defensive.

/Örjan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: feature request: caching message arrival time
  2019-06-03 16:02         ` Örjan Ekeberg
@ 2019-06-03 22:21           ` Daniel Kahn Gillmor
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Kahn Gillmor @ 2019-06-03 22:21 UTC (permalink / raw)
  To: Örjan Ekeberg, Ralph Seichter, notmuch

[-- Attachment #1: Type: text/plain, Size: 1129 bytes --]

On Mon 2019-06-03 18:02:53 +0200, Örjan Ekeberg wrote:
> As far as I understand the autocrypt protocol (i.e. not much;-) ), the
> vulnerability is that an incoming message with a later time-stamp than
> the locally saved autocrypt status can update the stored state
> (e.g. turn off encryption).  Manipulating the time-stamp to make the
> message appear to be *older* than it really is should only mean that it is
> less likely to update the saved state?
>
> If this is correct, using the oldest of all the time-stamps seen in the
> Date-header and any of the Received-headers should be the most
> defensive.

It's the most defensive against one form of attack: forging e-mails
intended to update the user's Autocrypt state about a given peer.

But another form of attack is also possible: convincing the user to
*not* update their Autocrypt state about a given peer, while leaving the
original message otherwise plausible and intact, thereby raising no
suspicions about delivery problems.

I'd like notmuch's Autocrypt implementation to try to defend against
either attack where possible.

       --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-06-04  8:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-06-01  3:29 feature request: caching message arrival time Daniel Kahn Gillmor
2019-06-01 14:13 ` David Bremner
2019-06-01 14:19 ` Ralph Seichter
2019-06-01 15:30   ` Daniel Kahn Gillmor
2019-06-03  8:57     ` Örjan Ekeberg
2019-06-03 13:17       ` Daniel Kahn Gillmor
2019-06-03 14:02         ` Ralph Seichter
2019-06-03 22:16           ` Daniel Kahn Gillmor
2019-06-03 16:02         ` Örjan Ekeberg
2019-06-03 22:21           ` Daniel Kahn Gillmor

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).