unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Eric Wong <e@80x24.org>, Jeff King <peff@peff.net>
Cc: meta@public-inbox.org
Subject: Re: Nonlinear history?
Date: Wed, 23 Aug 2017 11:29:24 -0700	[thread overview]
Message-ID: <CAGZ79kYLcy_Pe0POUjUC+SaZYzFnLYhBYbY+ZNEBwc+32j8b1A@mail.gmail.com> (raw)
In-Reply-To: <20170823014239.GA4113@starla>

On Tue, Aug 22, 2017 at 6:42 PM, Eric Wong <e@80x24.org> wrote:
> Stefan Beller <sbeller@google.com> wrote:
>> So I happened to search an old post of mine today,
>> specifically I knew only a couple of bits of it:
>> * I authored a patch series, that had a given
>>   string in the name ("protocolv2")
>> * I was looking for an answer by Peff
>>
>> To find the post in question I used both the local git mailing list repository
>> (as cloned from https://public-inbox.org/git) to find the starting point[1]
>> as well as the online list to see the relations between posts, such
>> that I finally arrived at [2].
>>
>> However in the process of searching locally I wondered if the
>> repository data could be organized better, instead of linearly.
>
> The reason it is organized linearly is so it can be
> up-to-the-minute and fetched incrementally as soon as mail
> arrives (or it is marked as spam).

So the design decision is to be as fast as possible on
relaying the message, reducing the time from receiving
to publishing?

> However, I've been considering after-the-fact organization, too
> (similar to how packing works in git).  However, it's not
> to optimize search, but to improve storage efficiency:
>
> 1) purge spam messages from history

I thought this would already happen. every once in a while
I get a spam mail via the mailing list and the last time I checked
it was not to be found in the public-inbox archive, so I assumed
spam filtering is already part of the decision for each new
message.

> 2) squash to reduce tree and commit objects

This would only work for patch series, such that the
author is kept (and time is only skewed by a few seconds)

> 3) perhaps choose smarter filenames which can improve
>    packing heuristics
>
> So I'm also strongly favoring moving away from the 2/38
> Message-ID naming scheme we currently use, too.

I wondered if the email message ID (i.e.
20170823014239.GA4113@starla/) is a good base for a
naming scheme? (sharding into directories would need to
be added. Maybe even 'in reverse'? That would help
to separate mails by host/sender).

>> So what if the git history would reflect the parent relationships
>> of the emails? Essentially each email is comparable to
>> a topic branch in the git workflow (potentially with other
>> series/email on top of it). Each topic would be merged
>> to master immediately, such that the first parent master
>> branch history consists of merges only; the second parent
>> is the new ingested email, which is either a root-commit
>> (when a new topic is started), or a commit building on top
>> of another commit (which contains the email it is responding
>> to; that other commit is merged to master already).
>
> That may not work well because emails arrive out-of-order,
> especially when many are sent in rapid succession with
> "git send-email".  I've had to make bugfixes to some of the
> Perl+Xapian logic to deal with OOO message delivery, too :)
>
> And having to map Message-ID to a particular commit would
> require extra overhead to keep track of parents, no?

Yes, but that is already a problem while using the data as a viewer.
Every time I visit https://public-inbox.org/meta/20170823014239.GA4113@starla/
the server needs to compute the "thread overview". If the git history
would be grouped by message relationships, the querying could be
done via git, which -now that I think about it-  may not actually be
cheaper than searching in the "unstructured" data as of now.

The receiving out of order seems to be a problem in this design.

Note that Peff seems to have build tooling around public-inbox
(https://public-inbox.org/git/20170823154747.vxtyy2v2ofkxwrkx@sigill.intra.peff.net/)
that would produce this precise lookup already.

> So, I think what you're describing already happens in the Perl
> search code as every message gets assigned a thread_id when it
> is indexed in Xapian.  I suppose you still cannot look at AGPL-3
> code (being a Googler), but I stole the logic from notmuch
> (C++, GPL-3+, no 'A') circa 2015/2016(*).  I believe mairix(**)
> uses similar logic for mapping messages to thread IDs, too.
>
> So the thread skeleton you see at the bottom of every message
> page is done using a boolean thread_id search OR-ed with a
> Subject search.

That sounds efficient.

> In short, I would like to depend more on the search engine for
> logic and keep that flexible; but continue to keep the (git)
> storage layer "dumb".  The smarts would be in Xapian, which can
> be tuned and refined after-the-fact with minimal refetching.

eh. I see your point (and motivation as the maintainer of public
inbox).

As a user I would have hoped for a "smart" git layer, as I like
searching the data using git tools, which would be enhanced if
the git layer is not "dumb".

> And Xapian could also be swapped out for alternative search
> engines, too (Groonga, maybe).  I consider it having a similar
> in philosophy to git itself w.r.t. storage optimization,
> merge strategies, and rename detection.

So dumb data, with a smart (and potentially even smarter
in the future) program on top.

>
>> Has this idea been come up before or even discussed before?
>
> Not exactly what you're asking, but I guess what I described is
> similar to what we already do via Xapian.
>
>> Thanks,
>> Stefan
>>
>> [1] I really like the search by author feature!
>> [2] https://public-inbox.org/git/20150604130902.GA12404@peff.net/
>
> Thanks.   Fwiw, I did not know Xapian at all when I started this
> project, but it's become one of my favorite things about it :)
> And I stole the f:, tc:, s: and several other prefixes from
> mairix.
>
>
>
> (*) I've never used notmuch as I don't use Maildir for long-term
>     archival and I don't think notmuch ever supported anything else.
>     I also don't know C++, so maybe I interpreted wrong :x
>
> (**) I still use mairix, but I'm considering starting a
>      separate project to replace it for my personal mail(***)
>      It may also be useful for prototyping future public-inbox
>      changes, too.
>
> (***) For private emails, I want IMAP support + offline
>       memoization instead of my current mairix + offlineimap +
>       archive-old-mail-to-mboxrd script.  I'd still want
>       to rely on git for message caching/memoization.

Thanks for your considerations.

  reply	other threads:[~2017-08-23 18:29 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-16 21:01 Nonlinear history? Stefan Beller
2017-08-23  1:42 ` Eric Wong
2017-08-23 18:29   ` Stefan Beller [this message]
2017-08-23 19:40     ` Eric Wong
2017-08-23 20:06     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGZ79kYLcy_Pe0POUjUC+SaZYzFnLYhBYbY+ZNEBwc+32j8b1A@mail.gmail.com \
    --to=sbeller@google.com \
    --cc=e@80x24.org \
    --cc=meta@public-inbox.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).