From: Eric Wong <e@80x24.org>
To: Stefan Beller <sbeller@google.com>
Cc: meta@public-inbox.org
Subject: Re: Nonlinear history?
Date: Wed, 23 Aug 2017 01:42:39 +0000 [thread overview]
Message-ID: <20170823014239.GA4113@starla> (raw)
In-Reply-To: <CAGZ79kZW6O_wCZRMrWDc1yXvQzTDbFOLrcjt=81XGJj=VUjBzw@mail.gmail.com>
Stefan Beller <sbeller@google.com> wrote:
> So I happened to search an old post of mine today,
> specifically I knew only a couple of bits of it:
> * I authored a patch series, that had a given
> string in the name ("protocolv2")
> * I was looking for an answer by Peff
>
> To find the post in question I used both the local git mailing list repository
> (as cloned from https://public-inbox.org/git) to find the starting point[1]
> as well as the online list to see the relations between posts, such
> that I finally arrived at [2].
>
> However in the process of searching locally I wondered if the
> repository data could be organized better, instead of linearly.
The reason it is organized linearly is so it can be
up-to-the-minute and fetched incrementally as soon as mail
arrives (or it is marked as spam).
However, I've been considering after-the-fact organization, too
(similar to how packing works in git). However, it's not
to optimize search, but to improve storage efficiency:
1) purge spam messages from history
2) squash to reduce tree and commit objects
3) perhaps choose smarter filenames which can improve
packing heuristics
So I'm also strongly favoring moving away from the 2/38
Message-ID naming scheme we currently use, too.
> So what if the git history would reflect the parent relationships
> of the emails? Essentially each email is comparable to
> a topic branch in the git workflow (potentially with other
> series/email on top of it). Each topic would be merged
> to master immediately, such that the first parent master
> branch history consists of merges only; the second parent
> is the new ingested email, which is either a root-commit
> (when a new topic is started), or a commit building on top
> of another commit (which contains the email it is responding
> to; that other commit is merged to master already).
That may not work well because emails arrive out-of-order,
especially when many are sent in rapid succession with
"git send-email". I've had to make bugfixes to some of the
Perl+Xapian logic to deal with OOO message delivery, too :)
And having to map Message-ID to a particular commit would
require extra overhead to keep track of parents, no?
So, I think what you're describing already happens in the Perl
search code as every message gets assigned a thread_id when it
is indexed in Xapian. I suppose you still cannot look at AGPL-3
code (being a Googler), but I stole the logic from notmuch
(C++, GPL-3+, no 'A') circa 2015/2016(*). I believe mairix(**)
uses similar logic for mapping messages to thread IDs, too.
So the thread skeleton you see at the bottom of every message
page is done using a boolean thread_id search OR-ed with a
Subject search.
In short, I would like to depend more on the search engine for
logic and keep that flexible; but continue to keep the (git)
storage layer "dumb". The smarts would be in Xapian, which can
be tuned and refined after-the-fact with minimal refetching.
And Xapian could also be swapped out for alternative search
engines, too (Groonga, maybe). I consider it having a similar
in philosophy to git itself w.r.t. storage optimization,
merge strategies, and rename detection.
> Has this idea been come up before or even discussed before?
Not exactly what you're asking, but I guess what I described is
similar to what we already do via Xapian.
> Thanks,
> Stefan
>
> [1] I really like the search by author feature!
> [2] https://public-inbox.org/git/20150604130902.GA12404@peff.net/
Thanks. Fwiw, I did not know Xapian at all when I started this
project, but it's become one of my favorite things about it :)
And I stole the f:, tc:, s: and several other prefixes from
mairix.
(*) I've never used notmuch as I don't use Maildir for long-term
archival and I don't think notmuch ever supported anything else.
I also don't know C++, so maybe I interpreted wrong :x
(**) I still use mairix, but I'm considering starting a
separate project to replace it for my personal mail(***)
It may also be useful for prototyping future public-inbox
changes, too.
(***) For private emails, I want IMAP support + offline
memoization instead of my current mairix + offlineimap +
archive-old-mail-to-mboxrd script. I'd still want
to rely on git for message caching/memoization.
next prev parent reply other threads:[~2017-08-23 1:42 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-16 21:01 Nonlinear history? Stefan Beller
2017-08-23 1:42 ` Eric Wong [this message]
2017-08-23 18:29 ` Stefan Beller
2017-08-23 19:40 ` Eric Wong
2017-08-23 20:06 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170823014239.GA4113@starla \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
--cc=sbeller@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).