From: Eric Wong <e@80x24.org>
To: "W. Trevor King" <wking@tremily.us>
Cc: notmuch@notmuchmail.org, meta@public-inbox.org
Subject: Re: Mail archives in Git using ssoma
Date: Sun, 21 Aug 2016 21:14:55 +0000 [thread overview]
Message-ID: <20160821211455.GA11841@starla> (raw)
In-Reply-To: <20160821202820.GC30347@odin.tremily.us>
"W. Trevor King" <wking@tremily.us> wrote:
> On Sun, Aug 21, 2016 at 06:37:04PM +0000, Eric Wong wrote:
> > Btw, for public-inbox, I'm using git-fast-import now, so imports are
> > a bit faster and $GIT_DIR/ssoma.index is no longer used. This was
> > crucial for getting git@vger archives imported in a reasonable time.
>
> ssoma-mda imports 22k notmuch messages in around 15 minutes (with
> profiling enabled), and:
In contrast, git@vger is around 300K messages. LKML is well
into the millions, and I hope public-inbox (and git!) can handle
that one day, even on cheap hardware (haven't tried).
One problem I noticed with ssoma-mda is that it gets slower as
more messages get imported, since all those files sit in the
index, and the git index format is bad for incremental updates
with big, flat trees. Big trees are a general problem with git:
I'm now storing blob IDs directly in Xapian and will be
using them more to avoid tree lookups. tree creation
lookups degrade the same way the index does as they
get bigger.
Currently it's using 2/38 of the SHA-1 like git loose
objects; a goal might be to move towards supporting 2/2/36
(or deeper) as Jeff noted substantial object traversal
improvements:
https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/
Of course, support for 2/38 will be retained for old
archives/messages.
> $ python -m cProfile -o profile import.py notmuch.mbox
> $ python -c "import pstats; p=pstats.Stats('profile'); p.sort_stats('cumulative').print_stats(10)"
> Sun Aug 21 12:56:49 2016 profile
>
> 101823722 function calls (99078415 primitive calls) in 885.069 seconds
>
> Ordered by: cumulative time
> List reduced from 1145 to 10 due to restriction <10>
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 70/1 0.002 0.000 885.069 885.069 {built-in method exec}
> 1 0.111 0.111 885.069 885.069 /home/wking/src/notmuch/notmuch-archives.git/import.py:9(<module>)
> 1 0.400 0.400 884.915 884.915 /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox)
> 22875 0.601 0.000 863.371 0.038 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver)
> 22875 8.943 0.000 810.459 0.035 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append)
> 22875 0.418 0.000 308.353 0.013 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree)
> 22875 307.855 0.013 307.855 0.013 {built-in method git_index_write_tree}
> 22874 0.575 0.000 279.293 0.012 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree)
> 22874 278.501 0.012 278.501 0.012 {built-in method git_diff_tree_to_index}
It looks like writing the index is already the slowest, here, in
terms of total time, too. It might be interesting if you
profiled each *-mda invocation to see the degradation from the
first to last message.
> 22875 0.088 0.000 80.413 0.004 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read)
>
> 38 ms per ssoma delivery is probably fast enough, especially if you
Not even close for me :)
> are invoking ssoma-mda once per message, since process setup will take a similar amount of time:
>
> $ time python -c 'print("hello")'
> hello
>
> real 0m0.016s
> user 0m0.013s
> sys 0m0.003s
>
> It's possible that fast-import would shave a few ms off the pygit2
> addition (I'm not sure, and maybe pygit2 is faster than fast-import).
> But I doubt it matters enough either way to be worth changing unless
> you are dealing with a really large corpus.
One key feature is fast-import avoids writing an index entirely.
I think pygit2 would have to learn that, too.
prev parent reply other threads:[~2016-08-21 21:14 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20141107190321.GL23609@odin.tremily.us>
[not found] ` <20160821043631.GA2338@odin.tremily.us>
[not found] ` <20160821094833.GB2338@odin.tremily.us>
2016-08-21 12:08 ` Mail archives in Git using ssoma (Docker image) Eric Wong
2016-08-21 17:36 ` W. Trevor King
2016-08-21 18:28 ` Eric Wong
[not found] ` <20160821183704.GB11495@dcvr>
2016-08-21 20:28 ` Mail archives in Git using ssoma W. Trevor King
2016-08-21 21:14 ` Eric Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160821211455.GA11841@starla \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
--cc=notmuch@notmuchmail.org \
--cc=wking@tremily.us \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).