unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Mail archives in Git using ssoma
@ 2014-11-07 19:03 W. Trevor King
  2016-08-21  4:36 ` W. Trevor King
  0 siblings, 1 reply; 10+ messages in thread
From: W. Trevor King @ 2014-11-07 19:03 UTC (permalink / raw)
  To: notmuch; +Cc: Eric Wong

[-- Attachment #1: Type: text/plain, Size: 7808 bytes --]

Hello everyone :),

I like Git, so when folks suggest storing things in Git, I'm usually
excited ;).  Eric Wong has been working on some tools to store email
in a Git repository, and his client-side code is ssoma [1].  I wanted
a bit more metadata than the stock ssoma-mda [2], and ended up just
writing a ssoma-mda in Python [3].  It needs Python ≥3.4 and pygit2.
I had pygit2 already installed for Python 3.3 (which gave me a local
libgit2), so I used pip to install it for 3.4:

  $ python3.4 -m ensurepip --user
  $ pip3.4 install --user pygit2

Then I grabbed the archives, and pulled them into Git:

  $ wget http://notmuchmail.org/archives/notmuch.mbox
  $ git init --bare notmuch-archives.git
  $ cd notmuch-archives.git
  $ python3.4
  >>> import email.utils
  >>> import mailbox
  >>> import ssoma_mda
  >>> mbox = mailbox.mbox('../notmuch.mbox', factory=None, create=False)
  >>> messages = sorted(mbox, key=lambda m: email.utils.mktime_tz(email.utils.parsedate_tz(m['date'])))
  >>> for message in messages:
  ...     if ((message['message-id'] == '<m2k4gmyjer.fsf@ecocode.net>' and
  ...             message['X-List-Received-Date'] == 'Sat, 26 Feb 2011 14:23:34 -0000') or
  ...           (message['message-id'] == '<4EDF728E.3050204@gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Wed, 07 Dec 2011 14:05:16 -0000') or
  ...           (message['message-id'] == <4FE369F2.5080804@gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Thu, 21 Jun 2012 18:38:07 -0000') or
  ...           (message['message-id'] == '<5122353D.4060601@gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Mon, 18 Feb 2013 14:06:12 -0000') or
  ...           (message['message-id'] == '<CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA@mail.gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Wed, 24 Apr 2013 18:09:55 -0000') or
  ...           (message['message-id'] == '<527B9E8C.5000001@krugs.de>' and
  ...             message['X-List-Received-Date'] == 'Thu, 07 Nov 2013 14:07:32 -0000') or
  ...           (message['message-id'] == '<1399645162-8653-1-git-send-email-wael.nasreddine@gmail.com>' and
  ...             message['X-List-Received-Date'] == 'Fri, 09 May 2014 14:19:36 -0000') or
  ...           (message['message-id'] == '<m2mw9xkyvg.fsf@krugs.de>' and
  ...             message['X-List-Received-Date'] == 'Thu, 18 Sep 2014 10:27:35 -0000') or
  ...           (message['message-id'] == '<cover.1411379395.git.jani@nikula.org>' and
  ...             message['X-List-Received-Date'] != 'Mon, 22 Sep 2014 09:54:16 -0000')):
  ...         continue
  ...     ssoma_mda.deliver(message=message, once=True)
  >>> ^D

On my 1.1GHz Intel Celeron 847 Sandy Bridge netbook, that took about
half an hour.  The initial repository was large:

  $ du -hs .
  394M    .

But packing it up made it small:

  $ git gc --aggressive
  du -hs .
  51M     .

With a few less images than the mbox:

  $ git log --oneline | wc -l
  19650

Compared with 19660 messages in the mbox at 107 MB (160 MB for the
associated Maildir).

The messages I dropped removed duplicate Message-IDs:

* id:m2k4gmyjer.fsf@ecocode.net had different received dates:

    -X-List-Received-Date: Sat, 26 Feb 2011 14:12:20 -0000
    +X-List-Received-Date: Sat, 26 Feb 2011 14:23:34 -0000

  but no significant differences.

* id:4EDF728E.3050204@gmail.com had a real address in the
  first-to-arrive version:

    -X-List-Received-Date: Wed, 07 Dec 2011 14:10:13 -0000
    -> <4winter@informatik.uni-hamburg.de>

  an an obfuscated one in the second-to-arrive version:

    +X-List-Received-Date: Wed, 07 Dec 2011 14:05:16 -0000
    +> <4winter-jNDFPZUTrfQBEfOqpokbeYV0Y/DQsy6Ps0AfqQuZ5sE@public.gmane.org>

* id:4FE369F2.5080804@gmail.com had the same:

    -X-List-Received-Date: Thu, 21 Jun 2012 18:37:54 -0000
    -> <R.M.Krug@gmail.com

    +X-List-Received-Date: Thu, 21 Jun 2012 18:38:07 -0000
    -> <mailto:R.M.Krug@gmail.com>> wrote:

* id:5122353D.4060601@gmail.com had different received dates:

    -X-List-Received-Date: Mon, 18 Feb 2013 14:06:05 -0000
    +X-List-Received-Date: Mon, 18 Feb 2013 14:06:12 -0000

  but no significant differences.

* id:CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA@mail.gmail.com
  had different MIME boundaries:

    -Content-Type: multipart/alternative; boundary=f46d043be11ac45a0904db1f3428
    -X-List-Received-Date: Wed, 24 Apr 2013 18:09:46 -0000

    +Content-Type: multipart/alternative; boundary=e89a8f646ff3faa11d04db1f3294
    +X-List-Received-Date: Wed, 24 Apr 2013 18:09:55 -0000

  but no significant differences.

* id:527B9E8C.5000001@krugs.de had obfuscated addresses:

    -X-List-Received-Date: Thu, 07 Nov 2013 14:07:33 -0000
    -> Rainer M Krug <Rainer@krugs.de> writes:

    +X-List-Received-Date: Thu, 07 Nov 2013 14:07:32 -0000
    +> Rainer M Krug <Rainer-vfylz/Ys1k4@public.gmane.org> writes:

* id:1399645162-8653-1-git-send-email-wael.nasreddine@gmail.com had
  additional content in the later submission:

    -Subject: [PATCH] Add Travis-CI config file.
    -Date: Fri,  9 May 2014 07:19:22 -0700
    -X-List-Received-Date: Fri, 09 May 2014 14:19:36 -0000
    - .travis.yml | 10 ++++++++++
    - 1 file changed, 10 insertions(+)

    +Subject: [PATCH v2] Enable Travis-CI as a backup continuous integration
    +       service.
    +Date: Fri,  9 May 2014 14:44:50 -0700
    +X-List-Received-Date: Fri, 09 May 2014 21:45:16 -0000
    +
    +The v2 adds a notification section to send failure (or back to passing) notifications
    +to the mailing list and to the IRC channel
    +
    + .travis.yml | 13 +++++++++++++
    + 1 file changed, 13 insertions(+)

* id:m2mw9xkyvg.fsf@krugs.de had an obfuscated adderss and different signature:

    -X-List-Received-Date: Thu, 18 Sep 2014 10:27:31 -0000
    ->> guyzmo <guyzmo@m0g.net> writes:
     -----BEGIN PGP SIGNATURE-----
     Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
    -iQEcBAEBAgAGBQJUGrN3AAoJENvXNx4PUvmC4J0IAN9Wf+0ArvirJCoewItnEZoo
    -ySg4VRP7uWVqDxHVl5N9XFv4YE2bZ2E2eMGvbo6v7I82lhqeR5dauZhlgCMki+ZI

    +X-List-Received-Date: Thu, 18 Sep 2014 10:27:35 -0000
    +>> guyzmo <guyzmo-kMjww5mZloE@public.gmane.org> writes:
     -----BEGIN PGP SIGNATURE-----
     Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
    +iQEcBAEBAgAGBQJUGrN4AAoJENvXNx4PUvmC6LsIAIaFrd4MFnm8EixrAHPGfW6j
    +L3KNG7Dv+hQuNRUN6qn+emZHI8wX4O74HOZOpZWkE09CmjkPJBmf7IuJwtz2ONbM

* id:cover.1411379395.git.jani@nikula.org came in three times, with
  three dates, but no significant differences:

    Date: Mon, 22 Sep 2014 11:54:20 +0200
    X-List-Received-Date: Mon, 22 Sep 2014 09:54:16 -0000

    Date: Mon, 22 Sep 2014 11:54:42 +0200
    X-List-Received-Date: Mon, 22 Sep 2014 09:54:37 -0000

    Date: Mon, 22 Sep 2014 11:54:51 +0200
    X-List-Received-Date: Mon, 22 Sep 2014 09:54:49 -0000

Anyhow, I've pushed the Git archive [4,5] if anyone wants to play
around with ssoma.  I think this would be a nice backend for folks
building notmuch-based web archives, and pulling from Git is easier
than downloading a new mbox ;).

Cheers,
Trevor

[1]: http://ssoma.public-inbox.org/README
[2]: http://public-inbox.org/meta/m/ec8f54cf6451eef6e9f59eff691cd9002f4fdf65.html
[3]: http://git.tremily.us/?p=ssoma-mda.git;a=shortlog;h=refs/heads/python
     I have an uncommitted patch to work around http://bugs.python.org/issue22684
[4]: http://git.tremily.us/?p=notmuch-archives.git
[5]: git://tremily.us/notmuch-archives.git

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma
  2014-11-07 19:03 Mail archives in Git using ssoma W. Trevor King
@ 2016-08-21  4:36 ` W. Trevor King
  2016-08-21  9:48   ` Mail archives in Git using ssoma (Docker image) W. Trevor King
  2016-08-21 18:37   ` Mail archives in Git using ssoma Eric Wong
  0 siblings, 2 replies; 10+ messages in thread
From: W. Trevor King @ 2016-08-21  4:36 UTC (permalink / raw)
  To: notmuch; +Cc: Eric Wong

[-- Attachment #1: Type: text/plain, Size: 1549 bytes --]

On Fri, Nov 07, 2014 at 11:03:21AM -0800, W. Trevor King wrote:
> Eric Wong has been working on some tools to store email in a Git
> repository, and his client-side code is ssoma [1].  I wanted a bit
> more metadata than the stock ssoma-mda [2], and ended up just
> writing a ssoma-mda in Python [3]…
>
> Then I grabbed the archives, and pulled them into Git:
> …
> The messages I dropped removed duplicate Message-IDs:
> …

ssoma and public-inbox came up recently (with the end of Gmane) in
[1].  I've brought my archives [2] up to speed with a fresh mbox
downloaded today [3].  Beyond the ignored messages mentioned in my
initial email, I had to ignore:

* id:67EEA3E1-918F-47AE-8AD7-EF0A5923D800@m0g.net

  Which had different headers up through:

  -X-List-Received-Date: Wed, 06 Jan 2016 15:49:49 -0000
  +X-List-Received-Date: Wed, 06 Jan 2016 15:50:34 -0000

  but the same body in both instances.

I also had to remove two control characters:

  $ tr -d '\034' <notmuch.mbox >notmuch-fixed.mbox

to get the mbox into a format that Python could parse without errors.

I've pushed the mbox → ssoma(ish) import script to the ‘import’ branch
of [2] if folks want to play around.

Cheers,
Trevor

[1]: id:20160820062931.GY30347@odin.tremily.us
[2]: git://tremily.us/notmuch-archives.git
[3]: http://notmuchmail.org/archives/notmuch.mbox

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma (Docker image)
  2016-08-21  4:36 ` W. Trevor King
@ 2016-08-21  9:48   ` W. Trevor King
       [not found]     ` <20160821120852.GA12964@dcvr>
  2016-08-21 18:37   ` Mail archives in Git using ssoma Eric Wong
  1 sibling, 1 reply; 10+ messages in thread
From: W. Trevor King @ 2016-08-21  9:48 UTC (permalink / raw)
  To: notmuch; +Cc: David Bremner, Steven Allen, Tomi Ollila, Carl Worth, Eric Wong

[-- Attachment #1: Type: text/plain, Size: 2208 bytes --]

On Sat, Aug 20, 2016 at 09:36:31PM -0700, W. Trevor King wrote:
> [2]: git://tremily.us/notmuch-archives.git

This is the ssoma archive (with the data in it).  I just set up a
basic HTTP archive (following [1]) based on a Docker image [2] (Gentoo
doesn't package all the Perl dependencies public-inbox needs).
Dockerfile for rebuilding the image is in [2].  I'm currently hosting
the archives (HTTP only) at [3].  Spinning that up from the Docker
image looks like:

  $ mkdir srv
  $ git clone --bare git://tremily.us/notmuch-archives.git srv/notmuch
  $ echo 'Notmuch -- Just an email system' >srv/notmuch.git/description
  $ git config -f srv/notmuch.git/config publicinbox.http http://tremily.us
  $ git config -f srv/notmuch.git/config publicinbox.email notmuch@notmuchmail.org
  $ docker run --name notmuch-archives -d -p 80:8080 -v ${PWD}/srv/:/srv/ wking/public-inbox

(although I'm using -p ###:8080 and have an Nginx reverse-proxy in
front).  It's not updating automatically yet, but that will probably
look like:

1. Pull new mbox [4].
2. Import into notmuch-archives [5].
3. Re-run public-inbox-index (this could probably be via ‘docker exec …’.

But I'll have to test that to confirm.  And ideally we'd be using
ssoma-mda or similar directly, instead of going through mbox, but I'd
rather get the official headers on the stored mail than be efficient
;).

One shift from Gmane's mid.gmane.org/… is that the public-inbox UI
Message-ID lookup is per-bucket, and public-inbox seems to be
encouraging per-list buckets.

And while I feel like I had a good grasp of the ssoma format two years
ago, I know very little about Perl and public-inbox.  I'm sure you
could setup a public-inbox host that is more efficient than what's
currently in my Docker image.

Cheers,
Trevor

[1]: http://public-inbox.org/INSTALL
[2]: https://hub.docker.com/r/wking/public-inbox/
[3]: http://tremily.us/notmuch/
[4]: https://notmuchmail.org/archives/notmuch.mbox
[5]: id:20160821043631.GA2338@odin.tremily.us

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma (Docker image)
       [not found]     ` <20160821120852.GA12964@dcvr>
@ 2016-08-21 13:49       ` David Bremner
  2016-09-07  1:23         ` Eric Wong
  2016-08-21 17:36       ` W. Trevor King
  1 sibling, 1 reply; 10+ messages in thread
From: David Bremner @ 2016-08-21 13:49 UTC (permalink / raw)
  To: W. Trevor King; +Cc: notmuch

Eric Wong <e@80x24.org> writes:


> For mirroring existing lists, I started using public-inbox-watch
> which currently watches Maildirs.  The config knobs are sorta
> documented from my announcement to git@vger:
>
> https://public-inbox.org/git/20160710004813.GA20210@dcvr.yhbt.net/
> http://hjrcffqmbrq6wope.onion/git/20160710004813.GA20210@dcvr.yhbt.net/
>
> Initial import (w/o spamassassin) was done with
> scripts/import_vger_from_mbox in the source:
>
>         torsocks git clone http://hjrcffqmbrq6wope.onion/public-inbox
>         git clone https://public-inbox.org/ public-inbox
>         git clone git://repo.or.cz/public-inbox
>

FWIW, I already have a Maildir with a complete (and updated) archive of the list (and
only that) for use of nmbug. So at the risk of putting all eggs in one
basket, perhaps public-inbox-watch could watch that maildir.

d

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma (Docker image)
       [not found]     ` <20160821120852.GA12964@dcvr>
  2016-08-21 13:49       ` David Bremner
@ 2016-08-21 17:36       ` W. Trevor King
  2016-08-21 18:28         ` Eric Wong
  1 sibling, 1 reply; 10+ messages in thread
From: W. Trevor King @ 2016-08-21 17:36 UTC (permalink / raw)
  To: Eric Wong
  Cc: notmuch, David Bremner, Steven Allen, Tomi Ollila, Carl Worth,
	meta

[-- Attachment #1: Type: text/plain, Size: 4424 bytes --]

On Sun, Aug 21, 2016 at 12:08:52PM +0000, Eric Wong wrote:
> "W. Trevor King" <wking@tremily.us> wrote:
> > This is the ssoma archive (with the data in it).  I just set up a
> > basic HTTP archive (following [1]) based on a Docker image [2] (Gentoo
> > doesn't package all the Perl dependencies public-inbox needs).
> 
> Ugh, that sucks (sorry, not a fan of Docker).
> 
> What's missing from Gentoo?

Gentoo doesn't package (or I couldn't find the package for)
Encode::MIME::Header or Mail::Thread.  I tried installing things from
CPAN, but ran into a compile-time error from the ‘cpan’ invocationand
gave up ;).  I can try and reproduce the error if you're curious, but
I don't have it handy at the moment.

> >   $ git config -f srv/notmuch.git/config publicinbox.http http://tremily.us
> >   $ git config -f srv/notmuch.git/config publicinbox.email notmuch@notmuchmail.org
> 
> That should probably be:
> 
> 	; based on your [3]
> 	git config -f srv/notmuch.git/config \
> 		publicinbox.notmuch.url http://tremily.us/notmuch
> 
> 	git config -f srv/notmuch.git/config \
> 		publicinbox.notmuch.address notmuch@notmuchmail.org
> 
> 	; this is crucial for all the public-inbox-* tools
> 	git config -f srv/notmuch.git/config \
> 		publicinbox.notmuch.mainrepo /path/to/notmuch.git

I was using these in the Dockerfile's CMD:

  (cd /srv;
   for NAME in *;
   do
     CONF="/srv/${NAME}/config";
     public-inbox-init "${NAME}" "/srv/${NAME}" $(git config -f "${CONF}" publicinbox.http) $(git config -f "${CONF}" publicinbox.email);
   done) && …

Are you saying that I can skip the ~/.public-inbox/config entries
setup by public-inbox-init if I set publicinbox.{name}.* in the ssoma
repository's config?  That would be nice.

I don't see a point to having {name} in ssoma-config settings though,
since you're already in a single bucket by that point (using
publicinbox.{name}.* makes sense in the multi-bucket
~/.public-inbox/config).

> > It's not updating automatically yet, but that will probably look
> > like:
> > 
> > 1. Pull new mbox [4].
> > 2. Import into notmuch-archives [5].
> > 3. Re-run public-inbox-index (this could probably be via ‘docker exec …’.
> > 
> > But I'll have to test that to confirm.  And ideally we'd be using
> > ssoma-mda or similar directly, instead of going through mbox, but I'd
> > rather get the official headers on the stored mail than be efficient
> > ;).
> 
> For mirroring existing lists, I started using public-inbox-watch
> which currently watches Maildirs.

If I had a Maildir locally, I'd just use procmail and push new
messages into ssoma-mda.  I'm using the import script because my local
mail has “how we delivered this to Trevor” headers (which I don't want
to add) but the downloaded mbox has “how we delivered this to
notmuch@notmuchmail.org” (which seems like a better fit for a shared
ssoma repo).

> I recommend public-inbox-watch for mirroring existing lists (such as
> what I did with git@vger) but public-inbox-mda for self-hosted lists
> (such as meta@public-inbox.org).

Why is that?  Procmail + public-inbox-mda (or my Python ssoma-mda fork
[1,2]) seems simpler and equally effective if you want to insert a
message that your mail system is delivering locally.

> > One shift from Gmane's mid.gmane.org/… is that the public-inbox UI
> > Message-ID lookup is per-bucket, and public-inbox seems to be
> > encouraging per-list buckets.
> 
> The public-inbox-nntpd interface supports mid lookups across all
> inboxes in that instance; so it should be doable in the WWW
> interface, too.  Either way, I think it has to be O(n) where (n) is
> the number of Xapian DBs, though.

I'm more concerned about the interface, and less about the
implementation (which can be improved later).  The (n) lookups are
trivially parallelizable, and you can always add a Message-ID →
buckets lookup table if (n) lookups turns out to be too slow.

Cheers,
Trevor

[1]: id:20141107190321.GL23609@odin.tremily.us
[2]: id:af679af8257e250ac606e35a1307ad02907b8426.1413663212.git.wking@tremily.us
     http://public-inbox.org/meta/af679af8257e250ac606e35a1307ad02907b8426.1413663212.git.wking@tremily.us/t/#u

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma (Docker image)
  2016-08-21 17:36       ` W. Trevor King
@ 2016-08-21 18:28         ` Eric Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2016-08-21 18:28 UTC (permalink / raw)
  To: W. Trevor King
  Cc: notmuch, David Bremner, Steven Allen, Tomi Ollila, Carl Worth,
	meta

"W. Trevor King" <wking@tremily.us> wrote:
> On Sun, Aug 21, 2016 at 12:08:52PM +0000, Eric Wong wrote:
> > "W. Trevor King" <wking@tremily.us> wrote:
> > > This is the ssoma archive (with the data in it).  I just set up a
> > > basic HTTP archive (following [1]) based on a Docker image [2] (Gentoo
> > > doesn't package all the Perl dependencies public-inbox needs).
> > 
> > Ugh, that sucks (sorry, not a fan of Docker).
> > 
> > What's missing from Gentoo?
> 
> Gentoo doesn't package (or I couldn't find the package for)
> Encode::MIME::Header or Mail::Thread.  I tried installing things from
> CPAN, but ran into a compile-time error from the ‘cpan’ invocationand
> gave up ;).  I can try and reproduce the error if you're curious, but
> I don't have it handy at the moment.

Encode::MIME::Header is distributed with perl itself on Debian and also
the stock upstream install.  Not sure if there's an option you missed or
disabled.

Which perl version do you use?

perl on 5.14 Debian wheezy even seems to have it.  I actually
still want everything to work on 5.8, since that seems to be
the de-facto baseline in the wild.


Mail::Thread is one .pm, and I'll probably replace it with
something (same algorithm) which can use half the memory by
avoiding wrapper object abstractions (it's probably the biggest
memory hog at the moment).

lib/PublicInbox/Thread.pm already has 3 monkey patches to workaround
upstream bugs in Mail::Thread.  It's dead upstream, and not available on
FreeBSD, either.

> > >   $ git config -f srv/notmuch.git/config publicinbox.http http://tremily.us
> > >   $ git config -f srv/notmuch.git/config publicinbox.email notmuch@notmuchmail.org
> > 
> > That should probably be:
> > 
> > 	; based on your [3]
> > 	git config -f srv/notmuch.git/config \
> > 		publicinbox.notmuch.url http://tremily.us/notmuch
> > 
> > 	git config -f srv/notmuch.git/config \
> > 		publicinbox.notmuch.address notmuch@notmuchmail.org
> > 
> > 	; this is crucial for all the public-inbox-* tools
> > 	git config -f srv/notmuch.git/config \
> > 		publicinbox.notmuch.mainrepo /path/to/notmuch.git
> 
> I was using these in the Dockerfile's CMD:
> 
>   (cd /srv;
>    for NAME in *;
>    do
>      CONF="/srv/${NAME}/config";
>      public-inbox-init "${NAME}" "/srv/${NAME}" $(git config -f "${CONF}" publicinbox.http) $(git config -f "${CONF}" publicinbox.email);
>    done) && …
> 
> Are you saying that I can skip the ~/.public-inbox/config entries
> setup by public-inbox-init if I set publicinbox.{name}.* in the ssoma
> repository's config?  That would be nice.

Erm, sorry, no, I mean ~/.public-inbox/config as the "git config -f"
arg in the above commands.  Your original config was
meaningless in the context of public-inbox itself; I don't
recall public-inbox relies on $GIT_DIR/config much (if at all)
outside of standard git things.

Using ~/.public-inbox/config is required for multi-inbox lookups
(since you normally run MDA w/o args)

You can also override ~/.public-inbox/config by setting the
PI_CONFIG env (like GIT_CONFIG).

> I don't see a point to having {name} in ssoma-config settings though,
> since you're already in a single bucket by that point (using
> publicinbox.{name}.* makes sense in the multi-bucket
> ~/.public-inbox/config).
> 
> > > It's not updating automatically yet, but that will probably look
> > > like:
> > > 
> > > 1. Pull new mbox [4].
> > > 2. Import into notmuch-arcives [5].
> > > 3. Re-run public-inbox-index (this could probably be via ‘docker exec …’.
> > > 
> > > But I'll have to test that to confirm.  And ideally we'd be using
> > > ssoma-mda or similar directly, instead of going through mbox, but I'd
> > > rather get the official headers on the stored mail than be efficient
> > > ;).
> > 
> > For mirroring existing lists, I started using public-inbox-watch
> > which currently watches Maildirs.
> 
> If I had a Maildir locally, I'd just use procmail and push new
> messages into ssoma-mda.  I'm using the import script because my local
> mail has “how we delivered this to Trevor” headers (which I don't want
> to add) but the downloaded mbox has “how we delivered this to
> notmuch@notmuchmail.org” (which seems like a better fit for a shared
> ssoma repo).

I don't mind extra/different headers.   The majority of messages in
public-inbox.org/git/ has messages that were delivered to gmane;
recent ones are delivered to me, and some holes were filled in by
Jeff King's archives.  All of our mail systems add different
headers.

> > I recommend public-inbox-watch for mirroring existing lists (such as
> > what I did with git@vger) but public-inbox-mda for self-hosted lists
> > (such as meta@public-inbox.org).
> 
> Why is that?  Procmail + public-inbox-mda (or my Python ssoma-mda fork
> [1,2]) seems simpler and equally effective if you want to insert a
> message that your mail system is delivering locally.

-watch is usable for importing big archives or bursts of traffic
since it doesn't have to reload Perl/python on every mail (this
is probably not a problem for notmuch; but is for vger lists).
The defaults are also less-opinionated so it won't reject
attachments that passed through the list server.

Maildir with your MUA (in case you missed some earlier, and got
them from another user or archive).

There's also a Filter inferface I added (see
lib/PublicInbox/Filter/Vger.pm as an example) for dropping
list trailers before SA, so those trailers don't influence
Bayes, but you can do that in the MDA stage, too.


But *-mda is fine, too :>

> [1]: id:20141107190321.GL23609@odin.tremily.us
> [2]: id:af679af8257e250ac606e35a1307ad02907b8426.1413663212.git.wking@tremily.us
>      http://public-inbox.org/meta/af679af8257e250ac606e35a1307ad02907b8426.1413663212.git.wking@tremily.us/t/#u

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma
  2016-08-21  4:36 ` W. Trevor King
  2016-08-21  9:48   ` Mail archives in Git using ssoma (Docker image) W. Trevor King
@ 2016-08-21 18:37   ` Eric Wong
  2016-08-21 20:28     ` W. Trevor King
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Wong @ 2016-08-21 18:37 UTC (permalink / raw)
  To: W. Trevor King; +Cc: notmuch

"W. Trevor King" <wking@tremily.us> wrote:
> On Fri, Nov 07, 2014 at 11:03:21AM -0800, W. Trevor King wrote:
> > Eric Wong has been working on some tools to store email in a Git
> > repository, and his client-side code is ssoma [1].  I wanted a bit
> > more metadata than the stock ssoma-mda [2], and ended up just
> > writing a ssoma-mda in Python [3]…

Btw, for public-inbox, I'm using git-fast-import now, so imports
are a bit faster and $GIT_DIR/ssoma.index is no longer used.
This was crucial for getting git@vger archives imported in
a reasonable time.

public-inbox-* still keeps ssoma.index up-to-date for backwards
compatibility with ssoma, and will probably do so until 2020 or
later (there'll be a few years of deprecation notices)

So I or someone else needs to update Perl ssoma to use fast-import at
some point, too; and I suggest your python version do the same.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma
  2016-08-21 18:37   ` Mail archives in Git using ssoma Eric Wong
@ 2016-08-21 20:28     ` W. Trevor King
  2016-08-21 21:14       ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: W. Trevor King @ 2016-08-21 20:28 UTC (permalink / raw)
  To: Eric Wong; +Cc: notmuch, meta

[-- Attachment #1: Type: text/plain, Size: 3363 bytes --]

On Sun, Aug 21, 2016 at 06:37:04PM +0000, Eric Wong wrote:
> "W. Trevor King" <wking@tremily.us> wrote:
> > On Fri, Nov 07, 2014 at 11:03:21AM -0800, W. Trevor King wrote:
> > > Eric Wong has been working on some tools to store email in a Git
> > > repository, and his client-side code is ssoma [1].  I wanted a bit
> > > more metadata than the stock ssoma-mda [2], and ended up just
> > > writing a ssoma-mda in Python [3]…
>
> Btw, for public-inbox, I'm using git-fast-import now, so imports are
> a bit faster and $GIT_DIR/ssoma.index is no longer used.  This was
> crucial for getting git@vger archives imported in a reasonable time.
>
> public-inbox-* still keeps ssoma.index up-to-date for backwards
> compatibility with ssoma, and will probably do so until 2020 or
> later (there'll be a few years of deprecation notices)
>
> So I or someone else needs to update Perl ssoma to use fast-import
> at some point, too; and I suggest your python version do the same.

ssoma-mda imports 22k notmuch messages in around 15 minutes (with
profiling enabled), and:

  $ python -m cProfile -o profile import.py notmuch.mbox
  $ python -c "import pstats; p=pstats.Stats('profile'); p.sort_stats('cumulative').print_stats(10)"
  Sun Aug 21 12:56:49 2016    profile

           101823722 function calls (99078415 primitive calls) in 885.069 seconds

     Ordered by: cumulative time
     List reduced from 1145 to 10 due to restriction <10>

     ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       70/1    0.002    0.000  885.069  885.069 {built-in method exec}
          1    0.111    0.111  885.069  885.069 /home/wking/src/notmuch/notmuch-archives.git/import.py:9(<module>)
          1    0.400    0.400  884.915  884.915 /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox)
      22875    0.601    0.000  863.371    0.038 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver)
      22875    8.943    0.000  810.459    0.035 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append)
      22875    0.418    0.000  308.353    0.013 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree)
      22875  307.855    0.013  307.855    0.013 {built-in method git_index_write_tree}
      22874    0.575    0.000  279.293    0.012 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree)
      22874  278.501    0.012  278.501    0.012 {built-in method git_diff_tree_to_index}
      22875    0.088    0.000   80.413    0.004 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read)

38 ms per ssoma delivery is probably fast enough, especially if you
are invoking ssoma-mda once per message, since process setup will take a similar amount of time:

  $ time python -c 'print("hello")'
  hello

  real    0m0.016s
  user    0m0.013s
  sys     0m0.003s

It's possible that fast-import would shave a few ms off the pygit2
addition (I'm not sure, and maybe pygit2 is faster than fast-import).
But I doubt it matters enough either way to be worth changing unless
you are dealing with a really large corpus.

Cheers,
Trevor

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma
  2016-08-21 20:28     ` W. Trevor King
@ 2016-08-21 21:14       ` Eric Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2016-08-21 21:14 UTC (permalink / raw)
  To: W. Trevor King; +Cc: notmuch, meta

"W. Trevor King" <wking@tremily.us> wrote:
> On Sun, Aug 21, 2016 at 06:37:04PM +0000, Eric Wong wrote:
> > Btw, for public-inbox, I'm using git-fast-import now, so imports are
> > a bit faster and $GIT_DIR/ssoma.index is no longer used.  This was
> > crucial for getting git@vger archives imported in a reasonable time.
> 
> ssoma-mda imports 22k notmuch messages in around 15 minutes (with
> profiling enabled), and:

In contrast, git@vger is around 300K messages.  LKML is well
into the millions, and I hope public-inbox (and git!) can handle
that one day, even on cheap hardware (haven't tried).

One problem I noticed with ssoma-mda is that it gets slower as
more messages get imported, since all those files sit in the
index, and the git index format is bad for incremental updates
with big, flat trees.  Big trees are a general problem with git:

    I'm now storing blob IDs directly in Xapian and will be
    using them more to avoid tree lookups.  tree creation
    lookups degrade the same way the index does as they
    get bigger.

    Currently it's using 2/38 of the SHA-1 like git loose
    objects; a goal might be to move towards supporting 2/2/36
    (or deeper) as Jeff noted substantial object traversal
    improvements:

https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/

    Of course, support for 2/38 will be retained for old
    archives/messages.

>   $ python -m cProfile -o profile import.py notmuch.mbox
>   $ python -c "import pstats; p=pstats.Stats('profile'); p.sort_stats('cumulative').print_stats(10)"
>   Sun Aug 21 12:56:49 2016    profile
> 
>            101823722 function calls (99078415 primitive calls) in 885.069 seconds
> 
>      Ordered by: cumulative time
>      List reduced from 1145 to 10 due to restriction <10>
> 
>      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>        70/1    0.002    0.000  885.069  885.069 {built-in method exec}
>           1    0.111    0.111  885.069  885.069 /home/wking/src/notmuch/notmuch-archives.git/import.py:9(<module>)
>           1    0.400    0.400  884.915  884.915 /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox)
>       22875    0.601    0.000  863.371    0.038 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver)
>       22875    8.943    0.000  810.459    0.035 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append)
>       22875    0.418    0.000  308.353    0.013 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree)
>       22875  307.855    0.013  307.855    0.013 {built-in method git_index_write_tree}
>       22874    0.575    0.000  279.293    0.012 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree)
>       22874  278.501    0.012  278.501    0.012 {built-in method git_diff_tree_to_index}

It looks like writing the index is already the slowest, here, in
terms of total time, too.  It might be interesting if you
profiled each *-mda invocation to see the degradation from the
first to last message.

>       22875    0.088    0.000   80.413    0.004 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read)
> 
> 38 ms per ssoma delivery is probably fast enough, especially if you

Not even close for me :)

> are invoking ssoma-mda once per message, since process setup will take a similar amount of time:
> 
>   $ time python -c 'print("hello")'
>   hello
> 
>   real    0m0.016s
>   user    0m0.013s
>   sys     0m0.003s
> 
> It's possible that fast-import would shave a few ms off the pygit2
> addition (I'm not sure, and maybe pygit2 is faster than fast-import).
> But I doubt it matters enough either way to be worth changing unless
> you are dealing with a really large corpus.

One key feature is fast-import avoids writing an index entirely.
I think pygit2 would have to learn that, too.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Mail archives in Git using ssoma (Docker image)
  2016-08-21 13:49       ` David Bremner
@ 2016-09-07  1:23         ` Eric Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2016-09-07  1:23 UTC (permalink / raw)
  To: David Bremner; +Cc: W. Trevor King, notmuch

David Bremner <david@tethera.net> wrote:
> Eric Wong <e@80x24.org> writes:
> > For mirroring existing lists, I started using public-inbox-watch
> > which currently watches Maildirs.  The config knobs are sorta
> > documented from my announcement to git@vger:
> >
> > https://public-inbox.org/git/20160710004813.GA20210@dcvr.yhbt.net/
> > http://hjrcffqmbrq6wope.onion/git/20160710004813.GA20210@dcvr.yhbt.net/
> >
> > Initial import (w/o spamassassin) was done with
> > scripts/import_vger_from_mbox in the source:
> >
> >         torsocks git clone http://hjrcffqmbrq6wope.onion/public-inbox
> >         git clone https://public-inbox.org/ public-inbox
> >         git clone git://repo.or.cz/public-inbox
> >
> 
> FWIW, I already have a Maildir with a complete (and updated) archive of the list (and
> only that) for use of nmbug. So at the risk of putting all eggs in one
> basket, perhaps public-inbox-watch could watch that maildir.

Yes, public-inbox-watch(1) is probably preferable for any subscriber to
start archiving the notmuch list.  I just pushed out some POD manpages
which should probably help (along with the existing INSTALL doc):

   https://public-inbox.org/meta/20160907004907.1479-1-e@80x24.org/

public-inbox-overview(7) should be a good starting point of ways
to start mirroring/hosting.  Please feel free to ask me directly
and/or meta@public-inbox.org if you need clarification or help.
I'm scatterbrained and tend to omit things when writing
documentation (it's hard to tell what a reader wants to know :x)


Anyways, thanks for notmuch (and being GPL-3.0+)!  I'm not a
user myself(*), but I've found the notmuch source to be a good
place to steal Xapian usage examples from for public-inbox :>



(*) I have trouble with Maildir-only scalability and
    still use gzipped mbox for old mail.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-09-07  1:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-07 19:03 Mail archives in Git using ssoma W. Trevor King
2016-08-21  4:36 ` W. Trevor King
2016-08-21  9:48   ` Mail archives in Git using ssoma (Docker image) W. Trevor King
     [not found]     ` <20160821120852.GA12964@dcvr>
2016-08-21 13:49       ` David Bremner
2016-09-07  1:23         ` Eric Wong
2016-08-21 17:36       ` W. Trevor King
2016-08-21 18:28         ` Eric Wong
2016-08-21 18:37   ` Mail archives in Git using ssoma Eric Wong
2016-08-21 20:28     ` W. Trevor King
2016-08-21 21:14       ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).