Re: Q: V2 format - Eric Wong

unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed

From: Eric Wong <e@80x24.org>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: meta@public-inbox.org
Subject: Re: Q: V2 format
Date: Thu, 12 Jul 2018 01:47:15 +0000	[thread overview]
Message-ID: <20180712014715.dn5aouayoa3uejp4@dcvr> (raw)
In-Reply-To: <87k1q1bky6.fsf@xmission.com>

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> I have been digging through the code looking so I can understand the v2
> format and I have some ideas on how things might be improved, and some
> questions so that I understand.

Great to know you're interested!  Fwiw, I've still been meaning
to turn my v2 docs into a POD manpage:

  https://public-inbox.org/meta/20180419015813.GA20051@dcvr/

> V1 supported the concept of messages being added and deleted from
> the git repository all while keeping a full history of everything that
> went on.  The V2 code appears to have the name 'm' for added and 'd' for
> deleted, but the public-inbox-index code appears to expect deletes to
> happen by way of an altered history that totally purge the commits,
> and does not process the 'd' entries.

"Purge" is a new concept for v2 and not even exposed (yet) in
via tools.  Normal operations to remove files using 'd' (via
-watch or -rm) don't rewrite old history so it won't disrupt
non-force fetches.

> What is the thinking about deleted entries, and for v2 what is the
> preferred way to delete mail from a public inbox git repository and why?

Definitely prefer the normal way with 'd' files to not break
people using non-force fetches.  "Purge" is too disruptive
and reserved for extraordinary cases (e.g. legal reasons).

> Size.  Reading the history of the public inbox meta mailling list and
> playing around I discovered that I can shave off about 100M of the V2
> size of the git public inbox git repository but pushing all of the
> messages into a single commit.  Not great for day to day operation,
> but if rebasses are part of the plan, and old archives part of the
> challenge I see quite a lot of potential for old archives to be reduced
> to a git repository with a single commit.

Rebases/rewriting history is definitely not part of the plan and
a last resort.

> Names.  Is there a good reason not to use message numbers as the names
> in the git repositories?  (Other than the cost to change the code?) That
> would remove the need for treat the sqlite msgmap database as precious,
> and it would make it easier to recover if an nntp server goes away.  In
> V2 format the git mailing list git repository is only about 2M larger if
> each message has it's msg number as it's name.  Plus the git log
> is easier to read as messages are all + or -.

Big trees in git were a scalability problem in v1 because of the
long 2/38 names.  With shorter names you propose (base-10 serial
number?, the scalability problem gets pushed off a bit, I suppose.
But not indefinitely; and later v2 partitions will suffer more
from longer names.

I also want to limit the use and exposure of serial numbers as
much as possible.  It's unavoidable with the NNTP interface;
but reliance on serial numbers in public interfaces leads to
centralization.

The current v2 is also better for inode-starved users in case
somebody forgets to type "--mirror" or "--bare" with clone.  For
the most part (unless purge is used), the SQLite database is
actually recoverable.

So no, I don't think having serial numbers stored in filenames
is the right thing.

> xapian.  Can the Xapian database be made optional in V2?  

Definitely in the TODO :)

> I absolutely
> think a quick search for terms and other things very valuable, so I
> would never suggest giving up Xapian.  On the other hand on my personal
> laptop the xapian database for lkml takes ages and ages to build, and it
> pushes the system into swap.  Which is all around unpleasant.  That
> seems to eat into the distributed nature of the goal of public inbox.
> I have tried to see what could be done that might shrink the size of
> the xapian database.  The only think I could think of is perhaps
> sharding the xapian database by time/msgnum ranges.   That would allow
> the old xapians databases to be compacted and forgotten about, and I
> think it would allow less wastage in the current xapian database as it
> would be smaller, so wasting 50% space (or whatever the btrees waste)
> would be less of an issue.  And as smaller databases are faster I think
> that would in general be a help.

One big killer for Xapian is position information required for
"quoted phrase searches".  I seem to remember deleting the position.*
files was safe as it would only break phrase searches (but I
haven't tried it).

So there should be an option to toggle between the "index_text"
and routines in Xapian "index_text_without_positions".

Given the way the indexing only works on the most recent data;
I think one could also write a script to delete old data/results
from Xapian without affecting current/future indexing.
That would pop back up if/when there's schema upgrades requiring
a rebuild, though...

I believe there should be 3 levels of v2 operation:

1) SQLite-only (NNTP and all the threading stuff works)
2) SQLite + Xapian w/o positions (good enough for most things)
3) SQLite + Xapian w/ positions (current, default)

2) seems like a reasonable trade-off for most sites; I'm not
sure how often phrase searching gets used.

> Time permitting I am willing to do some of this work so that
> public-inbox works well for me.  I want to see what your vision is for
> the code before I start anything.

Thanks for running this by, first.  I'm not convinced git layout
changes are warranted at this point for v2.

Making Xapian optional and configurable to use
index_text_without_positions is something I definitely want to
see happen, though.

next prev parent reply	other threads:[~2018-07-12  1:47 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-11 20:01 Q: V2 format Eric W. Biederman
2018-07-11 21:18 ` Konstantin Ryabitsev
2018-07-11 21:41   ` Eric W. Biederman
2018-07-12  1:47 ` Eric Wong [this message]
2018-07-12 13:58   ` Eric W. Biederman
2018-07-12 23:09     ` Eric Wong
2018-07-13 13:39       ` Eric W. Biederman
2018-07-13 20:03         ` Eric W. Biederman
2018-07-13 22:22           ` msgmap serial number regeneration [was: Q: V2 format] Eric Wong
2018-07-14 19:01             ` Eric W. Biederman
2018-07-15  3:18               ` Eric Wong
2018-07-16 15:20                 ` Eric W. Biederman
2018-07-13 22:02         ` bug: v2 deletes on incremental fetch " Eric Wong
2018-07-13 22:51           ` Eric W. Biederman
2018-07-14  0:46           ` [PATCH] v2writable: unindex deleted messages after incremental fetch Eric Wong
2018-07-13 23:07         ` IMAP server [was: Q: V2 format] Eric Wong
2018-07-13 23:12           ` Eric W. Biederman
2018-09-28 20:10           ` Johannes Berg
2018-09-28 21:01             ` Eric W. Biederman
2018-10-01  7:46               ` Johannes Berg
2018-10-01  8:51                 ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180712014715.dn5aouayoa3uejp4@dcvr \
    --to=e@80x24.org \
    --cc=ebiederm@xmission.com \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).