From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id D4E061F4B4 for ; Sat, 12 Dec 2020 09:31:42 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH] doc: add public-inbox-extindex-format(5) manpage Date: Sat, 12 Dec 2020 09:31:42 +0000 Message-Id: <20201212093142.9222-1-e@80x24.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: The CLI tool still needs usability work, and "misc" is still in flux, but the core message indexing part is stable (since it's stolen from v2 :P). --- .../public-inbox-extindex-format.pod | 110 ++++++++++++++++++ MANIFEST | 1 + Makefile.PL | 2 +- 3 files changed, 112 insertions(+), 1 deletion(-) create mode 100644 Documentation/public-inbox-extindex-format.pod diff --git a/Documentation/public-inbox-extindex-format.pod b/Documentation/public-inbox-extindex-format.pod new file mode 100644 index 00000000..757685e5 --- /dev/null +++ b/Documentation/public-inbox-extindex-format.pod @@ -0,0 +1,110 @@ +% public-inbox developer manual + +=head1 NAME + +public-inbox extindex format description + +=head1 DESCRIPTION + +The extindex is an index-only evolution of the per-inbox +SQLite and Xapian indices used by L +and L. It exists to facilitate +searches across multiple inboxes as well as to reduce index +space when messages are cross-posted to several existing +inboxes. + +It transparently indexes messages across any combination of v1 and v2 +inboxes and data about inboxes themselves. + +=head1 DIRECTORY LAYOUT + +While inspired by v2, there is no git blob storage nor +C DB. + +Instead, there is an C (all caps) git repo which treats +every indexed v1 inbox or v2 epoch as a git alternate. + +As with v2 inboxes, it uses C and Xapian "shards" +for WWW and IMAP use. Several exclusive new tables are added +to deal with L and metadata. + +Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP +newsgroup. Thus it lacks C to enforce the +unique Message-ID requirement of NNTP. + +=head2 INDEX OVERVIEW AND DEFINITIONS + + $SCHEMA_VERSION - DB schema version (for Xapian) + $SHARD - Integer starting with 0 based on parallelism + + foo/ # "foo" is the name of the index + - ei.lock # lock file to protect global state + - ALL.git # empty, alternates for inboxes + - ei$SCHEMA_VERSION/$SHARD # per-shard Xapian DB + - ei$SCHEMA_VERSION/over.sqlite3 # overview DB for WWW, IMAP + - ei$SCHEMA_VERSION/misc # misc Xapian DB + +File and directory names are intentionally different from +analogous v2 names to ensure extindex and v2 inboxes can +easily be distinguished from each other. + +=head2 XREF3 DEDUPLICATION + +Due cross-posted messages being the norm in the large Linux kernel +development community and Xapian indices being the primary consumer of +storage, it makes sense to deduplicate indexing as much as possible. + +The internal storage format is based on the NNTP "Xref" tuple, +but with the addition of a third element: the git blob OID. +Thus the triple is expressed in string form as: + + $NEWSGROUP_NAME:$ARTICLE_NUM:$OID + +If no C is configured for an inbox, the C +of the inbox is used. + +This data is stored in the C table of over.sqlite3. + +=head2 misc XAPIAN DB + +In addition to the numeric Xapian shards for indexing messages, +there is a new, in-development Xapian index for storing data +about inboxes themselves and other non-message data. This +index allows us to speed up operations involving hundreds or +thousands of inboxes. + +=head1 BENEFITS + +In addition to providing cross-inbox search capabilities, it can +also replace per-inbox Xapian shards (but not per-inbox +over.sqlite3). This allows reduction in disk space, open file +handles, and associated memory use. + +=head1 CAVEATS + +Relocating v1 and v2 inboxes on the filesystem will require +extindex to be garbage-collected and/or reindexed. + +Configuring and maintaining stable C names before any +messages are indexed from every inbox can avoid expensive +reindexing and rely exclusively on GC. + +=head1 LOCKING + +L locking exclusively locks the empty ei.lock file +for all non-atomic operations. + +=head1 THANKS + +Thanks to the Linux Foundation for sponsoring the development +and testing. + +=head1 COPYRIGHT + +Copyright 2020 all contributors L + +License: AGPL-3.0+ L + +=head1 SEE ALSO + +L diff --git a/MANIFEST b/MANIFEST index b39f63db..ac442606 100644 --- a/MANIFEST +++ b/MANIFEST @@ -27,6 +27,7 @@ Documentation/public-inbox-config.pod Documentation/public-inbox-convert.pod Documentation/public-inbox-daemon.pod Documentation/public-inbox-edit.pod +Documentation/public-inbox-extindex-format.pod Documentation/public-inbox-httpd.pod Documentation/public-inbox-imapd.pod Documentation/public-inbox-index.pod diff --git a/Makefile.PL b/Makefile.PL index 56679598..8e710df2 100644 --- a/Makefile.PL +++ b/Makefile.PL @@ -44,7 +44,7 @@ $v->{-m1} = [ map { } } @EXE_FILES ]; $v->{-m5} = [ qw(public-inbox-config public-inbox-v1-format - public-inbox-v2-format) ]; + public-inbox-v2-format public-inbox-extindex-format) ]; $v->{-m7} = [ qw(public-inbox-overview public-inbox-tuning) ]; $v->{-m8} = [ qw(public-inbox-daemon) ]; my @sections = (1, 5, 7, 8);