unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH] doc: add public-inbox-extindex-format(5) manpage
@ 2020-12-12  9:31 Eric Wong
  2020-12-12 18:24 ` Kyle Meyer
  0 siblings, 1 reply; 2+ messages in thread
From: Eric Wong @ 2020-12-12  9:31 UTC (permalink / raw)
  To: meta

The CLI tool still needs usability work, and "misc" is still in
flux, but the core message indexing part is stable (since it's
stolen from v2 :P).
---
 .../public-inbox-extindex-format.pod          | 110 ++++++++++++++++++
 MANIFEST                                      |   1 +
 Makefile.PL                                   |   2 +-
 3 files changed, 112 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/public-inbox-extindex-format.pod

diff --git a/Documentation/public-inbox-extindex-format.pod b/Documentation/public-inbox-extindex-format.pod
new file mode 100644
index 00000000..757685e5
--- /dev/null
+++ b/Documentation/public-inbox-extindex-format.pod
@@ -0,0 +1,110 @@
+% public-inbox developer manual
+
+=head1 NAME
+
+public-inbox extindex format description
+
+=head1 DESCRIPTION
+
+The extindex is an index-only evolution of the per-inbox
+SQLite and Xapian indices used by L<public-inbox-v2-format(5)>
+and L<public-inbox-v1-format(5)>.  It exists to facilitate
+searches across multiple inboxes as well as to reduce index
+space when messages are cross-posted to several existing
+inboxes.
+
+It transparently indexes messages across any combination of v1 and v2
+inboxes and data about inboxes themselves.
+
+=head1 DIRECTORY LAYOUT
+
+While inspired by v2, there is no git blob storage nor
+C<msgmap.sqlite3> DB.
+
+Instead, there is an C<ALL.git> (all caps) git repo which treats
+every indexed v1 inbox or v2 epoch as a git alternate.
+
+As with v2 inboxes, it uses C<over.sqlite3> and Xapian "shards"
+for WWW and IMAP use.  Several exclusive new tables are added
+to deal with L</XREF3 DEDUPLICATION> and metadata.
+
+Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP
+newsgroup.  Thus it lacks C<msgmap.sqlite3> to enforce the
+unique Message-ID requirement of NNTP.
+
+=head2 INDEX OVERVIEW AND DEFINITIONS
+
+  $SCHEMA_VERSION - DB schema version (for Xapian)
+  $SHARD - Integer starting with 0 based on parallelism
+
+  foo/                              # "foo" is the name of the index
+  - ei.lock                         # lock file to protect global state
+  - ALL.git                         # empty, alternates for inboxes
+  - ei$SCHEMA_VERSION/$SHARD        # per-shard Xapian DB
+  - ei$SCHEMA_VERSION/over.sqlite3  # overview DB for WWW, IMAP
+  - ei$SCHEMA_VERSION/misc          # misc Xapian DB
+
+File and directory names are intentionally different from
+analogous v2 names to ensure extindex and v2 inboxes can
+easily be distinguished from each other.
+
+=head2 XREF3 DEDUPLICATION
+
+Due cross-posted messages being the norm in the large Linux kernel
+development community and Xapian indices being the primary consumer of
+storage, it makes sense to deduplicate indexing as much as possible.
+
+The internal storage format is based on the NNTP "Xref" tuple,
+but with the addition of a third element: the git blob OID.
+Thus the triple is expressed in string form as:
+
+	$NEWSGROUP_NAME:$ARTICLE_NUM:$OID
+
+If no C<newsgroup> is configured for an inbox, the C<inboxdir>
+of the inbox is used.
+
+This data is stored in the C<xref3> table of over.sqlite3.
+
+=head2 misc XAPIAN DB
+
+In addition to the numeric Xapian shards for indexing messages,
+there is a new, in-development Xapian index for storing data
+about inboxes themselves and other non-message data.  This
+index allows us to speed up operations involving hundreds or
+thousands of inboxes.
+
+=head1 BENEFITS
+
+In addition to providing cross-inbox search capabilities, it can
+also replace per-inbox Xapian shards (but not per-inbox
+over.sqlite3).  This allows reduction in disk space, open file
+handles, and associated memory use.
+
+=head1 CAVEATS
+
+Relocating v1 and v2 inboxes on the filesystem will require
+extindex to be garbage-collected and/or reindexed.
+
+Configuring and maintaining stable C<newsgroup> names before any
+messages are indexed from every inbox can avoid expensive
+reindexing and rely exclusively on GC.
+
+=head1 LOCKING
+
+L<flock(2)> locking exclusively locks the empty ei.lock file
+for all non-atomic operations.
+
+=head1 THANKS
+
+Thanks to the Linux Foundation for sponsoring the development
+and testing.
+
+=head1 COPYRIGHT
+
+Copyright 2020 all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
+
+=head1 SEE ALSO
+
+L<public-inbox-v2-format(5)>
diff --git a/MANIFEST b/MANIFEST
index b39f63db..ac442606 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -27,6 +27,7 @@ Documentation/public-inbox-config.pod
 Documentation/public-inbox-convert.pod
 Documentation/public-inbox-daemon.pod
 Documentation/public-inbox-edit.pod
+Documentation/public-inbox-extindex-format.pod
 Documentation/public-inbox-httpd.pod
 Documentation/public-inbox-imapd.pod
 Documentation/public-inbox-index.pod
diff --git a/Makefile.PL b/Makefile.PL
index 56679598..8e710df2 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -44,7 +44,7 @@ $v->{-m1} = [ map {
 		}
 	} @EXE_FILES ];
 $v->{-m5} = [ qw(public-inbox-config public-inbox-v1-format
-		public-inbox-v2-format) ];
+		public-inbox-v2-format public-inbox-extindex-format) ];
 $v->{-m7} = [ qw(public-inbox-overview public-inbox-tuning) ];
 $v->{-m8} = [ qw(public-inbox-daemon) ];
 my @sections = (1, 5, 7, 8);

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] doc: add public-inbox-extindex-format(5) manpage
  2020-12-12  9:31 [PATCH] doc: add public-inbox-extindex-format(5) manpage Eric Wong
@ 2020-12-12 18:24 ` Kyle Meyer
  0 siblings, 0 replies; 2+ messages in thread
From: Kyle Meyer @ 2020-12-12 18:24 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta


Eric Wong writes:

> The CLI tool still needs usability work, and "misc" is still in
> flux, but the core message indexing part is stable (since it's
> stolen from v2 :P).

All very exciting :)  Thanks.

> +=head2 XREF3 DEDUPLICATION
> +
> +Due cross-posted messages being the norm in the large Linux kernel

s/Due/& to/

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-12-12 18:24 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-12  9:31 [PATCH] doc: add public-inbox-extindex-format(5) manpage Eric Wong
2020-12-12 18:24 ` Kyle Meyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).