unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH] update and add documentation for repository formats
@ 2019-01-02  8:33 Eric Wong
  0 siblings, 0 replies; only message in thread
From: Eric Wong @ 2019-01-02  8:33 UTC (permalink / raw)
  To: meta

Remove confusing documentation around ssoma now that we
have NNTP and downloadable mbox support.

Only lightly-checked for grammar and speling, and not yet
formatting.  Edits, corrections and addendums expected :>
---
 Documentation/design_notes.txt           |  18 +-
 Documentation/include.mk                 |   4 +-
 Documentation/public-inbox-mda.pod       |   2 +-
 Documentation/public-inbox-v1-format.pod | 171 +++++++++++++++++
 Documentation/public-inbox-v2-format.pod | 234 +++++++++++++++++++++++
 INSTALL                                  |   4 +-
 MANIFEST                                 |   2 +
 README                                   |  20 +-
 lib/PublicInbox/Import.pm                |   6 +-
 9 files changed, 430 insertions(+), 31 deletions(-)
 create mode 100644 Documentation/public-inbox-v1-format.pod
 create mode 100644 Documentation/public-inbox-v2-format.pod

diff --git a/Documentation/design_notes.txt b/Documentation/design_notes.txt
index c5d9427..9ad4977 100644
--- a/Documentation/design_notes.txt
+++ b/Documentation/design_notes.txt
@@ -27,9 +27,7 @@ Use existing infrastructure
 * Existing spam filtering on an SMTP server is also effective on
   public-inbox.
 
-* readers may continue using use their choice of mail clients and
-  mailbox formats, only learning a few commands of the ssoma(1) tool
-  is required.
+* Readers may continue using use their choice of NNTP and mail clients.
 
 * Atom is a reasonable feed format for casual readers and is supported
   by a variety of feed readers.
@@ -145,19 +143,11 @@ What sucks about public-inbox
 Scalability notes
 -----------------
 
-Even with shallow clone, storing the history of large/busy mailing lists
-may place much burden on subscribers and servers.  However, having a
-single (or few) refs representing the entire history of a list is good
-for small lists since it's easier to look up a message by Message-ID, so
-we want to avoid splitting refs with independent histories.
-
-ssoma will likely grow its own built-in ref rotation system based on
-message count (not rotating at fixed time intervals).  This would
-split the histories and require O(n) lookup time based on Message-ID,
-where `n' is the number of history splits.
+See the public-inbox-v2-format(5) manpage for all the scalability
+problems solved.
 
 Copyright
 ---------
 
-Copyright 2013-2018 all contributors <meta@public-inbox.org>
+Copyright 2013-2019 all contributors <meta@public-inbox.org>
 License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
diff --git a/Documentation/include.mk b/Documentation/include.mk
index ad7b80a..28fa757 100644
--- a/Documentation/include.mk
+++ b/Documentation/include.mk
@@ -1,4 +1,4 @@
-# Copyright (C) 2013-2018 all contributors <meta@public-inbox.org>
+# Copyright (C) 2013-2019 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 all::
 
@@ -24,6 +24,8 @@ m1 += public-inbox-watch
 m1 += public-inbox-index
 m5 =
 m5 += public-inbox-config
+m5 += public-inbox-v1-format
+m5 += public-inbox-v2-format
 m7 =
 m7 += public-inbox-overview
 m8 =
diff --git a/Documentation/public-inbox-mda.pod b/Documentation/public-inbox-mda.pod
index 1a5ade8..41a697b 100644
--- a/Documentation/public-inbox-mda.pod
+++ b/Documentation/public-inbox-mda.pod
@@ -56,4 +56,4 @@ License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
 
 =head1 SEE ALSO
 
-L<git(1)>, L<git-config(1)>, L<ssoma_repository(5)>
+L<git(1)>, L<git-config(1)>, L<public-inbox-v1-format(5)>
diff --git a/Documentation/public-inbox-v1-format.pod b/Documentation/public-inbox-v1-format.pod
new file mode 100644
index 0000000..2a6b8d3
--- /dev/null
+++ b/Documentation/public-inbox-v1-format.pod
@@ -0,0 +1,171 @@
+% public-inbox developer manual
+
+=head1 NAME
+
+public-inbox v1 git repository and tree description (aka "ssoma")
+
+=head1 DESCRIPTION
+
+WARNING: this does NOT describe the scalable v2 format used
+by public-inbox.  Use of ssoma is not recommended for new
+installations due to scalability problems.
+
+ssoma uses a git repository to store each email as a git blob.
+The tree filename of the blob is based on the SHA1 hexdigest of
+the first Message-ID header.  A commit is made for each message
+delivered.  The commit SHA-1 identifier is used by ssoma clients
+to track synchronization state.
+
+=head1 PATHNAMES IN TREES
+
+A Message-ID may be extremely long and also contain slashes, so using
+them as a path name is challenging.  Instead we use the SHA-1 hexdigest
+of the Message-ID (excluding the leading "E<lt>" and trailing "E<gt>")
+to generate a path name.  Leading and trailing white space in the
+Message-ID header is ignored for hashing.
+
+A message with Message-ID of: E<lt>20131106023245.GA20224@dcvr.yhbt.netE<gt>
+
+Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63
+
+Thus it is easy to look up the contents of a message matching a given
+a Message-ID.
+
+=head1 MESSAGE-ID CONFLICTS
+
+public-inbox v1 repositories currently do not resolve conflicting
+Message-IDs or messages with multiple Message-IDs.
+
+=head1 HEADERS
+
+The Message-ID header is required.
+"Bytes", "Lines" and "Content-Length" headers are stripped and not
+allowed, they can interfere with further processing.
+When using ssoma with public-inbox-mda, the "Status" mbox header
+is also stripped as that header makes no sense in a public archive.
+
+=head1 LOCKING
+
+L<flock(2)> locking exclusively locks the empty $GIT_DIR/ssoma.lock file
+for all non-atomic operations.
+
+=head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA)
+
+1. Message is delivered to a mail transport agent (MTA)
+
+1a. (optional) reject/discard spam, this should run before ssoma-mda
+
+1b. (optional) reject/strip unwanted attachments
+
+ssoma-mda handles all steps once invoked.
+
+2. Mail transport agent invokes ssoma-mda
+
+3. reads message via stdin, extracting Message-ID
+
+4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock
+
+5. creates or updates the blob of associated 2/38 SHA-1 path
+
+6. updates the index and commits
+
+7. releases $GIT_DIR/ssoma.lock
+
+ssoma-mda can also be used as an L<inotify(7)> trigger to monitor maildirs,
+and the ability to monitor IMAP mailboxes using IDLE will be available
+in the future.
+
+=head1 GIT REPOSITORIES (SERVERS)
+
+ssoma uses bare git repositories on both servers and clients.
+
+Using the L<git-init(1)> command with --bare is the recommend method
+of creating a git repository on a server:
+
+	git init --bare /path/to/wherever/you/want.git
+
+There are no standardized paths for servers, administrators make
+all the choices regarding git repository locations.
+
+Special files in $GIT_DIR on the server:
+
+=over
+
+=item $GIT_DIR/ssoma.lock
+
+An empty file for L<flock(2)> locking.
+This is necessary to ensure the index and commits are updated
+consistently and multiple processes running MDA do not step on
+each other.
+
+=item $GIT_DIR/public-inbox/msgmap.sqlite3
+
+SQLite3 database maintaining a stable mapping of Message-IDs to NNTP
+article numbers.  Used by L<public-inbox-nntpd(1)> and created
+and updated by L<public-inbox-index(1)>.
+
+Automatically updated by L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
+
+Losing or damaging this file will cause synchronization problems for
+NNTP clients.  This file is expected to be stable and require no
+updates to its schema.
+
+Requires L<DBD::SQLite>.
+
+=item $GIT_DIR/public-inbox/xapian$N/
+
+Xapian database for search indices in the PSGI web UI.
+
+$N is the value of PublicInbox::Search::SCHEMA_VERSION, and
+installations may have parallel versions on disk during upgrades
+or to roll-back upgrades.
+
+This is created and updated by L<public-inbox-index(1)>.
+
+Automatically updated by L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
+
+This directory can always be regenerated with L<public-inbox-index(1)>.
+If lost or damaaged, there is no need to back it up unless the
+CPU/memory cost of regenerating it outweighs the storage/transfer cost.
+
+Since SCHEMA_VERSION 15 and the development of the v2 format,
+the "overview" DB also exists in the xapian directory for v1
+repositories.  See L<public-inbox-v2-format(5)/OVERVIEW DB>
+
+=item $GIT_DIR/ssoma.index
+
+This file is no longer used or created by public-inbox, but it is
+updated if it exists to remain compatible with ssoma installations.
+
+A git index file used for MDA updates.  The normal git index (in
+$GIT_DIR/index) is not used at all as there is typically no working
+tree.
+
+=back
+
+Each client $GIT_DIR may have multiple mbox/maildir/command targets.
+It is possible for a client to extract the mail stored in the git
+repository to multiple mboxes for compatibility with a variety of
+different tools.
+
+=head1 CAVEATS
+
+It is NOT recommended to check out the working directory of a git.
+there may be many files.
+
+It is impossible to completely expunge messages, even spam, as git
+retains full history.  Projects may (with adequate notice) cycle to new
+repositories/branches with history cleaned up via L<git-filter-branch(1)>.
+This is up to the administrators.
+
+=head1 COPYRIGHT
+
+Copyright 2013-2019 all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
+
+=head1 SEE ALSO
+
+L<gitrepository-layout(5)>, L<ssoma(1)>
diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
new file mode 100644
index 0000000..05ef32a
--- /dev/null
+++ b/Documentation/public-inbox-v2-format.pod
@@ -0,0 +1,234 @@
+% public-inbox developer manual
+
+=head1 NAME
+
+public-inbox v2 repository description
+
+=head1 DESCRIPTION
+
+The v2 format is designed primarily to address several
+scalability problems of the original format described at
+L<public-inbox-v1-format(5)>.  It also handles messages with
+Message-IDs.
+
+=head1 INBOX LAYOUT
+
+The key change in v2 is the inbox is no longer a bare git
+repository, but a directory with two or more git repositories.
+v2 divides git repositories by time "epochs" and Xapian
+databases for parallelism by "partitions".
+
+=head2 INBOX OVERVIEW AND DEFINITIONS
+
+$EPOCH - Integer starting with 0 based on time
+$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian
+$PART - Integer (0..NPROCESSORS)
+
+foo/ # assuming "foo" is the name of the list
+- inbox.lock                 # lock file (flock) to protect global state
+- git/$EPOCH.git             # normal git repositories
+- all.git                    # empty git repo, alternates to git/$EPOCH.git
+- xap$SCHEMA_VERSION/$PART   # per-partition Xapian DB
+- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading
+- msgmap.sqlite3             # same the v1 msgmap
+
+For blob lookups, the reader only needs to open the "all.git"
+repository with $GIT_DIR/objects/info/alternates which references
+every $EPOCH.git repo.
+
+Individual $EPOCH.git repos DO NOT use alternates themselves as
+git currently limits recursion of alternates nesting depth to 5.
+
+=head2 GIT EPOCHS
+
+One of the inherent scalability problems with git itself is the
+full history of a project must be stored and carried around to
+all clients.  To address this problem, the v2 format uses
+multiple git repositories, stored as time-based "epochs".
+
+We currently divide epochs into roughly one gigabyte segments;
+but this size can be configurable (if needed) in the future.
+
+A pleasant side-effect of this design is the git packs of older
+epochs are stable, allowing them to be cloned without requiring
+expensive pack generation.  This also allows clients to clone
+only the epochs they are interested in to save bandwidth and
+storage.
+
+To minimize changes to existing v1-based code and simplify our
+code, we use the "alternates" mechanism described in
+L<gitrepository-layout(5)> to link all the epoch repositories
+with a single read-only "all.git" endpoint.
+
+Processes retrieve blobs via the "all.git" repository, while
+writers write blobs directly to epochs.
+
+=head2 GIT TREE LAYOUT
+
+One key problem specific to v1 was large trees were frequently a
+performance problem as name lookups are expensive and there were
+limited deltafication opportunities with unpredictable file
+names.  As a result, all Xapian-enabled installations retrieve
+blob object_ids directly in v1, bypassing tree lookups.
+
+While dividing git repositories into epochs caps the growth of
+trees, worst-case tree size was still unnecessary overhead and
+worth eliminating.
+
+So in contrast to the big trees of v1, the v2 git tree contains
+only a single file at the top-level of the tree, either 'm' (for
+'mail' or 'message') or 'd' (for deleted).  A tree does not have
+'m' and 'd' at the same time.
+
+Mail is still stored in blobs (instead of inline with the commit
+object) as we still need a stable reference in the indices in
+case commit history is rewritten to comply with legal
+requirements.
+
+After-the-fact invocations of L<public-inbox-index> will ignore
+messages written to 'd' after they are written to 'm'.
+
+Deltafication is not significantly improved over v1, but overall
+storage for trees is made as as small as possible.  Initial
+statistics and benchmarks showing the benefits of this approach
+are documented at:
+
+L<https://public-inbox.org/meta/20180209205140.GA11047@dcvr/>
+
+=head2 XAPIAN PARTITIONS
+
+Another second scalability problem in v1 was the inability to
+utilize multiple CPU cores for Xapian indexing.  This is
+addressed by using partitions in Xapian to perform import
+indexing in parallel.
+
+As with git alternates, Xapian natively supports a read-only
+interface which transparently abstracts away the knowledge of
+multiple partitions.  This allows us to simplify our read-only
+code paths.
+
+The performance of the storage device is now the bottleneck on
+larger multi-core systems.  In our experience, performance is
+improves with high-quality and high-quantity solid-state storage.
+Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
+consistent performance while developing this feature.
+
+Rotational storage devices are NOT recommended for indexing of
+large mail archives; but are fine for backup and usable for
+small instances.
+
+=head2 OVERVIEW DB
+
+Towards the end of v2 development, it became apparent Xapian did
+not perform well for sorting large result sets used to generate
+the landing page in the PSGI UI (/$INBOX/) or many queries used
+by the NNTP server.  Thus, SQLite was employed and the Xapian
+"skeleton" DB was renamed to the "overview" DB (after the NNTP
+OVER/XOVER commands).
+
+The overview DB maintains all the header information necessary
+to implement the NNTP OVER/XOVER commands and non-search
+endpoints of of the PSGI UI.
+
+In the future, Xapian will become completely optional for v2 (as
+it is for v1) as SQLite turns out to be powerful enough to
+maintain overview information.  Most of the PSGI and all of the
+NNTP functionality will be possible with only SQLite in addition
+to git.
+
+The overview DB was an instrumental piece in maintaining near
+constant-time read performance on a dataset 2-3 times larger
+than LKML history as of 2018.
+
+=head3 GHOST MESSAGES
+
+The overview DB also includes references to "ghost" messages,
+or messages which have replies but have not been seen by us.
+Thus it is expected to have more rows than the "msgmap" DB
+described below.
+
+=head2 msgmap.sqlite3
+
+The SQLite msgmap DB is unchanged from v1, but it is now at the
+top-level of the directory.
+
+=head1 OBJECT IDENTIFIERS
+
+There are three distinct type of identifiers.  content_id is the
+new one for v2 and should make message removal and deduplication
+easier.  object_id and Message-ID are already known.
+
+=over
+
+=item object_id
+
+The blob identifier git uses (currently SHA-1).  No need to
+publically expose this outside of normal git ops (cloning) and
+there's no need to make this searchable.  As with v1 of
+public-inbox, this is stored as part of the Xapian document so
+expensive name lookups can be avoided for document retrieval.
+
+=item Message-ID
+
+The email header; duplicates allowed for archival purposes.
+This remains a searchable field in Xapian.  Note: it's possible
+for emails to have multiple Message-ID headers (and L<git-send-email(1)>
+had that bug for a bit); so we take all of them into account.
+In case of conflicts detected by content_id below, we generate a new
+Message-ID based on content_id; if the generated Message-ID still
+conflicts, a random one is generated.
+
+=item content_id
+
+A hash of relevant headers and raw body content for
+purging of unwanted content.  This is not stored anywhere,
+but always calculated on-the-fly.
+
+For now, the relevant headers are:
+
+	Subject, From, Date, References, In-Reply-To, To, Cc
+
+Received, List-Id, and similar headers are NOT part of content_id as
+they differ across lists and we will want removal to be able to cross
+lists.
+
+The textual parts of the body are decoded, CRLF normalized to
+LF, and trailing whitespace stripped.  Notably, hashing the
+raw body risks being broken by list signatures; but we can use
+filters (e.g. PublicInbox::Filter::Vger) to clean the body for
+imports.
+
+content_id is SHA-256 for now; but can be changed at any time
+without making DB changes.
+
+=back
+
+=head1 LOCKING
+
+L<flock(2)> locking exclusively locks the empty inbox.lock file
+for all non-atomic operations.
+
+=head1 HEADERS
+
+Same handling as with v1, except the Message-ID header will will
+be generated if not provided or conflicting.  "Bytes", "Lines"
+and "Content-Length" headers are stripped and not allowed, they
+can interfere with further processing.
+
+The "Status" mbox header is also stripped as that header makes
+no sense in a public archive.
+
+=head1 THANKS
+
+Thanks to the Linux Foundation for sponsoring the development
+and testing of the v2 repository format.
+
+=head1 COPYRIGHT
+
+Copyright 2018-2019 all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
+
+=head1 SEE ALSO
+
+L<gitrepository-layout(5)>, L<public-inbox-v1-format(5)>
diff --git a/INSTALL b/INSTALL
index 3fe0e4f..aa4afb5 100644
--- a/INSTALL
+++ b/INSTALL
@@ -2,7 +2,7 @@ public-inbox (server-side) installation
 ---------------------------------------
 
 This is for folks who want to setup their own public-inbox instance.
-Clients should see https://ssoma.public-inbox.org/INSTALL.html instead
+Clients should use normal git-clone/git-fetch, or NNTP clients
 if they want to import mail into their personal inboxes.
 
 TODO: this still needs to be documented better,
@@ -134,5 +134,5 @@ installation is complete.
 Copyright
 ---------
 
-Copyright 2013-2018 all contributors <meta@public-inbox.org>
+Copyright 2013-2019 all contributors <meta@public-inbox.org>
 License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
diff --git a/MANIFEST b/MANIFEST
index f25a580..d56cd85 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -16,6 +16,8 @@ Documentation/public-inbox-index.pod
 Documentation/public-inbox-mda.pod
 Documentation/public-inbox-nntpd.pod
 Documentation/public-inbox-overview.pod
+Documentation/public-inbox-v1-format.pod
+Documentation/public-inbox-v2-format.pod
 Documentation/public-inbox-watch.pod
 Documentation/txt2pre
 HACKING
diff --git a/README b/README
index 26e0b69..ffd433d 100644
--- a/README
+++ b/README
@@ -22,8 +22,9 @@ to run their own instances with minimal overhead.
 Implementation
 --------------
 
-public-inbox stores mail in a git repository keyed by Message-ID
-as documented in: https://ssoma.public-inbox.org/ssoma_repository.txt
+public-inbox stores mail in git repositories as documented
+in https://public-inbox.org/public-inbox-v2-format.txt and
+https://public-inbox.org/public-inbox-v1-format.txt
 
 By storing (and optionally) exposing an inbox via git, it is
 fast and efficient to host and mirror public-inboxes.
@@ -35,10 +36,10 @@ discussions if archives do not expose Message-ID and References
 headers.  List server admins are also burdened with delivery
 failures.
 
-public-inbox uses the "pull" model.  Casual readers may also
+public-inbox uses the "pull" model.  Casual readers may
 follow the list via NNTP, Atom feed or HTML archives.
 
-If a reader loses interest, they simply stop syncing.
+If a reader loses interest, they simply stop following.
 
 Since we use git, mirrors are easy-to-setup, and lists are
 easy-to-relocate to different mail addresses without losing
@@ -75,6 +76,9 @@ Requirements (participant)
   their mailers to reduce the impact of a public-inbox as a
   single point of failure.
 
+* The HTTP web interface exposes mboxrd files, and NNTP clients often
+  feature reply-by-email functionality
+
 * participants do not need to install public-inbox, only server admins
 
 Requirements (server)
@@ -123,10 +127,6 @@ You may also clone all messages via git:
 	git clone --mirror https://public-inbox.org/meta/
 	torsocks git clone --mirror http://hjrcffqmbrq6wope.onion/meta/
 
-Or pass the same git repository URL for ssoma using the instructions at:
-
-	https://ssoma.public-inbox.org/README.html
-
 Anti-Spam
 ---------
 
@@ -140,7 +140,7 @@ Content Filtering
 -----------------
 
 To discourage phishing, trackers, exploits and other nuisances,
-only plain-text emails are allowed and HTML is rejected.
+only plain-text emails are allowed and HTML is rejected by default.
 This improves accessibility, and saves bandwidth and storage
 as mail is archived forever.
 
@@ -151,7 +151,7 @@ aims to preserve the focus on content, and not presentation.
 Copyright
 ---------
 
-Copyright 2013-2018 all contributors <meta@public-inbox.org>
+Copyright 2013-2019 all contributors <meta@public-inbox.org>
 License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
 This program is free software: you can redistribute it and/or modify
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 3df7d98..29c482f 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2016-2018 all contributors <meta@public-inbox.org>
+# Copyright (C) 2016-2019 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # git fast-import-based ssoma-mda MDA replacement
@@ -635,8 +635,8 @@ =head1 SYNOPSYS
 =head1 DESCRIPTION
 
 An importer and remover for public-inboxes which takes L<Email::MIME>
-messages as input and stores them in a ssoma repository as
-documented in L<https://ssoma.public-inbox.org/ssoma_repository.txt>,
+messages as input and stores them in a git repository as
+documented in L<https://public-inbox.org/public-inbox-v1-format.txt>,
 except it does not allow duplicate Message-IDs.
 
 It requires L<git(1)> and L<git-fast-import(1)> to be installed.
-- 
EW


^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2019-01-02  8:33 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-02  8:33 [PATCH] update and add documentation for repository formats Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).