From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id CD1871F404; Thu, 19 Apr 2018 01:58:13 +0000 (UTC) Date: Thu, 19 Apr 2018 01:58:13 +0000 From: Eric Wong To: meta@public-inbox.org Subject: v2 development notes Message-ID: <20180419015813.GA20051@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline List-Id: Just some notes I was maintaining a long the way. It should probably be turned into .pod for a manpage at some point... public-inbox repository layout v2 --------------------------------- First off, Xapian remains a swappable component for another search engine; but for now, only Xapian (along with SQLite) is supported and required for v2 to work. Basic idea is v2 won't be git-centered; in other words it is not be a bare git repository polluted by public-inbox-specific files. Instead, one or more bare git repositories will exist within the hierarchy. Partitioning ------------ There are two types of partitioning done in v2 to address different performance problems with the original v1 (ssoma) inboxes. The first is size/time-based partitioning based on epoch. Each git repository is limited to roughly 1G (to be made configurable, later). Once a git repository hits its size threshold, a new one is created and new messages only go to it. For the server administrator, this has a pleasant side effect of limiting pack sizes and clone times. It also allows mirrors to do a partial mirror for inboxes spanning several git repos. The second partition type is by CPU core count. Xapian indexing is an expensive operation and consumes a significant amount of CPU time. Since multi-core CPUs are common nowadays, we split off Xapian indexing into multiple cores. Fortunately, its' read-only interface can transparently abstract away the multiple partitions. object identifiers ------------------ There will be three distinct type of identifiers. content_id is the new one for v2 and should make message removal and deduplication easier. object_id and Message-ID are already known. * object_id - the blob identifier git uses (currently SHA-1) No need to publically expose this outside of normal git ops (cloning) and there's no need to make this searchable. As with v1 of public-inbox, this will be stored as part of the Xapian document so expensive name lookups can be avoided for document retrieval. * Message-ID - the email header; duplicates allowed for archival purposes. Needs to be a searchable field in Xapian. Note: it's possible for emails to have multiple Message-ID headers (and git-send-email(1) had that bug for a bit); so we take all of them into account. In case of conflicts detected by content_id below, we generate a new Message-ID based on content_id; if the generated Message-ID still conflicts, a random one is generated. * content_id - a hash of relevant headers and raw body content for purging of unwanted content. This is not stored anywhere, but calculated on-the-fly. For now, the relevant headers are: Subject, From, Date, References, In-Reply-To, To, Cc Received, List-Id, and similar headers are NOT part of content_id as they differ across lists and we will want removal to be able to cross lists. The textual parts of the body are decoded, CRLF normalized to LF, and trailing whitespace stripped. Notably, hashing the raw body risks being broken by list signatures; but we can use filters (e.g. PublicInbox::Filter::Vger) to clean the body for imports. This is SHA-256 for now; but can be changed at any time without DB changes. repository layout ----------------- $EPOCH - Integer starting with 0 based on time $SCHEMA_VERSION - SCHEMA_VERSION used by Xapian, we'll inherit and start with '14' from v1.0.0 $PART - Integer (0..NPROCESSORS) foo/ # assuming "foo" is the name of the list - inbox.lock # lock file (flock) to protect global state - git/$EPOCH.git # normal git repositories - all.git # empty git repo, alternates to git/$EPOCH.git - xap$SCHEMA_VERSION/$PART # per-partition Xapian DB - xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading For blob lookups, the reader only needs to open the "all.git" repository with $GIT_DIR/objects/info/alternates which references every $EPOCH.git repo. Individual $EPOCH.git repos DO NOT use alternates themselves as git currently limits recursion of alternates nesting depth to 5. git tree layout --------------- During the original (v1) development, large trees were frequently a performance problem as name lookups are expensive and there were limited deltafication opportunities. Unlike the ssoma-based layout in v1, the v2 git tree contains only a single file at the top-level of the tree, either 'm' (for 'mail' or 'message') or 'd' (for deleted). Mail is still stored in blobs (instead of inline with the commit object) as we still need a stable reference in the indices in case history is rewritten to comply with legal requirements. After-the-fact invocations of public-inbox-index will ignore messages written to 'd' after they are written to 'm'. Deltafication is not significantly improved over v1, but overall storage for trees is greatly reduced. https://public-inbox.org/meta/20180209205140.GA11047@dcvr/T/ Overview DB ----------- Late into v2 development, it became apparent Xapian did not perform well with sorting large result sets used to generate the landing page in the PSGI UI (/$INBOX/) or many queries used by the NNTP server. Thus, SQLite was employed and the Xapian "skeleton" DB was renamed to the "overview" DB (after the NNTP XOVER/OVER commands). In the future, Xapian will become optional for v2. Most of the PSGI all of the NNTP functionality will be possible with only SQLite in addition to git. https://public-inbox.org/meta/20180402000456.13446-1-e@80x24.org/T/