* Relationship between public-inbox and ssoma? @ 2018-03-05 0:54 Nicolás Ojeda Bär 2018-03-05 2:07 ` Eric Wong 0 siblings, 1 reply; 12+ messages in thread From: Nicolás Ojeda Bär @ 2018-03-05 0:54 UTC (permalink / raw) To: meta Hello, Thanks very much for this great project. I am a bit puzzled about the difference between public-inbox and ssoma. In particular: - What is the difference between public-inbox-mda and ssoma-mda ? - Are the git repository formats the same for public-inbox and ssoma ? Any comments appreciated. Thanks a lot! Best wishes, Nicolás ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Relationship between public-inbox and ssoma? 2018-03-05 0:54 Relationship between public-inbox and ssoma? Nicolás Ojeda Bär @ 2018-03-05 2:07 ` Eric Wong 2018-03-05 11:45 ` Nicolás Ojeda Bär 2018-03-15 15:30 ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier 0 siblings, 2 replies; 12+ messages in thread From: Eric Wong @ 2018-03-05 2:07 UTC (permalink / raw) To: Nicolás Ojeda Bär; +Cc: meta Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote: > Hello, > > Thanks very much for this great project. > > I am a bit puzzled about the difference between public-inbox and ssoma. In particular: > > - What is the difference between public-inbox-mda and ssoma-mda ? public-inbox-mda is more suitable for public endpoints where it's the primary entry point for a publically-shared mail. ssoma-mda is/was intended for personal mail. Originally, public-inbox depended on and used ssoma, but that was given up for more performance. Sidenote: I don't recommend public-inbox-mda for running _mirrors_ of existing mailing lists since it's stricter than what most lists accept. public-inbox-watch is more lenient and more performant (on Linux with inotify, at least); so I wrote it for mirroring. > - Are the git repository formats the same for public-inbox and ssoma ? Currently they are the same with one exception: ssoma allows two different messages (different blob SHA-1) to have the same Message-Id by default; public-inbox (current version) does not. (ssoma-mda has a "-1" option to disable duplicate Message-Id). The work-in-progress "v2" public-inbox format diverges and I don't currently have plans to port ssoma to use it. The v1 format will remain supported in public-inbox. I'm not sure if ssoma is worth the effort any more, as it's too much effort to promote a new sync protocol (even if based on git). I'd rather improve NNTP servers and clients as an option for people to read public inboxes. > Any comments appreciated. > > Thanks a lot! No problem, thanks for your interest. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Relationship between public-inbox and ssoma? 2018-03-05 2:07 ` Eric Wong @ 2018-03-05 11:45 ` Nicolás Ojeda Bär 2018-03-05 17:50 ` Eric Wong 2018-03-15 15:30 ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier 1 sibling, 1 reply; 12+ messages in thread From: Nicolás Ojeda Bär @ 2018-03-05 11:45 UTC (permalink / raw) To: Eric Wong; +Cc: meta Hello Eric, Thanks for the prompt reply. I am trying to migrate a long-lived mailing list (65k messages over 26 years), below are some troubles/questions I am having; any suggestions would be greatly appreciated. - public-inbox-watch seems to struggle with very big maildirs; for now I am moving the data into the maildir a little at a time and that seems to work. Is there a particular obstacle to making the importing process more incremental? - Trouble due to missing/malformed headers (mostly on very old messages). For example, here is the header of a message that trips public-inbox-watch: From weis@margaux Fri Nov 27 16:24:50 1992 Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100 Message-ID: <9211271524.AA29971@margaux.inria.fr> To: caml-list@margaux Sender: weis@margaux Status: O The error is: fatal: Invalid rfc2822 date "" in ident: <> (I guess due to the lack of a Date: field). I added a Date: field just to test and noticed that Author: in the git commit was empty, I guess due to the use of Sender: rather than From: header. Do you think it is feasible to improve public-inbox-watch to try to extract the date from some other header like above? and to use Sender: when From: is not found? - There are some messages that do not have Message-Id, but public-inbox-watch seems to be able to handle them. Is it the case that Date: is the only header that is absolutely necessary for public-inbox-watch to process the message? - Does public-inbox-watch ever modify the message data? - In general public-inbox-watch prints very little about what it is doing, which makes it hard(er) to trace problems; a verbose flag would be a nice addition, I think. Thanks! Best wishes, Nicolás On Mon, Mar 5, 2018 at 3:07 AM, Eric Wong <e@80x24.org> wrote: > Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote: >> Hello, >> >> Thanks very much for this great project. >> >> I am a bit puzzled about the difference between public-inbox and ssoma. In particular: >> >> - What is the difference between public-inbox-mda and ssoma-mda ? > > public-inbox-mda is more suitable for public endpoints where > it's the primary entry point for a publically-shared mail. > ssoma-mda is/was intended for personal mail. Originally, > public-inbox depended on and used ssoma, but that was given up > for more performance. > > Sidenote: I don't recommend public-inbox-mda for running > _mirrors_ of existing mailing lists since it's stricter than > what most lists accept. public-inbox-watch is more lenient and > more performant (on Linux with inotify, at least); so I wrote > it for mirroring. > >> - Are the git repository formats the same for public-inbox and ssoma ? > > Currently they are the same with one exception: ssoma allows two > different messages (different blob SHA-1) to have the same > Message-Id by default; public-inbox (current version) does not. > (ssoma-mda has a "-1" option to disable duplicate Message-Id). > > The work-in-progress "v2" public-inbox format diverges and I > don't currently have plans to port ssoma to use it. The v1 > format will remain supported in public-inbox. > > I'm not sure if ssoma is worth the effort any more, as it's too > much effort to promote a new sync protocol (even if based on > git). I'd rather improve NNTP servers and clients as an option > for people to read public inboxes. > >> Any comments appreciated. >> >> Thanks a lot! > > No problem, thanks for your interest. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Relationship between public-inbox and ssoma? 2018-03-05 11:45 ` Nicolás Ojeda Bär @ 2018-03-05 17:50 ` Eric Wong 2018-03-05 18:06 ` Nicolás Ojeda Bär 0 siblings, 1 reply; 12+ messages in thread From: Eric Wong @ 2018-03-05 17:50 UTC (permalink / raw) To: Nicolás Ojeda Bär; +Cc: meta Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote: > Hello Eric, > > Thanks for the prompt reply. I am trying to migrate a long-lived > mailing list (65k messages over 26 years), below are some > troubles/questions I am having; > any suggestions would be greatly appreciated. > > - public-inbox-watch seems to struggle with very big maildirs; for now > I am moving the data into the maildir a little at a time and that > seems to work. Is there a particular obstacle > to making the importing process more incremental? Do you know if it's SpamAssassin being slow? I disable network checks for large imports in ~/.spamassassin/user_prefs (if I'm using SA at all during the imports): # uncomment the following for importing archives: # dns_available no # skip_rbl_checks 1 # skip_uribl_checks 1 Fwiw, large directories are a performance killer in any application. Seek times and cache overheads are two problems, at least, so an SSD will definitely help; and maybe even shorter filenames. I usually prefer one-off scripts like scripts/import_vger_from_mbox for initial imports and store large archives in compressed mboxes instead of Maildir. Lack of mbox support is one reason I never used notmuch despite studying it. > - Trouble due to missing/malformed headers (mostly on very old > messages). For example, here is the header of a message that trips > public-inbox-watch: > > From weis@margaux Fri Nov 27 16:24:50 1992 > Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100 > Message-ID: <9211271524.AA29971@margaux.inria.fr> > To: caml-list@margaux > Sender: weis@margaux > Status: O > > The error is: fatal: Invalid rfc2822 date "" in ident: <> (I guess > due to the lack of a Date: field). I added a Date: field just to test > and > noticed that Author: in the git commit was empty, I guess due to the > use of Sender: rather than From: header. I have a patch in the wings to use the Received: date: https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw And I'm thinking about favoring Received: over Date: if both exist, since Date: headers are more often wrong... > Do you think it is feasible to improve public-inbox-watch to try to > extract the date from some other header like above? > and to use Sender: when From: is not found? Sure, I suppose falling back to Sender is correct if From is missing. > - There are some messages that do not have Message-Id, but > public-inbox-watch seems to be able to handle them. Yes, we generate a Message-Id if one is missing > Is it the case that Date: is the only header that is absolutely > necessary for public-inbox-watch to process the message? Probably none of them are, actually. > - Does public-inbox-watch ever modify the message data? Message-ID generation is one that's generated. Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in lib/PublicInbox/MDA.pm are all dropped: our @BAD_HEADERS = ( # postfix qw(delivered-to x-original-to), # prevent training loops # The rest are taken from Mailman 2.1.15: # could contain passwords: qw(approved approve x-approved x-approve urgent), # could be used phishing: qw(return-receipt-to disposition-notification-to x-confirm-reading-to), # Pegasus mail: qw(x-pmrqc) ); Email::MIME might modify invalid characters in the headers (or if there's bugs in Email::MIME). I don't think bodies are modified outside of the not-really-documented PublicInbox::Filter API. You can check out some filters at lib/PublicInbox/Filter/*.pm (some commit messages document them, but I don't think there's manpages, yet) > - In general public-inbox-watch prints very little about what it is > doing, which makes it hard(er) to trace problems; a verbose flag would > be a nice > addition, I think. I usually use strace on Linux to track down problems. I'm not sure it's worth the effort to introduce new options/features if generic tracing utilities are more detailed and accurate. Also, I'm going to be mostly offline for about a week starting tomorrow; so don't expect prompt replies for a bit. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Relationship between public-inbox and ssoma? 2018-03-05 17:50 ` Eric Wong @ 2018-03-05 18:06 ` Nicolás Ojeda Bär 2018-03-19 7:43 ` watch performance [was: Relationship between public-inbox and ssoma?] Eric Wong 0 siblings, 1 reply; 12+ messages in thread From: Nicolás Ojeda Bär @ 2018-03-05 18:06 UTC (permalink / raw) To: Eric Wong; +Cc: meta Hi Eric, Thanks for the quick reply. On Mon, Mar 5, 2018 at 6:50 PM, Eric Wong <e@80x24.org> wrote: > Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote: >> Hello Eric, >> >> Thanks for the prompt reply. I am trying to migrate a long-lived >> mailing list (65k messages over 26 years), below are some >> troubles/questions I am having; >> any suggestions would be greatly appreciated. >> >> - public-inbox-watch seems to struggle with very big maildirs; for now >> I am moving the data into the maildir a little at a time and that >> seems to work. Is there a particular obstacle >> to making the importing process more incremental? > > Do you know if it's SpamAssassin being slow? > > I disable network checks for large imports in ~/.spamassassin/user_prefs > (if I'm using SA at all during the imports): > # uncomment the following for importing archives: > # dns_available no > # skip_rbl_checks 1 > # skip_uribl_checks 1 I don't think it is even installed and I have not set it up at all, so probably not. > Fwiw, large directories are a performance killer in any > application. Seek times and cache overheads are two problems, > at least, so an SSD will definitely help; and maybe even shorter > filenames. OK. > I usually prefer one-off scripts like > scripts/import_vger_from_mbox for initial imports and store > large archives in compressed mboxes instead of Maildir. Lack of > mbox support is one reason I never used notmuch despite studying > it. Thanks for the pointer, I will take a look, hopefully it will nudge me in the right direction. >> - Trouble due to missing/malformed headers (mostly on very old >> messages). For example, here is the header of a message that trips >> public-inbox-watch: >> >> From weis@margaux Fri Nov 27 16:24:50 1992 >> Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100 >> Message-ID: <9211271524.AA29971@margaux.inria.fr> >> To: caml-list@margaux >> Sender: weis@margaux >> Status: O >> >> The error is: fatal: Invalid rfc2822 date "" in ident: <> (I guess >> due to the lack of a Date: field). I added a Date: field just to test >> and >> noticed that Author: in the git commit was empty, I guess due to the >> use of Sender: rather than From: header. > > I have a patch in the wings to use the Received: date: > > https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw > > And I'm thinking about favoring Received: over Date: if both > exist, since Date: headers are more often wrong... Great, I will try your patch to see if I can get my messages past public-inbox-watch. >> Do you think it is feasible to improve public-inbox-watch to try to >> extract the date from some other header like above? >> and to use Sender: when From: is not found? > > Sure, I suppose falling back to Sender is correct if From is > missing. OK, I will see if I can patch this on my own this since I am keen on getting this mailing list imported. >> - There are some messages that do not have Message-Id, but >> public-inbox-watch seems to be able to handle them. > > Yes, we generate a Message-Id if one is missing > >> Is it the case that Date: is the only header that is absolutely >> necessary for public-inbox-watch to process the message? > > Probably none of them are, actually. Currently, public-inbox-watch refuses to process the message with the header quoted above due to a missing Date: header. >> - Does public-inbox-watch ever modify the message data? > > Message-ID generation is one that's generated. > Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in > lib/PublicInbox/MDA.pm are all dropped: > > our @BAD_HEADERS = ( > # postfix > qw(delivered-to x-original-to), # prevent training loops > > # The rest are taken from Mailman 2.1.15: > # could contain passwords: > qw(approved approve x-approved x-approve urgent), > # could be used phishing: > qw(return-receipt-to disposition-notification-to x-confirm-reading-to), > # Pegasus mail: > qw(x-pmrqc) > ); > > Email::MIME might modify invalid characters in the headers (or > if there's bugs in Email::MIME). I don't think bodies are > modified outside of the not-really-documented > PublicInbox::Filter API. You can check out some filters at > lib/PublicInbox/Filter/*.pm (some commit messages document them, > but I don't think there's manpages, yet) OK, will take a look. >> - In general public-inbox-watch prints very little about what it is >> doing, which makes it hard(er) to trace problems; a verbose flag would >> be a nice >> addition, I think. > > I usually use strace on Linux to track down problems. I'm not > sure it's worth the effort to introduce new options/features > if generic tracing utilities are more detailed and accurate. > Makes sense. Thanks for the suggestion. > Also, I'm going to be mostly offline for about a week starting > tomorrow; so don't expect prompt replies for a bit. Sure, thanks for the heads-up. Best wishes, Nicolás ^ permalink raw reply [flat|nested] 12+ messages in thread
* watch performance [was: Relationship between public-inbox and ssoma?] 2018-03-05 18:06 ` Nicolás Ojeda Bär @ 2018-03-19 7:43 ` Eric Wong 0 siblings, 0 replies; 12+ messages in thread From: Eric Wong @ 2018-03-19 7:43 UTC (permalink / raw) To: Nicolás Ojeda Bär; +Cc: meta Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote: > On Mon, Mar 5, 2018 at 6:50 PM, Eric Wong <e@80x24.org> wrote: > > Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote: > >> Hello Eric, > >> > >> Thanks for the prompt reply. I am trying to migrate a long-lived > >> mailing list (65k messages over 26 years), below are some > >> troubles/questions I am having; > >> any suggestions would be greatly appreciated. > >> > >> - public-inbox-watch seems to struggle with very big maildirs; for now > >> I am moving the data into the maildir a little at a time and that > >> seems to work. Is there a particular obstacle > >> to making the importing process more incremental? Heh, I've been adjusting some of that code to support v2 and -watch has actually has been incremental for a while. It tries to balance work between inboxes fairly and might be writing data out to disk than you want it to for initial imports. It was a trade-off for allowing readers to see up-to-date data and throughput. Also, I forget to ask, are you on Linux with Inotify support? I haven't tried Filesys::Notify::Simple (used by -watch) without it so maybe other OSes struggle. > > I usually prefer one-off scripts like > > scripts/import_vger_from_mbox for initial imports and store > > large archives in compressed mboxes instead of Maildir. Lack of > > mbox support is one reason I never used notmuch despite studying > > it. Ah, another thing I do almost subconciously for running imports and tests is use "eatmydata" to disable fsync: https://www.flamingspork.com/projects/libeatmydata/ Running -watch with eatmydata on my desktop with an SSD, I didn't notice any problems with ~28K mail from LKML from the past month or so. It might be a pain to support our own knobs for disabling fsync: There's one knob for Xapian (only 1.4.x, I think), one knob for SQLite, and git doesn't allow disabling fsync on packs, yet, only loose objects at the moment; so "eatmydata" is probably the easiest. > > And I'm thinking about favoring Received: over Date: if both > > exist, since Date: headers are more often wrong... Ugh, but there's patchbombs and git adjusts Date: to get sorting right for MUAs, so using Received: makes those out-of-order :< So overall inbox sorting might use Received:, but sorting within individual threads will need to use the Date: header. ^ permalink raw reply [flat|nested] 12+ messages in thread
* internal format (was: Relationship between public-inbox and ssoma?) 2018-03-05 2:07 ` Eric Wong 2018-03-05 11:45 ` Nicolás Ojeda Bär @ 2018-03-15 15:30 ` Stefan Monnier 2018-03-15 16:40 ` Eric Wong 1 sibling, 1 reply; 12+ messages in thread From: Stefan Monnier @ 2018-03-15 15:30 UTC (permalink / raw) To: meta > The work-in-progress "v2" public-inbox format diverges and I > don't currently have plans to port ssoma to use it. The v1 > format will remain supported in public-inbox. Which reminds me: do you have some document that explains the reasoning behind the choice of format (especially which alternatives were considered and dropped and why)? Stefan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: internal format (was: Relationship between public-inbox and ssoma?) 2018-03-15 15:30 ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier @ 2018-03-15 16:40 ` Eric Wong 2018-03-15 18:49 ` internal format Stefan Monnier 0 siblings, 1 reply; 12+ messages in thread From: Eric Wong @ 2018-03-15 16:40 UTC (permalink / raw) To: Stefan Monnier; +Cc: meta Stefan Monnier <monnier@IRO.UMontreal.CA> wrote: > > The work-in-progress "v2" public-inbox format diverges and I > > don't currently have plans to port ssoma to use it. The v1 > > format will remain supported in public-inbox. > > Which reminds me: do you have some document that explains the reasoning > behind the choice of format (especially which alternatives were > considered and dropped and why)? v1 or v2? Some of the reasoning for v2 was here: https://public-inbox.org/meta/20180209205140.GA11047@dcvr/ v1 was similar to what git did with loose objects and prevented dupes based on Message-ID. That worked well enough for small non-mirror lists and wasn't designed with search (Xapian) in mind. As for git itself: reliability, ease-of-replication, storage efficiency. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: internal format 2018-03-15 16:40 ` Eric Wong @ 2018-03-15 18:49 ` Stefan Monnier 2018-03-15 20:14 ` Eric Wong 0 siblings, 1 reply; 12+ messages in thread From: Stefan Monnier @ 2018-03-15 18:49 UTC (permalink / raw) To: Eric Wong; +Cc: meta > v1 or v2? Some of the reasoning for v2 was here: > https://public-inbox.org/meta/20180209205140.GA11047@dcvr/ IIUC, the issues you consider important are: - Size - Time to perform "git rev-list --objects --all" - Flexibility, e.g. to be able to remove messages. For size your benchmarks seem to indicate that as long as it's kept inside Git, the choice of format doesn't actually affect it significantly (and this matches my expectations). Tho I guess it's probably possible to improve on it with enough efforts (e.g. storing attachments separately, or splitting large messages into chunks, e.g. like `bup` does), but I doubt it's worth the effort (especially if you assume that the mailing-list imposes a limit on message size). For timing, I'm curious why you only consider "git rev-list --objects --all". Which operation does this corresponds to in public-inbox and is that really the only one that is performance-sensitive? > As for git itself: reliability, ease-of-replication, storage > efficiency. Yes, that part I totally understand (same reason I used Git in BuGit https://gitlab.com/monnier/bugit). Part of my question was related to the fact that in BuGit I store the messages in the commit-object rather than in files (which trivially gives me conflict-free merges as well as "discussion threads") so I was wondering if it would make sense in the case of public-inbox to keep the email messages in the commit objects rather than in files, but since I don't really know which operations are frequent/important I really have no idea. One thing that strikes me is that you don't seem to use its "decentralization": IIUC public-inbox always assumes one of the repositories is the "master" and others are mirrors (or mirrors of mirrors), so you get efficient "fast-forward" updates, but you don't do "merges". This probably means that keeping the email messages in commit objects wouldn't bring any benefits. Also this means that public-inbox could freely rewrite history, for example (which you'll need to really expunge messages) and just use "forced updates" in mirrors. Now I'm left wondering what it would mean for something like public-inbox to support merging. Stefan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: internal format 2018-03-15 18:49 ` internal format Stefan Monnier @ 2018-03-15 20:14 ` Eric Wong 2018-03-15 21:05 ` Stefan Monnier 0 siblings, 1 reply; 12+ messages in thread From: Eric Wong @ 2018-03-15 20:14 UTC (permalink / raw) To: Stefan Monnier; +Cc: meta Stefan Monnier <monnier@IRO.UMontreal.CA> wrote: > > v1 or v2? Some of the reasoning for v2 was here: > > https://public-inbox.org/meta/20180209205140.GA11047@dcvr/ > > IIUC, the issues you consider important are: > > - Size > - Time to perform "git rev-list --objects --all" > - Flexibility, e.g. to be able to remove messages. > > For size your benchmarks seem to indicate that as long as it's kept > inside Git, the choice of format doesn't actually affect it > significantly (and this matches my expectations). > Tho I guess it's probably possible to improve on it with enough efforts > (e.g. storing attachments separately, or splitting large messages into > chunks, e.g. like `bup` does), but I doubt it's worth the effort > (especially if you assume that the mailing-list imposes a limit on > message size). Right, I decided splitting big messages wasn't worth the complexity and we leave it up to the (usually reasonable) mail server. > For timing, I'm curious why you only consider > "git rev-list --objects --all". Which operation does this corresponds > to in public-inbox and is that really the only one that is > performance-sensitive? That traverses the object graph (same walk used for repacking where bitmaps don't help). I got it from Peff https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/ That's the main thing we can control with repository layout. Large packs are generally a problem with git, so v2 partitions repositories at roughly 1G. > > As for git itself: reliability, ease-of-replication, storage > > efficiency. > > Yes, that part I totally understand (same reason I used Git in BuGit > https://gitlab.com/monnier/bugit). Part of my question was related to > the fact that in BuGit I store the messages in the commit-object rather > than in files (which trivially gives me conflict-free merges as well as > "discussion threads") so I was wondering if it would make sense in the > case of public-inbox to keep the email messages in the commit objects > rather than in files, but since I don't really know which operations are > frequent/important I really have no idea. I thought about storing messages in the commit object, but that would break our current use of Xapian if history rewrites are required for legal reasons. > One thing that strikes me is that you don't seem to use its > "decentralization": IIUC public-inbox always assumes one of the > repositories is the "master" and others are mirrors (or mirrors of > mirrors), so you get efficient "fast-forward" updates, but you > don't do "merges". Right, git merges require the use of pre-established communications channels (e.g. email) to coordinate. I don't believe merging and keeping an authoritative history/order makes sense with public-inbox (more on this later). What's important to decentralization is the "root" can change easily (change of URLs / archival addresses) and all the messages eventually end up replicatable. I consider ease-of-replication and efficiency the building blocks of decentralization. Beyond that, I believe encouraging "pull" via NNTP and discouraging "push" via SMTP with mlmmj/mailman/etc. can eventually lend itself to entirely forkable communities. > This probably means that keeping the email messages in commit objects > wouldn't bring any benefits. > > Also this means that public-inbox could freely rewrite history, for > example (which you'll need to really expunge messages) and just use > "forced updates" in mirrors. We currently store blob SHA-1s in Xapian to avoid tree lookups in git. Having a history rewrite can break an entire chain of unrelated messages if we store commit SHA-1 in Xapian instead of blobs. > Now I'm left wondering what it would mean for something like > public-inbox to support merging. I consider it a waste of effort to maintain an authoritive commit history when archiving mail. There's too many variables when it comes to mail servers and headers and no guarantees on message ordering. Among other things, the last (top) Received: header will surely differ if multiple people start archiving a list independently of each other. The email messages are what's important, so replaying an mbox/Maildir into an importer will get the data that matters (and deduplication checks will avoid redundant mails). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: internal format 2018-03-15 20:14 ` Eric Wong @ 2018-03-15 21:05 ` Stefan Monnier 2018-03-15 21:21 ` Eric Wong 0 siblings, 1 reply; 12+ messages in thread From: Stefan Monnier @ 2018-03-15 21:05 UTC (permalink / raw) To: Eric Wong; +Cc: meta >> For timing, I'm curious why you only consider >> "git rev-list --objects --all". Which operation does this corresponds >> to in public-inbox and is that really the only one that is >> performance-sensitive? > That traverses the object graph (same walk used for repacking > where bitmaps don't help). Yes, I understand what it does in Git, but I wonder why a full traversal of the graph is the only/main operation you care about. Hmm... I guess your other operations are: - lookup by message-id (which is made efficient because you index files by the message-id). - everything else is done by keeping another index (from NNTP article number to message-id (or to blob?)), as in the case of Xapian. Actually, if you directly index the blobs, you don't really need to index your file by message-id (you could keep the index from message-id to blobs external, just as is done for Xapian, right?). > We currently store blob SHA-1s in Xapian to avoid tree lookups > in git. Having a history rewrite can break an entire chain of > unrelated messages if we store commit SHA-1 in Xapian instead of > blobs. Ah, indeed, keeping them as files means that the file's own SHA won't change when you rewrite history so it makes it much easier to rewrite history if you rely on this (also probably a lot more efficient within Git). >> Now I'm left wondering what it would mean for something like >> public-inbox to support merging. > I consider it a waste of effort to maintain an authoritive > commit history when archiving mail. Indeed, as long as we're left wondering what good it would do to be able to merge, we're left with its downsides. Stefan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: internal format 2018-03-15 21:05 ` Stefan Monnier @ 2018-03-15 21:21 ` Eric Wong 0 siblings, 0 replies; 12+ messages in thread From: Eric Wong @ 2018-03-15 21:21 UTC (permalink / raw) To: Stefan Monnier; +Cc: meta Stefan Monnier <monnier@IRO.UMontreal.CA> wrote: > >> For timing, I'm curious why you only consider > >> "git rev-list --objects --all". Which operation does this corresponds > >> to in public-inbox and is that really the only one that is > >> performance-sensitive? > > That traverses the object graph (same walk used for repacking > > where bitmaps don't help). > > Yes, I understand what it does in Git, but I wonder why a full traversal > of the graph is the only/main operation you care about. > > Hmm... I guess your other operations are: > - lookup by message-id (which is made efficient because you index files > by the message-id). > - everything else is done by keeping another index (from NNTP article > number to message-id (or to blob?)), as in the case of Xapian. > > Actually, if you directly index the blobs, you don't really need to > index your file by message-id (you could keep the index from message-id > to blobs external, just as is done for Xapian, right?). Right, storing blob OIDs in Xapian means tree lookups are irrelevant to read performance. Since we can rely on Xapian for v2, we can fix the graph traversal problem by simplifying the trees and speed up writes by having smaller trees. The only remaining performance pain point is the overall size of repos (which we work around by partitioning). ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-03-19 7:43 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-03-05 0:54 Relationship between public-inbox and ssoma? Nicolás Ojeda Bär 2018-03-05 2:07 ` Eric Wong 2018-03-05 11:45 ` Nicolás Ojeda Bär 2018-03-05 17:50 ` Eric Wong 2018-03-05 18:06 ` Nicolás Ojeda Bär 2018-03-19 7:43 ` watch performance [was: Relationship between public-inbox and ssoma?] Eric Wong 2018-03-15 15:30 ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier 2018-03-15 16:40 ` Eric Wong 2018-03-15 18:49 ` internal format Stefan Monnier 2018-03-15 20:14 ` Eric Wong 2018-03-15 21:05 ` Stefan Monnier 2018-03-15 21:21 ` Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).