* [v2] one file to rule them all?
@ 2018-02-09 20:51 Eric Wong
2018-02-15 10:55 ` Eric Wong
0 siblings, 1 reply; 21+ messages in thread
From: Eric Wong @ 2018-02-09 20:51 UTC (permalink / raw)
To: meta
[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]
Since 95acd5901491e4f333f5d2bbeed6fb5e6b53e07c
("searchmsg: add git object ID to doc_data")
the need for having file stored in trees is reduced
since Xapian stores the git object_id and asks git to
retrieve it without doing tree lookups.
So, as long as git knows an object exists, it should be no
problem to just continually replace a single blob at the top
level.
Testing with git@vger history (https://public-inbox.org/git/ at
10066bdacf246bf885f7a11d06a5a87946d7af73 <20180208172153.GA30760@tor.lan>
by Torsten Bögershausen <tboegi@web.de> @ Thu Feb 8 18:21:53 2018 +0100)
For the 2-2-36 and 2-2-36 trees I took into account naming the
last 16-bytes since that's what git.git uses for
sorting/grouping for packing (pack_name_hash in pack-objects.h)
2-38 914M (baseline)
2-2-2-34 849M
2-2-36 832M
1-file 839M
2-2-2-34 has the most trees, so it's not great in terms of
space. 2-2-36 optimizes deltas better than the 1-file route;
but not significantly so.
It seems optimizing for deltafication isn't worth the effort...
Timing "git rev-list --objects --all |wc -l" reveals much bigger
differences. Timings are a bit fuzzy since (AFAIK) this is a
shared system, but it's not really a contest:
2-38 ~5 minutes
2-2-2-34 ~30 seconds
2-2-36 ~30 seconds
1-file ~5 seconds
Smaller trees are way faster :)
The downside of this change is squashing history will no longer
be possible; but it won't be needed for efficiency reasons.
In other words, git scales infinitely well to deep history
depths, but not to breadth of large trees[1].
Marking spam and handling message removals might be a little
trickier as chronology will have to be taken into account...
(will post more on this, later)
I also considered storing messages in the commit object itself
but that would be tougher to reconcile if rewriting git history
is necessary for legal reasons (DMCA).
[1] - we currently process history with --reverse to walk
in chronological order to ease processing of message
removals; but --reverse is has an O(n) cost associated
with it so we should avoid it. The thread association
logic should be robust enough to be time-independent.
[-- Attachment #2: 1file-convert.perl --]
[-- Type: text/plain, Size: 889 bytes --]
#!/usr/bin/perl -w
# Copyright 2018 The Linux Foundation
# License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
use warnings;
use Email::MIME;
use Digest::MD5 qw(md5_hex);
$| = 0;
my $h = '[0-9a-f]';
my $state = '';
my $blob;
my $suff; # 16 bytes for git hashing
while (<STDIN>) {
if ($_ eq "blob\n") {
$state = 'blob';
} elsif (/^commit /) {
$state = 'commit';
} elsif ($state eq 'commit') {
if (m{^(M 100644 :\d+) ${h}{2}/${h}{38}}o) {
my ($pfx) = ($1);
print "$pfx msg\n";
next;
}
if (/^data (\d+)/) {
print $_;
my $len = $1;
if ($len) {
my $tmp;
read(STDIN, $tmp, $len) or die "read: $!\n";
print $tmp;
}
next;
}
} elsif ($state eq 'blob') {
if (/^data (\d+)/) {
my $len = $1;
print $_;
next unless $len;
read(STDIN, $blob, $len) or die "read: $!\n";
print $blob;
next;
}
}
print $_;
}
[-- Attachment #3: 2-2-36-convert.perl --]
[-- Type: text/plain, Size: 1141 bytes --]
#!/usr/bin/perl -w
# Copyright 2018 The Linux Foundation
# License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
use warnings;
use Email::MIME;
use Digest::MD5 qw(md5_hex);
$| = 0;
my $h = '[0-9a-f]';
my $state = '';
my $blob;
my $suff; # 16 bytes for git hashing
while (<STDIN>) {
if ($_ eq "blob\n") {
$state = 'blob';
} elsif (/^commit /) {
$state = 'commit';
} elsif ($state eq 'commit') {
if (m{^(M 100644 :\d+) (${h}{2})/(${h}{2})(${h}{36})}o) {
my ($pfx, $x2, $x4, $x36) = ($1, $2, $3, $4);
print "$pfx $x2/$x4/$x36.$suff\n";
next;
}
if (/^data (\d+)/) {
print $_;
my $len = $1;
if ($len) {
my $tmp;
read(STDIN, $tmp, $len) or die "read: $!\n";
print $tmp;
}
next;
}
} elsif ($state eq 'blob') {
if (/^data (\d+)/) {
my $len = $1;
print $_;
next unless $len;
read(STDIN, $blob, $len) or die "read: $!\n";
print $blob;
my $mime = Email::MIME->new($blob);
$suff = $mime->header('Subject');
utf8::encode($suff);
# git uses the last 16 bytes for deltas
$suff = substr(md5_hex(substr($suff, -16)), -16);
next;
}
}
print $_;
}
[-- Attachment #4: 2-2-2-34-convert.perl --]
[-- Type: text/plain, Size: 1163 bytes --]
#!/usr/bin/perl -w
# Copyright 2018 The Linux Foundation
# License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
use warnings;
use Email::MIME;
use Digest::MD5 qw(md5_hex);
$| = 0;
my $h = '[0-9a-f]';
my $state = '';
my $blob;
my $suff; # 16 bytes for git hashing
while (<STDIN>) {
if ($_ eq "blob\n") {
$state = 'blob';
} elsif (/^commit /) {
$state = 'commit';
} elsif ($state eq 'commit') {
if (m{^(M 100644 :\d+) (${h}{2})/(${h}{2})(${h}{2})(${h}{34})}o) {
my ($pfx, $x2, $x4, $x6, $x34) = ($1, $2, $3, $4, $5);
print "$pfx $x2/$x4/$x6/$x34.$suff\n";
next;
}
if (/^data (\d+)/) {
print $_;
my $len = $1;
if ($len) {
my $tmp;
read(STDIN, $tmp, $len) or die "read: $!\n";
print $tmp;
}
next;
}
} elsif ($state eq 'blob') {
if (/^data (\d+)/) {
my $len = $1;
print $_;
next unless $len;
read(STDIN, $blob, $len) or die "read: $!\n";
print $blob;
my $mime = Email::MIME->new($blob);
$suff = $mime->header('Subject');
utf8::encode($suff);
# git uses the last 16 bytes for deltas
$suff = substr(md5_hex(substr($suff, -16)), -16);
next;
}
}
print $_;
}
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v2] one file to rule them all?
2018-02-09 20:51 [v2] one file to rule them all? Eric Wong
@ 2018-02-15 10:55 ` Eric Wong
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
0 siblings, 1 reply; 21+ messages in thread
From: Eric Wong @ 2018-02-15 10:55 UTC (permalink / raw)
To: meta
Eric Wong <e@80x24.org> wrote:
> Timing "git rev-list --objects --all |wc -l" reveals much bigger
> differences. Timings are a bit fuzzy since (AFAIK) this is a
> shared system, but it's not really a contest:
>
> 2-38 ~5 minutes
> 2-2-2-34 ~30 seconds
> 2-2-36 ~30 seconds
> 1-file ~5 seconds
>
> Smaller trees are way faster :)
The LKML 2000-2017 archives (16GB uncompressed mboxes) I have
are 6.3G of objects with 1-file storage in git and took around
33 minutes to do a full import utilizing a single core and
single git repo (no deduplication checks).
"git repack -adb" takes about 2 minutes
"git rev-list --objects --all |wc -l"
takes around 1 minute with over 8 million objects
As a baseline, pure Perl parsing of the mboxes (no writing to
git) was around 23 minutes on a single core; so git-fast-import
does add some overhead but probably not as much as Xapian will
add for the initial import.
The v1 2-38 code slowed to a crawl as more data got into the
repo and I gave up after it hit 18G and hit snags with
badly-formatted dates (worked around by relying on Date::Parse
instead of git's RFC2822 parser).
Side note: Using 4 parallel processes for the parse-only tests
took around 10.5 minutes; while 2 processes took around 11-12
minutes. Then I realized 2 of the 4 processors were HT,
so it appears HT doesn't help much with Perl parsing...
> In other words, git scales infinitely well to deep history
> depths, but not to breadth of large trees[1].
Serving a ~6G for clones is still a lot of bandwidth; so
partitioning the git repos to limit the size of each clone
seems worth it.
Yearly partitions is probably too frequent and we'd end up with
too many packs (and resulting more open-FDs, cache-misses,
metadata stored in Xapian). I think partitioning based on
message-count/sizes might be a better metric for splitting as
LKML seems to get more traffic year-after-year.
> Marking spam and handling message removals might be a little
> trickier as chronology will have to be taken into account...
> (will post more on this, later)
Keeping track of everything in Xapian while walking backwards
through git history shouldn't be a big problem, actually.
(Xapian has read-after-write consistency)
However, trying to reason about partitioning of Xapian DBs
across time/message-count boundaries was making my head hurt and
now I'm not sure if it's necessary to partition Xapian DBs.
While normal text indexing is trivial to partition and
parallelize, associating messages with thread IDs requires
"global" knowledge spanning all partitions (since mail threads
span multiple periods). Unfortunately, this requires much
synchronization and synchronization hurts parallelism.
Partitioning Xapian DBs is useful to speed up full-indexing and
not much else. Full-(re)indexing is a rare event, and can be
done on a cold DB while the hot one is taking traffic. In fact,
I would expect lookups on partitioned DBs to be slower since it
has more files to go through and has to map things like
internal document_ids to non-conflicting ones.
Also, we don't serve Xapian data to be cloned; which is the main
reason to do partitioning of git storage...
^ permalink raw reply [flat|nested] 21+ messages in thread
* [WIP 0/17] initial v2 work based on one-file tree
2018-02-15 10:55 ` Eric Wong
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 01/17] AUTHORS: add The Linux Foundation Eric Wong (Contractor, The Linux Foundation)
` (16 more replies)
0 siblings, 17 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
The basic idea is to outsource deduplication to Xapian
and use git as dumb storage. This yields huge dividends
in object traversal based on preliminary tests:
https://public-inbox.org/meta/20180209205140.GA11047@dcvr/
Additionally, insertion time does not degrade due to giant
tree objects which plagued the initial v1 design.
There's also a couple of small fixes along the way to make it
tolerate some crap in older archives.
The search indexer and content-based deduplication will
still need to be worked on.
Eric Wong (Contractor, The Linux Foundation) (17):
AUTHORS: add The Linux Foundation
watch_maildir: allow '-' in mail filename
scripts/import_vger_from_mbox: relax From_ line match slightly
import: stop writing legacy ssoma.index by default
import: begin supporting this without ssoma.lock
import: initial handling for v2
t/import: test for last_object_id insertion
content_id: add test case
searchmsg: add mid_mime import for _extract_mid
scripts/import_vger_from_mbox: support --dry-run option
import: APIs to support v2 use
search: free up 'Q' prefix for a real unique identifier
searchidx: fix comment around next_thread_id
address: extract more characters from email addresses
import: pass "raw" dates to git-fast-import(1)
scripts/import_vger_from_mbox: use v2 layout for import
import: quiet down warnings from bogus From: lines
^ permalink raw reply [flat|nested] 21+ messages in thread
* [WIP 01/17] AUTHORS: add The Linux Foundation
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 02/17] watch_maildir: allow '-' in mail filename Eric Wong (Contractor, The Linux Foundation)
` (15 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
I'll be working as a contractor for The Linux Foundation on v2
in an effort to support LKML and associated lists.
---
AUTHORS | 1 +
1 file changed, 1 insertion(+)
diff --git a/AUTHORS b/AUTHORS
index 201ed03..1ad02cd 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -4,3 +4,4 @@ See history in git (via `git clone https://public-inbox.org/public-inbox')
for a full history of the project.
* Eric Wong <e@80x24.org> (BDFL)
+* The Linux Foundation (v2 work)
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 02/17] watch_maildir: allow '-' in mail filename
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 01/17] AUTHORS: add The Linux Foundation Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly Eric Wong (Contractor, The Linux Foundation)
` (14 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Hostnames can contain '-' and this allows public-inbox-watch(1)
to work on machines which generate Maildir files with '-' in
them.
---
lib/PublicInbox/WatchMaildir.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index a3fab42..403b6cf 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -170,7 +170,7 @@ sub _force_mid {
sub _try_path {
my ($self, $path) = @_;
my @p = split(m!/+!, $path);
- return if $p[-1] !~ /\A[a-zA-Z0-9][\w:,=\.]+\z/;
+ return if $p[-1] !~ /\A[a-zA-Z0-9][\-\w:,=\.]+\z/;
if ($p[-1] =~ /:2,([A-Z]+)\z/i) {
my $flags = $1;
return if $flags =~ /[DT]/; # no [D]rafts or [T]rashed mail
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 01/17] AUTHORS: add The Linux Foundation Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 02/17] watch_maildir: allow '-' in mail filename Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 04/17] import: stop writing legacy ssoma.index by default Eric Wong (Contractor, The Linux Foundation)
` (13 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
The mboxes I got from cregit have two spaces after the email
address, while the "git format-patch" output I'm used to dealing
with only has one space.
It's still a "strict" match in that it checks for something
resembling a timestamp, but it relaxes the number of spaces
between the email address and date.
---
scripts/import_vger_from_mbox | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 44055ff..9b3afc8 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -28,7 +28,7 @@ sub do_add ($$) {
}
# asctime: From example@example.com Fri Jun 23 02:56:55 2000
-my $from_strict = qr/^From \S+ \S+ \S+ +\S+ [^:]+:[^:]+:[^:]+ [^:]+/;
+my $from_strict = qr/^From \S+ +\S+ \S+ +\S+ [^:]+:[^:]+:[^:]+ [^:]+/;
my $prev = undef;
while (defined(my $l = <STDIN>)) {
if ($l =~ /$from_strict/o) {
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 04/17] import: stop writing legacy ssoma.index by default
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (2 preceding siblings ...)
2018-02-15 11:08 ` [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 05/17] import: begin supporting this without ssoma.lock Eric Wong (Contractor, The Linux Foundation)
` (12 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
For machines which have never seen ssoma, they don't need the
index so stop creating it.
---
lib/PublicInbox/Import.pm | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 8eec17e..299329b 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -229,10 +229,9 @@ sub done {
# for compatibility with existing ssoma installations
# we can probably remove this entirely by 2020
my $git_dir = $self->{git}->{git_dir};
- # XXX: change the following scope to: if (-e $index) # in 2018 or so..
my @cmd = ('git', "--git-dir=$git_dir");
- if ($nchg && !$ENV{FAST}) {
- my $index = "$git_dir/ssoma.index";
+ my $index = "$git_dir/ssoma.index";
+ if ($nchg && -e $index && !$ENV{FAST}) {
my $env = { GIT_INDEX_FILE => $index };
run_die([@cmd, qw(read-tree -m -v -i), $self->{ref}], $env);
}
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 05/17] import: begin supporting this without ssoma.lock
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (3 preceding siblings ...)
2018-02-15 11:08 ` [WIP 04/17] import: stop writing legacy ssoma.index by default Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 06/17] import: initial handling for v2 Eric Wong (Contractor, The Linux Foundation)
` (11 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
We'll reuse this class in v2, but won't be utilizing
per-git-repository ssoma.lock files.
Meanwhile, stop treating ::Inbox objects as an afterthought
and allow importing name and email into them.
---
lib/PublicInbox/Import.pm | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 299329b..56633a8 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -13,13 +13,20 @@ use PublicInbox::MID qw(mid_mime mid2path);
use PublicInbox::Address;
sub new {
- my ($class, $git, $name, $email, $inbox) = @_;
+ my ($class, $git, $name, $email, $ibx) = @_;
+ my $ref = 'refs/heads/master';
+ if ($ibx) {
+ $ref = $ibx->{ref_head} || 'refs/heads/master';
+ $name ||= $ibx->{name};
+ $email ||= $ibx->{-primary_address};
+ }
bless {
git => $git,
ident => "$name <$email>",
mark => 1,
- ref => 'refs/heads/master',
- inbox => $inbox,
+ ref => $ref,
+ inbox => $ibx,
+ ssoma_lock => 1, # disable for v2
}, $class
}
@@ -34,12 +41,16 @@ sub gfi_start {
pipe($out_r, $out_w) or die "pipe failed: $!";
my $git = $self->{git};
my $git_dir = $git->{git_dir};
- my $lockpath = "$git_dir/ssoma.lock";
- sysopen(my $lockfh, $lockpath, O_WRONLY|O_CREAT) or
- die "failed to open lock $lockpath: $!";
- # wait for other processes to be done
- flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
+ my $lockfh;
+ if ($self->{ssoma_lock}) {
+ my $lockpath = "$git_dir/ssoma.lock";
+ sysopen($lockfh, $lockpath, O_WRONLY|O_CREAT) or
+ die "failed to open lock $lockpath: $!";
+ # wait for other processes to be done
+ flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
+ }
+
local $/ = "\n";
chomp($self->{tip} = $git->qx(qw(rev-parse --revs-only), $self->{ref}));
@@ -247,6 +258,7 @@ sub done {
eval { run_die([@cmd, qw(gc --auto)], undef) };
}
+ $self->{ssoma_lock} or return;
my $lockfh = delete $self->{lockfh} or die "BUG: not locked: $!";
flock($lockfh, LOCK_UN) or die "unlock failed: $!";
close $lockfh or die "close lock failed: $!";
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 06/17] import: initial handling for v2
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (4 preceding siblings ...)
2018-02-15 11:08 ` [WIP 05/17] import: begin supporting this without ssoma.lock Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 07/17] t/import: test for last_object_id insertion Eric Wong (Contractor, The Linux Foundation)
` (10 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Call order will need to change a bit since this is going to be
tied to Xapian
---
MANIFEST | 1 +
lib/PublicInbox/ContentId.pm | 30 ++++++++++++++++++
lib/PublicInbox/Import.pm | 74 +++++++++++++++++++++++++++++++++++---------
3 files changed, 91 insertions(+), 14 deletions(-)
create mode 100644 lib/PublicInbox/ContentId.pm
diff --git a/MANIFEST b/MANIFEST
index 5074d8d..85e8503 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -46,6 +46,7 @@ examples/varnish-4.vcl
lib/PublicInbox/Address.pm
lib/PublicInbox/AltId.pm
lib/PublicInbox/Config.pm
+lib/PublicInbox/ContentId.pm
lib/PublicInbox/Daemon.pm
lib/PublicInbox/Emergency.pm
lib/PublicInbox/EvCleanup.pm
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
new file mode 100644
index 0000000..65d5a76
--- /dev/null
+++ b/lib/PublicInbox/ContentId.pm
@@ -0,0 +1,30 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+package PublicInbox::ContentId;
+use strict;
+use warnings;
+use base qw/Exporter/;
+our @EXPORT_OK = qw/content_id/;
+
+# not sure if less-widely supported hash families are worth bothering with
+use Digest::SHA;
+
+# Content-* headers are often no-ops, so maybe we don't need them
+my @ID_HEADERS = qw(Subject From Date Message-ID References To Cc In-Reply-To);
+
+sub content_id ($;$) {
+ my ($mime, $alg) = @_;
+ $alg ||= 256;
+ my $dig = Digest::SHA->new($alg);
+ my $hdr = $mime->header_obj;
+
+ foreach my $h (@ID_HEADERS) {
+ my @v = $hdr->header_raw($h);
+ $dig->add($_) foreach @v;
+ }
+ $dig->add($mime->body_raw);
+ 'SHA-' . $dig->algorithm . ':' . $dig->hexdigest;
+}
+
+1;
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 56633a8..b8e9dd0 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -11,6 +11,7 @@ use Fcntl qw(:flock :DEFAULT);
use PublicInbox::Spawn qw(spawn);
use PublicInbox::MID qw(mid_mime mid2path);
use PublicInbox::Address;
+use PublicInbox::ContentId qw(content_id);
sub new {
my ($class, $git, $name, $email, $ibx) = @_;
@@ -26,6 +27,7 @@ sub new {
mark => 1,
ref => $ref,
inbox => $ibx,
+ path_type => '2/38', # or 'v2'
ssoma_lock => 1, # disable for v2
}, $class
}
@@ -88,6 +90,7 @@ sub norm_body ($) {
$b
}
+# only used for v1 (ssoma) inboxes
sub _check_path ($$$$) {
my ($r, $w, $tip, $path) = @_;
return if $tip eq '';
@@ -97,17 +100,9 @@ sub _check_path ($$$$) {
$info =~ /\Amissing / ? undef : $info;
}
-# returns undef on non-existent
-# ('MISMATCH', msg) on mismatch
-# (:MARK, msg) on success
-sub remove {
- my ($self, $mime, $msg) = @_; # mime = Email::MIME
-
- my $mid = mid_mime($mime);
- my $path = mid2path($mid);
+sub check_remove_v1 {
+ my ($r, $w, $tip, $path, $mime) = @_;
- my ($r, $w) = $self->gfi_start;
- my $tip = $self->{tip};
my $info = _check_path($r, $w, $tip, $path) or return ('MISSING',undef);
$info =~ m!\A100644 blob ([a-f0-9]{40})\t!s or die "not blob: $info";
my $blob = $1;
@@ -140,6 +135,34 @@ sub remove {
if ($cur_s ne $cur_m || norm_body($cur) ne norm_body($mime)) {
return ('MISMATCH', $cur);
}
+ (undef, $cur);
+}
+
+# returns undef on non-existent
+# ('MISMATCH', msg) on mismatch
+# (:MARK, msg) on success
+#
+# For v2 inboxes, the content_id is returned instead of the msg
+# v2 callers should check with Xapian before calling this as
+# it is not idempotent.
+sub remove {
+ my ($self, $mime, $msg) = @_; # mime = Email::MIME
+
+ my $path_type = $self->{path_type};
+ my ($path, $err, $cur, $blob);
+
+ my ($r, $w) = $self->gfi_start;
+ my $tip = $self->{tip};
+ if ($path_type eq '2/38') {
+ $path = mid2path(mid_mime($mime));
+ ($err, $cur) = check_remove_v1($r, $w, $tip, $path, $mime);
+ return ($err, $cur) if $err;
+ } else {
+ $cur = content_id($mime);
+ my $len = length($cur);
+ $blob = $self->{mark}++;
+ print $w "blob\nmark :$blob\ndata $len\n$cur\n" or wfail;
+ }
my $ref = $self->{ref};
my $commit = $self->{mark}++;
@@ -156,7 +179,11 @@ sub remove {
"committer $ident $now\n",
"data $len\n$msg\n\n",
'from ', ($parent ? $parent : $tip), "\n" or wfail;
- print $w "D $path\n\n" or wfail;
+ if (defined $path) {
+ print $w "D $path\n\n" or wfail;
+ } else {
+ print $w "M 100644 :$blob d\n\n" or wfail;
+ }
$self->{nchg}++;
(($self->{tip} = ":$commit"), $cur);
}
@@ -177,15 +204,25 @@ sub add {
my $date = $mime->header('Date');
my $subject = $mime->header('Subject');
$subject = '(no subject)' unless defined $subject;
- my $mid = mid_mime($mime);
- my $path = mid2path($mid);
+ my $path_type = $self->{path_type};
+
+ my $path;
+ if ($path_type eq '2/38') {
+ $path = mid2path(mid_mime($mime));
+ } else { # v2 layout, one file:
+ $path = 'm';
+ }
my ($r, $w) = $self->gfi_start;
my $tip = $self->{tip};
- _check_path($r, $w, $tip, $path) and return;
+ if ($path_type eq '2/38') {
+ _check_path($r, $w, $tip, $path) and return;
+ }
# kill potentially confusing/misleading headers
$mime->header_set($_) for qw(bytes lines content-length status);
+
+ # spam check:
if ($check_cb) {
$mime = $check_cb->($mime) or return;
}
@@ -194,6 +231,15 @@ sub add {
my $blob = $self->{mark}++;
print $w "blob\nmark :$blob\ndata ", length($mime), "\n" or wfail;
print $w $mime, "\n" or wfail;
+
+ # v2: we need this for Xapian
+ if ($self->{want_object_id}) {
+ print $w "get-mark :$blob\n" or wfail;
+ defined(my $object_id = <$r>) or
+ die "get-mark failed, need git 2.6.0+\n";
+ chomp($self->{last_object_id} = $object_id);
+ }
+
my $ref = $self->{ref};
my $commit = $self->{mark}++;
my $parent = $tip =~ /\A:/ ? $tip : undef;
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 07/17] t/import: test for last_object_id insertion
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (5 preceding siblings ...)
2018-02-15 11:08 ` [WIP 06/17] import: initial handling for v2 Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 08/17] content_id: add test case Eric Wong (Contractor, The Linux Foundation)
` (9 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Check for this before doing the Xapian-based v2 importer.
---
t/import.t | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/t/import.t b/t/import.t
index fb6238e..92c82b9 100644
--- a/t/import.t
+++ b/t/import.t
@@ -6,7 +6,10 @@ use Test::More;
use PublicInbox::MIME;
use PublicInbox::Git;
use PublicInbox::Import;
-use File::Temp qw/tempdir/;
+use PublicInbox::Spawn qw(spawn);
+use IO::File;
+use Fcntl qw(:DEFAULT);
+use File::Temp qw/tempdir tempfile/;
my $dir = tempdir('pi-import-XXXXXX', TMPDIR => 1, CLEANUP => 1);
is(system(qw(git init -q --bare), $dir), 0, 'git init successful');
@@ -20,10 +23,30 @@ my $mime = PublicInbox::MIME->create(
'Content-Type' => 'text/plain',
Subject => 'this is a subject',
'Message-ID' => '<a@example.com>',
+ Date => 'Fri, 02 Oct 1993 00:00:00 +0000',
],
body => "hello world\n",
);
+
+$im->{want_object_id} = 1 if 'v2';
like($im->add($mime), qr/\A:\d+\z/, 'added one message');
+
+if ('v2') {
+ like($im->{last_object_id}, qr/\A[a-f0-9]{40}\z/, 'got last_object_id');
+ my @cmd = ('git', "--git-dir=$git->{git_dir}", qw(hash-object --stdin));
+ my $in = tempfile();
+ print $in $mime->as_string or die "write failed: $!";
+ $in->flush or die "flush failed: $!";
+ $in->seek(0, SEEK_SET);
+ my $out = tempfile();
+ my $pid = spawn(\@cmd, {}, { 0 => fileno($in), 1 => fileno($out)});
+ is(waitpid($pid, 0), $pid, 'waitpid succeeds on hash-object');
+ is($?, 0, 'hash-object');
+ $out->seek(0, SEEK_SET);
+ chomp(my $hashed_obj = <$out>);
+ is($hashed_obj, $im->{last_object_id}, "last_object_id matches exp");
+}
+
$im->done;
my @revs = $git->qx(qw(rev-list HEAD));
is(scalar @revs, 1, 'one revision created');
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 08/17] content_id: add test case
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (6 preceding siblings ...)
2018-02-15 11:08 ` [WIP 07/17] t/import: test for last_object_id insertion Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 09/17] searchmsg: add mid_mime import for _extract_mid Eric Wong (Contractor, The Linux Foundation)
` (8 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
---
MANIFEST | 1 +
t/content_id.t | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+)
create mode 100644 t/content_id.t
diff --git a/MANIFEST b/MANIFEST
index 85e8503..1df27f2 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -128,6 +128,7 @@ t/check-www-inbox.perl
t/common.perl
t/config.t
t/config_limiter.t
+t/content_id.t
t/emergency.t
t/fail-bin/spamc
t/feed.t
diff --git a/t/content_id.t b/t/content_id.t
new file mode 100644
index 0000000..c0ae6ec
--- /dev/null
+++ b/t/content_id.t
@@ -0,0 +1,24 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use PublicInbox::ContentId qw(content_id);
+use Email::MIME;
+
+my $mime = Email::MIME->create(
+ header => [
+ From => 'a@example.com',
+ To => 'b@example.com',
+ 'Content-Type' => 'text/plain',
+ Subject => 'this is a subject',
+ 'Message-ID' => '<a@example.com>',
+ Date => 'Fri, 02 Oct 1993 00:00:00 +0000',
+ ],
+ body => "hello world\n",
+);
+
+my $res = content_id($mime);
+like($res, qr/\ASHA-256:[a-f0-9]{64}\z/, 'cid in format expected');
+
+done_testing();
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 09/17] searchmsg: add mid_mime import for _extract_mid
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (7 preceding siblings ...)
2018-02-15 11:08 ` [WIP 08/17] content_id: add test case Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option Eric Wong (Contractor, The Linux Foundation)
` (7 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Oops, I guess this code was never called and may not be
needed. But for now, import it so it can run properly.
---
lib/PublicInbox/SearchMsg.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index afba8b1..70aa706 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -8,7 +8,7 @@ use strict;
use warnings;
use Search::Xapian;
use Date::Parse qw/str2time/;
-use PublicInbox::MID qw/mid_clean/;
+use PublicInbox::MID qw/mid_clean mid_mime/;
use PublicInbox::Address;
sub new {
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (8 preceding siblings ...)
2018-02-15 11:08 ` [WIP 09/17] searchmsg: add mid_mime import for _extract_mid Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 11/17] import: APIs to support v2 use Eric Wong (Contractor, The Linux Foundation)
` (6 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
This can be useful for getting baseline of performance
of just Email::MIME and Date: header parsing. We'll need
to do some Date: header parsing for LKML since there are
some wonky date formats which causes the git RFC822 parser
to choke.
---
scripts/import_vger_from_mbox | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 9b3afc8..3fa5c77 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -3,16 +3,21 @@
# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
use warnings;
+use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/;
+use Date::Parse qw/str2time/;
use Email::MIME;
$Email::MIME::ContentType::STRICT_PARAMS = 0; # user input is imperfect
use PublicInbox::Git;
use PublicInbox::Import;
my $usage = "usage: $0 NAME EMAIL <MBOX\n";
+my $dry_run;
+my %opts = ( 'n|dry-run' => \$dry_run );
+GetOptions(%opts) or die $usage;
chomp(my $git_dir = `git rev-parse --git-dir`);
my $git = PublicInbox::Git->new($git_dir);
my $name = shift or die $usage; # git
my $email = shift or die $usage; # git@vger.kernel.org
-my $im = PublicInbox::Import->new($git, $name, $email);
+my $im = $dry_run ? undef : PublicInbox::Import->new($git, $name, $email);
binmode STDIN;
my $msg = '';
use PublicInbox::Filter::Vger;
@@ -22,9 +27,27 @@ sub do_add ($$) {
$$msg =~ s/(\r?\n)+\z/$1/s;
$msg = Email::MIME->new($$msg);
$msg = $vger->scrub($msg);
+ my $hdr = $msg->header_obj;
+ my $date = $hdr->header_raw('Date');
+ if ($date) {
+ eval { str2time($date) };
+ if ($@) {
+ warn "bad Date: $date in ",
+ $hdr->header_raw('Message-ID'), ": $@\n";
+ }
+ } else {
+ warn "missing Date: $date in ",
+ $hdr->header_raw('Message-ID'), ": $@\n";
+ my $n = 0;
+ foreach my $r ($hdr->header_raw('Received')) {
+ warn "$n Received: $r\n";
+ }
+ warn(('-' x 72), "\n");
+ }
+ return unless $im;
$im->add($msg) or
warn "duplicate: ",
- $msg->header_obj->header_raw('Message-ID'), "\n";
+ $hdr->header_raw('Message-ID'), "\n";
}
# asctime: From example@example.com Fri Jun 23 02:56:55 2000
@@ -44,4 +67,4 @@ while (defined(my $l = <STDIN>)) {
$msg .= $l;
}
do_add($im, \$msg) if $msg;
-$im->done;
+$im->done if $im;
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 11/17] import: APIs to support v2 use
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (9 preceding siblings ...)
2018-02-15 11:08 ` [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 12/17] search: free up 'Q' prefix for a real unique identifier Eric Wong (Contractor, The Linux Foundation)
` (5 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Wrap "get-mark" and "checkpoint" commands for git-fast-import
while documenting/cementing parts of the API.
---
lib/PublicInbox/Import.pm | 28 ++++++++++++++++++++++------
t/import.t | 4 +++-
2 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index b8e9dd0..811e355 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -138,9 +138,27 @@ sub check_remove_v1 {
(undef, $cur);
}
+# used for v2 (maybe)
+sub checkpoint {
+ my ($self) = @_;
+ return unless $self->{pid};
+ print { $self->{out} } "checkpoint\n" or wfail;
+ undef;
+}
+
+# used for v2
+sub get_mark {
+ my ($self, $mark) = @_;
+ die "not active\n" unless $self->{pid};
+ my ($r, $w) = $self->gfi_start;
+ print $w "get-mark $mark\n" or wfail;
+ defined(my $oid = <$r>) or die "get-mark failed, need git 2.6.0+\n";
+ $oid;
+}
+
# returns undef on non-existent
-# ('MISMATCH', msg) on mismatch
-# (:MARK, msg) on success
+# ('MISMATCH', Email::MIME) on mismatch
+# (:MARK, Email::MIME) on success
#
# For v2 inboxes, the content_id is returned instead of the msg
# v2 callers should check with Xapian before calling this as
@@ -189,6 +207,7 @@ sub remove {
}
# returns undef on duplicate
+# returns the :MARK of the most recent commit
sub add {
my ($self, $mime, $check_cb) = @_; # mime = Email::MIME
@@ -234,10 +253,7 @@ sub add {
# v2: we need this for Xapian
if ($self->{want_object_id}) {
- print $w "get-mark :$blob\n" or wfail;
- defined(my $object_id = <$r>) or
- die "get-mark failed, need git 2.6.0+\n";
- chomp($self->{last_object_id} = $object_id);
+ chomp($self->{last_object_id} = $self->get_mark(":$blob"));
}
my $ref = $self->{ref};
diff --git a/t/import.t b/t/import.t
index 92c82b9..ca59772 100644
--- a/t/import.t
+++ b/t/import.t
@@ -87,6 +87,8 @@ isnt($msg->header('Subject'), $mime->header('Subject'), 'subject mismatch');
$mime->header_set('Message-Id', '<failcheck@example.com>');
is($im->add($mime, sub { undef }), undef, 'check callback fails');
is($im->remove($mime), undef, 'message not added, so not removed');
-
+is(undef, $im->checkpoint, 'checkpoint works before ->done');
$im->done;
+is(undef, $im->checkpoint, 'checkpoint works after ->done');
+$im->checkpoint;
done_testing();
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 12/17] search: free up 'Q' prefix for a real unique identifier
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (10 preceding siblings ...)
2018-02-15 11:08 ` [WIP 11/17] import: APIs to support v2 use Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-22 21:08 ` Eric Wong
2018-02-15 11:08 ` [WIP 13/17] searchidx: fix comment around next_thread_id Eric Wong (Contractor, The Linux Foundation)
` (4 subsequent siblings)
16 siblings, 1 reply; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
This will allow easier-compatibility with v2 code which will
introduce content_id as the unique identifier.
The old "XMID" becomes "XM" as a free text searchable term.
"Q" becomes "XMID" as a boolean prefix.
There's no user-visible changes in this, but there needs to
be a schema version bump later on...
(more changes planned which can affect v1)
---
lib/PublicInbox/Search.pm | 8 ++++----
lib/PublicInbox/SearchIdx.pm | 8 ++++----
lib/PublicInbox/SearchMsg.pm | 2 +-
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 9ab5afe..3ec96ca 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -56,13 +56,13 @@ my %bool_pfx_internal = (
);
my %bool_pfx_external = (
- mid => 'Q', # uniQue id (Message-ID)
+ mid => 'XMID', # uniQue id (Message-ID)
);
my %prob_prefix = (
# for mairix compatibility
s => 'S',
- m => 'XMID', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
+ m => 'XM', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
f => 'A',
t => 'XTO',
tc => 'XTO XCC',
@@ -85,7 +85,7 @@ my %prob_prefix = (
dfblob => 'XDFPRE XDFPOST',
# default:
- '' => 'XMID S A XNQ XQUOT XFN',
+ '' => 'XM S A XNQ XQUOT XFN',
);
# not documenting m: and mid: for now, the using the URLs works w/o Xapian
@@ -285,7 +285,7 @@ sub lookup_message {
my ($self, $mid) = @_;
$mid = mid_clean($mid);
- my $doc_id = $self->find_unique_doc_id('Q' . $mid);
+ my $doc_id = $self->find_unique_doc_id('XMID' . $mid);
my $smsg;
if (defined $doc_id) {
# raises on error:
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 66faed3..0ee0779 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -276,7 +276,7 @@ sub add_message {
}
$smsg = PublicInbox::SearchMsg->new($mime);
my $doc = $smsg->{doc};
- $doc->add_term('Q' . $mid);
+ $doc->add_term('XMID' . $mid);
my $subj = $smsg->subject;
if ($subj ne '') {
@@ -334,7 +334,7 @@ sub add_message {
});
link_message($self, $smsg, $old_tid);
- $tg->index_text($mid, 1, 'XMID');
+ $tg->index_text($mid, 1, 'XM');
$doc->set_data($smsg->to_doc_data($blob));
if (my $altid = $self->{-altid}) {
@@ -366,7 +366,7 @@ sub remove_message {
$mid = mid_clean($mid);
eval {
- $doc_id = $self->find_unique_doc_id('Q' . $mid);
+ $doc_id = $self->find_unique_doc_id('XMID' . $mid);
if (defined $doc_id) {
$db->delete_document($doc_id);
} else {
@@ -683,7 +683,7 @@ sub create_ghost {
my $tid = $self->next_thread_id;
my $doc = Search::Xapian::Document->new;
- $doc->add_term('Q' . $mid);
+ $doc->add_term('XMID' . $mid);
$doc->add_term('G' . $tid);
$doc->add_term('T' . 'ghost');
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index 70aa706..25c1abb 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -157,7 +157,7 @@ sub mid ($;$) {
} elsif (my $rv = $self->{mid}) {
$rv;
} else {
- $self->{mid} = _get_term_val($self, 'Q', qr/\AQ/) ||
+ $self->{mid} = _get_term_val($self, 'XMID', qr/\AXMID/) ||
$self->_extract_mid;
}
}
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 13/17] searchidx: fix comment around next_thread_id
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (11 preceding siblings ...)
2018-02-15 11:08 ` [WIP 12/17] search: free up 'Q' prefix for a real unique identifier Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 14/17] address: extract more characters from email addresses Eric Wong (Contractor, The Linux Foundation)
` (3 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
I decided not to copy the notmuch implementation regarding
serialization of integers to Xapian metadata.
---
lib/PublicInbox/SearchIdx.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0ee0779..fa5057f 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -394,7 +394,7 @@ sub term_generator { # write-only
}
# increments last_thread_id counter
-# returns a 64-bit integer represented as a hex string
+# returns a 64-bit integer represented as a decimal string
sub next_thread_id {
my ($self) = @_;
my $db = $self->{xdb};
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 14/17] address: extract more characters from email addresses
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (12 preceding siblings ...)
2018-02-15 11:08 ` [WIP 13/17] searchidx: fix comment around next_thread_id Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 15/17] import: pass "raw" dates to git-fast-import(1) Eric Wong (Contractor, The Linux Foundation)
` (2 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
There's a lot of weird characters which show up in LKML archives
which we did not support before. Furthermore, allow spaces
before the '>' in the From: line as at least some non-spam
poster used it.
---
lib/PublicInbox/Address.pm | 3 ++-
t/address.t | 5 +++--
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/Address.pm b/lib/PublicInbox/Address.pm
index f334ade..548f417 100644
--- a/lib/PublicInbox/Address.pm
+++ b/lib/PublicInbox/Address.pm
@@ -8,7 +8,8 @@ use warnings;
# just enough to make thing sanely displayable and pass to git
sub emails {
- ($_[0] =~ /([\w\.\+=\-]+\@[\w\.\-]+)>?\s*(?:\(.*?\))?(?:,\s*|\z)/g)
+ ($_[0] =~ /([\w\.\+=\?"\(\)\-!#\$%&'\*\/\^\`\|\{\}~]+\@[\w\.\-\(\)]+)
+ (?:\s[^>]*)?>?\s*(?:\(.*?\))?(?:,\s*|\z)/gx)
}
sub names {
diff --git a/t/address.t b/t/address.t
index e35e4f8..eced5c4 100644
--- a/t/address.t
+++ b/t/address.t
@@ -9,8 +9,9 @@ is_deeply([qw(e@example.com e@example.org)],
[PublicInbox::Address::emails('User <e@example.com>, e@example.org')],
'address extraction works as expected');
-is_deeply([PublicInbox::Address::emails('"ex@example.com" <ex@example.com>')],
- [qw(ex@example.com)]);
+is_deeply(['user@example.com'],
+ [PublicInbox::Address::emails('<user@example.com (Comment)>')],
+ 'comment after domain accepted before >');
my @names = PublicInbox::Address::names(
'User <e@e>, e@e, "John A. Doe" <j@d>, <x@x>');
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 15/17] import: pass "raw" dates to git-fast-import(1)
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (13 preceding siblings ...)
2018-02-15 11:08 ` [WIP 14/17] address: extract more characters from email addresses Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 17/17] import: quiet down warnings from bogus From: lines Eric Wong (Contractor, The Linux Foundation)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
For LKML, it appears we need an even more liberal parser than
RFC2822 date parser in git. I have not validated Date::Parse
parses dates correctly, but this at least prevents
git-fast-import(1) from choking.
---
lib/PublicInbox/Import.pm | 65 +++++++++++++++++++++++++++++++++++------------
1 file changed, 49 insertions(+), 16 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 811e355..845fbb6 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -12,6 +12,8 @@ use PublicInbox::Spawn qw(spawn);
use PublicInbox::MID qw(mid_mime mid2path);
use PublicInbox::Address;
use PublicInbox::ContentId qw(content_id);
+use Date::Parse qw(str2time);
+use Time::Zone qw(tz_offset);
sub new {
my ($class, $git, $name, $email, $ibx) = @_;
@@ -57,7 +59,7 @@ sub gfi_start {
chomp($self->{tip} = $git->qx(qw(rev-parse --revs-only), $self->{ref}));
my @cmd = ('git', "--git-dir=$git_dir", qw(fast-import
- --quiet --done --date-format=rfc2822));
+ --quiet --done --date-format=raw));
my $rdr = { 0 => fileno($out_r), 1 => fileno($in_w) };
my $pid = spawn(\@cmd, undef, $rdr);
die "spawn fast-import failed: $!" unless defined $pid;
@@ -74,14 +76,7 @@ sub gfi_start {
sub wfail () { die "write to fast-import failed: $!" }
-sub now2822 () {
- my @t = gmtime(time);
- my $day = qw(Sun Mon Tue Wed Thu Fri Sat)[$t[6]];
- my $mon = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)[$t[4]];
-
- sprintf('%s, %2d %s %d %02d:%02d:%02d +0000',
- $day, $t[3], $mon, $t[5] + 1900, $t[2], $t[1], $t[0]);
-}
+sub now_raw () { time . ' +0000' }
sub norm_body ($) {
my ($mime) = @_;
@@ -189,7 +184,7 @@ sub remove {
print $w "reset $ref\n" or wfail;
}
my $ident = $self->{ident};
- my $now = now2822();
+ my $now = now_raw();
$msg ||= 'rm';
my $len = length($msg) + 1;
print $w "commit $ref\nmark :$commit\n",
@@ -206,6 +201,43 @@ sub remove {
(($self->{tip} = ":$commit"), $cur);
}
+sub parse_date ($) {
+ my ($mime) = @_;
+ my $hdr = $mime->header_obj;
+ my $date = $hdr->header_raw('Date');
+ my ($ts, $zone);
+ my $mid = $hdr->header_raw('Message-ID');
+ if ($date) {
+ $ts = eval { str2time($date) };
+ if ($@) {
+ warn "bad Date: $date in $mid: $@\n";
+ } elsif ($date =~ /\s+([\+\-]\d+)\s*\z/) {
+ $zone = $1;
+ }
+ }
+ unless ($ts) {
+ my @recvd = $hdr->header_raw('Received');
+ foreach my $r (@recvd) {
+ $zone = undef;
+ $r =~ /\s*(\d+\s+[[:alpha:]]+\s+\d{2,4}\s+
+ \d+\D\d+(?:\D\d+)\s+([\+\-]\d+))/osx or next;
+ $zone = $2;
+ $ts = eval { str2time($1) } and last;
+ warn "no date in Received: $r\n";
+ }
+ }
+ $zone ||= '+0000';
+ # "-1200" is the furthest westermost zone offset,
+ # but git fast-import is liberal so we use "-1400"
+ if ($zone >= 1400 || $zone <= -1400) {
+ warn "bogus TZ offset: $zone, ignoring and assuming +0000\n";
+ $zone = '+0000';
+ }
+ $ts ||= time;
+ $ts = 0 if $ts < 0; # git uses unsigned times
+ "$ts $zone";
+}
+
# returns undef on duplicate
# returns the :MARK of the most recent commit
sub add {
@@ -220,7 +252,7 @@ sub add {
# <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
$name =~ tr/<>//d;
- my $date = $mime->header('Date');
+ my $date_raw = parse_date($mime);
my $subject = $mime->header('Subject');
$subject = '(no subject)' unless defined $subject;
my $path_type = $self->{path_type};
@@ -246,10 +278,11 @@ sub add {
$mime = $check_cb->($mime) or return;
}
- $mime = $mime->as_string;
my $blob = $self->{mark}++;
- print $w "blob\nmark :$blob\ndata ", length($mime), "\n" or wfail;
- print $w $mime, "\n" or wfail;
+ my $str = $mime->as_string;
+ print $w "blob\nmark :$blob\ndata ", length($str), "\n" or wfail;
+ print $w $str, "\n" or wfail;
+ $str = undef;
# v2: we need this for Xapian
if ($self->{want_object_id}) {
@@ -269,8 +302,8 @@ sub add {
utf8::encode($subject);
# quiet down wide character warnings:
print $w "commit $ref\nmark :$commit\n",
- "author $name <$email> $date\n",
- "committer $self->{ident} ", now2822(), "\n" or wfail;
+ "author $name <$email> $date_raw\n",
+ "committer $self->{ident} ", now_raw(), "\n" or wfail;
print $w "data ", (length($subject) + 1), "\n",
$subject, "\n\n" or wfail;
if ($tip ne '') {
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (14 preceding siblings ...)
2018-02-15 11:08 ` [WIP 15/17] import: pass "raw" dates to git-fast-import(1) Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 17/17] import: quiet down warnings from bogus From: lines Eric Wong (Contractor, The Linux Foundation)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Big lists are orders of magnitude more efficient with v2.
---
scripts/import_vger_from_mbox | 24 ++++++------------------
1 file changed, 6 insertions(+), 18 deletions(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 3fa5c77..6ea2ca5 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -22,32 +22,20 @@ binmode STDIN;
my $msg = '';
use PublicInbox::Filter::Vger;
my $vger = PublicInbox::Filter::Vger->new;
+if ($im) {
+ $im->{ssoma_lock} = 0;
+ $im->{path_type} = 'v2';
+}
+
sub do_add ($$) {
my ($im, $msg) = @_;
$$msg =~ s/(\r?\n)+\z/$1/s;
$msg = Email::MIME->new($$msg);
$msg = $vger->scrub($msg);
- my $hdr = $msg->header_obj;
- my $date = $hdr->header_raw('Date');
- if ($date) {
- eval { str2time($date) };
- if ($@) {
- warn "bad Date: $date in ",
- $hdr->header_raw('Message-ID'), ": $@\n";
- }
- } else {
- warn "missing Date: $date in ",
- $hdr->header_raw('Message-ID'), ": $@\n";
- my $n = 0;
- foreach my $r ($hdr->header_raw('Received')) {
- warn "$n Received: $r\n";
- }
- warn(('-' x 72), "\n");
- }
return unless $im;
$im->add($msg) or
warn "duplicate: ",
- $hdr->header_raw('Message-ID'), "\n";
+ $msg->header_obj->header_raw('Message-ID'), "\n";
}
# asctime: From example@example.com Fri Jun 23 02:56:55 2000
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 17/17] import: quiet down warnings from bogus From: lines
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (15 preceding siblings ...)
2018-02-15 11:08 ` [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
There's a lot of crap in archives and git-fast-import
accepts empty names and email addresses for authors
just fine.
---
lib/PublicInbox/Import.pm | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 845fbb6..f8d1003 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -246,11 +246,6 @@ sub add {
my $from = $mime->header('From');
my ($email) = PublicInbox::Address::emails($from);
my ($name) = PublicInbox::Address::names($from);
- # git gets confused with:
- # "'A U Thor <u@example.com>' via foo" <foo@example.com>
- # ref:
- # <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
- $name =~ tr/<>//d;
my $date_raw = parse_date($mime);
my $subject = $mime->header('Subject');
@@ -297,10 +292,26 @@ sub add {
print $w "reset $ref\n" or wfail;
}
- utf8::encode($email);
- utf8::encode($name);
+ # quiet down wide character warnings with utf8::encode
+ if (defined $email) {
+ utf8::encode($email);
+ } else {
+ $email = '';
+ warn "no email in From: $from\n";
+ }
+
+ # git gets confused with:
+ # "'A U Thor <u@example.com>' via foo" <foo@example.com>
+ # ref:
+ # <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
+ if (defined $name) {
+ $name =~ tr/<>//d;
+ utf8::encode($name);
+ } else {
+ $name = '';
+ warn "no name in From: $from\n";
+ }
utf8::encode($subject);
- # quiet down wide character warnings:
print $w "commit $ref\nmark :$commit\n",
"author $name <$email> $date_raw\n",
"committer $self->{ident} ", now_raw(), "\n" or wfail;
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [WIP 12/17] search: free up 'Q' prefix for a real unique identifier
2018-02-15 11:08 ` [WIP 12/17] search: free up 'Q' prefix for a real unique identifier Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-22 21:08 ` Eric Wong
0 siblings, 0 replies; 21+ messages in thread
From: Eric Wong @ 2018-02-22 21:08 UTC (permalink / raw)
To: meta
"Eric Wong (Contractor, The Linux Foundation)" <e@80x24.org> wrote:
> This will allow easier-compatibility with v2 code which will
> introduce content_id as the unique identifier.
> The old "XMID" becomes "XM" as a free text searchable term.
> "Q" becomes "XMID" as a boolean prefix.
Leaning towards rejecting this one. These prefixes such as "Q"
are merely customary measures with Xapian, and Message-ID is
customarily unique...
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2018-02-22 21:08 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-09 20:51 [v2] one file to rule them all? Eric Wong
2018-02-15 10:55 ` Eric Wong
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 01/17] AUTHORS: add The Linux Foundation Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 02/17] watch_maildir: allow '-' in mail filename Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 04/17] import: stop writing legacy ssoma.index by default Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 05/17] import: begin supporting this without ssoma.lock Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 06/17] import: initial handling for v2 Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 07/17] t/import: test for last_object_id insertion Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 08/17] content_id: add test case Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 09/17] searchmsg: add mid_mime import for _extract_mid Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 11/17] import: APIs to support v2 use Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 12/17] search: free up 'Q' prefix for a real unique identifier Eric Wong (Contractor, The Linux Foundation)
2018-02-22 21:08 ` Eric Wong
2018-02-15 11:08 ` [WIP 13/17] searchidx: fix comment around next_thread_id Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 14/17] address: extract more characters from email addresses Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 15/17] import: pass "raw" dates to git-fast-import(1) Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 17/17] import: quiet down warnings from bogus From: lines Eric Wong (Contractor, The Linux Foundation)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).