* [WIP 01/17] AUTHORS: add The Linux Foundation
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 02/17] watch_maildir: allow '-' in mail filename Eric Wong (Contractor, The Linux Foundation)
` (15 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
I'll be working as a contractor for The Linux Foundation on v2
in an effort to support LKML and associated lists.
---
AUTHORS | 1 +
1 file changed, 1 insertion(+)
diff --git a/AUTHORS b/AUTHORS
index 201ed03..1ad02cd 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -4,3 +4,4 @@ See history in git (via `git clone https://public-inbox.org/public-inbox')
for a full history of the project.
* Eric Wong <e@80x24.org> (BDFL)
+* The Linux Foundation (v2 work)
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 02/17] watch_maildir: allow '-' in mail filename
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 01/17] AUTHORS: add The Linux Foundation Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly Eric Wong (Contractor, The Linux Foundation)
` (14 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Hostnames can contain '-' and this allows public-inbox-watch(1)
to work on machines which generate Maildir files with '-' in
them.
---
lib/PublicInbox/WatchMaildir.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index a3fab42..403b6cf 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -170,7 +170,7 @@ sub _force_mid {
sub _try_path {
my ($self, $path) = @_;
my @p = split(m!/+!, $path);
- return if $p[-1] !~ /\A[a-zA-Z0-9][\w:,=\.]+\z/;
+ return if $p[-1] !~ /\A[a-zA-Z0-9][\-\w:,=\.]+\z/;
if ($p[-1] =~ /:2,([A-Z]+)\z/i) {
my $flags = $1;
return if $flags =~ /[DT]/; # no [D]rafts or [T]rashed mail
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 01/17] AUTHORS: add The Linux Foundation Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 02/17] watch_maildir: allow '-' in mail filename Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 04/17] import: stop writing legacy ssoma.index by default Eric Wong (Contractor, The Linux Foundation)
` (13 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
The mboxes I got from cregit have two spaces after the email
address, while the "git format-patch" output I'm used to dealing
with only has one space.
It's still a "strict" match in that it checks for something
resembling a timestamp, but it relaxes the number of spaces
between the email address and date.
---
scripts/import_vger_from_mbox | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 44055ff..9b3afc8 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -28,7 +28,7 @@ sub do_add ($$) {
}
# asctime: From example@example.com Fri Jun 23 02:56:55 2000
-my $from_strict = qr/^From \S+ \S+ \S+ +\S+ [^:]+:[^:]+:[^:]+ [^:]+/;
+my $from_strict = qr/^From \S+ +\S+ \S+ +\S+ [^:]+:[^:]+:[^:]+ [^:]+/;
my $prev = undef;
while (defined(my $l = <STDIN>)) {
if ($l =~ /$from_strict/o) {
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 04/17] import: stop writing legacy ssoma.index by default
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (2 preceding siblings ...)
2018-02-15 11:08 ` [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 05/17] import: begin supporting this without ssoma.lock Eric Wong (Contractor, The Linux Foundation)
` (12 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
For machines which have never seen ssoma, they don't need the
index so stop creating it.
---
lib/PublicInbox/Import.pm | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 8eec17e..299329b 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -229,10 +229,9 @@ sub done {
# for compatibility with existing ssoma installations
# we can probably remove this entirely by 2020
my $git_dir = $self->{git}->{git_dir};
- # XXX: change the following scope to: if (-e $index) # in 2018 or so..
my @cmd = ('git', "--git-dir=$git_dir");
- if ($nchg && !$ENV{FAST}) {
- my $index = "$git_dir/ssoma.index";
+ my $index = "$git_dir/ssoma.index";
+ if ($nchg && -e $index && !$ENV{FAST}) {
my $env = { GIT_INDEX_FILE => $index };
run_die([@cmd, qw(read-tree -m -v -i), $self->{ref}], $env);
}
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 05/17] import: begin supporting this without ssoma.lock
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (3 preceding siblings ...)
2018-02-15 11:08 ` [WIP 04/17] import: stop writing legacy ssoma.index by default Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 06/17] import: initial handling for v2 Eric Wong (Contractor, The Linux Foundation)
` (11 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
We'll reuse this class in v2, but won't be utilizing
per-git-repository ssoma.lock files.
Meanwhile, stop treating ::Inbox objects as an afterthought
and allow importing name and email into them.
---
lib/PublicInbox/Import.pm | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 299329b..56633a8 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -13,13 +13,20 @@ use PublicInbox::MID qw(mid_mime mid2path);
use PublicInbox::Address;
sub new {
- my ($class, $git, $name, $email, $inbox) = @_;
+ my ($class, $git, $name, $email, $ibx) = @_;
+ my $ref = 'refs/heads/master';
+ if ($ibx) {
+ $ref = $ibx->{ref_head} || 'refs/heads/master';
+ $name ||= $ibx->{name};
+ $email ||= $ibx->{-primary_address};
+ }
bless {
git => $git,
ident => "$name <$email>",
mark => 1,
- ref => 'refs/heads/master',
- inbox => $inbox,
+ ref => $ref,
+ inbox => $ibx,
+ ssoma_lock => 1, # disable for v2
}, $class
}
@@ -34,12 +41,16 @@ sub gfi_start {
pipe($out_r, $out_w) or die "pipe failed: $!";
my $git = $self->{git};
my $git_dir = $git->{git_dir};
- my $lockpath = "$git_dir/ssoma.lock";
- sysopen(my $lockfh, $lockpath, O_WRONLY|O_CREAT) or
- die "failed to open lock $lockpath: $!";
- # wait for other processes to be done
- flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
+ my $lockfh;
+ if ($self->{ssoma_lock}) {
+ my $lockpath = "$git_dir/ssoma.lock";
+ sysopen($lockfh, $lockpath, O_WRONLY|O_CREAT) or
+ die "failed to open lock $lockpath: $!";
+ # wait for other processes to be done
+ flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
+ }
+
local $/ = "\n";
chomp($self->{tip} = $git->qx(qw(rev-parse --revs-only), $self->{ref}));
@@ -247,6 +258,7 @@ sub done {
eval { run_die([@cmd, qw(gc --auto)], undef) };
}
+ $self->{ssoma_lock} or return;
my $lockfh = delete $self->{lockfh} or die "BUG: not locked: $!";
flock($lockfh, LOCK_UN) or die "unlock failed: $!";
close $lockfh or die "close lock failed: $!";
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 06/17] import: initial handling for v2
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (4 preceding siblings ...)
2018-02-15 11:08 ` [WIP 05/17] import: begin supporting this without ssoma.lock Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 07/17] t/import: test for last_object_id insertion Eric Wong (Contractor, The Linux Foundation)
` (10 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Call order will need to change a bit since this is going to be
tied to Xapian
---
MANIFEST | 1 +
lib/PublicInbox/ContentId.pm | 30 ++++++++++++++++++
lib/PublicInbox/Import.pm | 74 +++++++++++++++++++++++++++++++++++---------
3 files changed, 91 insertions(+), 14 deletions(-)
create mode 100644 lib/PublicInbox/ContentId.pm
diff --git a/MANIFEST b/MANIFEST
index 5074d8d..85e8503 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -46,6 +46,7 @@ examples/varnish-4.vcl
lib/PublicInbox/Address.pm
lib/PublicInbox/AltId.pm
lib/PublicInbox/Config.pm
+lib/PublicInbox/ContentId.pm
lib/PublicInbox/Daemon.pm
lib/PublicInbox/Emergency.pm
lib/PublicInbox/EvCleanup.pm
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
new file mode 100644
index 0000000..65d5a76
--- /dev/null
+++ b/lib/PublicInbox/ContentId.pm
@@ -0,0 +1,30 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+package PublicInbox::ContentId;
+use strict;
+use warnings;
+use base qw/Exporter/;
+our @EXPORT_OK = qw/content_id/;
+
+# not sure if less-widely supported hash families are worth bothering with
+use Digest::SHA;
+
+# Content-* headers are often no-ops, so maybe we don't need them
+my @ID_HEADERS = qw(Subject From Date Message-ID References To Cc In-Reply-To);
+
+sub content_id ($;$) {
+ my ($mime, $alg) = @_;
+ $alg ||= 256;
+ my $dig = Digest::SHA->new($alg);
+ my $hdr = $mime->header_obj;
+
+ foreach my $h (@ID_HEADERS) {
+ my @v = $hdr->header_raw($h);
+ $dig->add($_) foreach @v;
+ }
+ $dig->add($mime->body_raw);
+ 'SHA-' . $dig->algorithm . ':' . $dig->hexdigest;
+}
+
+1;
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 56633a8..b8e9dd0 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -11,6 +11,7 @@ use Fcntl qw(:flock :DEFAULT);
use PublicInbox::Spawn qw(spawn);
use PublicInbox::MID qw(mid_mime mid2path);
use PublicInbox::Address;
+use PublicInbox::ContentId qw(content_id);
sub new {
my ($class, $git, $name, $email, $ibx) = @_;
@@ -26,6 +27,7 @@ sub new {
mark => 1,
ref => $ref,
inbox => $ibx,
+ path_type => '2/38', # or 'v2'
ssoma_lock => 1, # disable for v2
}, $class
}
@@ -88,6 +90,7 @@ sub norm_body ($) {
$b
}
+# only used for v1 (ssoma) inboxes
sub _check_path ($$$$) {
my ($r, $w, $tip, $path) = @_;
return if $tip eq '';
@@ -97,17 +100,9 @@ sub _check_path ($$$$) {
$info =~ /\Amissing / ? undef : $info;
}
-# returns undef on non-existent
-# ('MISMATCH', msg) on mismatch
-# (:MARK, msg) on success
-sub remove {
- my ($self, $mime, $msg) = @_; # mime = Email::MIME
-
- my $mid = mid_mime($mime);
- my $path = mid2path($mid);
+sub check_remove_v1 {
+ my ($r, $w, $tip, $path, $mime) = @_;
- my ($r, $w) = $self->gfi_start;
- my $tip = $self->{tip};
my $info = _check_path($r, $w, $tip, $path) or return ('MISSING',undef);
$info =~ m!\A100644 blob ([a-f0-9]{40})\t!s or die "not blob: $info";
my $blob = $1;
@@ -140,6 +135,34 @@ sub remove {
if ($cur_s ne $cur_m || norm_body($cur) ne norm_body($mime)) {
return ('MISMATCH', $cur);
}
+ (undef, $cur);
+}
+
+# returns undef on non-existent
+# ('MISMATCH', msg) on mismatch
+# (:MARK, msg) on success
+#
+# For v2 inboxes, the content_id is returned instead of the msg
+# v2 callers should check with Xapian before calling this as
+# it is not idempotent.
+sub remove {
+ my ($self, $mime, $msg) = @_; # mime = Email::MIME
+
+ my $path_type = $self->{path_type};
+ my ($path, $err, $cur, $blob);
+
+ my ($r, $w) = $self->gfi_start;
+ my $tip = $self->{tip};
+ if ($path_type eq '2/38') {
+ $path = mid2path(mid_mime($mime));
+ ($err, $cur) = check_remove_v1($r, $w, $tip, $path, $mime);
+ return ($err, $cur) if $err;
+ } else {
+ $cur = content_id($mime);
+ my $len = length($cur);
+ $blob = $self->{mark}++;
+ print $w "blob\nmark :$blob\ndata $len\n$cur\n" or wfail;
+ }
my $ref = $self->{ref};
my $commit = $self->{mark}++;
@@ -156,7 +179,11 @@ sub remove {
"committer $ident $now\n",
"data $len\n$msg\n\n",
'from ', ($parent ? $parent : $tip), "\n" or wfail;
- print $w "D $path\n\n" or wfail;
+ if (defined $path) {
+ print $w "D $path\n\n" or wfail;
+ } else {
+ print $w "M 100644 :$blob d\n\n" or wfail;
+ }
$self->{nchg}++;
(($self->{tip} = ":$commit"), $cur);
}
@@ -177,15 +204,25 @@ sub add {
my $date = $mime->header('Date');
my $subject = $mime->header('Subject');
$subject = '(no subject)' unless defined $subject;
- my $mid = mid_mime($mime);
- my $path = mid2path($mid);
+ my $path_type = $self->{path_type};
+
+ my $path;
+ if ($path_type eq '2/38') {
+ $path = mid2path(mid_mime($mime));
+ } else { # v2 layout, one file:
+ $path = 'm';
+ }
my ($r, $w) = $self->gfi_start;
my $tip = $self->{tip};
- _check_path($r, $w, $tip, $path) and return;
+ if ($path_type eq '2/38') {
+ _check_path($r, $w, $tip, $path) and return;
+ }
# kill potentially confusing/misleading headers
$mime->header_set($_) for qw(bytes lines content-length status);
+
+ # spam check:
if ($check_cb) {
$mime = $check_cb->($mime) or return;
}
@@ -194,6 +231,15 @@ sub add {
my $blob = $self->{mark}++;
print $w "blob\nmark :$blob\ndata ", length($mime), "\n" or wfail;
print $w $mime, "\n" or wfail;
+
+ # v2: we need this for Xapian
+ if ($self->{want_object_id}) {
+ print $w "get-mark :$blob\n" or wfail;
+ defined(my $object_id = <$r>) or
+ die "get-mark failed, need git 2.6.0+\n";
+ chomp($self->{last_object_id} = $object_id);
+ }
+
my $ref = $self->{ref};
my $commit = $self->{mark}++;
my $parent = $tip =~ /\A:/ ? $tip : undef;
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 07/17] t/import: test for last_object_id insertion
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (5 preceding siblings ...)
2018-02-15 11:08 ` [WIP 06/17] import: initial handling for v2 Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 08/17] content_id: add test case Eric Wong (Contractor, The Linux Foundation)
` (9 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Check for this before doing the Xapian-based v2 importer.
---
t/import.t | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/t/import.t b/t/import.t
index fb6238e..92c82b9 100644
--- a/t/import.t
+++ b/t/import.t
@@ -6,7 +6,10 @@ use Test::More;
use PublicInbox::MIME;
use PublicInbox::Git;
use PublicInbox::Import;
-use File::Temp qw/tempdir/;
+use PublicInbox::Spawn qw(spawn);
+use IO::File;
+use Fcntl qw(:DEFAULT);
+use File::Temp qw/tempdir tempfile/;
my $dir = tempdir('pi-import-XXXXXX', TMPDIR => 1, CLEANUP => 1);
is(system(qw(git init -q --bare), $dir), 0, 'git init successful');
@@ -20,10 +23,30 @@ my $mime = PublicInbox::MIME->create(
'Content-Type' => 'text/plain',
Subject => 'this is a subject',
'Message-ID' => '<a@example.com>',
+ Date => 'Fri, 02 Oct 1993 00:00:00 +0000',
],
body => "hello world\n",
);
+
+$im->{want_object_id} = 1 if 'v2';
like($im->add($mime), qr/\A:\d+\z/, 'added one message');
+
+if ('v2') {
+ like($im->{last_object_id}, qr/\A[a-f0-9]{40}\z/, 'got last_object_id');
+ my @cmd = ('git', "--git-dir=$git->{git_dir}", qw(hash-object --stdin));
+ my $in = tempfile();
+ print $in $mime->as_string or die "write failed: $!";
+ $in->flush or die "flush failed: $!";
+ $in->seek(0, SEEK_SET);
+ my $out = tempfile();
+ my $pid = spawn(\@cmd, {}, { 0 => fileno($in), 1 => fileno($out)});
+ is(waitpid($pid, 0), $pid, 'waitpid succeeds on hash-object');
+ is($?, 0, 'hash-object');
+ $out->seek(0, SEEK_SET);
+ chomp(my $hashed_obj = <$out>);
+ is($hashed_obj, $im->{last_object_id}, "last_object_id matches exp");
+}
+
$im->done;
my @revs = $git->qx(qw(rev-list HEAD));
is(scalar @revs, 1, 'one revision created');
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 08/17] content_id: add test case
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (6 preceding siblings ...)
2018-02-15 11:08 ` [WIP 07/17] t/import: test for last_object_id insertion Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 09/17] searchmsg: add mid_mime import for _extract_mid Eric Wong (Contractor, The Linux Foundation)
` (8 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
---
MANIFEST | 1 +
t/content_id.t | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+)
create mode 100644 t/content_id.t
diff --git a/MANIFEST b/MANIFEST
index 85e8503..1df27f2 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -128,6 +128,7 @@ t/check-www-inbox.perl
t/common.perl
t/config.t
t/config_limiter.t
+t/content_id.t
t/emergency.t
t/fail-bin/spamc
t/feed.t
diff --git a/t/content_id.t b/t/content_id.t
new file mode 100644
index 0000000..c0ae6ec
--- /dev/null
+++ b/t/content_id.t
@@ -0,0 +1,24 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use PublicInbox::ContentId qw(content_id);
+use Email::MIME;
+
+my $mime = Email::MIME->create(
+ header => [
+ From => 'a@example.com',
+ To => 'b@example.com',
+ 'Content-Type' => 'text/plain',
+ Subject => 'this is a subject',
+ 'Message-ID' => '<a@example.com>',
+ Date => 'Fri, 02 Oct 1993 00:00:00 +0000',
+ ],
+ body => "hello world\n",
+);
+
+my $res = content_id($mime);
+like($res, qr/\ASHA-256:[a-f0-9]{64}\z/, 'cid in format expected');
+
+done_testing();
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 09/17] searchmsg: add mid_mime import for _extract_mid
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (7 preceding siblings ...)
2018-02-15 11:08 ` [WIP 08/17] content_id: add test case Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option Eric Wong (Contractor, The Linux Foundation)
` (7 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Oops, I guess this code was never called and may not be
needed. But for now, import it so it can run properly.
---
lib/PublicInbox/SearchMsg.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index afba8b1..70aa706 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -8,7 +8,7 @@ use strict;
use warnings;
use Search::Xapian;
use Date::Parse qw/str2time/;
-use PublicInbox::MID qw/mid_clean/;
+use PublicInbox::MID qw/mid_clean mid_mime/;
use PublicInbox::Address;
sub new {
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (8 preceding siblings ...)
2018-02-15 11:08 ` [WIP 09/17] searchmsg: add mid_mime import for _extract_mid Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 11/17] import: APIs to support v2 use Eric Wong (Contractor, The Linux Foundation)
` (6 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
This can be useful for getting baseline of performance
of just Email::MIME and Date: header parsing. We'll need
to do some Date: header parsing for LKML since there are
some wonky date formats which causes the git RFC822 parser
to choke.
---
scripts/import_vger_from_mbox | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 9b3afc8..3fa5c77 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -3,16 +3,21 @@
# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
use strict;
use warnings;
+use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/;
+use Date::Parse qw/str2time/;
use Email::MIME;
$Email::MIME::ContentType::STRICT_PARAMS = 0; # user input is imperfect
use PublicInbox::Git;
use PublicInbox::Import;
my $usage = "usage: $0 NAME EMAIL <MBOX\n";
+my $dry_run;
+my %opts = ( 'n|dry-run' => \$dry_run );
+GetOptions(%opts) or die $usage;
chomp(my $git_dir = `git rev-parse --git-dir`);
my $git = PublicInbox::Git->new($git_dir);
my $name = shift or die $usage; # git
my $email = shift or die $usage; # git@vger.kernel.org
-my $im = PublicInbox::Import->new($git, $name, $email);
+my $im = $dry_run ? undef : PublicInbox::Import->new($git, $name, $email);
binmode STDIN;
my $msg = '';
use PublicInbox::Filter::Vger;
@@ -22,9 +27,27 @@ sub do_add ($$) {
$$msg =~ s/(\r?\n)+\z/$1/s;
$msg = Email::MIME->new($$msg);
$msg = $vger->scrub($msg);
+ my $hdr = $msg->header_obj;
+ my $date = $hdr->header_raw('Date');
+ if ($date) {
+ eval { str2time($date) };
+ if ($@) {
+ warn "bad Date: $date in ",
+ $hdr->header_raw('Message-ID'), ": $@\n";
+ }
+ } else {
+ warn "missing Date: $date in ",
+ $hdr->header_raw('Message-ID'), ": $@\n";
+ my $n = 0;
+ foreach my $r ($hdr->header_raw('Received')) {
+ warn "$n Received: $r\n";
+ }
+ warn(('-' x 72), "\n");
+ }
+ return unless $im;
$im->add($msg) or
warn "duplicate: ",
- $msg->header_obj->header_raw('Message-ID'), "\n";
+ $hdr->header_raw('Message-ID'), "\n";
}
# asctime: From example@example.com Fri Jun 23 02:56:55 2000
@@ -44,4 +67,4 @@ while (defined(my $l = <STDIN>)) {
$msg .= $l;
}
do_add($im, \$msg) if $msg;
-$im->done;
+$im->done if $im;
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 11/17] import: APIs to support v2 use
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (9 preceding siblings ...)
2018-02-15 11:08 ` [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 12/17] search: free up 'Q' prefix for a real unique identifier Eric Wong (Contractor, The Linux Foundation)
` (5 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Wrap "get-mark" and "checkpoint" commands for git-fast-import
while documenting/cementing parts of the API.
---
lib/PublicInbox/Import.pm | 28 ++++++++++++++++++++++------
t/import.t | 4 +++-
2 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index b8e9dd0..811e355 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -138,9 +138,27 @@ sub check_remove_v1 {
(undef, $cur);
}
+# used for v2 (maybe)
+sub checkpoint {
+ my ($self) = @_;
+ return unless $self->{pid};
+ print { $self->{out} } "checkpoint\n" or wfail;
+ undef;
+}
+
+# used for v2
+sub get_mark {
+ my ($self, $mark) = @_;
+ die "not active\n" unless $self->{pid};
+ my ($r, $w) = $self->gfi_start;
+ print $w "get-mark $mark\n" or wfail;
+ defined(my $oid = <$r>) or die "get-mark failed, need git 2.6.0+\n";
+ $oid;
+}
+
# returns undef on non-existent
-# ('MISMATCH', msg) on mismatch
-# (:MARK, msg) on success
+# ('MISMATCH', Email::MIME) on mismatch
+# (:MARK, Email::MIME) on success
#
# For v2 inboxes, the content_id is returned instead of the msg
# v2 callers should check with Xapian before calling this as
@@ -189,6 +207,7 @@ sub remove {
}
# returns undef on duplicate
+# returns the :MARK of the most recent commit
sub add {
my ($self, $mime, $check_cb) = @_; # mime = Email::MIME
@@ -234,10 +253,7 @@ sub add {
# v2: we need this for Xapian
if ($self->{want_object_id}) {
- print $w "get-mark :$blob\n" or wfail;
- defined(my $object_id = <$r>) or
- die "get-mark failed, need git 2.6.0+\n";
- chomp($self->{last_object_id} = $object_id);
+ chomp($self->{last_object_id} = $self->get_mark(":$blob"));
}
my $ref = $self->{ref};
diff --git a/t/import.t b/t/import.t
index 92c82b9..ca59772 100644
--- a/t/import.t
+++ b/t/import.t
@@ -87,6 +87,8 @@ isnt($msg->header('Subject'), $mime->header('Subject'), 'subject mismatch');
$mime->header_set('Message-Id', '<failcheck@example.com>');
is($im->add($mime, sub { undef }), undef, 'check callback fails');
is($im->remove($mime), undef, 'message not added, so not removed');
-
+is(undef, $im->checkpoint, 'checkpoint works before ->done');
$im->done;
+is(undef, $im->checkpoint, 'checkpoint works after ->done');
+$im->checkpoint;
done_testing();
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 12/17] search: free up 'Q' prefix for a real unique identifier
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (10 preceding siblings ...)
2018-02-15 11:08 ` [WIP 11/17] import: APIs to support v2 use Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-22 21:08 ` Eric Wong
2018-02-15 11:08 ` [WIP 13/17] searchidx: fix comment around next_thread_id Eric Wong (Contractor, The Linux Foundation)
` (4 subsequent siblings)
16 siblings, 1 reply; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
This will allow easier-compatibility with v2 code which will
introduce content_id as the unique identifier.
The old "XMID" becomes "XM" as a free text searchable term.
"Q" becomes "XMID" as a boolean prefix.
There's no user-visible changes in this, but there needs to
be a schema version bump later on...
(more changes planned which can affect v1)
---
lib/PublicInbox/Search.pm | 8 ++++----
lib/PublicInbox/SearchIdx.pm | 8 ++++----
lib/PublicInbox/SearchMsg.pm | 2 +-
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 9ab5afe..3ec96ca 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -56,13 +56,13 @@ my %bool_pfx_internal = (
);
my %bool_pfx_external = (
- mid => 'Q', # uniQue id (Message-ID)
+ mid => 'XMID', # uniQue id (Message-ID)
);
my %prob_prefix = (
# for mairix compatibility
s => 'S',
- m => 'XMID', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
+ m => 'XM', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
f => 'A',
t => 'XTO',
tc => 'XTO XCC',
@@ -85,7 +85,7 @@ my %prob_prefix = (
dfblob => 'XDFPRE XDFPOST',
# default:
- '' => 'XMID S A XNQ XQUOT XFN',
+ '' => 'XM S A XNQ XQUOT XFN',
);
# not documenting m: and mid: for now, the using the URLs works w/o Xapian
@@ -285,7 +285,7 @@ sub lookup_message {
my ($self, $mid) = @_;
$mid = mid_clean($mid);
- my $doc_id = $self->find_unique_doc_id('Q' . $mid);
+ my $doc_id = $self->find_unique_doc_id('XMID' . $mid);
my $smsg;
if (defined $doc_id) {
# raises on error:
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 66faed3..0ee0779 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -276,7 +276,7 @@ sub add_message {
}
$smsg = PublicInbox::SearchMsg->new($mime);
my $doc = $smsg->{doc};
- $doc->add_term('Q' . $mid);
+ $doc->add_term('XMID' . $mid);
my $subj = $smsg->subject;
if ($subj ne '') {
@@ -334,7 +334,7 @@ sub add_message {
});
link_message($self, $smsg, $old_tid);
- $tg->index_text($mid, 1, 'XMID');
+ $tg->index_text($mid, 1, 'XM');
$doc->set_data($smsg->to_doc_data($blob));
if (my $altid = $self->{-altid}) {
@@ -366,7 +366,7 @@ sub remove_message {
$mid = mid_clean($mid);
eval {
- $doc_id = $self->find_unique_doc_id('Q' . $mid);
+ $doc_id = $self->find_unique_doc_id('XMID' . $mid);
if (defined $doc_id) {
$db->delete_document($doc_id);
} else {
@@ -683,7 +683,7 @@ sub create_ghost {
my $tid = $self->next_thread_id;
my $doc = Search::Xapian::Document->new;
- $doc->add_term('Q' . $mid);
+ $doc->add_term('XMID' . $mid);
$doc->add_term('G' . $tid);
$doc->add_term('T' . 'ghost');
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index 70aa706..25c1abb 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -157,7 +157,7 @@ sub mid ($;$) {
} elsif (my $rv = $self->{mid}) {
$rv;
} else {
- $self->{mid} = _get_term_val($self, 'Q', qr/\AQ/) ||
+ $self->{mid} = _get_term_val($self, 'XMID', qr/\AXMID/) ||
$self->_extract_mid;
}
}
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 13/17] searchidx: fix comment around next_thread_id
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (11 preceding siblings ...)
2018-02-15 11:08 ` [WIP 12/17] search: free up 'Q' prefix for a real unique identifier Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 14/17] address: extract more characters from email addresses Eric Wong (Contractor, The Linux Foundation)
` (3 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
I decided not to copy the notmuch implementation regarding
serialization of integers to Xapian metadata.
---
lib/PublicInbox/SearchIdx.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0ee0779..fa5057f 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -394,7 +394,7 @@ sub term_generator { # write-only
}
# increments last_thread_id counter
-# returns a 64-bit integer represented as a hex string
+# returns a 64-bit integer represented as a decimal string
sub next_thread_id {
my ($self) = @_;
my $db = $self->{xdb};
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 14/17] address: extract more characters from email addresses
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (12 preceding siblings ...)
2018-02-15 11:08 ` [WIP 13/17] searchidx: fix comment around next_thread_id Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 15/17] import: pass "raw" dates to git-fast-import(1) Eric Wong (Contractor, The Linux Foundation)
` (2 subsequent siblings)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
There's a lot of weird characters which show up in LKML archives
which we did not support before. Furthermore, allow spaces
before the '>' in the From: line as at least some non-spam
poster used it.
---
lib/PublicInbox/Address.pm | 3 ++-
t/address.t | 5 +++--
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/Address.pm b/lib/PublicInbox/Address.pm
index f334ade..548f417 100644
--- a/lib/PublicInbox/Address.pm
+++ b/lib/PublicInbox/Address.pm
@@ -8,7 +8,8 @@ use warnings;
# just enough to make thing sanely displayable and pass to git
sub emails {
- ($_[0] =~ /([\w\.\+=\-]+\@[\w\.\-]+)>?\s*(?:\(.*?\))?(?:,\s*|\z)/g)
+ ($_[0] =~ /([\w\.\+=\?"\(\)\-!#\$%&'\*\/\^\`\|\{\}~]+\@[\w\.\-\(\)]+)
+ (?:\s[^>]*)?>?\s*(?:\(.*?\))?(?:,\s*|\z)/gx)
}
sub names {
diff --git a/t/address.t b/t/address.t
index e35e4f8..eced5c4 100644
--- a/t/address.t
+++ b/t/address.t
@@ -9,8 +9,9 @@ is_deeply([qw(e@example.com e@example.org)],
[PublicInbox::Address::emails('User <e@example.com>, e@example.org')],
'address extraction works as expected');
-is_deeply([PublicInbox::Address::emails('"ex@example.com" <ex@example.com>')],
- [qw(ex@example.com)]);
+is_deeply(['user@example.com'],
+ [PublicInbox::Address::emails('<user@example.com (Comment)>')],
+ 'comment after domain accepted before >');
my @names = PublicInbox::Address::names(
'User <e@e>, e@e, "John A. Doe" <j@d>, <x@x>');
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 15/17] import: pass "raw" dates to git-fast-import(1)
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (13 preceding siblings ...)
2018-02-15 11:08 ` [WIP 14/17] address: extract more characters from email addresses Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 17/17] import: quiet down warnings from bogus From: lines Eric Wong (Contractor, The Linux Foundation)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
For LKML, it appears we need an even more liberal parser than
RFC2822 date parser in git. I have not validated Date::Parse
parses dates correctly, but this at least prevents
git-fast-import(1) from choking.
---
lib/PublicInbox/Import.pm | 65 +++++++++++++++++++++++++++++++++++------------
1 file changed, 49 insertions(+), 16 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 811e355..845fbb6 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -12,6 +12,8 @@ use PublicInbox::Spawn qw(spawn);
use PublicInbox::MID qw(mid_mime mid2path);
use PublicInbox::Address;
use PublicInbox::ContentId qw(content_id);
+use Date::Parse qw(str2time);
+use Time::Zone qw(tz_offset);
sub new {
my ($class, $git, $name, $email, $ibx) = @_;
@@ -57,7 +59,7 @@ sub gfi_start {
chomp($self->{tip} = $git->qx(qw(rev-parse --revs-only), $self->{ref}));
my @cmd = ('git', "--git-dir=$git_dir", qw(fast-import
- --quiet --done --date-format=rfc2822));
+ --quiet --done --date-format=raw));
my $rdr = { 0 => fileno($out_r), 1 => fileno($in_w) };
my $pid = spawn(\@cmd, undef, $rdr);
die "spawn fast-import failed: $!" unless defined $pid;
@@ -74,14 +76,7 @@ sub gfi_start {
sub wfail () { die "write to fast-import failed: $!" }
-sub now2822 () {
- my @t = gmtime(time);
- my $day = qw(Sun Mon Tue Wed Thu Fri Sat)[$t[6]];
- my $mon = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)[$t[4]];
-
- sprintf('%s, %2d %s %d %02d:%02d:%02d +0000',
- $day, $t[3], $mon, $t[5] + 1900, $t[2], $t[1], $t[0]);
-}
+sub now_raw () { time . ' +0000' }
sub norm_body ($) {
my ($mime) = @_;
@@ -189,7 +184,7 @@ sub remove {
print $w "reset $ref\n" or wfail;
}
my $ident = $self->{ident};
- my $now = now2822();
+ my $now = now_raw();
$msg ||= 'rm';
my $len = length($msg) + 1;
print $w "commit $ref\nmark :$commit\n",
@@ -206,6 +201,43 @@ sub remove {
(($self->{tip} = ":$commit"), $cur);
}
+sub parse_date ($) {
+ my ($mime) = @_;
+ my $hdr = $mime->header_obj;
+ my $date = $hdr->header_raw('Date');
+ my ($ts, $zone);
+ my $mid = $hdr->header_raw('Message-ID');
+ if ($date) {
+ $ts = eval { str2time($date) };
+ if ($@) {
+ warn "bad Date: $date in $mid: $@\n";
+ } elsif ($date =~ /\s+([\+\-]\d+)\s*\z/) {
+ $zone = $1;
+ }
+ }
+ unless ($ts) {
+ my @recvd = $hdr->header_raw('Received');
+ foreach my $r (@recvd) {
+ $zone = undef;
+ $r =~ /\s*(\d+\s+[[:alpha:]]+\s+\d{2,4}\s+
+ \d+\D\d+(?:\D\d+)\s+([\+\-]\d+))/osx or next;
+ $zone = $2;
+ $ts = eval { str2time($1) } and last;
+ warn "no date in Received: $r\n";
+ }
+ }
+ $zone ||= '+0000';
+ # "-1200" is the furthest westermost zone offset,
+ # but git fast-import is liberal so we use "-1400"
+ if ($zone >= 1400 || $zone <= -1400) {
+ warn "bogus TZ offset: $zone, ignoring and assuming +0000\n";
+ $zone = '+0000';
+ }
+ $ts ||= time;
+ $ts = 0 if $ts < 0; # git uses unsigned times
+ "$ts $zone";
+}
+
# returns undef on duplicate
# returns the :MARK of the most recent commit
sub add {
@@ -220,7 +252,7 @@ sub add {
# <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
$name =~ tr/<>//d;
- my $date = $mime->header('Date');
+ my $date_raw = parse_date($mime);
my $subject = $mime->header('Subject');
$subject = '(no subject)' unless defined $subject;
my $path_type = $self->{path_type};
@@ -246,10 +278,11 @@ sub add {
$mime = $check_cb->($mime) or return;
}
- $mime = $mime->as_string;
my $blob = $self->{mark}++;
- print $w "blob\nmark :$blob\ndata ", length($mime), "\n" or wfail;
- print $w $mime, "\n" or wfail;
+ my $str = $mime->as_string;
+ print $w "blob\nmark :$blob\ndata ", length($str), "\n" or wfail;
+ print $w $str, "\n" or wfail;
+ $str = undef;
# v2: we need this for Xapian
if ($self->{want_object_id}) {
@@ -269,8 +302,8 @@ sub add {
utf8::encode($subject);
# quiet down wide character warnings:
print $w "commit $ref\nmark :$commit\n",
- "author $name <$email> $date\n",
- "committer $self->{ident} ", now2822(), "\n" or wfail;
+ "author $name <$email> $date_raw\n",
+ "committer $self->{ident} ", now_raw(), "\n" or wfail;
print $w "data ", (length($subject) + 1), "\n",
$subject, "\n\n" or wfail;
if ($tip ne '') {
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (14 preceding siblings ...)
2018-02-15 11:08 ` [WIP 15/17] import: pass "raw" dates to git-fast-import(1) Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08 ` [WIP 17/17] import: quiet down warnings from bogus From: lines Eric Wong (Contractor, The Linux Foundation)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
Big lists are orders of magnitude more efficient with v2.
---
scripts/import_vger_from_mbox | 24 ++++++------------------
1 file changed, 6 insertions(+), 18 deletions(-)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 3fa5c77..6ea2ca5 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -22,32 +22,20 @@ binmode STDIN;
my $msg = '';
use PublicInbox::Filter::Vger;
my $vger = PublicInbox::Filter::Vger->new;
+if ($im) {
+ $im->{ssoma_lock} = 0;
+ $im->{path_type} = 'v2';
+}
+
sub do_add ($$) {
my ($im, $msg) = @_;
$$msg =~ s/(\r?\n)+\z/$1/s;
$msg = Email::MIME->new($$msg);
$msg = $vger->scrub($msg);
- my $hdr = $msg->header_obj;
- my $date = $hdr->header_raw('Date');
- if ($date) {
- eval { str2time($date) };
- if ($@) {
- warn "bad Date: $date in ",
- $hdr->header_raw('Message-ID'), ": $@\n";
- }
- } else {
- warn "missing Date: $date in ",
- $hdr->header_raw('Message-ID'), ": $@\n";
- my $n = 0;
- foreach my $r ($hdr->header_raw('Received')) {
- warn "$n Received: $r\n";
- }
- warn(('-' x 72), "\n");
- }
return unless $im;
$im->add($msg) or
warn "duplicate: ",
- $hdr->header_raw('Message-ID'), "\n";
+ $msg->header_obj->header_raw('Message-ID'), "\n";
}
# asctime: From example@example.com Fri Jun 23 02:56:55 2000
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [WIP 17/17] import: quiet down warnings from bogus From: lines
2018-02-15 11:08 ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
` (15 preceding siblings ...)
2018-02-15 11:08 ` [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-15 11:08 ` Eric Wong (Contractor, The Linux Foundation)
16 siblings, 0 replies; 21+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-15 11:08 UTC (permalink / raw)
To: meta
There's a lot of crap in archives and git-fast-import
accepts empty names and email addresses for authors
just fine.
---
lib/PublicInbox/Import.pm | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 845fbb6..f8d1003 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -246,11 +246,6 @@ sub add {
my $from = $mime->header('From');
my ($email) = PublicInbox::Address::emails($from);
my ($name) = PublicInbox::Address::names($from);
- # git gets confused with:
- # "'A U Thor <u@example.com>' via foo" <foo@example.com>
- # ref:
- # <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
- $name =~ tr/<>//d;
my $date_raw = parse_date($mime);
my $subject = $mime->header('Subject');
@@ -297,10 +292,26 @@ sub add {
print $w "reset $ref\n" or wfail;
}
- utf8::encode($email);
- utf8::encode($name);
+ # quiet down wide character warnings with utf8::encode
+ if (defined $email) {
+ utf8::encode($email);
+ } else {
+ $email = '';
+ warn "no email in From: $from\n";
+ }
+
+ # git gets confused with:
+ # "'A U Thor <u@example.com>' via foo" <foo@example.com>
+ # ref:
+ # <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
+ if (defined $name) {
+ $name =~ tr/<>//d;
+ utf8::encode($name);
+ } else {
+ $name = '';
+ warn "no name in From: $from\n";
+ }
utf8::encode($subject);
- # quiet down wide character warnings:
print $w "commit $ref\nmark :$commit\n",
"author $name <$email> $date_raw\n",
"committer $self->{ident} ", now_raw(), "\n" or wfail;
--
EW
^ permalink raw reply related [flat|nested] 21+ messages in thread