From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id F3C751FAE2 for ; Wed, 28 Feb 2018 23:42:07 +0000 (UTC) From: "Eric Wong (Contractor, The Linux Foundation)" To: meta@public-inbox.org Subject: [PATCH 01/21] v2writable: warn on duplicate Message-IDs Date: Wed, 28 Feb 2018 23:41:42 +0000 Message-Id: <20180228234202.8839-2-e@80x24.org> In-Reply-To: <20180228234202.8839-1-e@80x24.org> References: <20180228234202.8839-1-e@80x24.org> List-Id: This should give us an idea of how much a problem deduplication will be. --- lib/PublicInbox/SearchIdx.pm | 6 ++++-- lib/PublicInbox/V2Writable.pm | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index cc7e7ec..f9207e9 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -515,13 +515,15 @@ sub unindex_blob { } sub index_mm { - my ($self, $mime) = @_; + my ($self, $mime, $warn_existing) = @_; my $mid = mid_clean(mid_mime($mime)); my $mm = $self->{mm}; my $num = $mm->mid_insert($mid); + return $num if defined $num; + warn "<$mid> reused\n" if $warn_existing; # fallback to num_for since filters like RubyLang set the number - defined $num ? $num : $mm->num_for($mid); + $mm->num_for($mid); } sub unindex_mm { diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index cf19c76..29ed23c 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -63,7 +63,7 @@ sub add { my ($len, $msgref) = @{$im->{last_object}}; $self->idx_init; - my $num = $self->{all}->index_mm($mime); + my $num = $self->{all}->index_mm($mime, 1); my $nparts = $self->{partitions}; my $part = $num % $nparts; my $idx = $self->idx_part($part); -- EW