From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tomi.ollila@iki.fi>
Received: from localhost (localhost [127.0.0.1])
	by olra.theworths.org (Postfix) with ESMTP id 55757431FAF
	for <notmuch@notmuchmail.org>; Mon,  2 Jun 2014 07:17:47 -0700 (PDT)
X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level: 
X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none]
	autolearn=disabled
Received: from olra.theworths.org ([127.0.0.1])
	by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id a6qh5T1ss-aH for <notmuch@notmuchmail.org>;
	Mon,  2 Jun 2014 07:17:38 -0700 (PDT)
Received: from guru.guru-group.fi (guru.guru-group.fi [46.183.73.34])
	by olra.theworths.org (Postfix) with ESMTP id 6B477431FAE
	for <notmuch@notmuchmail.org>; Mon,  2 Jun 2014 07:17:38 -0700 (PDT)
Received: from guru.guru-group.fi (localhost [IPv6:::1])
	by guru.guru-group.fi (Postfix) with ESMTP id D19C710005E;
	Mon,  2 Jun 2014 17:17:33 +0300 (EEST)
From: Tomi Ollila <tomi.ollila@iki.fi>
To: Mark Walters <markwalters1009@gmail.com>,
	Vladimir Marek <Vladimir.Marek@oracle.com>, notmuch@notmuchmail.org
Subject: Re: Deduplication ?
In-Reply-To: <87d2ers9mi.fsf@qmul.ac.uk>
References: <20140602123212.GA12639@virt.cz.oracle.com>
	<87d2ers9mi.fsf@qmul.ac.uk>
User-Agent: Notmuch/0.18+28~gcecaba1 (http://notmuchmail.org) Emacs/24.3.1
	(x86_64-unknown-linux-gnu)
X-Face: HhBM'cA~<r"^Xv\KRN0P{vn'Y"Kd;zg_y3S[4)KSN~s?O\"QPoL
	$[Xv_BD:i/F$WiEWax}R(MPS`^UaptOGD`*/=@\1lKoVa9tnrg0TW?"r7aRtgk[F
	!)g;OY^,BjTbr)Np:%c_o'jj,Z
Date: Mon, 02 Jun 2014 17:17:33 +0300
Message-ID: <m2ppirs8ea.fsf@guru.guru-group.fi>
MIME-Version: 1.0
Content-Type: text/plain
X-BeenThere: notmuch@notmuchmail.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: "Use and development of the notmuch mail system."
	<notmuch.notmuchmail.org>
List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
	<mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
List-Archive: <http://notmuchmail.org/pipermail/notmuch>
List-Post: <mailto:notmuch@notmuchmail.org>
List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
	<mailto:notmuch-request@notmuchmail.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Jun 2014 14:17:47 -0000

On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:

> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
>
>> Hi,
>>
>> I want to import bigger chunk of archived messages into my notmuch
>> database. It's about 100k messages. The problem is, that I most probably
>> have quite a lot of those messages in the DB. Basically I would like to
>> add only those I don't have already.
>>
>> There are two possibilities
>>
>> a) I will add all the 100k messages and then remove the duplicities.
>>
>> b) I will write a script which will parse the message ID's of the
>>    to-be-added messages and try to match them to the notmuch DB. Adding
>>    only files I can't find already.
>>
>> Ad b) might be better option, but I started to play with the idea of
>> deduplication. I'm thinking about listing all the message IDs stored in
>> DB, listing all files belonging to the IDs and deleting all but one.
>> Also I'm thinking about implementing some simple algorithm telling me
>> whether the messages are really very similar. Just to be sure I don't
>> delete something I don't want to.
>>
>> Was anyone playing with the idea?
>
> I am not sure what your use case is but notmuch automatically
> deduplicates: that is if the message-id is one it has already seen no
> further indexing takes place. The only thing that happens is the new
> filename gets added to the list of filenames for the message.
>
> Thus importing should be almost as fast as if the message were not
> there, and the database should be almost identical to what you would get
> if you only imported the genuine new messages.
>
> If you want to save disk space then you could delete the duplicates
> after with something like
>
> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
> xargs -0

What if there are 3 duplicates (or 4... ;)

>
> (but please test it carefully first!)

One should also have some message content heuristics to determine that the
content is indeed duplicate and not something totally different (not that
we can see the different content anyway... but...)

>
> I would think something like this is better than trying to parse the
> message-ids yourself.


>
> Best wishes
>
> Mark
>

Tomi


>
>>
>> -- 
>> 	Vlad