unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Vladimir Marek <Vladimir.Marek@oracle.com>
To: notmuch@notmuchmail.org
Subject: Deduplication ?
Date: Mon, 2 Jun 2014 14:32:12 +0200	[thread overview]
Message-ID: <20140602123212.GA12639@virt.cz.oracle.com> (raw)

Hi,

I want to import bigger chunk of archived messages into my notmuch
database. It's about 100k messages. The problem is, that I most probably
have quite a lot of those messages in the DB. Basically I would like to
add only those I don't have already.

There are two possibilities

a) I will add all the 100k messages and then remove the duplicities.

b) I will write a script which will parse the message ID's of the
   to-be-added messages and try to match them to the notmuch DB. Adding
   only files I can't find already.

Ad b) might be better option, but I started to play with the idea of
deduplication. I'm thinking about listing all the message IDs stored in
DB, listing all files belonging to the IDs and deleting all but one.
Also I'm thinking about implementing some simple algorithm telling me
whether the messages are really very similar. Just to be sure I don't
delete something I don't want to.

Was anyone playing with the idea?

-- 
	Vlad

             reply	other threads:[~2014-06-02 13:22 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-02 12:32 Vladimir Marek [this message]
2014-06-02 13:43 ` Deduplication ? David Edmondson
2014-06-02 13:54   ` Vladimir Marek
2014-06-02 14:10     ` Mark Walters
2014-06-02 14:15       ` Mark Walters
2014-06-02 13:51 ` Mark Walters
2014-06-02 14:17   ` Tomi Ollila
2014-06-02 14:26     ` Mark Walters
2014-06-02 17:06       ` Jani Nikula
2014-06-02 17:25         ` David Edmondson
2014-06-02 18:29           ` Jani Nikula
2014-06-06 10:40           ` Vladimir Marek
2014-06-07 13:37             ` Tomi Ollila

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140602123212.GA12639@virt.cz.oracle.com \
    --to=vladimir.marek@oracle.com \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).