unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Deduplication ?
@ 2014-06-02 12:32 Vladimir Marek
  2014-06-02 13:43 ` David Edmondson
  2014-06-02 13:51 ` Mark Walters
  0 siblings, 2 replies; 13+ messages in thread
From: Vladimir Marek @ 2014-06-02 12:32 UTC (permalink / raw)
  To: notmuch

Hi,

I want to import bigger chunk of archived messages into my notmuch
database. It's about 100k messages. The problem is, that I most probably
have quite a lot of those messages in the DB. Basically I would like to
add only those I don't have already.

There are two possibilities

a) I will add all the 100k messages and then remove the duplicities.

b) I will write a script which will parse the message ID's of the
   to-be-added messages and try to match them to the notmuch DB. Adding
   only files I can't find already.

Ad b) might be better option, but I started to play with the idea of
deduplication. I'm thinking about listing all the message IDs stored in
DB, listing all files belonging to the IDs and deleting all but one.
Also I'm thinking about implementing some simple algorithm telling me
whether the messages are really very similar. Just to be sure I don't
delete something I don't want to.

Was anyone playing with the idea?

-- 
	Vlad

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 12:32 Deduplication ? Vladimir Marek
@ 2014-06-02 13:43 ` David Edmondson
  2014-06-02 13:54   ` Vladimir Marek
  2014-06-02 13:51 ` Mark Walters
  1 sibling, 1 reply; 13+ messages in thread
From: David Edmondson @ 2014-06-02 13:43 UTC (permalink / raw)
  To: Vladimir Marek, notmuch

On Mon, Jun 02 2014, Vladimir Marek wrote:
> Hi,
>
> I want to import bigger chunk of archived messages into my notmuch
> database. It's about 100k messages. The problem is, that I most probably
> have quite a lot of those messages in the DB. Basically I would like to
> add only those I don't have already.
>
> There are two possibilities
>
> a) I will add all the 100k messages and then remove the duplicities.
>
> b) I will write a script which will parse the message ID's of the
>    to-be-added messages and try to match them to the notmuch DB. Adding
>    only files I can't find already.
>
> Ad b) might be better option, but I started to play with the idea of
> deduplication. I'm thinking about listing all the message IDs stored in
> DB, listing all files belonging to the IDs and deleting all but one.
> Also I'm thinking about implementing some simple algorithm telling me
> whether the messages are really very similar. Just to be sure I don't
> delete something I don't want to.
>
> Was anyone playing with the idea?

notsync[1] used the (lack of) existence of a message id in the store to
decide whether to add something from an IMAP server, but it is old,
crufty, unused and unloved code.

> -- 
> 	Vlad
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

Footnotes: 
[1]  https://github.com/dme/notsync

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 12:32 Deduplication ? Vladimir Marek
  2014-06-02 13:43 ` David Edmondson
@ 2014-06-02 13:51 ` Mark Walters
  2014-06-02 14:17   ` Tomi Ollila
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Walters @ 2014-06-02 13:51 UTC (permalink / raw)
  To: Vladimir Marek, notmuch


Vladimir Marek <Vladimir.Marek@oracle.com> writes:

> Hi,
>
> I want to import bigger chunk of archived messages into my notmuch
> database. It's about 100k messages. The problem is, that I most probably
> have quite a lot of those messages in the DB. Basically I would like to
> add only those I don't have already.
>
> There are two possibilities
>
> a) I will add all the 100k messages and then remove the duplicities.
>
> b) I will write a script which will parse the message ID's of the
>    to-be-added messages and try to match them to the notmuch DB. Adding
>    only files I can't find already.
>
> Ad b) might be better option, but I started to play with the idea of
> deduplication. I'm thinking about listing all the message IDs stored in
> DB, listing all files belonging to the IDs and deleting all but one.
> Also I'm thinking about implementing some simple algorithm telling me
> whether the messages are really very similar. Just to be sure I don't
> delete something I don't want to.
>
> Was anyone playing with the idea?

I am not sure what your use case is but notmuch automatically
deduplicates: that is if the message-id is one it has already seen no
further indexing takes place. The only thing that happens is the new
filename gets added to the list of filenames for the message.

Thus importing should be almost as fast as if the message were not
there, and the database should be almost identical to what you would get
if you only imported the genuine new messages.

If you want to save disk space then you could delete the duplicates
after with something like

notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
xargs -0

(but please test it carefully first!)

I would think something like this is better than trying to parse the
message-ids yourself.

Best wishes

Mark


>
> -- 
> 	Vlad
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 13:43 ` David Edmondson
@ 2014-06-02 13:54   ` Vladimir Marek
  2014-06-02 14:10     ` Mark Walters
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Marek @ 2014-06-02 13:54 UTC (permalink / raw)
  To: David Edmondson; +Cc: notmuch

> > I want to import bigger chunk of archived messages into my notmuch
> > database. It's about 100k messages. The problem is, that I most probably
> > have quite a lot of those messages in the DB. Basically I would like to
> > add only those I don't have already.
> >
> > There are two possibilities
> >
> > a) I will add all the 100k messages and then remove the duplicities.
> >
> > b) I will write a script which will parse the message ID's of the
> >    to-be-added messages and try to match them to the notmuch DB. Adding
> >    only files I can't find already.
> >
> > Ad b) might be better option, but I started to play with the idea of
> > deduplication. I'm thinking about listing all the message IDs stored in
> > DB, listing all files belonging to the IDs and deleting all but one.
> > Also I'm thinking about implementing some simple algorithm telling me
> > whether the messages are really very similar. Just to be sure I don't
> > delete something I don't want to.
> >
> > Was anyone playing with the idea?
> 
> notsync[1] used the (lack of) existence of a message id in the store to
> decide whether to add something from an IMAP server, but it is old,
> crufty, unused and unloved code.

I see, that's close to my b) solution, thanks!
-- 
	Vlad

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 13:54   ` Vladimir Marek
@ 2014-06-02 14:10     ` Mark Walters
  2014-06-02 14:15       ` Mark Walters
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Walters @ 2014-06-02 14:10 UTC (permalink / raw)
  To: Vladimir Marek, David Edmondson; +Cc: notmuch


Vladimir Marek <Vladimir.Marek@oracle.com> writes:

>> > I want to import bigger chunk of archived messages into my notmuch
>> > database. It's about 100k messages. The problem is, that I most probably
>> > have quite a lot of those messages in the DB. Basically I would like to
>> > add only those I don't have already.
>> >
>> > There are two possibilities
>> >
>> > a) I will add all the 100k messages and then remove the duplicities.
>> >
>> > b) I will write a script which will parse the message ID's of the
>> >    to-be-added messages and try to match them to the notmuch DB. Adding
>> >    only files I can't find already.
>> >
>> > Ad b) might be better option, but I started to play with the idea of
>> > deduplication. I'm thinking about listing all the message IDs stored in
>> > DB, listing all files belonging to the IDs and deleting all but one.
>> > Also I'm thinking about implementing some simple algorithm telling me
>> > whether the messages are really very similar. Just to be sure I don't
>> > delete something I don't want to.
>> >
>> > Was anyone playing with the idea?
>> 
>> notsync[1] used the (lack of) existence of a message id in the store to
>> decide whether to add something from an IMAP server, but it is old,
>> crufty, unused and unloved code.
>
> I see, that's close to my b) solution, thanks!

Did you mean a) here? The idea was to add them all first and then run
this script to delete the duplicates.

Best wishes

Mark

> -- 
> 	Vlad
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 14:10     ` Mark Walters
@ 2014-06-02 14:15       ` Mark Walters
  0 siblings, 0 replies; 13+ messages in thread
From: Mark Walters @ 2014-06-02 14:15 UTC (permalink / raw)
  To: Vladimir Marek, David Edmondson; +Cc: notmuch



Mark Walters <markwalters1009@gmail.com> writes:

> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
>
>>> > I want to import bigger chunk of archived messages into my notmuch
>>> > database. It's about 100k messages. The problem is, that I most probably
>>> > have quite a lot of those messages in the DB. Basically I would like to
>>> > add only those I don't have already.
>>> >
>>> > There are two possibilities
>>> >
>>> > a) I will add all the 100k messages and then remove the duplicities.
>>> >
>>> > b) I will write a script which will parse the message ID's of the
>>> >    to-be-added messages and try to match them to the notmuch DB. Adding
>>> >    only files I can't find already.
>>> >
>>> > Ad b) might be better option, but I started to play with the idea of
>>> > deduplication. I'm thinking about listing all the message IDs stored in
>>> > DB, listing all files belonging to the IDs and deleting all but one.
>>> > Also I'm thinking about implementing some simple algorithm telling me
>>> > whether the messages are really very similar. Just to be sure I don't
>>> > delete something I don't want to.
>>> >
>>> > Was anyone playing with the idea?
>>> 
>>> notsync[1] used the (lack of) existence of a message id in the store to
>>> decide whether to add something from an IMAP server, but it is old,
>>> crufty, unused and unloved code.
>>
>> I see, that's close to my b) solution, thanks!
>
> Did you mean a) here? The idea was to add them all first and then run
> this script to delete the duplicates.
>

Sorry: out of order arrival times and lack of care on my part. Sorry!

MW

> Best wishes
>
> Mark
>
>> -- 
>> 	Vlad
>> _______________________________________________
>> notmuch mailing list
>> notmuch@notmuchmail.org
>> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 13:51 ` Mark Walters
@ 2014-06-02 14:17   ` Tomi Ollila
  2014-06-02 14:26     ` Mark Walters
  0 siblings, 1 reply; 13+ messages in thread
From: Tomi Ollila @ 2014-06-02 14:17 UTC (permalink / raw)
  To: Mark Walters, Vladimir Marek, notmuch

On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:

> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
>
>> Hi,
>>
>> I want to import bigger chunk of archived messages into my notmuch
>> database. It's about 100k messages. The problem is, that I most probably
>> have quite a lot of those messages in the DB. Basically I would like to
>> add only those I don't have already.
>>
>> There are two possibilities
>>
>> a) I will add all the 100k messages and then remove the duplicities.
>>
>> b) I will write a script which will parse the message ID's of the
>>    to-be-added messages and try to match them to the notmuch DB. Adding
>>    only files I can't find already.
>>
>> Ad b) might be better option, but I started to play with the idea of
>> deduplication. I'm thinking about listing all the message IDs stored in
>> DB, listing all files belonging to the IDs and deleting all but one.
>> Also I'm thinking about implementing some simple algorithm telling me
>> whether the messages are really very similar. Just to be sure I don't
>> delete something I don't want to.
>>
>> Was anyone playing with the idea?
>
> I am not sure what your use case is but notmuch automatically
> deduplicates: that is if the message-id is one it has already seen no
> further indexing takes place. The only thing that happens is the new
> filename gets added to the list of filenames for the message.
>
> Thus importing should be almost as fast as if the message were not
> there, and the database should be almost identical to what you would get
> if you only imported the genuine new messages.
>
> If you want to save disk space then you could delete the duplicates
> after with something like
>
> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
> xargs -0

What if there are 3 duplicates (or 4... ;)

>
> (but please test it carefully first!)

One should also have some message content heuristics to determine that the
content is indeed duplicate and not something totally different (not that
we can see the different content anyway... but...)

>
> I would think something like this is better than trying to parse the
> message-ids yourself.


>
> Best wishes
>
> Mark
>

Tomi


>
>>
>> -- 
>> 	Vlad

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 14:17   ` Tomi Ollila
@ 2014-06-02 14:26     ` Mark Walters
  2014-06-02 17:06       ` Jani Nikula
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Walters @ 2014-06-02 14:26 UTC (permalink / raw)
  To: Tomi Ollila, Vladimir Marek, notmuch


Tomi Ollila <tomi.ollila@iki.fi> writes:

> On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:
>
>> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
>>
>>> Hi,
>>>
>>> I want to import bigger chunk of archived messages into my notmuch
>>> database. It's about 100k messages. The problem is, that I most probably
>>> have quite a lot of those messages in the DB. Basically I would like to
>>> add only those I don't have already.
>>>
>>> There are two possibilities
>>>
>>> a) I will add all the 100k messages and then remove the duplicities.
>>>
>>> b) I will write a script which will parse the message ID's of the
>>>    to-be-added messages and try to match them to the notmuch DB. Adding
>>>    only files I can't find already.
>>>
>>> Ad b) might be better option, but I started to play with the idea of
>>> deduplication. I'm thinking about listing all the message IDs stored in
>>> DB, listing all files belonging to the IDs and deleting all but one.
>>> Also I'm thinking about implementing some simple algorithm telling me
>>> whether the messages are really very similar. Just to be sure I don't
>>> delete something I don't want to.
>>>
>>> Was anyone playing with the idea?
>>
>> I am not sure what your use case is but notmuch automatically
>> deduplicates: that is if the message-id is one it has already seen no
>> further indexing takes place. The only thing that happens is the new
>> filename gets added to the list of filenames for the message.
>>
>> Thus importing should be almost as fast as if the message were not
>> there, and the database should be almost identical to what you would get
>> if you only imported the genuine new messages.
>>
>> If you want to save disk space then you could delete the duplicates
>> after with something like
>>
>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
>> xargs -0
>
> What if there are 3 duplicates (or 4... ;)

I was assuming that it was merging 2 duplicate-free bunches of messages,
but I guess the new 100000 might not be. In that case running the above
repeatedly (ie until it is a no-op) would be fine. 

>
>>
>> (but please test it carefully first!)
>
> One should also have some message content heuristics to determine that the
> content is indeed duplicate and not something totally different (not that
> we can see the different content anyway... but...)

That would be nice.

Best wishes

Mark


>>
>> I would think something like this is better than trying to parse the
>> message-ids yourself.
>
>
>>
>> Best wishes
>>
>> Mark
>>
>
> Tomi
>
>
>>
>>>
>>> -- 
>>> 	Vlad

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 14:26     ` Mark Walters
@ 2014-06-02 17:06       ` Jani Nikula
  2014-06-02 17:25         ` David Edmondson
  0 siblings, 1 reply; 13+ messages in thread
From: Jani Nikula @ 2014-06-02 17:06 UTC (permalink / raw)
  To: Mark Walters, Tomi Ollila, Vladimir Marek, notmuch

On Mon, 02 Jun 2014, Mark Walters <markwalters1009@gmail.com> wrote:
> Tomi Ollila <tomi.ollila@iki.fi> writes:
>
>> On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:
>>
>>> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
>>> If you want to save disk space then you could delete the duplicates
>>> after with something like
>>>
>>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
>>> xargs -0
>>
>> What if there are 3 duplicates (or 4... ;)
>
> I was assuming that it was merging 2 duplicate-free bunches of messages,
> but I guess the new 100000 might not be. In that case running the above
> repeatedly (ie until it is a no-op) would be fine. 

With 'notmuch new' in between the runs, obviously.

Alternatively, find the biggest --duplicate=N which still outputs
something, and run the command for each N...2.


>> One should also have some message content heuristics to determine that the
>> content is indeed duplicate and not something totally different (not that
>> we can see the different content anyway... but...)
>
> That would be nice.

And quite hard.


BR,
Jani.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 17:06       ` Jani Nikula
@ 2014-06-02 17:25         ` David Edmondson
  2014-06-02 18:29           ` Jani Nikula
  2014-06-06 10:40           ` Vladimir Marek
  0 siblings, 2 replies; 13+ messages in thread
From: David Edmondson @ 2014-06-02 17:25 UTC (permalink / raw)
  To: Jani Nikula, Mark Walters, Tomi Ollila, Vladimir Marek, notmuch

On Mon, Jun 02 2014, Jani Nikula wrote:
>>> One should also have some message content heuristics to determine that the
>>> content is indeed duplicate and not something totally different (not that
>>> we can see the different content anyway... but...)
>>
>> That would be nice.
>
> And quite hard.

Thinking about this a bit...

The headers are likely to be different, so you could remove them (get
rid of everything up to the first empty line).

Various mailing lists add footers, so you would need to remove them (a
regular expression based approach would catch most of them easily).

The remaining content should be the same for identical messages, so a
sensible hash (md5) could be used to compare.

Although, some MTAs modify the body of the message when manipulating
encoding. I don't know how to address this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 17:25         ` David Edmondson
@ 2014-06-02 18:29           ` Jani Nikula
  2014-06-06 10:40           ` Vladimir Marek
  1 sibling, 0 replies; 13+ messages in thread
From: Jani Nikula @ 2014-06-02 18:29 UTC (permalink / raw)
  To: David Edmondson, Mark Walters, Tomi Ollila, Vladimir Marek,
	notmuch

On Mon, 02 Jun 2014, David Edmondson <david.edmondson@oracle.com> wrote:
> On Mon, Jun 02 2014, Jani Nikula wrote:
>>>> One should also have some message content heuristics to determine that the
>>>> content is indeed duplicate and not something totally different (not that
>>>> we can see the different content anyway... but...)
>>>
>>> That would be nice.
>>
>> And quite hard.
>
> Thinking about this a bit...
>
> The headers are likely to be different, so you could remove them (get
> rid of everything up to the first empty line).
>
> Various mailing lists add footers, so you would need to remove them (a
> regular expression based approach would catch most of them easily).

This may work for text/plain messages, but for mime messages (and I
think text/html too) an extra layer of mime structure is usually
added. The problem becomes matching a subtree of mime structure, and
deciding the non-matching layer is noise that can be ignored. The
mailing list manager adding the extra layer may also decode and
reconstruct the existing parts instead of using them as-is.

> The remaining content should be the same for identical messages, so a
> sensible hash (md5) could be used to compare.
>
> Although, some MTAs modify the body of the message when manipulating
> encoding. I don't know how to address this.

Let's assume we can figure it all out and find the duplicates. The
question remains, which one to save and which ones to remove? For list
mail, perhaps you'd like to save the copy you received through the list
so you know it's list mail (and you could search for it using list-id:
header *cough* if we indexed that *cough*). Or perhaps you'd like to
save the copy you received directly because some lists let people have
their addresses filtered from cc: header before distributing.

More useful would probably be raising some flags if the heuristics
detect messages with the same message-id that are clearly *different*
messages. (Perhaps that's what Tomi was after to begin with?)

Finally, I personally wouldn't want any duplicates removed; rather I'd
like notmuch to index information across all duplicates, and provide UI
features to see the alternatives if desired.

BR,
Jani.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-02 17:25         ` David Edmondson
  2014-06-02 18:29           ` Jani Nikula
@ 2014-06-06 10:40           ` Vladimir Marek
  2014-06-07 13:37             ` Tomi Ollila
  1 sibling, 1 reply; 13+ messages in thread
From: Vladimir Marek @ 2014-06-06 10:40 UTC (permalink / raw)
  To: David Edmondson; +Cc: Tomi Ollila, notmuch

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

Hi,


So I wrote some code which works for me well. I have erased ~40k
messages out of 500k. It does not try to be complete solution, it only
detects and removes the obvious cases. The idea is to help me control
the number of duplicates when I import big mail archives which surely
contain many duplicates into my mail database.

> Thinking about this a bit...

> The headers are likely to be different, so you could remove them (get
> rid of everything up to the first empty line).

Yes, that's what I ended up doing. And I delete the files which have
less 'Received:' headers.


> Various mailing lists add footers, so you would need to remove them (a
> regular expression based approach would catch most of them easily).

I defined a list of known footers. Then I take the two mails with the
same message-id, create diff between them and  compare it to the list of
footers.


> The remaining content should be the same for identical messages, so a
> sensible hash (md5) could be used to compare.
> 
> Although, some MTAs modify the body of the message when manipulating
> encoding. I don't know how to address this.

I'm attaching my perl script if anyone is interested. It's in no way
complete solution. It is supposed to be used as

notmuch search --output=files --duplicate=2 '*' > dups
./dedup # It opens the file 'dups'

The attached version does not remove anyting (the 'unlink' command is
commented out).


Interestingly this does not work (it seems to return all messages):
notmuch search --output=messages --duplicate=2 '*'

Also I have found that if I run 'notmuch search' and 'notmuch new' at
the same time, the notmuch search crashes sometimes. That's why I don't
use

notmuch search ... | ./dedup

Use with care :)

Thank you for your help
-- 
	Vlad

[-- Attachment #2: dedup --]
[-- Type: text/plain, Size: 4901 bytes --]

#!/usr/bin/perl

use Data::Dumper;
use List::Util;


@TO_IGNORE= (

<<'EOT'
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch
EOT

,

<<'EOT'
> _______________________________________________
> Userland-perl mailing list
> Userland-perl@userland.us.oracle.com
> http://userland.us.oracle.com/mailman/listinfo/userland-perl
EOT

,

<<'EOT'
> _______________________________________________
> Mercurial mailing list
> Mercurial@selenic.com
> http://selenic.com/mailman/listinfo/mercurial
EOT

,

<<'EOT'
> --    
> To unsubscribe from this list go to the following URL and read the
> instructions:  https://lists.samba.org/mailman/options/samba
EOT

,

<<'EOT'
> 
EOT

);

sub rm($$) {
	my ($file, $comment) = @_;
	print "-> $file\n";
	print $comment;
	# unlink $file;
}

sub check_mail_id($) {
	$ID = $_[0];

	unless (open ID, "-|", "./notmuch", "search", "--output=files", "id:$ID") {
		warn "Can not fork: $!";
		return;
	}
	chomp(@FILES = <ID>);
	close ID;

	if (scalar @FILES <= 1) {
		warn "Not enough files for ID:$ID\n";
		return;
	}

	my ($F1, $F2) = @FILES;

	unless (-r $F1) {
		warn "Can not read $F1 in ID:$ID\n";
		return;
	}
	unless (-r $F2) {
		warn "Can not read $F2 in ID:$ID\n";
		return;
	}
	if ($F1 eq $F2) {
		warn "Same filename $F1\n in ID:$ID\n";
		return;
	}

	unless (open DIFF_WHOLE, "-|", $diff, $F1, $F2) {
		warn "Can not fork $diff: $!\n";
		return;
	}
	$DIFF_WHOLE = join "", <DIFF_WHOLE>;
	close DIFF_WHOLE;

	if ( length($DIFF_WHOLE) == 0 ) {
		rm $F2, "deleting_1\nID:$ID\n\n";
		return;
	}

	# 35a36
	# > Content-Length: 893
	if (
		$DIFF_WHOLE =~ /^\d+a\d+\n> Content-Length: \d+$/
		or
		$DIFF_WHOLE =~ /^\d+d\d+\n< Content-Length: \d+$/
	) {
		rm $F2, "deleting_2\nID:$ID\n\n";
		return;
	}



	# $r="[a-zA-Z0-9 ()[\]\.\+:/=;,\t-]+";
	# if (
	# 	$DIFF_WHOLE =~ /1,7d0\n< Received:$r\n< \t$r\n< \t$r\n< Received:$r\n< \t$r\n< \t$r\n< \t$r\n\d+a\d+,\d+\n> Content-Length:$r\n> Lines:$r/
	# ) {
	# 	printf "deleting_3\nID:$ID\n$DIFF_WHOLE\n\n";
	# 	return;
	# }

	unless (open DIFF_BODY, "-|", "bash", "-c", "$diff <(sed -e 1,/^\$/d \"\$1\" ) <(sed -e 1,/^\$/d \"\$2\" )", "", $F1, $F2) {
		warn "Can not fork $diff (2): $!\n";
		return;
	}
	$DIFF_BODY = join "", <DIFF_BODY>;
	close DIFF_BODY;

	if ( length($DIFF_BODY) == 0 ) {
		# The bodies are the same - let's find which one has less
		# Received: headers and delete that
		unless (open F, $F1) 
		{
			warn "Can't open F1 '$F1': $!";
			return;
		}
		my $count1 = grep { /^Received: / } <F>;
		close F;
		unless (open F, $F2) 
		{
			warn "Can't open F2 '$F2': $!";
			return;
		}
		my $count2 = grep { /^Received: / } <F>;
		close F;

		if ($count1 > $count2) {
			rm $F2, "deleting_4a\nID:$ID\n\n";
		} else {
			rm $F1, "deleting_4b\nID:$ID\n\n";
		}
		return;
	}


	for (@TO_IGNORE) {
		next unless $DIFF_BODY =~ $_;
		# Remove the first one as the second is adding lines
		rm $F1, "deleting_5\nID:$ID\n\n";
		return;
	}

	for (@TO_IGNORE_REVERSE) {
		next unless $DIFF_BODY =~ $_;
		# Remove the second as it is removing some lines
		rm $F2, "deleting_6\nID:$ID\n\n";
		return;
	}

	#--------------------------------------------------
	# '2c2
	# < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)
	# ---
	# > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)
	# 39c39
	# < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)
	# ---
	# > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)
	# 55c55
	# < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)--
	# ---
	# > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)--
	#-------------------------------------------------- 
	$re = qr/(\d+)c\1\n< --Boundary_\(\S+\)(?:--)?\n---\n> --Boundary_\(\S+\)(?:--)?\n/;
	if ( $DIFF_BODY =~ m/^(?:$re+)$/ ) {
		# Change in boundary strings
		rm $F2, "deleting_7\nID:$ID\n\n";
		return;
	}

	print "DIFF_BODY (ID: $ID):\n'$DIFF_BODY'\n\n" if length $DIFF_BODY < 300;
}

$diff = 'diff';
$diff = 'gdiff' if -x '/usr/bin/gdiff'; # Solaris

# First create reverse regexps (removing lines from the mail) so that we don't
# overwrite the original @TO_IGNORE
@TO_IGNORE_REVERSE = map {
	$x = $_;                       # Make sure we don't change the @TO_IGNORE array
	$x =~ s/^>/</mg;               # Make sure all the lines are adding a text
	qr/^(?:\d+,)?\d+d\d+\n\Q$x\E$/ # 1,2d3 or 2d3
} @TO_IGNORE;

# Now map the positive regexp (adding lines to the mail)
@TO_IGNORE = map {
	s/^</>/mg;                      # Make sure all the lines are removing text
	qr/^\d+a\d+?(?:,\d+)?\n\Q$_\E$/ # 115a116,119 or 114a116
} @TO_IGNORE;

# File 'dups' is created via
# notmuch search --output=files --duplicate=2 '*' > dups

open INPUT, "dups" or die "Can't open dups: $!\n";
while (<INPUT>) {
	chomp;
	if (open FILE, $_) {
		$id =  List::Util::first { s/^message-id:.*<(.*)>\n$/\1/i } <FILE>;
		close FILE;
		check_mail_id $id if defined $id;
	} else {
		print "Can't find '$_\n'";
	}
}
close INPUT;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Deduplication ?
  2014-06-06 10:40           ` Vladimir Marek
@ 2014-06-07 13:37             ` Tomi Ollila
  0 siblings, 0 replies; 13+ messages in thread
From: Tomi Ollila @ 2014-06-07 13:37 UTC (permalink / raw)
  To: Vladimir Marek; +Cc: notmuch

On Fri, Jun 06 2014, Vladimir Marek <Vladimir.Marek@oracle.com> wrote:

> Hi,
>

 // stuff deleted //

>
> I'm attaching my perl script if anyone is interested. It's in no way
> complete solution. It is supposed to be used as
>
> notmuch search --output=files --duplicate=2 '*' > dups
> ./dedup # It opens the file 'dups'
>
> The attached version does not remove anyting (the 'unlink' command is
> commented out).
>
>
> Interestingly this does not work (it seems to return all messages):
> notmuch search --output=messages --duplicate=2 '*'
>
> Also I have found that if I run 'notmuch search' and 'notmuch new' at
> the same time, the notmuch search crashes sometimes. That's why I don't
> use
>
> notmuch search ... | ./dedup
>
> Use with care :)

To me, any perl code that lacks use strict; use warning; looks like a BIG
footgun ;/

>
> Thank you for your help
> -- 
> 	Vlad


Tomi

> #!/usr/bin/perl
>
> use Data::Dumper;
> use List::Util;
>
>
> @TO_IGNORE= (
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-06-07 13:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-02 12:32 Deduplication ? Vladimir Marek
2014-06-02 13:43 ` David Edmondson
2014-06-02 13:54   ` Vladimir Marek
2014-06-02 14:10     ` Mark Walters
2014-06-02 14:15       ` Mark Walters
2014-06-02 13:51 ` Mark Walters
2014-06-02 14:17   ` Tomi Ollila
2014-06-02 14:26     ` Mark Walters
2014-06-02 17:06       ` Jani Nikula
2014-06-02 17:25         ` David Edmondson
2014-06-02 18:29           ` Jani Nikula
2014-06-06 10:40           ` Vladimir Marek
2014-06-07 13:37             ` Tomi Ollila

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).