* Deduplication ? @ 2014-06-02 12:32 Vladimir Marek 2014-06-02 13:43 ` David Edmondson 2014-06-02 13:51 ` Mark Walters 0 siblings, 2 replies; 13+ messages in thread From: Vladimir Marek @ 2014-06-02 12:32 UTC (permalink / raw) To: notmuch Hi, I want to import bigger chunk of archived messages into my notmuch database. It's about 100k messages. The problem is, that I most probably have quite a lot of those messages in the DB. Basically I would like to add only those I don't have already. There are two possibilities a) I will add all the 100k messages and then remove the duplicities. b) I will write a script which will parse the message ID's of the to-be-added messages and try to match them to the notmuch DB. Adding only files I can't find already. Ad b) might be better option, but I started to play with the idea of deduplication. I'm thinking about listing all the message IDs stored in DB, listing all files belonging to the IDs and deleting all but one. Also I'm thinking about implementing some simple algorithm telling me whether the messages are really very similar. Just to be sure I don't delete something I don't want to. Was anyone playing with the idea? -- Vlad ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 12:32 Deduplication ? Vladimir Marek @ 2014-06-02 13:43 ` David Edmondson 2014-06-02 13:54 ` Vladimir Marek 2014-06-02 13:51 ` Mark Walters 1 sibling, 1 reply; 13+ messages in thread From: David Edmondson @ 2014-06-02 13:43 UTC (permalink / raw) To: Vladimir Marek, notmuch On Mon, Jun 02 2014, Vladimir Marek wrote: > Hi, > > I want to import bigger chunk of archived messages into my notmuch > database. It's about 100k messages. The problem is, that I most probably > have quite a lot of those messages in the DB. Basically I would like to > add only those I don't have already. > > There are two possibilities > > a) I will add all the 100k messages and then remove the duplicities. > > b) I will write a script which will parse the message ID's of the > to-be-added messages and try to match them to the notmuch DB. Adding > only files I can't find already. > > Ad b) might be better option, but I started to play with the idea of > deduplication. I'm thinking about listing all the message IDs stored in > DB, listing all files belonging to the IDs and deleting all but one. > Also I'm thinking about implementing some simple algorithm telling me > whether the messages are really very similar. Just to be sure I don't > delete something I don't want to. > > Was anyone playing with the idea? notsync[1] used the (lack of) existence of a message id in the store to decide whether to add something from an IMAP server, but it is old, crufty, unused and unloved code. > -- > Vlad > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch Footnotes: [1] https://github.com/dme/notsync ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 13:43 ` David Edmondson @ 2014-06-02 13:54 ` Vladimir Marek 2014-06-02 14:10 ` Mark Walters 0 siblings, 1 reply; 13+ messages in thread From: Vladimir Marek @ 2014-06-02 13:54 UTC (permalink / raw) To: David Edmondson; +Cc: notmuch > > I want to import bigger chunk of archived messages into my notmuch > > database. It's about 100k messages. The problem is, that I most probably > > have quite a lot of those messages in the DB. Basically I would like to > > add only those I don't have already. > > > > There are two possibilities > > > > a) I will add all the 100k messages and then remove the duplicities. > > > > b) I will write a script which will parse the message ID's of the > > to-be-added messages and try to match them to the notmuch DB. Adding > > only files I can't find already. > > > > Ad b) might be better option, but I started to play with the idea of > > deduplication. I'm thinking about listing all the message IDs stored in > > DB, listing all files belonging to the IDs and deleting all but one. > > Also I'm thinking about implementing some simple algorithm telling me > > whether the messages are really very similar. Just to be sure I don't > > delete something I don't want to. > > > > Was anyone playing with the idea? > > notsync[1] used the (lack of) existence of a message id in the store to > decide whether to add something from an IMAP server, but it is old, > crufty, unused and unloved code. I see, that's close to my b) solution, thanks! -- Vlad ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 13:54 ` Vladimir Marek @ 2014-06-02 14:10 ` Mark Walters 2014-06-02 14:15 ` Mark Walters 0 siblings, 1 reply; 13+ messages in thread From: Mark Walters @ 2014-06-02 14:10 UTC (permalink / raw) To: Vladimir Marek, David Edmondson; +Cc: notmuch Vladimir Marek <Vladimir.Marek@oracle.com> writes: >> > I want to import bigger chunk of archived messages into my notmuch >> > database. It's about 100k messages. The problem is, that I most probably >> > have quite a lot of those messages in the DB. Basically I would like to >> > add only those I don't have already. >> > >> > There are two possibilities >> > >> > a) I will add all the 100k messages and then remove the duplicities. >> > >> > b) I will write a script which will parse the message ID's of the >> > to-be-added messages and try to match them to the notmuch DB. Adding >> > only files I can't find already. >> > >> > Ad b) might be better option, but I started to play with the idea of >> > deduplication. I'm thinking about listing all the message IDs stored in >> > DB, listing all files belonging to the IDs and deleting all but one. >> > Also I'm thinking about implementing some simple algorithm telling me >> > whether the messages are really very similar. Just to be sure I don't >> > delete something I don't want to. >> > >> > Was anyone playing with the idea? >> >> notsync[1] used the (lack of) existence of a message id in the store to >> decide whether to add something from an IMAP server, but it is old, >> crufty, unused and unloved code. > > I see, that's close to my b) solution, thanks! Did you mean a) here? The idea was to add them all first and then run this script to delete the duplicates. Best wishes Mark > -- > Vlad > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 14:10 ` Mark Walters @ 2014-06-02 14:15 ` Mark Walters 0 siblings, 0 replies; 13+ messages in thread From: Mark Walters @ 2014-06-02 14:15 UTC (permalink / raw) To: Vladimir Marek, David Edmondson; +Cc: notmuch Mark Walters <markwalters1009@gmail.com> writes: > Vladimir Marek <Vladimir.Marek@oracle.com> writes: > >>> > I want to import bigger chunk of archived messages into my notmuch >>> > database. It's about 100k messages. The problem is, that I most probably >>> > have quite a lot of those messages in the DB. Basically I would like to >>> > add only those I don't have already. >>> > >>> > There are two possibilities >>> > >>> > a) I will add all the 100k messages and then remove the duplicities. >>> > >>> > b) I will write a script which will parse the message ID's of the >>> > to-be-added messages and try to match them to the notmuch DB. Adding >>> > only files I can't find already. >>> > >>> > Ad b) might be better option, but I started to play with the idea of >>> > deduplication. I'm thinking about listing all the message IDs stored in >>> > DB, listing all files belonging to the IDs and deleting all but one. >>> > Also I'm thinking about implementing some simple algorithm telling me >>> > whether the messages are really very similar. Just to be sure I don't >>> > delete something I don't want to. >>> > >>> > Was anyone playing with the idea? >>> >>> notsync[1] used the (lack of) existence of a message id in the store to >>> decide whether to add something from an IMAP server, but it is old, >>> crufty, unused and unloved code. >> >> I see, that's close to my b) solution, thanks! > > Did you mean a) here? The idea was to add them all first and then run > this script to delete the duplicates. > Sorry: out of order arrival times and lack of care on my part. Sorry! MW > Best wishes > > Mark > >> -- >> Vlad >> _______________________________________________ >> notmuch mailing list >> notmuch@notmuchmail.org >> http://notmuchmail.org/mailman/listinfo/notmuch ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 12:32 Deduplication ? Vladimir Marek 2014-06-02 13:43 ` David Edmondson @ 2014-06-02 13:51 ` Mark Walters 2014-06-02 14:17 ` Tomi Ollila 1 sibling, 1 reply; 13+ messages in thread From: Mark Walters @ 2014-06-02 13:51 UTC (permalink / raw) To: Vladimir Marek, notmuch Vladimir Marek <Vladimir.Marek@oracle.com> writes: > Hi, > > I want to import bigger chunk of archived messages into my notmuch > database. It's about 100k messages. The problem is, that I most probably > have quite a lot of those messages in the DB. Basically I would like to > add only those I don't have already. > > There are two possibilities > > a) I will add all the 100k messages and then remove the duplicities. > > b) I will write a script which will parse the message ID's of the > to-be-added messages and try to match them to the notmuch DB. Adding > only files I can't find already. > > Ad b) might be better option, but I started to play with the idea of > deduplication. I'm thinking about listing all the message IDs stored in > DB, listing all files belonging to the IDs and deleting all but one. > Also I'm thinking about implementing some simple algorithm telling me > whether the messages are really very similar. Just to be sure I don't > delete something I don't want to. > > Was anyone playing with the idea? I am not sure what your use case is but notmuch automatically deduplicates: that is if the message-id is one it has already seen no further indexing takes place. The only thing that happens is the new filename gets added to the list of filenames for the message. Thus importing should be almost as fast as if the message were not there, and the database should be almost identical to what you would get if you only imported the genuine new messages. If you want to save disk space then you could delete the duplicates after with something like notmuch search --output=files --format=text0 --duplicate=2 '*' piped to xargs -0 (but please test it carefully first!) I would think something like this is better than trying to parse the message-ids yourself. Best wishes Mark > > -- > Vlad > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 13:51 ` Mark Walters @ 2014-06-02 14:17 ` Tomi Ollila 2014-06-02 14:26 ` Mark Walters 0 siblings, 1 reply; 13+ messages in thread From: Tomi Ollila @ 2014-06-02 14:17 UTC (permalink / raw) To: Mark Walters, Vladimir Marek, notmuch On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote: > Vladimir Marek <Vladimir.Marek@oracle.com> writes: > >> Hi, >> >> I want to import bigger chunk of archived messages into my notmuch >> database. It's about 100k messages. The problem is, that I most probably >> have quite a lot of those messages in the DB. Basically I would like to >> add only those I don't have already. >> >> There are two possibilities >> >> a) I will add all the 100k messages and then remove the duplicities. >> >> b) I will write a script which will parse the message ID's of the >> to-be-added messages and try to match them to the notmuch DB. Adding >> only files I can't find already. >> >> Ad b) might be better option, but I started to play with the idea of >> deduplication. I'm thinking about listing all the message IDs stored in >> DB, listing all files belonging to the IDs and deleting all but one. >> Also I'm thinking about implementing some simple algorithm telling me >> whether the messages are really very similar. Just to be sure I don't >> delete something I don't want to. >> >> Was anyone playing with the idea? > > I am not sure what your use case is but notmuch automatically > deduplicates: that is if the message-id is one it has already seen no > further indexing takes place. The only thing that happens is the new > filename gets added to the list of filenames for the message. > > Thus importing should be almost as fast as if the message were not > there, and the database should be almost identical to what you would get > if you only imported the genuine new messages. > > If you want to save disk space then you could delete the duplicates > after with something like > > notmuch search --output=files --format=text0 --duplicate=2 '*' piped to > xargs -0 What if there are 3 duplicates (or 4... ;) > > (but please test it carefully first!) One should also have some message content heuristics to determine that the content is indeed duplicate and not something totally different (not that we can see the different content anyway... but...) > > I would think something like this is better than trying to parse the > message-ids yourself. > > Best wishes > > Mark > Tomi > >> >> -- >> Vlad ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 14:17 ` Tomi Ollila @ 2014-06-02 14:26 ` Mark Walters 2014-06-02 17:06 ` Jani Nikula 0 siblings, 1 reply; 13+ messages in thread From: Mark Walters @ 2014-06-02 14:26 UTC (permalink / raw) To: Tomi Ollila, Vladimir Marek, notmuch Tomi Ollila <tomi.ollila@iki.fi> writes: > On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote: > >> Vladimir Marek <Vladimir.Marek@oracle.com> writes: >> >>> Hi, >>> >>> I want to import bigger chunk of archived messages into my notmuch >>> database. It's about 100k messages. The problem is, that I most probably >>> have quite a lot of those messages in the DB. Basically I would like to >>> add only those I don't have already. >>> >>> There are two possibilities >>> >>> a) I will add all the 100k messages and then remove the duplicities. >>> >>> b) I will write a script which will parse the message ID's of the >>> to-be-added messages and try to match them to the notmuch DB. Adding >>> only files I can't find already. >>> >>> Ad b) might be better option, but I started to play with the idea of >>> deduplication. I'm thinking about listing all the message IDs stored in >>> DB, listing all files belonging to the IDs and deleting all but one. >>> Also I'm thinking about implementing some simple algorithm telling me >>> whether the messages are really very similar. Just to be sure I don't >>> delete something I don't want to. >>> >>> Was anyone playing with the idea? >> >> I am not sure what your use case is but notmuch automatically >> deduplicates: that is if the message-id is one it has already seen no >> further indexing takes place. The only thing that happens is the new >> filename gets added to the list of filenames for the message. >> >> Thus importing should be almost as fast as if the message were not >> there, and the database should be almost identical to what you would get >> if you only imported the genuine new messages. >> >> If you want to save disk space then you could delete the duplicates >> after with something like >> >> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to >> xargs -0 > > What if there are 3 duplicates (or 4... ;) I was assuming that it was merging 2 duplicate-free bunches of messages, but I guess the new 100000 might not be. In that case running the above repeatedly (ie until it is a no-op) would be fine. > >> >> (but please test it carefully first!) > > One should also have some message content heuristics to determine that the > content is indeed duplicate and not something totally different (not that > we can see the different content anyway... but...) That would be nice. Best wishes Mark >> >> I would think something like this is better than trying to parse the >> message-ids yourself. > > >> >> Best wishes >> >> Mark >> > > Tomi > > >> >>> >>> -- >>> Vlad ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 14:26 ` Mark Walters @ 2014-06-02 17:06 ` Jani Nikula 2014-06-02 17:25 ` David Edmondson 0 siblings, 1 reply; 13+ messages in thread From: Jani Nikula @ 2014-06-02 17:06 UTC (permalink / raw) To: Mark Walters, Tomi Ollila, Vladimir Marek, notmuch On Mon, 02 Jun 2014, Mark Walters <markwalters1009@gmail.com> wrote: > Tomi Ollila <tomi.ollila@iki.fi> writes: > >> On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote: >> >>> Vladimir Marek <Vladimir.Marek@oracle.com> writes: >>> If you want to save disk space then you could delete the duplicates >>> after with something like >>> >>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to >>> xargs -0 >> >> What if there are 3 duplicates (or 4... ;) > > I was assuming that it was merging 2 duplicate-free bunches of messages, > but I guess the new 100000 might not be. In that case running the above > repeatedly (ie until it is a no-op) would be fine. With 'notmuch new' in between the runs, obviously. Alternatively, find the biggest --duplicate=N which still outputs something, and run the command for each N...2. >> One should also have some message content heuristics to determine that the >> content is indeed duplicate and not something totally different (not that >> we can see the different content anyway... but...) > > That would be nice. And quite hard. BR, Jani. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 17:06 ` Jani Nikula @ 2014-06-02 17:25 ` David Edmondson 2014-06-02 18:29 ` Jani Nikula 2014-06-06 10:40 ` Vladimir Marek 0 siblings, 2 replies; 13+ messages in thread From: David Edmondson @ 2014-06-02 17:25 UTC (permalink / raw) To: Jani Nikula, Mark Walters, Tomi Ollila, Vladimir Marek, notmuch On Mon, Jun 02 2014, Jani Nikula wrote: >>> One should also have some message content heuristics to determine that the >>> content is indeed duplicate and not something totally different (not that >>> we can see the different content anyway... but...) >> >> That would be nice. > > And quite hard. Thinking about this a bit... The headers are likely to be different, so you could remove them (get rid of everything up to the first empty line). Various mailing lists add footers, so you would need to remove them (a regular expression based approach would catch most of them easily). The remaining content should be the same for identical messages, so a sensible hash (md5) could be used to compare. Although, some MTAs modify the body of the message when manipulating encoding. I don't know how to address this. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 17:25 ` David Edmondson @ 2014-06-02 18:29 ` Jani Nikula 2014-06-06 10:40 ` Vladimir Marek 1 sibling, 0 replies; 13+ messages in thread From: Jani Nikula @ 2014-06-02 18:29 UTC (permalink / raw) To: David Edmondson, Mark Walters, Tomi Ollila, Vladimir Marek, notmuch On Mon, 02 Jun 2014, David Edmondson <david.edmondson@oracle.com> wrote: > On Mon, Jun 02 2014, Jani Nikula wrote: >>>> One should also have some message content heuristics to determine that the >>>> content is indeed duplicate and not something totally different (not that >>>> we can see the different content anyway... but...) >>> >>> That would be nice. >> >> And quite hard. > > Thinking about this a bit... > > The headers are likely to be different, so you could remove them (get > rid of everything up to the first empty line). > > Various mailing lists add footers, so you would need to remove them (a > regular expression based approach would catch most of them easily). This may work for text/plain messages, but for mime messages (and I think text/html too) an extra layer of mime structure is usually added. The problem becomes matching a subtree of mime structure, and deciding the non-matching layer is noise that can be ignored. The mailing list manager adding the extra layer may also decode and reconstruct the existing parts instead of using them as-is. > The remaining content should be the same for identical messages, so a > sensible hash (md5) could be used to compare. > > Although, some MTAs modify the body of the message when manipulating > encoding. I don't know how to address this. Let's assume we can figure it all out and find the duplicates. The question remains, which one to save and which ones to remove? For list mail, perhaps you'd like to save the copy you received through the list so you know it's list mail (and you could search for it using list-id: header *cough* if we indexed that *cough*). Or perhaps you'd like to save the copy you received directly because some lists let people have their addresses filtered from cc: header before distributing. More useful would probably be raising some flags if the heuristics detect messages with the same message-id that are clearly *different* messages. (Perhaps that's what Tomi was after to begin with?) Finally, I personally wouldn't want any duplicates removed; rather I'd like notmuch to index information across all duplicates, and provide UI features to see the alternatives if desired. BR, Jani. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-02 17:25 ` David Edmondson 2014-06-02 18:29 ` Jani Nikula @ 2014-06-06 10:40 ` Vladimir Marek 2014-06-07 13:37 ` Tomi Ollila 1 sibling, 1 reply; 13+ messages in thread From: Vladimir Marek @ 2014-06-06 10:40 UTC (permalink / raw) To: David Edmondson; +Cc: Tomi Ollila, notmuch [-- Attachment #1: Type: text/plain, Size: 1766 bytes --] Hi, So I wrote some code which works for me well. I have erased ~40k messages out of 500k. It does not try to be complete solution, it only detects and removes the obvious cases. The idea is to help me control the number of duplicates when I import big mail archives which surely contain many duplicates into my mail database. > Thinking about this a bit... > The headers are likely to be different, so you could remove them (get > rid of everything up to the first empty line). Yes, that's what I ended up doing. And I delete the files which have less 'Received:' headers. > Various mailing lists add footers, so you would need to remove them (a > regular expression based approach would catch most of them easily). I defined a list of known footers. Then I take the two mails with the same message-id, create diff between them and compare it to the list of footers. > The remaining content should be the same for identical messages, so a > sensible hash (md5) could be used to compare. > > Although, some MTAs modify the body of the message when manipulating > encoding. I don't know how to address this. I'm attaching my perl script if anyone is interested. It's in no way complete solution. It is supposed to be used as notmuch search --output=files --duplicate=2 '*' > dups ./dedup # It opens the file 'dups' The attached version does not remove anyting (the 'unlink' command is commented out). Interestingly this does not work (it seems to return all messages): notmuch search --output=messages --duplicate=2 '*' Also I have found that if I run 'notmuch search' and 'notmuch new' at the same time, the notmuch search crashes sometimes. That's why I don't use notmuch search ... | ./dedup Use with care :) Thank you for your help -- Vlad [-- Attachment #2: dedup --] [-- Type: text/plain, Size: 4901 bytes --] #!/usr/bin/perl use Data::Dumper; use List::Util; @TO_IGNORE= ( <<'EOT' > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch EOT , <<'EOT' > _______________________________________________ > Userland-perl mailing list > Userland-perl@userland.us.oracle.com > http://userland.us.oracle.com/mailman/listinfo/userland-perl EOT , <<'EOT' > _______________________________________________ > Mercurial mailing list > Mercurial@selenic.com > http://selenic.com/mailman/listinfo/mercurial EOT , <<'EOT' > -- > To unsubscribe from this list go to the following URL and read the > instructions: https://lists.samba.org/mailman/options/samba EOT , <<'EOT' > EOT ); sub rm($$) { my ($file, $comment) = @_; print "-> $file\n"; print $comment; # unlink $file; } sub check_mail_id($) { $ID = $_[0]; unless (open ID, "-|", "./notmuch", "search", "--output=files", "id:$ID") { warn "Can not fork: $!"; return; } chomp(@FILES = <ID>); close ID; if (scalar @FILES <= 1) { warn "Not enough files for ID:$ID\n"; return; } my ($F1, $F2) = @FILES; unless (-r $F1) { warn "Can not read $F1 in ID:$ID\n"; return; } unless (-r $F2) { warn "Can not read $F2 in ID:$ID\n"; return; } if ($F1 eq $F2) { warn "Same filename $F1\n in ID:$ID\n"; return; } unless (open DIFF_WHOLE, "-|", $diff, $F1, $F2) { warn "Can not fork $diff: $!\n"; return; } $DIFF_WHOLE = join "", <DIFF_WHOLE>; close DIFF_WHOLE; if ( length($DIFF_WHOLE) == 0 ) { rm $F2, "deleting_1\nID:$ID\n\n"; return; } # 35a36 # > Content-Length: 893 if ( $DIFF_WHOLE =~ /^\d+a\d+\n> Content-Length: \d+$/ or $DIFF_WHOLE =~ /^\d+d\d+\n< Content-Length: \d+$/ ) { rm $F2, "deleting_2\nID:$ID\n\n"; return; } # $r="[a-zA-Z0-9 ()[\]\.\+:/=;,\t-]+"; # if ( # $DIFF_WHOLE =~ /1,7d0\n< Received:$r\n< \t$r\n< \t$r\n< Received:$r\n< \t$r\n< \t$r\n< \t$r\n\d+a\d+,\d+\n> Content-Length:$r\n> Lines:$r/ # ) { # printf "deleting_3\nID:$ID\n$DIFF_WHOLE\n\n"; # return; # } unless (open DIFF_BODY, "-|", "bash", "-c", "$diff <(sed -e 1,/^\$/d \"\$1\" ) <(sed -e 1,/^\$/d \"\$2\" )", "", $F1, $F2) { warn "Can not fork $diff (2): $!\n"; return; } $DIFF_BODY = join "", <DIFF_BODY>; close DIFF_BODY; if ( length($DIFF_BODY) == 0 ) { # The bodies are the same - let's find which one has less # Received: headers and delete that unless (open F, $F1) { warn "Can't open F1 '$F1': $!"; return; } my $count1 = grep { /^Received: / } <F>; close F; unless (open F, $F2) { warn "Can't open F2 '$F2': $!"; return; } my $count2 = grep { /^Received: / } <F>; close F; if ($count1 > $count2) { rm $F2, "deleting_4a\nID:$ID\n\n"; } else { rm $F1, "deleting_4b\nID:$ID\n\n"; } return; } for (@TO_IGNORE) { next unless $DIFF_BODY =~ $_; # Remove the first one as the second is adding lines rm $F1, "deleting_5\nID:$ID\n\n"; return; } for (@TO_IGNORE_REVERSE) { next unless $DIFF_BODY =~ $_; # Remove the second as it is removing some lines rm $F2, "deleting_6\nID:$ID\n\n"; return; } #-------------------------------------------------- # '2c2 # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A) # --- # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ) # 39c39 # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A) # --- # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ) # 55c55 # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)-- # --- # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)-- #-------------------------------------------------- $re = qr/(\d+)c\1\n< --Boundary_\(\S+\)(?:--)?\n---\n> --Boundary_\(\S+\)(?:--)?\n/; if ( $DIFF_BODY =~ m/^(?:$re+)$/ ) { # Change in boundary strings rm $F2, "deleting_7\nID:$ID\n\n"; return; } print "DIFF_BODY (ID: $ID):\n'$DIFF_BODY'\n\n" if length $DIFF_BODY < 300; } $diff = 'diff'; $diff = 'gdiff' if -x '/usr/bin/gdiff'; # Solaris # First create reverse regexps (removing lines from the mail) so that we don't # overwrite the original @TO_IGNORE @TO_IGNORE_REVERSE = map { $x = $_; # Make sure we don't change the @TO_IGNORE array $x =~ s/^>/</mg; # Make sure all the lines are adding a text qr/^(?:\d+,)?\d+d\d+\n\Q$x\E$/ # 1,2d3 or 2d3 } @TO_IGNORE; # Now map the positive regexp (adding lines to the mail) @TO_IGNORE = map { s/^</>/mg; # Make sure all the lines are removing text qr/^\d+a\d+?(?:,\d+)?\n\Q$_\E$/ # 115a116,119 or 114a116 } @TO_IGNORE; # File 'dups' is created via # notmuch search --output=files --duplicate=2 '*' > dups open INPUT, "dups" or die "Can't open dups: $!\n"; while (<INPUT>) { chomp; if (open FILE, $_) { $id = List::Util::first { s/^message-id:.*<(.*)>\n$/\1/i } <FILE>; close FILE; check_mail_id $id if defined $id; } else { print "Can't find '$_\n'"; } } close INPUT; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Deduplication ? 2014-06-06 10:40 ` Vladimir Marek @ 2014-06-07 13:37 ` Tomi Ollila 0 siblings, 0 replies; 13+ messages in thread From: Tomi Ollila @ 2014-06-07 13:37 UTC (permalink / raw) To: Vladimir Marek; +Cc: notmuch On Fri, Jun 06 2014, Vladimir Marek <Vladimir.Marek@oracle.com> wrote: > Hi, > // stuff deleted // > > I'm attaching my perl script if anyone is interested. It's in no way > complete solution. It is supposed to be used as > > notmuch search --output=files --duplicate=2 '*' > dups > ./dedup # It opens the file 'dups' > > The attached version does not remove anyting (the 'unlink' command is > commented out). > > > Interestingly this does not work (it seems to return all messages): > notmuch search --output=messages --duplicate=2 '*' > > Also I have found that if I run 'notmuch search' and 'notmuch new' at > the same time, the notmuch search crashes sometimes. That's why I don't > use > > notmuch search ... | ./dedup > > Use with care :) To me, any perl code that lacks use strict; use warning; looks like a BIG footgun ;/ > > Thank you for your help > -- > Vlad Tomi > #!/usr/bin/perl > > use Data::Dumper; > use List::Util; > > > @TO_IGNORE= ( > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-06-07 13:38 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-02 12:32 Deduplication ? Vladimir Marek 2014-06-02 13:43 ` David Edmondson 2014-06-02 13:54 ` Vladimir Marek 2014-06-02 14:10 ` Mark Walters 2014-06-02 14:15 ` Mark Walters 2014-06-02 13:51 ` Mark Walters 2014-06-02 14:17 ` Tomi Ollila 2014-06-02 14:26 ` Mark Walters 2014-06-02 17:06 ` Jani Nikula 2014-06-02 17:25 ` David Edmondson 2014-06-02 18:29 ` Jani Nikula 2014-06-06 10:40 ` Vladimir Marek 2014-06-07 13:37 ` Tomi Ollila
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).