Hi, So I wrote some code which works for me well. I have erased ~40k messages out of 500k. It does not try to be complete solution, it only detects and removes the obvious cases. The idea is to help me control the number of duplicates when I import big mail archives which surely contain many duplicates into my mail database. > Thinking about this a bit... > The headers are likely to be different, so you could remove them (get > rid of everything up to the first empty line). Yes, that's what I ended up doing. And I delete the files which have less 'Received:' headers. > Various mailing lists add footers, so you would need to remove them (a > regular expression based approach would catch most of them easily). I defined a list of known footers. Then I take the two mails with the same message-id, create diff between them and compare it to the list of footers. > The remaining content should be the same for identical messages, so a > sensible hash (md5) could be used to compare. > > Although, some MTAs modify the body of the message when manipulating > encoding. I don't know how to address this. I'm attaching my perl script if anyone is interested. It's in no way complete solution. It is supposed to be used as notmuch search --output=files --duplicate=2 '*' > dups ./dedup # It opens the file 'dups' The attached version does not remove anyting (the 'unlink' command is commented out). Interestingly this does not work (it seems to return all messages): notmuch search --output=messages --duplicate=2 '*' Also I have found that if I run 'notmuch search' and 'notmuch new' at the same time, the notmuch search crashes sometimes. That's why I don't use notmuch search ... | ./dedup Use with care :) Thank you for your help -- Vlad