>> On Tue, Sep 04 2012, Dmitry Kurochkin wrote: >>> +class MailComparator: >>> + """Checks if mail files are duplicates.""" >>> + def __init__(self, filename): >>> + self.filename = filename >>> + self.mail = self.readFile(self.filename) >>> + >>> + def isDuplicate(self, filename): >>> + return self.mail == self.readFile(filename) >>> + >>> + @staticmethod >>> + def readFile(filename): >>> + with open(filename) as f: >>> + data = "" >>> + while True: >>> + line = f.readline() >>> + for header in IGNORED_HEADERS: >>> + if line.startswith(header): > Michal Nazarewicz writes: >> Case of headers should be ignored, but this does not ignore it. On Tue, Sep 04 2012, Dmitry Kurochkin wrote: > It does. Wait, how? If line is “received:” how does it starts with “Received:”? >>> + if os.path.realpath(comparator.filename) == os.path.realpath(filename): >>> + print "Message '%s' has filenames pointing to the >>> same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename, >>> filename) >> >> So why aren't those removed? >> > > Because it is the same file indexed twice (probably because of > symlinks). We do not want to remove the only message file. Ah, right, with symlinks this is troublesome, but than again, we can check if there is at least one non-symlink. If there is, delete everything else, if there is not, delete all but one arbitrarily chosen symlink. >>> + elif comparator.isDuplicate(filename): >>> + os.remove(filename) >>> + duplicates_count += 1 >>> + else: >>> + #print "Potential duplicates: %s" % msg.get_message_id() >>> + suspected_duplicates_count += 1 >>> + >>> + new_timestamp = time.time() >>> + if new_timestamp - timestamp > 1: >>> + timestamp = new_timestamp >>> + sys.stdout.write("\rProcessed %s messages, removed %s duplicates..." % (msg_count, duplicates_count)) >>> + sys.stdout.flush() >>> + >>> +print "\rFinished. Processed %s messages, removed %s duplicates." % (msg_count, duplicates_count) >>> +if duplicates_count > 0: >>> + print "You might want to run 'notmuch new' now." >>> + >>> +if suspected_duplicates_count > 0: >>> + print >>> + print "Found %s messages with duplicate IDs but different content." % suspected_duplicates_count >>> + print "Perhaps we should ignore more headers." >> >> Please consider the following instead (not tested): > Thanks for reviewing my poor python code :) I am afraid I do not have > enough interest in improving it. I just implemented a simple solution > for my problem. Though it looks like you already took time to rewrite > the script. Would be great if you send it as a proper patch obsoleting > this one. Bah, I'll probably won't have time to properly test it. -- Best regards, _ _ .o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michał “mina86” Nazarewicz (o o) ooo +------------------ooO--(_)--Ooo--