From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id B9DEF1F8C8 for ; Tue, 5 Oct 2021 07:09:18 +0000 (UTC) Date: Tue, 5 Oct 2021 07:09:18 +0000 From: Eric Wong To: meta@public-inbox.org Subject: Re: [PATCH 3/4] content_hash: normalize whitespace before hashing addresses Message-ID: <20211005070918.GA13812@dcvr> References: <20211002111835.19220-1-e@80x24.org> <20211002111835.19220-4-e@80x24.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20211002111835.19220-4-e@80x24.org> List-Id: Eric Wong wrote: > This should prevent some false duplicates. I noticed this > while implementing "lei mail-diff", and only noticed it when > I implemented the ContentDigestDbg wrapper for mail-diff. Btw, I completely forgot -extindex has a --dedupe switch for dealing with situations like this: public-inbox-extindex --dedupe=MSGID [--dedupe=MSGID1] public-inbox-extindex --dedupe # everything! It looks like there's even test cases for it in t/extsearch.t (!) I'm running --dedupe on yhbt.net/lore/all, because apparently --reindex doesn't deduplicate :x And there's a lot of stuff I still need to document in the pod/manpage :x