From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.2 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id F0BA61F44D; Sat, 13 Apr 2024 07:58:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=80x24.org; s=selector1; t=1712995097; bh=bDEKIuKKk5hCjF5onUBRvzp/8oDAdTuRYVWVENr3NPg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=PkwFvQ9TZEAdwBIPvRu2VtnVYxERKtFz3um3zYpkEz1z+NAHCKzUlPYjYbTLIeJ1j IGmJh8+vqdSKaX0iaoOWAdqA70opfrooBgdAa5bQYQrIfnzRz0vH3bJErQyLDSe3XR fHkaUMrKRgZzBbiOaQmkvNWJh9XnSVvEP0W2gW6s= Date: Sat, 13 Apr 2024 07:58:16 +0000 From: Eric Wong To: Jacob Keller Cc: Konstantin Ryabitsev , meta@public-inbox.org Subject: Re: downloading t.mbox.gz messages are not sorted in expected order Message-ID: <20240413075816.M838152@dcvr> References: <3c335d9a-f0dd-43c3-b1e1-ce94cc291ecb@intel.com> <20240411-dancing-pink-marmoset-f442d0@meerkat> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Jacob Keller wrote: > > > On 4/11/2024 3:42 PM, Konstantin Ryabitsev wrote: > > On Thu, Apr 11, 2024 at 03:32:43PM -0700, Jacob Keller wrote: > >> I sometimes download patch series off of public inbox hosted servers to > >> apply with git-am. Occasionally I have found that these do not apply > >> cleanly because the thread is not sorted in patch order. > > > > It's more than just the order -- if there are replies in the thread, the mbox > > file won't apply either. > > > > If the order was correct, it is usually easy enough to just "git am > --skip" the patches which have no content. However... > > > This is the reason why the b4 tool exists: > > https://b4.docs.kernel.org/ > > > > This is extremely useful and I was unaware of its existence. Thanks!! Good to know b4 works for you. FWIW, t.mbox.gz uses NNTP article number ordering to ensure batched fetches work and duplicates can't get served. IOW, it fetches a batch of 1000 header rows at a time from a single thread to avoid using too much memory for a single request. The next batch (another 1K) only gets fetched once the current batch is done. So it must order by article number to deal with that, especially since new messages may appear in the thread while the current batch is being streamed. Identical Date: headers can appear multiple times in the same thread, so using a >= or > comparison for retrieval wouldn't work. Of course, most threads are <1000 messages, so I did think about sorting by Date for small threads (as we do for the HTML output)... However with the current t.mbox.gz code, we expect (and can handle) new messages appearing while a t.mbox.gz is being served. So if a thread has 10 messages, the first batch fetch would only return those 10. However, while a client is slowly downloading the first 10 messages, more messages show up. The current retrieval scheme allows new messages in a thread to show up without needing another request. AFAIK, it's actually easier and fewer SQL statements to do the current way.