unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* message bloat over time...
@ 2020-09-02 19:05 Eric Wong
  2020-09-02 21:38 ` Konstantin Ryabitsev
  0 siblings, 1 reply; 2+ messages in thread
From: Eric Wong @ 2020-09-02 19:05 UTC (permalink / raw)
  To: meta

I've been indexing and reindexing a local mirror of
https://lore.kernel.org/lkml a bit, and it's kinda depressing to
see newer messages being more and more bloated even on a
plain-text-only mailing list :<

The first column ("$X.git" is the epoch number, older epochs
are lower-numbered: "0.git" is oldest, "8.git" (not shown) is
the newest.  8.git is omitted since it's still in-progress,
each epoch is capped at roughly ~1.1G of packed git storage.

The last column is the number of messages in that epoch,
so fewer messages fit in each epoch:

  7.git counting 17d7e25e3e862d5d99182557bb723374230a8497 ... 312754
  6.git counting bc9b3c196d0fc92a520e9ad4f92c4d3c1db1943f ... 346017
  5.git counting 31ed379430c456f90bdd172b223020c0e6d7cb8d ... 379561
  4.git counting 88294f6d487193f5984791ee81213a25130d0559 ... 416015
  3.git counting 93d9eace2721494d8457c7f5f6de803c0d648172 ... 453851
  2.git counting d48078ceeec1f51313253a56ed3ba0eae7fde909 ... 455366
  1.git counting 6b67b9f5e0cd82d3c734e6cdc44c1f722ab6fb6a ... 475671
  0.git counting b67bf7f62c8125d67461cc6e7d1736ddc8844a18 ... 570488

So yeah, old epochs could fit more messages because messages
were smaller back then...

/me goes back to yelling at the sky...

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: message bloat over time...
  2020-09-02 19:05 message bloat over time Eric Wong
@ 2020-09-02 21:38 ` Konstantin Ryabitsev
  0 siblings, 0 replies; 2+ messages in thread
From: Konstantin Ryabitsev @ 2020-09-02 21:38 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Wed, Sep 02, 2020 at 07:05:25PM +0000, Eric Wong wrote:
> I've been indexing and reindexing a local mirror of
> https://lore.kernel.org/lkml a bit, and it's kinda depressing to
> see newer messages being more and more bloated even on a
> plain-text-only mailing list :<
> 
> The first column ("$X.git" is the epoch number, older epochs
> are lower-numbered: "0.git" is oldest, "8.git" (not shown) is
> the newest.  8.git is omitted since it's still in-progress,
> each epoch is capped at roughly ~1.1G of packed git storage.
> 
> The last column is the number of messages in that epoch,
> so fewer messages fit in each epoch:
> 
>   7.git counting 17d7e25e3e862d5d99182557bb723374230a8497 ... 312754
>   6.git counting bc9b3c196d0fc92a520e9ad4f92c4d3c1db1943f ... 346017
>   5.git counting 31ed379430c456f90bdd172b223020c0e6d7cb8d ... 379561

I'm not sure it's quite a fair comparison between 4 and 5, since the 
initial import was done from email sources that were heavily sanitized 
for headers -- both for privacy and for size. Everything we've been 
receiving since then carries untouched headers, which includes entire 
Received lines and all the DKIM/DMARC/SPF checking junk.

>   4.git counting 88294f6d487193f5984791ee81213a25130d0559 ... 416015
>   3.git counting 93d9eace2721494d8457c7f5f6de803c0d648172 ... 453851
>   2.git counting d48078ceeec1f51313253a56ed3ba0eae7fde909 ... 455366
>   1.git counting 6b67b9f5e0cd82d3c734e6cdc44c1f722ab6fb6a ... 475671
>   0.git counting b67bf7f62c8125d67461cc6e7d1736ddc8844a18 ... 570488
> 
> So yeah, old epochs could fit more messages because messages
> were smaller back then...

I've considered doing some header stripping, but I've opted to preserve 
them for provenance/authenticity reasons. I may still change my mind at 
some point. :)

-K

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-09-02 21:38 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-02 19:05 message bloat over time Eric Wong
2020-09-02 21:38 ` Konstantin Ryabitsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).