From: "Jonathan McHugh" <indieterminacy@libre.brussels>
To: "Maxim Cournoyer" <maxim.cournoyer@gmail.com>,
"Ludovic Courtès" <ludo@gnu.org>
Cc: guix-devel <guix-devel@gnu.org>
Subject: Re: Profiling of man-db database generation with zlib vs zstd
Date: Wed, 30 Mar 2022 16:16:18 +0000 [thread overview]
Message-ID: <9d511563872e581d8a6fe5fc8bcb2532@libre.brussels> (raw)
In-Reply-To: <875ynvv6l9.fsf@gmail.com>
Hi Maxim,
Out of interest, will a zstd dictionary be (eventually) utilised as a strategy for further reducing compression and decompression speeds?
```
The compression library Zstandard (also known as "Zstd") has the ability to create an external "dictionary" from a set of training files which can be used to more efficiently (in terms of compression and decompression speed and also in terms of compression ratio) compress files of the same type as the training files. For example, if a dictionary is "trained" on an example set of email messages, anyone with access to the dictionary will be able to more efficiently compress another email file. The trick is that the commonalities are kept in the dictionary file, and, therefore, anyone wishing to decompress the email must have already had that same dictionary sent to them.[2]
```
http://fileformats.archiveteam.org/wiki/Zstandard_dictionary
I appreciate it may confuse your piecemeal benchmarking (certainly at this stage) but I would assume that creating a dictionary (or dictionaries, say covering each Guix package category for linguistic overlaps) for manpages would further improve zstd speeds.
HTH,
====================
Jonathan McHugh
indieterminacy@libre.brussels
March 30, 2022 4:49 PM, "Maxim Cournoyer" <maxim.cournoyer@gmail.com> wrote:
> Hi Ludovic,
>
> Ludovic Courtès <ludo@gnu.org> writes:
>
> [...]
>
>> To isolate the problem, you could allocate the 4 MiB buffer outside of
>> the loop and use ‘get-bytevector-n!’, and also remove code that writes
>> to ‘output’.
>
> I've adjusted the benchmark like so:
>
> --8<---------------cut here---------------start------------->8---
> (use-modules (ice-9 binary-ports)
> (ice-9 match)
> (rnrs bytevectors)
> (zstd))
>
> (define MiB (expt 2 20))
> (define block-size (* 4 MiB))
> (define bv (make-bytevector block-size))
> (define input-file "/tmp/chromium-98.0.4758.102.tar.zst")
>
> (define (run)
> (call-with-input-file input-file
> (lambda (port)
> (call-with-zstd-input-port port
> (lambda (input)
> (while (not (eof-object?
> (get-bytevector-n! input bv 0 block-size)))))))))
>
> (run)
> --8<---------------cut here---------------end--------------->8---
>
> It now runs much faster:
>
> --8<---------------cut here---------------start------------->8---
> $ time+ zstd -cdk /tmp/chromium-98.0.4758.102.tar.zst > /dev/null
> cpu: 98%, mem: 10560 KiB, wall: 0:09.56, sys: 0.37, usr: 9.06
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> $ time+ guile ~/src/guile-zstd/benchmark.scm
> cpu: 100%, mem: 25152 KiB, wall: 0:11.69, sys: 0.38, usr: 11.30
> --8<---------------cut here---------------end--------------->8---
>
> So guile-zstd was about 20% slower, not too far.
>
> For completeness, here's the same benchmark adjusted for guile-zlib:
>
> --8<---------------cut here---------------start------------->8---
> (use-modules (ice-9 binary-ports)
> (ice-9 match)
> (rnrs bytevectors)
> (zlib))
>
> (define MiB (expt 2 20))
> (define block-size (* 4 MiB))
> (define bv (make-bytevector block-size))
> (define input-file "/tmp/chromium-98.0.4758.102.tar.gz")
>
> (define (run)
> (call-with-input-file input-file
> (lambda (port)
> (call-with-gzip-input-port port
> (lambda (input)
> (while (not (eof-object?
> (get-bytevector-n! input bv 0 block-size)))))))))
>
> (run)
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> $ time+ guile ~/src/guile-zstd/benchmark-zlib.scm
> cpu: 86%, mem: 14552 KiB, wall: 0:23.50, sys: 1.09, usr: 19.15
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> $ time+ gunzip -ck /tmp/chromium-98.0.4758.102.tar.gz > /dev/null
> cpu: 98%, mem: 2304 KiB, wall: 0:35.99, sys: 0.60, usr: 34.99
> --8<---------------cut here---------------end--------------->8---
>
> Surprisingly, here guile-zlib appears to be faster than the 'gunzip'
> command; guile-zstd is about twice as fast to decompress this 4 GiB
> something archive (compressed with zstd at level 19).
>
> So, it seems the foundation we're building on is sane after all. This
> suggests that compression is not the bottleneck when generating the man
> pages database, probably because it only needs to read the first few
> bytes of each compressed manpage to gather the information it needs, and
> that the rest is more expensive compared to that (such as
> string-tokenize'ing the lines read to parse the data).
>
> To be continued...
>
> Thanks!
>
> Maxim
next prev parent reply other threads:[~2022-03-30 17:01 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-22 19:09 Profiling of man-db database generation with zlib vs zstd Maxim Cournoyer
2022-03-24 21:37 ` Ludovic Courtès
2022-03-26 3:22 ` Maxim Cournoyer
2022-03-27 3:44 ` Maxim Cournoyer
2022-03-29 10:22 ` Ludovic Courtès
2022-03-28 3:49 ` Maxim Cournoyer
2022-03-29 10:30 ` Ludovic Courtès
2022-03-30 14:49 ` Maxim Cournoyer
2022-03-30 16:16 ` Jonathan McHugh [this message]
2022-03-31 17:13 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9d511563872e581d8a6fe5fc8bcb2532@libre.brussels \
--to=indieterminacy@libre.brussels \
--cc=guix-devel@gnu.org \
--cc=ludo@gnu.org \
--cc=maxim.cournoyer@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).