From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id QDJAG+eMRGJgEQEAgWs5BA (envelope-from ) for ; Wed, 30 Mar 2022 19:01:27 +0200 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id 2AzlF+eMRGIimAAAauVa8A (envelope-from ) for ; Wed, 30 Mar 2022 19:01:27 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 4024B19770 for ; Wed, 30 Mar 2022 19:01:27 +0200 (CEST) Received: from localhost ([::1]:60862 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nZbEm-0002Ij-U1 for larch@yhetil.org; Wed, 30 Mar 2022 12:32:00 -0400 Received: from eggs.gnu.org ([209.51.188.92]:44822) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nZazi-0005Va-0F for guix-devel@gnu.org; Wed, 30 Mar 2022 12:16:26 -0400 Received: from libre.brussels ([144.76.234.112]:49138) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nZazf-0001kU-Ue; Wed, 30 Mar 2022 12:16:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=libre.brussels; s=mail; t=1648656978; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vN1wM2Gi47J/Z6F4iG17KvEgnjyoSFyqwtC4W3rJiEg=; b=l7QzwEYFwxcU849GLeCdKEsK6P5OlYCCDCYba9g8+L/j723NA5vjYkGMvYDAbJXFvxFzRy BEy84sAaBqxUiI1Bx1UpR8vHFOKTfpDfHLrKFpRqgJrrtTEarSg1PVSPzYQzCEbIPOQGmW fVAT1MGx3SXer+VE/CPjIhfYQA9mg5M= MIME-Version: 1.0 Date: Wed, 30 Mar 2022 16:16:18 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: "Jonathan McHugh" Message-ID: <9d511563872e581d8a6fe5fc8bcb2532@libre.brussels> Subject: Re: Profiling of man-db database generation with zlib vs zstd To: "Maxim Cournoyer" , "=?utf-8?B?THVkb3ZpYyBDb3VydMOocw==?=" In-Reply-To: <875ynvv6l9.fsf@gmail.com> References: <875ynvv6l9.fsf@gmail.com> <875yo53iuq.fsf@gmail.com> <87ee2r9gms.fsf@gnu.org> <87o81qviqg.fsf@gmail.com> <87czi5126h.fsf@gnu.org> Received-SPF: pass client-ip=144.76.234.112; envelope-from=indieterminacy@libre.brussels; helo=libre.brussels X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN X-Migadu-To: larch@yhetil.org X-Migadu-Country: US ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1648659687; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=vN1wM2Gi47J/Z6F4iG17KvEgnjyoSFyqwtC4W3rJiEg=; b=rFdLM0IpFpxreq6CdxMTrTiYSLtVClo61kpj6JB/zHXZu0gdnzJFZHUTa2okodj0/Yt9yk Ot8+eagN8bppFrsZSxhJq5MJSu85KWuS1uVorPfz1wogjjvPQUayJ2BerV8/qAsDRdLFGr 69Dgui/Wu+mGDg/brQyvsB8nn4X+dKwpS+oo14VL7sHFNrxAFGgIdlcqAFPX7Y5ck8yv8y 0lI9pJh4OX25ciPZ0tpIkhjCMbHzcBQGb8Y+VdM+h6HZHpAPGQa+WbRIJ+FD2SXPBhuH/U pva+XYboMd69ybGK/2Bwtt4N54UfVzW2O/rRtP8DAycG4c7qYVFglHb5QFeWvQ== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1648659687; a=rsa-sha256; cv=none; b=trEoxLPyL+uRB8qPGjFkNIQoSuqyDnU0UxlD5d2FC56OKse2maCcpVvF0OqvKhHONnOrLj qqPhW7a3Xae8GrjunQOUEV5m5/jangVt5u4u+6HW5/qFLXNxpwUkZtADdq2fVmNG/4D9js dKGqtEOCC/OorRr1isy9Z4iYwKVnEerxq69CbwYOliDySW1C8dOS+7yajYOjagV/YWkwwU GTQs7ZL1rI51Ngknj7qHvDIlRitDgJjF4FPEjhl/PkvPmcIm766/wAiItkNPhHeu6hCQF/ HqB7xIdOu6Bds9tqD3m8+sanwgVI8SXVHSi408ch35GabFc2JT9B4GSc14RZCw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=libre.brussels header.s=mail header.b=l7QzwEYF; dmarc=fail reason="SPF not aligned (relaxed)" header.from=libre.brussels (policy=none); spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: 5.03 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=libre.brussels header.s=mail header.b=l7QzwEYF; dmarc=fail reason="SPF not aligned (relaxed)" header.from=libre.brussels (policy=none); spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: 4024B19770 X-Spam-Score: 5.03 X-Migadu-Scanner: scn0.migadu.com X-TUID: CvmCf4aPVkXf Hi Maxim, Out of interest, will a zstd dictionary be (eventually) utilised as a str= ategy for further reducing compression and decompression speeds? ``` The compression library Zstandard (also known as "Zstd") has the ability = to create an external "dictionary" from a set of training files which can= be used to more efficiently (in terms of compression and decompression s= peed and also in terms of compression ratio) compress files of the same t= ype as the training files. For example, if a dictionary is "trained" on a= n example set of email messages, anyone with access to the dictionary wil= l be able to more efficiently compress another email file. The trick is t= hat the commonalities are kept in the dictionary file, and, therefore, an= yone wishing to decompress the email must have already had that same dict= ionary sent to them.[2]=20 ``` http://fileformats.archiveteam.org/wiki/Zstandard_dictionary I=20appreciate it may confuse your piecemeal benchmarking (certainly at t= his stage) but I would assume that creating a dictionary (or dictionaries= , say covering each Guix package category for linguistic overlaps) for ma= npages would further improve zstd speeds. HTH, =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Jonathan McHugh indieterminacy@libre.brussels March 30, 2022 4:49 PM, "Maxim Cournoyer" wro= te: > Hi Ludovic, >=20 >=20Ludovic Court=C3=A8s writes: >=20 >=20[...] >=20 >>=20To isolate the problem, you could allocate the 4 MiB buffer outside = of >> the loop and use =E2=80=98get-bytevector-n!=E2=80=99, and also remove = code that writes >> to =E2=80=98output=E2=80=99. >=20 >=20I've adjusted the benchmark like so: >=20 >=20--8<---------------cut here---------------start------------->8--- > (use-modules (ice-9 binary-ports) > (ice-9 match) > (rnrs bytevectors) > (zstd)) >=20 >=20(define MiB (expt 2 20)) > (define block-size (* 4 MiB)) > (define bv (make-bytevector block-size)) > (define input-file "/tmp/chromium-98.0.4758.102.tar.zst") >=20 >=20(define (run) > (call-with-input-file input-file > (lambda (port) > (call-with-zstd-input-port port > (lambda (input) > (while (not (eof-object? > (get-bytevector-n! input bv 0 block-size))))))))) >=20 >=20(run) > --8<---------------cut here---------------end--------------->8--- >=20 >=20It now runs much faster: >=20 >=20--8<---------------cut here---------------start------------->8--- > $ time+ zstd -cdk /tmp/chromium-98.0.4758.102.tar.zst > /dev/null > cpu: 98%, mem: 10560 KiB, wall: 0:09.56, sys: 0.37, usr: 9.06 > --8<---------------cut here---------------end--------------->8--- >=20 >=20--8<---------------cut here---------------start------------->8--- > $ time+ guile ~/src/guile-zstd/benchmark.scm > cpu: 100%, mem: 25152 KiB, wall: 0:11.69, sys: 0.38, usr: 11.30 > --8<---------------cut here---------------end--------------->8--- >=20 >=20So guile-zstd was about 20% slower, not too far. >=20 >=20For completeness, here's the same benchmark adjusted for guile-zlib: >=20 >=20--8<---------------cut here---------------start------------->8--- > (use-modules (ice-9 binary-ports) > (ice-9 match) > (rnrs bytevectors) > (zlib)) >=20 >=20(define MiB (expt 2 20)) > (define block-size (* 4 MiB)) > (define bv (make-bytevector block-size)) > (define input-file "/tmp/chromium-98.0.4758.102.tar.gz") >=20 >=20(define (run) > (call-with-input-file input-file > (lambda (port) > (call-with-gzip-input-port port > (lambda (input) > (while (not (eof-object? > (get-bytevector-n! input bv 0 block-size))))))))) >=20 >=20(run) > --8<---------------cut here---------------end--------------->8--- >=20 >=20--8<---------------cut here---------------start------------->8--- > $ time+ guile ~/src/guile-zstd/benchmark-zlib.scm > cpu: 86%, mem: 14552 KiB, wall: 0:23.50, sys: 1.09, usr: 19.15 > --8<---------------cut here---------------end--------------->8--- >=20 >=20--8<---------------cut here---------------start------------->8--- > $ time+ gunzip -ck /tmp/chromium-98.0.4758.102.tar.gz > /dev/null > cpu: 98%, mem: 2304 KiB, wall: 0:35.99, sys: 0.60, usr: 34.99 > --8<---------------cut here---------------end--------------->8--- >=20 >=20Surprisingly, here guile-zlib appears to be faster than the 'gunzip' > command; guile-zstd is about twice as fast to decompress this 4 GiB > something archive (compressed with zstd at level 19). >=20 >=20So, it seems the foundation we're building on is sane after all. This > suggests that compression is not the bottleneck when generating the man > pages database, probably because it only needs to read the first few > bytes of each compressed manpage to gather the information it needs, an= d > that the rest is more expensive compared to that (such as > string-tokenize'ing the lines read to parse the data). >=20 >=20To be continued... >=20 >=20Thanks! >=20 >=20Maxim