unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Disarchive database synchronization
@ 2023-03-14 15:55 Ludovic Courtès
  2023-03-18 19:49 ` Timothy Sample
  0 siblings, 1 reply; 4+ messages in thread
From: Ludovic Courtès @ 2023-03-14 15:55 UTC (permalink / raw)
  To: guix-devel; +Cc: guix-sysadmin, Timothy Sample, Simon Tournier


[-- Attachment #1.1: Type: text/plain, Size: 2102 bytes --]

Hello Guix!

As you may know, there are currently two different Disarchive databases:
the one at <https://disarchive.ngyro.com/> that Timothy Sample set up a
few years back, and the one at <https://disarchive.guix.gnu.org> that we
set up later, with a continuous integration job to populate it¹.

The database at ngyro.com has more historical metadata (metadata about
tarballs that older Guix revisions referred to) because Timothy worked
hard to populate it with tarballs from all the packages Guix refers to
starting from 1.0—which is crucial for long-term reproducibility.

Thanks to Timothy, I have now copied over things from
disarchive.ngyro.com to disarchive.guix.gnu.org.  The stats are as
follows:

  disarchive.ngyro.com had 28,396 entries
  12,905 (45%) entries were missing from disarchive.guix.gnu.org
  15,491 (the rest: 55%) entries were present in both yet different.
  3,444 entries of disarchive.guix were missing from disarchive.ngyro²

I copied over the 12K entries that were missing from
disarchive.guix.gnu.org.  (Note that there are currently only two copies
of the database: one at/in [bB]erlin, and one at/in [Bb]ordeaux.)
disarchive.guix.gnu.org now weighs in at 1.8 GiB for 31,839 entries.

For the remaining entries, it’s trickier.  Sometimes it’s just the
gzip compression parameters that differ, which could be addressed with a
little bit more work:

--8<---------------cut here---------------start------------->8---
$ file ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz
ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz:                         gzip compressed data, max compression, from Unix, original size modulo 2^32 446731
../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz: gzip compressed data, max speed, from Unix, original size modulo 2^32 446731
--8<---------------cut here---------------end--------------->8---

Sometimes it’s trickier:


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: Type: text/x-patch, Size: 1266 bytes --]

# diff -u <(gunzip -d < 0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz) <(gunzip -d < ../../disarchive/sha256/0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz)
--- /dev/fd/63  2023-03-14 16:13:21.635733426 +0100
+++ /dev/fd/62  2023-03-14 16:13:21.635733426 +0100
@@ -1,7 +1,7 @@
 (disarchive
   (version 0)
   (gzip-member
-    (name "webview-sys-0.6.2.tar.gz")
+    (name "rust-webview-sys-0.6.2.tar.gz")
     (digest
       (sha256
         "0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9"))
@@ -13,7 +13,7 @@
     (footer (crc 1807070134) (isize 121344))
     (compressor zlib-best)
     (input (tarball
-             (name "webview-sys-0.6.2.tar")
+             (name "rust-webview-sys-0.6.2.tar")
              (digest
                (sha256
                  "4fb18f3206838e11f7f8caba6fad9e0f796109428b502793b9f2f0613fe0f275"))
@@ -78,7 +78,7 @@
              (padding 0)
              (input (directory-ref
                       (version 0)
-                      (name "webview-sys-0.6.2")
+                      (name "rust-webview-sys-0.6.2")
                       (addresses
                         (swhid "swh:1:dir:fa41df38bf639ada28c900b0915661e787fe6d15"))
                       (digest

[-- Attachment #1.3: Type: text/plain, Size: 808 bytes --]


As Tim pointed out, Disarchive disassembly is not fully deterministic
and/or might change a bit over time as Disarchive evolves, and that’s
prolly what we’re seeing here.

The admins among us can see the remaining files in
/gnu/disarchive.ngyro.com on berlin.  That directory also contains two
files: ‘files-present-in-both-yet-different.txt’ and
‘files-that-were-missing.txt’.

Kudos to Timothy for making it possible.

Feedback welcome!

Ludo’.

¹ https://lists.gnu.org/archive/html/guix-devel/2021-10/msg00060.html

² Some of these showed up at disarchive.ngyro.com since I copied the
  database ~16h ago.  Example missing entry is “samplv1-0.9.24.tar.gz”:
  <https://disarchive.ngyro.com/sha256/ff0bfbaacfb514cb1a0194b0a43ca121f7679640a293f907fb1bbb2640d373b0>.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 869 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disarchive database synchronization
  2023-03-14 15:55 Disarchive database synchronization Ludovic Courtès
@ 2023-03-18 19:49 ` Timothy Sample
  2023-03-20  9:14   ` Ludovic Courtès
  2023-04-03 15:07   ` Simon Tournier
  0 siblings, 2 replies; 4+ messages in thread
From: Timothy Sample @ 2023-03-18 19:49 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, guix-sysadmin, Simon Tournier

Hey Ludo,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> I copied over the 12K entries that were missing from
> disarchive.guix.gnu.org.  (Note that there are currently only two copies
> of the database: one at/in [bB]erlin, and one at/in [Bb]ordeaux.)
> disarchive.guix.gnu.org now weighs in at 1.8 GiB for 31,839 entries.

Wow – 12K!  For some reason I thought it would be fewer.  It’s very good
that we (finally) sync’d up the databases.

Also, my set is now at 31,821 after collecting the runoff from the
latest Preservation of Guix Report.  That’s shockingly close to the
31,839 you have.

> For the remaining entries, it’s trickier.  Sometimes it’s just the
> gzip compression parameters that differ, which could be addressed with a
> little bit more work:
>
> $ file ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz
> ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz:                         gzip compressed data, max compression, from Unix, original size modulo 2^32 446731
> ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz: gzip compressed data, max speed, from Unix, original size modulo 2^32 446731

I’m not sure getting the compressed files to match matters.  Disarchive
cares a lot about that when it comes to source code tarballs, because
everybody signs and computes checksums over the compressed versions.
However, for these files, the differences introduced by compression can
be ignored.

> Sometimes it’s trickier:
>
> # diff -u <(gunzip -d < 0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz) <(gunzip -d < ../../disarchive/sha256/0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz)
> --- /dev/fd/63  2023-03-14 16:13:21.635733426 +0100
> +++ /dev/fd/62  2023-03-14 16:13:21.635733426 +0100
> @@ -1,7 +1,7 @@
>  (disarchive
>    (version 0)
>    (gzip-member
> -    (name "webview-sys-0.6.2.tar.gz")
> +    (name "rust-webview-sys-0.6.2.tar.gz")
>      (digest
>        (sha256
>          "0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9"))
> @@ -13,7 +13,7 @@
>      (footer (crc 1807070134) (isize 121344))
>      (compressor zlib-best)
>      (input (tarball
> -             (name "webview-sys-0.6.2.tar")
> +             (name "rust-webview-sys-0.6.2.tar")
>               (digest
>                 (sha256
>                   "4fb18f3206838e11f7f8caba6fad9e0f796109428b502793b9f2f0613fe0f275"))
> @@ -78,7 +78,7 @@
>               (padding 0)
>               (input (directory-ref
>                        (version 0)
> -                      (name "webview-sys-0.6.2")
> +                      (name "rust-webview-sys-0.6.2")
>                        (addresses
>                          (swhid "swh:1:dir:fa41df38bf639ada28c900b0915661e787fe6d15"))
>                        (digest

The name field is not used for data reconstruction.  It’s for human
consumption (and it may have made some early examples of use at the
command line easier to explain).  Here, the difference is based on the
fact that Crate URIs are weird, and the Preservation of Guix code does
not keep the origin file name.  Hence, the PoG version extracts the
Crate name alone from the URI, and the Cuirass version uses the Guix
package name with the “rust-” prefix.

> As Tim pointed out, Disarchive disassembly is not fully deterministic
> and/or might change a bit over time as Disarchive evolves, and that’s
> prolly what we’re seeing here.

I honestly think this is a good thing.  My instincts tell me that we
should excise all sources of ambiguity, like we’re trying to do in the
big picture.  However, Disarchive will get better at describing things
over time.  For instance, it doesn’t handle tar extension headers
elegantly at the moment.  In the future, if I fix this, I might consider
creating a “migrate” feature that improves existing specifications
(e.g., converting the old, verbose representation of extension headers
into the new representation).  In particular, I’ve left some warts in
the software in order to ship it, and I would be sad to try and commit
to those for the rest of time!

We might also add other resolver addresses besides SWHIDs....

Maybe I’m missing some perspective, but I don’t think trying to commit
to reproducible outputs for Disarchive makes sense.


-- Tim

P.S., we’ll have to do this dance again shortly, as I just computed
2,023 historical bzip2 specifications.  They’re not online yet, but
they’ll be up when I publish the next PoG report – which should take less
than a year this time!  :p


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disarchive database synchronization
  2023-03-18 19:49 ` Timothy Sample
@ 2023-03-20  9:14   ` Ludovic Courtès
  2023-04-03 15:07   ` Simon Tournier
  1 sibling, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2023-03-20  9:14 UTC (permalink / raw)
  To: Timothy Sample; +Cc: guix-devel, guix-sysadmin, Simon Tournier

Howdy Timothy!

Timothy Sample <samplet@ngyro.com> skribis:

> Ludovic Courtès <ludovic.courtes@inria.fr> writes:

[...]

>> For the remaining entries, it’s trickier.  Sometimes it’s just the
>> gzip compression parameters that differ, which could be addressed with a
>> little bit more work:
>>
>> $ file ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz
>> ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz:                         gzip compressed data, max compression, from Unix, original size modulo 2^32 446731
>> ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz: gzip compressed data, max speed, from Unix, original size modulo 2^32 446731
>
> I’m not sure getting the compressed files to match matters.

No it doesn’t matter for sure; it’s just that it would have made it
easier to check for relevant differences between the two Disarchive
databases.

>> Sometimes it’s trickier:
>>
>> # diff -u <(gunzip -d < 0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz) <(gunzip -d < ../../disarchive/sha256/0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz)
>> --- /dev/fd/63  2023-03-14 16:13:21.635733426 +0100
>> +++ /dev/fd/62  2023-03-14 16:13:21.635733426 +0100
>> @@ -1,7 +1,7 @@
>>  (disarchive
>>    (version 0)
>>    (gzip-member
>> -    (name "webview-sys-0.6.2.tar.gz")
>> +    (name "rust-webview-sys-0.6.2.tar.gz")

[...]

> The name field is not used for data reconstruction.  It’s for human
> consumption (and it may have made some early examples of use at the
> command line easier to explain).  Here, the difference is based on the
> fact that Crate URIs are weird, and the Preservation of Guix code does
> not keep the origin file name.  Hence, the PoG version extracts the
> Crate name alone from the URI, and the Cuirass version uses the Guix
> package name with the “rust-” prefix.

OK.  Again I was looking at this from the perspective of determining
whether there were “relevant” differences between the two Disarchive
databases.  Looks like it would be quite some work to determine that
automatically.

>> As Tim pointed out, Disarchive disassembly is not fully deterministic
>> and/or might change a bit over time as Disarchive evolves, and that’s
>> prolly what we’re seeing here.
>
> I honestly think this is a good thing.  My instincts tell me that we
> should excise all sources of ambiguity, like we’re trying to do in the
> big picture.  However, Disarchive will get better at describing things
> over time.  For instance, it doesn’t handle tar extension headers
> elegantly at the moment.  In the future, if I fix this, I might consider
> creating a “migrate” feature that improves existing specifications
> (e.g., converting the old, verbose representation of extension headers
> into the new representation).  In particular, I’ve left some warts in
> the software in order to ship it, and I would be sad to try and commit
> to those for the rest of time!

That makes a lot of sense!

> We might also add other resolver addresses besides SWHIDs....
>
> Maybe I’m missing some perspective, but I don’t think trying to commit
> to reproducible outputs for Disarchive makes sense.

Yes, I feel the same.

> P.S., we’ll have to do this dance again shortly, as I just computed
> 2,023 historical bzip2 specifications.  They’re not online yet, but
> they’ll be up when I publish the next PoG report – which should take less
> than a year this time!  :p

Woow, bzip2!  I was just now looking at a concrete disappearing-tarball
issue that involves bzip2:

  https://issues.guix.gnu.org/62071#8

Thank you!

Ludo’.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disarchive database synchronization
  2023-03-18 19:49 ` Timothy Sample
  2023-03-20  9:14   ` Ludovic Courtès
@ 2023-04-03 15:07   ` Simon Tournier
  1 sibling, 0 replies; 4+ messages in thread
From: Simon Tournier @ 2023-04-03 15:07 UTC (permalink / raw)
  To: Timothy Sample, Ludovic Courtès; +Cc: guix-devel, guix-sysadmin

Hi,

On sam., 18 mars 2023 at 13:49, Timothy Sample <samplet@ngyro.com> wrote:

>>               (input (directory-ref
>>                        (version 0)
>> -                      (name "webview-sys-0.6.2")
>> +                      (name "rust-webview-sys-0.6.2")

[...]

>> As Tim pointed out, Disarchive disassembly is not fully deterministic
>> and/or might change a bit over time as Disarchive evolves, and that’s
>> prolly what we’re seeing here.
>
> I honestly think this is a good thing.  My instincts tell me that we
> should excise all sources of ambiguity, like we’re trying to do in the
> big picture.  However, Disarchive will get better at describing things
> over time.  For instance, it doesn’t handle tar extension headers
> elegantly at the moment.  In the future, if I fix this, I might consider
> creating a “migrate” feature that improves existing specifications
> (e.g., converting the old, verbose representation of extension headers
> into the new representation).  In particular, I’ve left some warts in
> the software in order to ship it, and I would be sad to try and commit
> to those for the rest of time!

How do we know that this “disarchive disassemble” will work with that
“disarchive assemble”?  Is it tracked by the ’version’ field above?

Well, “disarchive disassemble” output a specification and the format of
this specification can change for various improvements.  However, this
specification format should be clearly specified and for one version of
that specification, the output must be deterministic.  Or I am missing a
point.

Cheers,
simon


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-04-03 16:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-14 15:55 Disarchive database synchronization Ludovic Courtès
2023-03-18 19:49 ` Timothy Sample
2023-03-20  9:14   ` Ludovic Courtès
2023-04-03 15:07   ` Simon Tournier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).