Deep vs Shallow trace: Removing the tradeoff?

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Deep vs Shallow trace: Removing the tradeoff?
@ 2021-03-27 16:56 ilmu
  2021-03-28  0:50 ` Julien Lepiller
  0 siblings, 1 reply; 8+ messages in thread
From: ilmu @ 2021-03-27 16:56 UTC (permalink / raw)
  To: guix-devel@gnu.org

Hi,

I had this idea while reading about authenticated datastructures, I would be very thankful if you could tell me why it doesn't work.

Bear in mind the following disclaimer as you read this:

My experience with these things is mostly theoretical, I have never used Bazel and although I was a user of Nix and am now moving to Guix I have not contributed to nixpkgs and only written simple expressions.

Without further ado.

The premise I am assuming is the framework introduced in Build Systems a la Carte. They talk about Bazel and Nix as representatives of two different dependency tracing strategies:

- Shallow tracing :: You hash immediate dependencies.
- Deep tracing :: You hash the whole transitive closure.

Now the tradeoff is basically the following:

- In Nix when you put a comment in curl you need to rebuild every single package in nixpkgs because they more or less all depend on curl in one way or another and therefore the curl source is in the transitive closure for almost every package.
- In Bazel when you put a comment in curl then the direct dependents need to be rebuilt but if they are the same as before after being rebuilt then the propagation is killed and nothing else needs to change.

However, in Bazel you will need to traverse the whole dependency tree all of the time to verify that everything is as it should be.

Now the idea I have is very simple:

We use recursive zero knowledge proofs with shallow traces, the rzkp caches the traversal and provides the same guarantee as the deep traces do (transitive closure is verified to be as it should be). Now if someone puts a comment in curl there is a small amount of packages that need to be rebuilt and then we redo only the proofs all the way up. This way we save ourselves a potentially massive amount of compilation.

As I said before I do not have much experience with the real implementations of these ideas so I am sure this is not as simple as it is in my head. However the distri experimental operating system (which implements a similar model to guix and nixos) does not put the hash in the store path but rather keeps a small metadata file for each path and then has a natural number suffix for the path of concurrent versions of the same package. This gives a better UX imho and is probably also easier to extend with more appropriate authenticated datastructures as they are discovered.

I hope I am not a raving madman and that this actually makes at least a slight amount of sense. Very much looking forward to takedowns :)

Kind regards,
- Ilmu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deep vs Shallow trace: Removing the tradeoff?
  2021-03-27 16:56 Deep vs Shallow trace: Removing the tradeoff? ilmu
@ 2021-03-28  0:50 ` Julien Lepiller
  2021-03-28 23:16   ` ilmu
  0 siblings, 1 reply; 8+ messages in thread
From: Julien Lepiller @ 2021-03-28  0:50 UTC (permalink / raw)
  To: ilmu, guix-devel@gnu.org

Le 27 mars 2021 12:56:08 GMT-04:00, ilmu <ilmu@rishi.is> a écrit :
>Hi,
>
>I had this idea while reading about authenticated datastructures, I
>would be very thankful if you could tell me why it doesn't work.
>
>Bear in mind the following disclaimer as you read this:
>
>My experience with these things is mostly theoretical, I have never
>used Bazel and although I was a user of Nix and am now moving to Guix I
>have not contributed to nixpkgs and only written simple expressions.
>
>Without further ado.
>
>
>The premise I am assuming is the framework introduced in Build Systems
>a la Carte. They talk about Bazel and Nix as representatives of two
>different dependency tracing strategies:
>
>- Shallow tracing :: You hash immediate dependencies.
>- Deep tracing :: You hash the whole transitive closure.
>
>Now the tradeoff is basically the following:
>
>- In Nix when you put a comment in curl you need to rebuild every
>single package in nixpkgs because they more or less all depend on curl
>in one way or another and therefore the curl source is in the
>transitive closure for almost every package.
>- In Bazel when you put a comment in curl then the direct dependents
>need to be rebuilt but if they are the same as before after being
>rebuilt then the propagation is killed and nothing else needs to
>change.
>
>However, in Bazel you will need to traverse the whole dependency tree
>all of the time to verify that everything is as it should be.

What I understand from the above is that their is not so much of a difference between Bazel and Nix/Guix. The "shallow" tracing *is* a deep tracing. It just records something different.

In Nix/Guix, we recursively compute the hash of packages from their inputs, receipe and other stuff. This hash is unrelated to the checksum of the result and changes as soon as one element changes. It essentially records the whole source dependency graph.

From what you write, Bazel records the checksum of inputs and receipe, so it essentially encodes the whole binary dependency graph. It's not "shallow", as a real change down the graph can have repercussions higher up.

Actually Eelco described something similar for Nix in his thesis, the intensionnal model. As you noticed, different receipes or inputs can lead to the same output (bitwise). The idea is to declare the expected checksum of the result, and use this to compute the store item name. As long as you apply changes that do not affect the output, different receipes (updates) can be used to generate the same item. So if you update a package, it might affect its output checksum, so you need to change it. Dependents might not be affected, in which case you can block update propagation there.

The question being how to monitor that packages need to be rebuilt or have the same hash? 

>
>
>Now the idea I have is very simple:
>
>We use recursive zero knowledge proofs with shallow traces, the rzkp
>caches the traversal and provides the same guarantee as the deep traces
>do (transitive closure is verified to be as it should be). Now if
>someone puts a comment in curl there is a small amount of packages that
>need to be rebuilt and then we redo only the proofs all the way up.
>This way we save ourselves a potentially massive amount of compilation.

I don't really understand why it must be zero-knowledge. My understanding is that to have a zero-knowledge proof, you first need a proof of the claim, then find a clever way to proove you have the proof, without disclosing it. Not sure it's helpful here. If you have a proof that you don't need to rebuild, why not share that proof to everyone in the first place?

>
>As I said before I do not have much experience with the real
>implementations of these ideas so I am sure this is not as simple as it
>is in my head. However the distri experimental operating system (which
>implements a similar model to guix and nixos) does not put the hash in
>the store path but rather keeps a small metadata file for each path and
>then has a natural number suffix for the path of concurrent versions of
>the same package. This gives a better UX imho and is probably also
>easier to extend with more appropriate authenticated datastructures as
>they are discovered.
>
>
>I hope I am not a raving madman and that this actually makes at least a
>slight amount of sense. Very much looking forward to takedowns :)
>
>
>Kind regards,
>- Ilmu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deep vs Shallow trace: Removing the tradeoff?
  2021-03-28  0:50 ` Julien Lepiller
@ 2021-03-28 23:16   ` ilmu
  2021-03-30 10:46     ` Ludovic Courtès
  2021-03-30 20:20     ` Bengt Richter
  0 siblings, 2 replies; 8+ messages in thread
From: ilmu @ 2021-03-28 23:16 UTC (permalink / raw)
  To: Julien Lepiller; +Cc: guix-devel@gnu.org

[-- Attachment #1.1: Type: text/plain, Size: 3800 bytes --]

First of all, thanks for engaging with me, sorry for not putting more work into the initial email. Note that I have attached inline images from the paper, the paper is also attached to this email.

> What I understand from the above is that there is not so much of a difference between Bazel and Nix/Guix. The "shallow" tracing is a deep tracing. It just records something different.

[classification.png]

The framework they present is that a build system can be classified based on how it decides whether a rebuild is necessary and how it schedules the necessary rebuilds (i.e. how it sorts the work to be done). They consider these orthogonal strategies a complete description of "what a build system is".

[constructive_traces.png]

The constructive trace allows you to keep intermediate artifacts but creates the extra overhead of managing these.

[deep_constructive_traces.png]

Since a deep constructive trace only looks at the terminal dependencies you have something of a merkle-dag. What I was suggesting is to be able to do "shallow builds" while still supporting "early cutoff". For reference:

[early_cutoff.png]

Early cutoff is a very desirable property that nix does not have (and I assume therefore that neither does guix).

[shallow_builds.png]

Basically, if we have a proof that the immediate dependencies are complete then we don't need to fetch anything deeper in the dependency tree.

> In Nix/Guix, we recursively compute the hash of packages from their inputs, receipe and other stuff. This hash is unrelated to the checksum of the result and changes as soon as one element changes. It essentially records the whole source dependency graph.
> From what you write, Bazel records the checksum of inputs and receipe, so it essentially encodes the whole binary dependency graph. It's not "shallow", as a real change down the graph can have repercussions higher up.
> Actually Eelco described something similar for Nix in his thesis, the intensionnal model. As you noticed, different receipes or inputs can lead to the same output (bitwise). The idea is to declare the expected checksum of the result, and use this to compute the store item name. As long as you apply changes that do not affect the output, different receipes (updates) can be used to generate the same item. So if you update a package, it might affect its output checksum, so you need to change it. Dependents might not be affected, in which case you can block update propagation there.
> The question being how to monitor that packages need to be rebuilt or have the same hash?

I think the pictures above should clarify, I should have looked at the paper again before sending my previous email, I read it quite a long time ago.. The property I was trying to recover in the nix-style build systems is this idea of an early cutoff. The strategy for achieving that is to use constructive traces with depth 1 (what I was calling "shallow traces") and then recursive succinct arguments of knowledge for retaining the shallow build property that you get from the "infinitely deep traces".

> I don't really understand why it must be zero-knowledge. My understanding is that to have a zero-knowledge proof, you first need a proof of the claim, then find a clever way to proove you have the proof, without disclosing it. Not sure it's helpful here. If you have a proof that you don't need to rebuild, why not share that proof to everyone in the first place?

Sure, you only need the succinctness property. I should have been more precise, what I am mainly after is resolving the tradeoff between shallow and deep traces.

I hope that this discussion can at least be helpful for people that are not familiar with these ideas, even if - as it seems from your response - the idea does not have merit.

Kind regards,
- ilmu

[-- Attachment #1.2.1: Type: text/html, Size: 4676 bytes --]

[-- Attachment #1.2.2: classification.png --]
[-- Type: image/png, Size: 22228 bytes --]

[-- Attachment #1.2.3: constructive_traces.png --]
[-- Type: image/png, Size: 72834 bytes --]

[-- Attachment #1.2.4: deep_constructive_traces.png --]
[-- Type: image/png, Size: 82885 bytes --]

[-- Attachment #1.2.5: early_cutoff.png --]
[-- Type: image/png, Size: 22288 bytes --]

[-- Attachment #1.2.6: shallow_builds.png --]
[-- Type: image/png, Size: 70537 bytes --]

[-- Attachment #2: build-systems.pdf --]
[-- Type: application/pdf, Size: 728559 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deep vs Shallow trace: Removing the tradeoff?
  2021-03-28 23:16   ` ilmu
@ 2021-03-30 10:46     ` Ludovic Courtès
  2021-03-30 20:20     ` Bengt Richter
  1 sibling, 0 replies; 8+ messages in thread
From: Ludovic Courtès @ 2021-03-30 10:46 UTC (permalink / raw)
  To: ilmu; +Cc: guix-devel@gnu.org

Hi,

ilmu <ilmu@rishi.is> skribis:

> Early cutoff is a very desirable property that nix does not have (and I assume therefore that neither does guix).

The “intensional model” that Julien mentioned, or what the Nix folks now
refer to as “content-addressed derivations”, have this early cutoff
property.  It’s one of the main motivations for this whole endeavors.

This document shows how Nix folks are working on retrofitting
“content-addressed derivations” into Nix:

  https://github.com/tweag/rfcs/blob/cas-rfc/rfcs/0062-content-addressed-paths.md

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deep vs Shallow trace: Removing the tradeoff?
  2021-03-28 23:16   ` ilmu
  2021-03-30 10:46     ` Ludovic Courtès
@ 2021-03-30 20:20     ` Bengt Richter
  2021-04-02 15:45       ` ilmu
  1 sibling, 1 reply; 8+ messages in thread
From: Bengt Richter @ 2021-03-30 20:20 UTC (permalink / raw)
  To: ilmu; +Cc: guix-devel@gnu.org

Hi ilmu,

On +2021-03-28 23:16:23 +0000, ilmu wrote:
> First of all, thanks for engaging with me, sorry for not putting more work into the initial email. Note that I have attached inline images from the paper, the paper is also attached to this email.
>

An interesting paper..

I have long been wondering if reproducibility is too exclusively focused
on the source transformation chain and the tools implementing that.

E.g., for programs that execute like side-effect-free functions,
why not an early-cutoff based on a functionally well-tested upgrade
written in a different language, with a totally different source?

If you source the following in a subshell

cat ./early-cutoff.txt
--8<---------------cut here---------------start------------->8---

export HELLO_TO=World

strace -qqxs120 -e write \
guile -c '(let* ((msg (string-append "Hello " (getenv "HELLO_TO") "\n"))) (display msg))' |& tr , $'\n'|grep '^ *"'

strace -qqxs120 -e write \
echo "Hello $HELLO_TO" |& tr , $'\n'|grep '^ *"'

--8<---------------cut here---------------end--------------->8---

Like,

(. early-cutoff.txt)
--8<---------------cut here---------------start------------->8---
 "Hello World\n"
 "Hello World\n"
--8<---------------cut here---------------end--------------->8---

It informally shows that in a particular context,

--8<---------------cut here---------------start------------->8---
guile -c '(let* ((msg (string-append "Hello " (getenv "HELLO_TO") "\n"))) (display msg))'
--8<---------------cut here---------------end--------------->8---

can substitute for
--8<---------------cut here---------------start------------->8---
echo "Hello $HELLO_TO"
--8<---------------cut here---------------end--------------->8---

So, what kinds of build sequences could be cut off early if you ignore how
you produced the code (i.e. just compile whatever source you like with whatever
tools you like) so long as the resultant code passed the functionality tests?

> > What I understand from the above is that there is not so much of a difference between Bazel and Nix/Guix. The "shallow" tracing is a deep tracing. It just records something different.
> 
> [classification.png]
> 
> The framework they present is that a build system can be classified based on how it decides whether a rebuild is necessary and how it schedules the necessary rebuilds (i.e. how it sorts the work to be done). They consider these orthogonal strategies a complete description of "what a build system is".
> 
> [constructive_traces.png]
> 
> The constructive trace allows you to keep intermediate artifacts but creates the extra overhead of managing these.
> 
> [deep_constructive_traces.png]
> 
> Since a deep constructive trace only looks at the terminal dependencies you have something of a merkle-dag. What I was suggesting is to be able to do "shallow builds" while still supporting "early cutoff". For reference:
> 
> [early_cutoff.png]
> 
> Early cutoff is a very desirable property that nix does not have (and I assume therefore that neither does guix).
> 
> [shallow_builds.png]
> 
> Basically, if we have a proof that the immediate dependencies are complete then we don't need to fetch anything deeper in the dependency tree.
> 
> > In Nix/Guix, we recursively compute the hash of packages from their inputs, receipe and other stuff. This hash is unrelated to the checksum of the result and changes as soon as one element changes. It essentially records the whole source dependency graph.
> > From what you write, Bazel records the checksum of inputs and receipe, so it essentially encodes the whole binary dependency graph. It's not "shallow", as a real change down the graph can have repercussions higher up.
> > Actually Eelco described something similar for Nix in his thesis, the intensionnal model. As you noticed, different receipes or inputs can lead to the same output (bitwise). The idea is to declare the expected checksum of the result, and use this to compute the store item name. As long as you apply changes that do not affect the output, different receipes (updates) can be used to generate the same item. So if you update a package, it might affect its output checksum, so you need to change it. Dependents might not be affected, in which case you can block update propagation there.
> > The question being how to monitor that packages need to be rebuilt or have the same hash?
> 
> I think the pictures above should clarify, I should have looked at the paper again before sending my previous email, I read it quite a long time ago.. The property I was trying to recover in the nix-style build systems is this idea of an early cutoff. The strategy for achieving that is to use constructive traces with depth 1 (what I was calling "shallow traces") and then recursive succinct arguments of knowledge for retaining the shallow build property that you get from the "infinitely deep traces".
> 
> > I don't really understand why it must be zero-knowledge. My understanding is that to have a zero-knowledge proof, you first need a proof of the claim, then find a clever way to proove you have the proof, without disclosing it. Not sure it's helpful here. If you have a proof that you don't need to rebuild, why not share that proof to everyone in the first place?
> 
> Sure, you only need the succinctness property. I should have been more precise, what I am mainly after is resolving the tradeoff between shallow and deep traces.
> 
> I hope that this discussion can at least be helpful for people that are not familiar with these ideas, even if - as it seems from your response - the idea does not have merit.
> 
> Kind regards,
> - ilmu

-- 
Regards,
Bengt Richter


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deep vs Shallow trace: Removing the tradeoff?
  2021-03-30 20:20     ` Bengt Richter
@ 2021-04-02 15:45       ` ilmu
  2021-04-17 15:05         ` Ludovic Courtès
  0 siblings, 1 reply; 8+ messages in thread
From: ilmu @ 2021-04-02 15:45 UTC (permalink / raw)
  To: Bengt Richter; +Cc: guix-devel@gnu.org

> > Early cutoff is a very desirable property that nix does not have (and I assume therefore that neither does guix).

> The “intensional model” that Julien mentioned, or what the Nix folks now
> refer to as “content-addressed derivations”, have this early cutoff
> property.  It’s one of the main motivations for this whole endeavors.

> This document shows how Nix folks are working on retrofitting
> “content-addressed derivations” into Nix:

>   https://github.com/tweag/rfcs/blob/cas-rfc/rfcs/0062-content-addressed-paths.md

This looks to me to be massively more complicated and also needing stronger assumptions about how reproducible the builds are. However I am not even sure I understand the proposals that well...

Thank you for bringing this to my attention though, does guix have any plans to do something like this?

> I have long been wondering if reproducibility is too exclusively focused
> on the source transformation chain and the tools implementing that.
>
> E.g., for programs that execute like side-effect-free functions,
> why not an early-cutoff based on a functionally well-tested upgrade
> written in a different language, with a totally different source?
>
> If you source the following in a subshell
>
> cat ./early-cutoff.txt
> --8<---------------cut here---------------start------------->8---
>
> export HELLO_TO=World
>
> strace -qqxs120 -e write \
> guile -c '(let* ((msg (string-append "Hello " (getenv "HELLO_TO") "\n"))) (display msg))' |& tr , $'\n'|grep '^ *"'
>
> strace -qqxs120 -e write \
> echo "Hello $HELLO_TO" |& tr , $'\n'|grep '^ *"'
>
> --8<---------------cut here---------------end--------------->8---
>
> Like,
>
> (. early-cutoff.txt)
> --8<---------------cut here---------------start------------->8---
> "Hello World\n"
> "Hello World\n"
> --8<---------------cut here---------------end--------------->8---
>
> It informally shows that in a particular context,
>
> --8<---------------cut here---------------start------------->8---
> guile -c '(let* ((msg (string-append "Hello " (getenv "HELLO_TO") "\n"))) (display msg))'
> --8<---------------cut here---------------end--------------->8---
>
> can substitute for
> --8<---------------cut here---------------start------------->8---
> echo "Hello $HELLO_TO"
> --8<---------------cut here---------------end--------------->8---
>
> So, what kinds of build sequences could be cut off early if you ignore how
> you produced the code (i.e. just compile whatever source you like with whatever
> tools you like) so long as the resultant code passed the functionality tests?

Yes! Precisely, we should be able to switch out the implementations so long as we can prove that the result is the same. At the limit we are essentially saying that extensionally equal things can be swapped for one another. However, extensional equality is pretty much impossible to calculate in the general case (you'd need to prove it constructively) and of course your example does not have intensional equality (which is the whole point of switching out the dependency, to improve the implementation, i.e. break intensional equality while keeping extensional equality).

The equality that we can keep using the succinct arguments of knowledge is extensional because in the special case of build systems we can use a dependency at build time and then if the resulting artefact is unchanged then we can do the early cut-off. So we depend on the property that an artefact can be used independently once built, so for example two versions of a compiler may give the same binary and then updating a compiler will allow you to only rebuild transitive closures of programs that depended on changed behaviors between versions.

To drive the point home: If the artefact has the same hash as it did before then clearly the underlying change did not affect it. Early cut-off is therefore not possible in the case where the program being depended on is supposed to be used at runtime.

This is at least how I am thinking about it, I think this would be a good incremental improvement on the state of the art but it would still have these limitations.

Kind regards,
- ilmu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deep vs Shallow trace: Removing the tradeoff?
  2021-04-02 15:45       ` ilmu
@ 2021-04-17 15:05         ` Ludovic Courtès
  2021-04-18  5:51           ` ilmu
  0 siblings, 1 reply; 8+ messages in thread
From: Ludovic Courtès @ 2021-04-17 15:05 UTC (permalink / raw)
  To: ilmu; +Cc: guix-devel@gnu.org

Hi,

ilmu <ilmu@rishi.is> skribis:

> To drive the point home: If the artefact has the same hash as it did before then clearly the underlying change did not affect it. Early cut-off is therefore not possible in the case where the program being depended on is supposed to be used at runtime.
>
> This is at least how I am thinking about it, I think this would be a good incremental improvement on the state of the art but it would still have these limitations.

What’s currently implemented is memoization: the hash (of the .drv and
that of the output(s)) is computed over all the inputs.

Since the output file name (the one passed to “./configure --prefix” &
co.) appears in the build result 99% of the time, the content vary at
least due to occurrences of the output file name.

Thus, content comparison are not doable as is.  What would help is
content comparison _modulo_ output file name(s).

But again, “same hash as it did before” is a framing that doesn’t hold
in a purely functional, stateless context.

Ludo’.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Deep vs Shallow trace: Removing the tradeoff?
  2021-04-17 15:05         ` Ludovic Courtès
@ 2021-04-18  5:51           ` ilmu
  0 siblings, 0 replies; 8+ messages in thread
From: ilmu @ 2021-04-18  5:51 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel@gnu.org

> What’s currently implemented is memoization: the hash (of the .drv and
> that of the output(s)) is computed over all the inputs.
>
> Since the output file name (the one passed to “./configure --prefix” &
> co.) appears in the build result 99% of the time, the content vary at
> least due to occurrences of the output file name.
>
> Thus, content comparison are not doable as is. What would help is
> content comparison modulo output file name(s).
>
> But again, “same hash as it did before” is a framing that doesn’t hold
> in a purely functional, stateless context.

Thank you for the clarification. Although, I must admit, I am not sure why the filename would change (except if it happens as a consequence of the hash being put in the filename).

I will need to spend the next few months actually reading the code and getting familiar with how things work "on the ground" before I can have anything useful to say. The reason I started this particular discussion was mostly to get a reality check whether the same methods apply in the build system context as in other authenticated datastructures (I don't see why not but I'm a layman, hence reality checks..)

In any case, glad to receive and answer. Keep up the good work! :)

Kind regards,
- Ilmu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-04-18  5:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-27 16:56 Deep vs Shallow trace: Removing the tradeoff? ilmu
2021-03-28  0:50 ` Julien Lepiller
2021-03-28 23:16   ` ilmu
2021-03-30 10:46     ` Ludovic Courtès
2021-03-30 20:20     ` Bengt Richter
2021-04-02 15:45       ` ilmu
2021-04-17 15:05         ` Ludovic Courtès
2021-04-18  5:51           ` ilmu

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).