unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Miguel Ángel Arruga Vivas" <rosen644835@gmail.com>
To: Julien Lepiller <julien@lepiller.eu>
Cc: guix-devel@gnu.org
Subject: Re: Identical files across subsequent package revisions
Date: Wed, 23 Dec 2020 20:07:40 +0100	[thread overview]
Message-ID: <87v9cswdc3.fsf@gmail.com> (raw)
In-Reply-To: <077ECD6C-AB0D-4FEA-ABBA-82550834265E@lepiller.eu> (Julien Lepiller's message of "Wed, 23 Dec 2020 10:40:00 -0500")

Hi Julien and Simon,

Julien Lepiller <julien@lepiller.eu> writes:

> Le 23 décembre 2020 09:07:23 GMT-05:00, zimoun <zimon.toutoune@gmail.com> a écrit :
>>Hi,
>>
>>On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas
>><rosen644835@gmail.com> wrote:
>>> Another idea that might fit well into that kind of protocol---with
>>> harder impact on the design, and probably with a high cost on the
>>> runtime---would be the "upgrade" of the deduplication process towards
>>a
>>> content-based file system as git does[2].  This way the a description
>>of
>>> the nar contents (size, hash) could trigger the retrieval only of the
>>> needed files not found in the current store.
>>
>>Is it not related to Content-Addressed Store?  i.e, «intensional
>>model»?
>>
>>Chap. 6: <https://edolstra.github.io/pubs/phd-thesis.pdf>
>>Nix FRC:
>><https://github.com/tweag/rfcs/blob/cas-rfc/rfcs/0062-content-addressed-paths.md>
>
> I think this is different, because we're talking about sub-element
> content-addressing. The intensional model is about content-addressing
> whole store elements. I think the idea would be to save individual
> files in, say, /gnu/store/.links, and let nar or narinfo files
> describe the files to retrieve. If we are missing some, we'd download
> them, then create hardlinks. This could even help our deduplication I
> think :)

Exactly.  My first approach would be a tree %links-prefix/hash/size, to
where all the internal contents of each store item would be hard linked:
mainly Git's approach with some touch here and there---some of them
probably have too much good will, the first approach isn't usually the
best. :-)

- The garbage collection process could check if there is any hard link
  to those files or remove them otherwise, deleting the hash folder when
  needed.

- The deduplication process could be in charge of moving the files and
  placing hard links instead, but hash collisions with the same size are
  always a possibility, therefore some mechanism is needed to treat
  these cases[1] and the vulnerabilities associated to them.

- The substitution process could retrieve from the server the
  information about the needed files, check which contents are already
  available and which ones must be retrieved, and ensure that the
  "generated nar" is the same as the one from the server.  This is quite
  related to the deduplication process and the mechanism used there[2].

Happy hacking!
Miguel

[1] Perhaps the usage of two different hash algorithms instead of one,
or different salts, could be enough for the "common" error case as a
collision on both cases are quite improbable.  They are possible anyway
with a size bigger than the hash size, therefore a final fallback to
actual bytes is probably needed.

[2] The folder could be hashed, even with a temporary salt agreed with
the server, to perform an independent/real-time check, but any issue
here has bigger consequences too, as no byte to byte comparison is
possible before the actual transmission.


  reply	other threads:[~2020-12-23 19:08 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
2020-12-23  9:08 ` Taylan Kammer
2020-12-23 10:50   ` Pierre Neidhardt
2020-12-23 22:06   ` Mark H Weaver
2020-12-23 22:47     ` Jonathan Brielmaier
2020-12-23 23:42     ` Mark H Weaver
2020-12-23 10:19 ` zimoun
2020-12-23 10:41   ` zimoun
2020-12-27 22:22     ` Ludovic Courtès
2020-12-23 11:48 ` Christopher Baines
2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
2020-12-23 14:07   ` zimoun
2020-12-23 15:40     ` Julien Lepiller
2020-12-23 19:07       ` Miguel Ángel Arruga Vivas [this message]
2020-12-27 22:29   ` Ludovic Courtès
2021-01-06  9:39     ` Ludovic Courtès
2020-12-29 20:01 ` pukkamustard
2021-01-06  9:27   ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v9cswdc3.fsf@gmail.com \
    --to=rosen644835@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=julien@lepiller.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).