unofficial mirror of help-guix@gnu.org 
 help / color / mirror / Atom feed
* Architecture to reduce download time when pulling multiple packages
@ 2023-10-12  3:27 Josh Marshall
  2023-10-12  8:04 ` Christopher Baines
  0 siblings, 1 reply; 10+ messages in thread
From: Josh Marshall @ 2023-10-12  3:27 UTC (permalink / raw)
  To: help-guix

Presently, I am waiting until the end of global warming to finish
pulling down texlive packages.  I see that there are a few servers
from which packages are provided.  Is the following feasible as a
feature to improve effective download speed?

List the base information for what packages there are and where those
packages are located.
1) For each package identifier, list the locations from which they may
be obtained.

List the top level status for each location so the first level of
scheduling packages across servers.  Something simple and sane.
Prioritize uniquely pulling down packages and more specifically
getting packages which are available from the fewest locations first
so as to not bottleneck later on.
2) For each location, have a mapped value for a presently downloading a package

It would be simpler to leave it at step 2 for downloading, but we can
do better.  If we're running into a situation where we have a package
which can be sourced from multiple locations and those locations
cannot be given unique packages (typically, more locations than
packages) then downloading a package can be interleaved between
multiple locations.
3) For each actively downloading package, list locations actively
assigned to obtain data for the package alongside information to
interleave data coming in from each location.

If someone is willing to do a bit of mentoring, this might be a good
project to work on.  Any thoughts on this?  Is this re-hashing old
ground?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages
  2023-10-12  3:27 Architecture to reduce download time when pulling multiple packages Josh Marshall
@ 2023-10-12  8:04 ` Christopher Baines
  2023-10-12 18:09   ` Josh Marshall
  0 siblings, 1 reply; 10+ messages in thread
From: Christopher Baines @ 2023-10-12  8:04 UTC (permalink / raw)
  To: Josh Marshall; +Cc: help-guix

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]


Josh Marshall <joshua.r.marshall.1991@gmail.com> writes:

> Presently, I am waiting until the end of global warming to finish
> pulling down texlive packages.  I see that there are a few servers
> from which packages are provided.  Is the following feasible as a
> feature to improve effective download speed?

It really depends. Can you give some examples of what you're downloading
and what speeds you're getting, and what speed your internet connection
can do in an optimal case (some file you've seen to download the
quickest)?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages
  2023-10-12  8:04 ` Christopher Baines
@ 2023-10-12 18:09   ` Josh Marshall
  2023-10-13 10:14     ` Christopher Baines
  0 siblings, 1 reply; 10+ messages in thread
From: Josh Marshall @ 2023-10-12 18:09 UTC (permalink / raw)
  To: Christopher Baines; +Cc: help-guix

I just went and installed everything with texlive in the name.  A few were
less than 1KB, a few over 1GB.  Download speed peaked at less than 3MB/s.
My Internet is a fiber connection with symmetric gigabit.  However, I think
picking at this info is the wrong tack.  I am of the opinion this should be
considered in the abstract and general to keep away from my situation and
to more apply towards all use cases.

On Thu, Oct 12, 2023, 4:06 AM Christopher Baines <mail@cbaines.net> wrote:

>
> Josh Marshall <joshua.r.marshall.1991@gmail.com> writes:
>
> > Presently, I am waiting until the end of global warming to finish
> > pulling down texlive packages.  I see that there are a few servers
> > from which packages are provided.  Is the following feasible as a
> > feature to improve effective download speed?
>
> It really depends. Can you give some examples of what you're downloading
> and what speeds you're getting, and what speed your internet connection
> can do in an optimal case (some file you've seen to download the
> quickest)?
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages
  2023-10-12 18:09   ` Josh Marshall
@ 2023-10-13 10:14     ` Christopher Baines
  2023-10-13 16:36       ` Josh Marshall
  0 siblings, 1 reply; 10+ messages in thread
From: Christopher Baines @ 2023-10-13 10:14 UTC (permalink / raw)
  To: Josh Marshall; +Cc: help-guix

[-- Attachment #1: Type: text/plain, Size: 1020 bytes --]


Josh Marshall <joshua.r.marshall.1991@gmail.com> writes:

> I just went and installed everything with texlive in the name.  A few
> were less than 1KB, a few over 1GB.  Download speed peaked at less
> than 3MB/s.  My Internet is a fiber connection with symmetric gigabit.
> However, I think picking at this info is the wrong tack.  I am of the
> opinion this should be considered in the abstract and general to keep
> away from my situation and to more apply towards all use cases.

The limited data I've seen suggests the download speeds people get from
the substitute servers vary a lot depending on their internet connecting
peering and location, so it's quite hard to consider this in an abstract
sense. Some people already get substitute download speeds that saturate
a gigabit connection, while some don't.

There has been work on setting up mirrors though, e.g. see [1] for some
information and data on how these work for different people.

1: https://lists.gnu.org/archive/html/guix-devel/2023-05/msg00290.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages
  2023-10-13 10:14     ` Christopher Baines
@ 2023-10-13 16:36       ` Josh Marshall
  2023-10-13 18:05         ` Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c! JRHaigh+ML.GNU.Guix--- via
  0 siblings, 1 reply; 10+ messages in thread
From: Josh Marshall @ 2023-10-13 16:36 UTC (permalink / raw)
  To: help-guix

This is to parallelize connections which should never hurt downloading
but can help.  Mirroring would be parallelizing for providing
packages, what I want to implement is to parallelize obtaining
packages.  Server side vs client side.

On Fri, Oct 13, 2023 at 6:18 AM Christopher Baines <mail@cbaines.net> wrote:
>
>
> Josh Marshall <joshua.r.marshall.1991@gmail.com> writes:
>
> > I just went and installed everything with texlive in the name.  A few
> > were less than 1KB, a few over 1GB.  Download speed peaked at less
> > than 3MB/s.  My Internet is a fiber connection with symmetric gigabit.
> > However, I think picking at this info is the wrong tack.  I am of the
> > opinion this should be considered in the abstract and general to keep
> > away from my situation and to more apply towards all use cases.
>
> The limited data I've seen suggests the download speeds people get from
> the substitute servers vary a lot depending on their internet connecting
> peering and location, so it's quite hard to consider this in an abstract
> sense. Some people already get substitute download speeds that saturate
> a gigabit connection, while some don't.
>
> There has been work on setting up mirrors though, e.g. see [1] for some
> information and data on how these work for different people.
>
> 1: https://lists.gnu.org/archive/html/guix-devel/2023-05/msg00290.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!
  2023-10-13 16:36       ` Josh Marshall
@ 2023-10-13 18:05         ` JRHaigh+ML.GNU.Guix--- via
  2023-10-15 18:21           ` Josh Marshall
  0 siblings, 1 reply; 10+ messages in thread
From: JRHaigh+ML.GNU.Guix--- via @ 2023-10-13 18:05 UTC (permalink / raw)
  To: Josh Marshall; +Cc: help-guix


[-- Attachment #1.1: Type: text/plain, Size: 6878 bytes --]

Hi Josh,

At Z-0400=2023-10-13Fri12:36:01, Josh Marshall sent:
> This is to parallelize connections which should never hurt downloading but can help.  Mirroring would be parallelizing for providing packages, what I want to implement is to parallelize obtaining packages.  Server side vs client side.

	Please, if you are going to do something like this, please use a torrent architecture like BitTorrent or GNUnet – I suggest Aria2c as a very good CLI download backend that can be daemonised and sent instructions over a socket to add, pause, remove downloads, etc., and it supports magnet URLs including the existing nontorrent servers (via ‘as’ parameters, iirc.).

	I actually implemented this in a local copy of APT Daemon many years ago (circa 2011), but the change was not accepted upstream to Launchpad (because I was not on bleeding-edge; I was too slow to keep-up with the upstream development).  My fork got forgotten about, because to get the full benefit the server would have had to have added a BitTorrent Info Hash (BTIH) to the metadata of each package, along with the MD5, SHA-256, etc. that it already did (not a big ask, really).  That said, without the full benefit of having the metadata, it did provide immediate benefit and I used it for many years, not upgrading my Ubuntu 11.04 Natty Narwhal that I was using back then until I really had to.

	The immediate benefit that it provided was exactly as you described: It allowed parallelisation of nontorrent downloads, be it from the same server or from multiple mirrors.  Iirc., I achieved this by simply passing the download list to Aria2c in daemon mode, I think I also converted all the HTTP URLs to ‘as’ parameters in magnet links, so that multiple mirrors could be passed using multiple ‘as’ parameters in each magnet link.  Then I simply relied on Aria2c being amazing at parallelising everything that I had given it!  I then also implemented progress updates such that APT Daemon could reflect where Aria2c was up to.

	The way I implemented this using Aria2c and magnet URLs meant that if additional hashes were known, they could be used as well, and so if the server metadata made the simple addition of adding BTIHs, it allows swarming to occur, which in-turn would massively reduce load on the central servers, and allow anyone who want to be a mirror to be a mirror simply by seeding indefinitely.  A default share ratio of 1.0 means that no user is a burden on the network, unless they deliberately change that.  Users can donate to the running costs of the project simply by increasing their share ratio, which adds another means of contribution that they may find easier than the others.

	Anyone keen to keep old packages online can simply seed them indefinitely, so this is also really great for archival purposes.  Even if the central project loses interest in the old packages and deletes them, anyone else can keep them up.  The hashes ensure that they have not been tampered with.

	There is also a really cool benefit that occurs, or can occur, on a LAN.  An entire network of computers can all swarm locally with each other, thus needing each package to only need downloading through the metered last mile bottleneck from the WAN precisely once – providing that local broadcasting is supported.  I think this requires Avahi, and I seem to remember that Aria2c supports this but I can't remember.  I don't ever remember getting this bit working but also I did not try hard because it would have required the metadata that I didn't have until after download, so even if I got it working it would not have been directly useful unless the APT repositories that I was using would include the BTIHs.

	So yeah, loads of great benefits to this architecture, and I highly-recommend it: convert all existing URLs to magnet links (can be done client-side as I did; or server-side); optionally add any additional mirrors as additional ‘as’ parameters (again client-side or server-side); add ‘btih’ parameters to the magnet links (the BTIH must be included in the server metadata to get the full benefit of the swarming, but conversion to magnet link format can be done client-side or server-side); then simply pass all this to a really good parallelising backend such as Aria2c; then update any progress data and relay pause, resume, cancel, etc. to the backend.

	One final note, as I am sure that there are a lot of GNUnet fans on this list, is that I would try Aria2c first to see how well it can work, and then try GNUnet or whatever else once you have a standard to benchmark against.  Both are Free Software, so no concern there.  Aria2c is an all-round download manager CLI that works with or without swarming, i.e. it is just as good at HTTPS as it is BitTorrent, and can do both at the same time.  GNUnet has the advantage of working from SHA-256 iirc., which is generally already included in the metadata of the repositories of various distributions, but I think it lacks a lot of other features and stability and ecosystem of alternative backends, compared to the BitTorrent network.

	Of course, there is no harm in including other hashes along with BTIH, to allow people to experiment with alternative backends, while always ensuring that what works works well.  Another hash that may be useful to include is the Tiger Tree Hash, which is structurally very similar to BTIH, but stronger, iirc..

	The first thing that the Guix project can do to signal interest in this architecture is to simply include the BTIH of each package in the repository metadata.  Be it in magnet URL form or not does not matter because the client can later convert that as needed.  The important thing is an authoritative statement in metadata that this version of this package has this BTIH.  Once that metadata is available, the game is on to implement swarming support, be it with Aria2c as a backend (as I recommend at least starting with) or otherwise.

	I know that this architecture works well out of first-hand experience with APT Daemon written in Python.  The only failure I had with it was lack of upstream support.  So I consider it important to first attain the upstream approval before really investing more time into this.  I seem to remember suggesting this to the Nix project many years ago and didn't get anywhere, and now I don't have the energy to try to improve upstream projects if they reject my ideas, so I'll be interested to see whether you have any success with your attempt to do the same.

	Good luck! ;-)

Kind regards,
James.
-- 
Wealth doesn't bring happiness, but poverty brings sadness.
Sent from Debian with Claws Mail, using email subaddressing as an alternative to error-prone heuristical spam filtering.
Postal: James R. Haigh, Middle Farm, Vennington, nr. Westbury, nr. Shrewsbury, Salop, SY5 9RG, Britain

[-- Attachment #1.2: JRHaigh+plz-sign4GPG@Runbox.com.asc --]
[-- Type: text/plain, Size: 3389 bytes --]

-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGTp6SMBEADPQq1qjx0IpKf6nDxGeF9wRitXtkAxyDQ2y7DvJDfZ2L+vOjFR
InjGeD7FLK+AF5mTlsoRIg+kmE2j/1lIGj9sIybxQbcK0lxnQWMF7SXMu7e2BqPz
d/cTW8M0hdvni9FaNERL0LZkxClfHGSaQF8YKqGLuWD6q8s8YXyOK+mk+bKBjLMj
lRJGiYvSS/qCeyIRvYYPwLuSVhUybtvN/bvXyGP1Vaf39ssOyJYxSEzebsc70Doq
sS0g+47Rdl4ni24RDq/MNn5fNOy3/inP5OXxv+RgFMUwfzpx49znETMYnCGs8sbb
s7U/tTch5LTcZooBvo5HIwumnwg1RB6PjC0s59FUahoudJSDLYhZQjjZMDIjeiIT
CXjQLz2QXOdv5412c0jfWq8lLMos8tj6ACu4cTuCshljTZ+d1wG5l4H17YelgAHt
PCL6vaP4MdIpdHfeU2K0hM4EI8D+ucHpC/Hz3YGOC/vEqnX/G+NFzKK6aS469Ex/
YSbt84akd+LSW78ql7Tmw7NveWYU7wJalcMEPX0YMTRGTOsPA76KS956YCXabB6o
7xuWK9uSQzOEG8OVV9DTf8HPxc2C0dcqSDqKNyW9rOeh6nhqyHXw96W50fMx4qmp
dl1eg2Zd22Sea/uv/xyxhfI3Bq3WamaWztPMzn983qgDc50E0RDDOBe4SQARAQAB
tKtKYW1lcyBSLiBIYWlnaCAoUGxlYXNlIHNpZ24geW91ciBlbWFpbCB0byB0aGlz
IGFkZHJlc3Mgd2l0aCBHTlUgUHJpdmFjeSBHdWFyZCBvciBjb21wYXRpYmxlOyBl
bHNlIG1heSBiZSBhdXRvbWF0aWNhbGx5IHJlZ2FyZGVkIGFzIHNwYW0uKSA8SlJI
YWlnaCtwbHotc2lnbjRHUEdAUnVuYm94LmNvbT6JAlQEEwEIAD4WIQSW9hOM3nLo
AXpJ6nMHFMjblIE24wUCZOnpIwIbAwUJE0/ZAAULCQgHAgYVCAkKCwIEFgIDAQIe
AQIXgAAKCRAHFMjblIE24xwDD/4gREdchv8MpVz4DSpDkvrXJfZJknmL5IJS2DLh
mVFQ/NYgFQZa2IdcMEdE7BkwTmL0czxgjHOo9hjHP70kkSGfwsuChaNoAXMj89R4
p1Sv61FC24cbrdaT4bS2CYQqhhzbOcTozAyGlP7AaijDPc4RkrjLqepjdlEvp1tk
dFwdusDpQZ/YE8Sri4GFVPh56ZvbJW1cfoAyVaOPvnE1z5XsW6fFnRfZ1/Ll3XIL
5nNCbXrXvTJ79bNvETmznvH7Cn+4c19YicrNFjlhrf2Sv+8/q/ljts2nny3TPX1/
nh8NlgHFRYcYza7YeBK6P3twmf2tNLXIDdKZoD49rsPBVTRGkV2u5pIBpEcgKZBr
cvnT07sOEnQrdz23gEfnwzNFHpW3XOp03bF8by99p9sfdFywundb3I3d/8Xs4/DD
RZl39Xh+HSMH/Vv6jDsrH+AKp3ItO0cg44Rt8I7l/FlIfekeZ5m6QMNiCtWE4Jk5
+bQVY4j3mRFxW1vNNxv4GRpyrAHBvjqc4CFZTiFvNil/iAKnCOfy9VNv39yq2ZUB
lDFoehrf2aHifp6KsR32tw91zjIoneY1Kl8b81IKYsf2w8GlkEks1hzpSNbQqLJt
e9fCDef10NuZJv0bYjUPVjwHHLwy3R8GER3BcJkytNg/MgtlQ/xFkfC6n9qDUf4G
QLLNPbkCDQRk6ekjARAAn6+xhvuz3uxTLsF3KX9f7ADl7ekwQRKNiZsn4H7nBQ2F
Rw5ByjxTNTWbmKkKNDVBtcgkdQAs1X+9rNaG2xrFqqjhQ3ND8Q9H8jw4m2S2zX7X
T+CbZJJxQ46yuyw1m/+P9PQu4I6dkzqEcld4F1OWhJFljLpIzrOJnhQI+d8DR0ky
3tLN88/MT1ps9/gHrHj54uNO+IkGQiebBxYP2jpq+h2K1DZHLIiNSe3YUjTkRFjs
oORnBinw0KVSnltkXn4CQ1iVSCDJWgAdiI7yLmoWP2nFDuRnFLhLwymjahpr6nGO
gJLBag6JmGDM2+AxKBaBpwShsNjfygwv1joAM5YIct9iK/CTbEgzvY5Q//t9mMV9
iz2tOmOOkEfRs8h7BSzvjVD6/VAYstIPg0EXnXk9DABPy7OeLcGPyOPULvCvbF41
gu+pv3TkJ9jsfLZdaZwxVAvcxkOJ0ZkeM/Nm/c3za/rpvK7H8j8GZkXyDjST2t7/
byn3Jf9yBtr1lg4O/6nm7TU09/W48Wg7LRPxp7hK4B87eBcIAO1WHOwNb5Hcrfc4
ggW8UfLmeFVEiT1kWFW77yuw/cgz7T2TMe4wRjVDHq9pvyqvWiRzq29wtbozDWVc
VAND4+oh3OX35pUrJLIwF27NfSWCiwYu+m4Vpd+ARpv7gFBg/XUujcriYDCwLiEA
EQEAAYkCPAQYAQgAJhYhBJb2E4zecugBeknqcwcUyNuUgTbjBQJk6ekjAhsMBQkT
T9kAAAoJEAcUyNuUgTbjkwkQAKUNSgHJLpaQA5rOV5fw8ddU1VeHVeBojVCWab9L
6AopnQ98tHaO1EQlsoMJ4R3n3iZG7r0Ek1Tb2o3QmAuQg9RgiHi2+ccXCs9VTy99
pQhguCZrTj3xrNCr5D4SvicAzSuwd5ALPZq4sdEpRA0ZIaUU3FBH2W4NZ5qex3G5
VvM1hVfXnBNTPeN/rT6KwF8BpKrrVjuiB4/2NY61i7WX6/Hb9gIwZqeHMGwI/wUp
n1hiIv0xrIcfu6/7zda3yAwmHTBO3foaETZHEj7eyK7a/qH5yPEN6f5srb9TNWds
cx/yCscTZdGtBJKoYYA/dQtwal2GuOrbq8jVTdtWk/VINn+XV0Twkoer3HD7IH+/
XbPReGS1X6+Q1XmGcc0CWQ+yvsUipowtj2Z8EGefscEHLcDOdcStW9HV8pOewGlQ
PbBz9QPQTnrqhDp6ia3oKEBuxuVCOMX7meNVzBycHvqbNGV1oGJbPkTJeRcEFwO/
NSBFTlXc+euH37ZFobTnDHj7wKsbKZ+GWI7QQmcbBCeuDaFjujhQ8AgUcp4nPUYI
Op+nhMZW6LHpE0JSFtMpSYAw3Lf+6dP4hVFOXLrqs1zTaM69g0tvQJOXbF1I9hJ7
0GPnLtYhj5xx+FJBcchp3ipY0dwEVW6jqRRPJWS6xfm42JggKmpRPG+1IW4wWmEK
Q9XT
=jwvA
-----END PGP PUBLIC KEY BLOCK-----

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!
  2023-10-13 18:05         ` Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c! JRHaigh+ML.GNU.Guix--- via
@ 2023-10-15 18:21           ` Josh Marshall
  2023-10-18  1:44             ` Josh Marshall
  2023-10-18  8:11             ` Christopher Baines
  0 siblings, 2 replies; 10+ messages in thread
From: Josh Marshall @ 2023-10-15 18:21 UTC (permalink / raw)
  To: James R. Haigh (+ML.GNU.Guix subaddress); +Cc: help-guix

So it sounds like my first steps are to re-implement the downloads
using aria2c.  This would affect the minimum base package, no?  Can I
get some buy-in from maintainers that such changes are acceptable?

On Fri, Oct 13, 2023 at 2:06 PM James R. Haigh (+ML.GNU.Guix
subaddress) <JRHaigh+ML.GNU.Guix@runbox.com> wrote:
>
> Hi Josh,
>
> At Z-0400=2023-10-13Fri12:36:01, Josh Marshall sent:
> > This is to parallelize connections which should never hurt downloading but can help.  Mirroring would be parallelizing for providing packages, what I want to implement is to parallelize obtaining packages.  Server side vs client side.
>
>         Please, if you are going to do something like this, please use a torrent architecture like BitTorrent or GNUnet – I suggest Aria2c as a very good CLI download backend that can be daemonised and sent instructions over a socket to add, pause, remove downloads, etc., and it supports magnet URLs including the existing nontorrent servers (via ‘as’ parameters, iirc.).
>
>         I actually implemented this in a local copy of APT Daemon many years ago (circa 2011), but the change was not accepted upstream to Launchpad (because I was not on bleeding-edge; I was too slow to keep-up with the upstream development).  My fork got forgotten about, because to get the full benefit the server would have had to have added a BitTorrent Info Hash (BTIH) to the metadata of each package, along with the MD5, SHA-256, etc. that it already did (not a big ask, really).  That said, without the full benefit of having the metadata, it did provide immediate benefit and I used it for many years, not upgrading my Ubuntu 11.04 Natty Narwhal that I was using back then until I really had to.
>
>         The immediate benefit that it provided was exactly as you described: It allowed parallelisation of nontorrent downloads, be it from the same server or from multiple mirrors.  Iirc., I achieved this by simply passing the download list to Aria2c in daemon mode, I think I also converted all the HTTP URLs to ‘as’ parameters in magnet links, so that multiple mirrors could be passed using multiple ‘as’ parameters in each magnet link.  Then I simply relied on Aria2c being amazing at parallelising everything that I had given it!  I then also implemented progress updates such that APT Daemon could reflect where Aria2c was up to.
>
>         The way I implemented this using Aria2c and magnet URLs meant that if additional hashes were known, they could be used as well, and so if the server metadata made the simple addition of adding BTIHs, it allows swarming to occur, which in-turn would massively reduce load on the central servers, and allow anyone who want to be a mirror to be a mirror simply by seeding indefinitely.  A default share ratio of 1.0 means that no user is a burden on the network, unless they deliberately change that.  Users can donate to the running costs of the project simply by increasing their share ratio, which adds another means of contribution that they may find easier than the others.
>
>         Anyone keen to keep old packages online can simply seed them indefinitely, so this is also really great for archival purposes.  Even if the central project loses interest in the old packages and deletes them, anyone else can keep them up.  The hashes ensure that they have not been tampered with.
>
>         There is also a really cool benefit that occurs, or can occur, on a LAN.  An entire network of computers can all swarm locally with each other, thus needing each package to only need downloading through the metered last mile bottleneck from the WAN precisely once – providing that local broadcasting is supported.  I think this requires Avahi, and I seem to remember that Aria2c supports this but I can't remember.  I don't ever remember getting this bit working but also I did not try hard because it would have required the metadata that I didn't have until after download, so even if I got it working it would not have been directly useful unless the APT repositories that I was using would include the BTIHs.
>
>         So yeah, loads of great benefits to this architecture, and I highly-recommend it: convert all existing URLs to magnet links (can be done client-side as I did; or server-side); optionally add any additional mirrors as additional ‘as’ parameters (again client-side or server-side); add ‘btih’ parameters to the magnet links (the BTIH must be included in the server metadata to get the full benefit of the swarming, but conversion to magnet link format can be done client-side or server-side); then simply pass all this to a really good parallelising backend such as Aria2c; then update any progress data and relay pause, resume, cancel, etc. to the backend.
>
>         One final note, as I am sure that there are a lot of GNUnet fans on this list, is that I would try Aria2c first to see how well it can work, and then try GNUnet or whatever else once you have a standard to benchmark against.  Both are Free Software, so no concern there.  Aria2c is an all-round download manager CLI that works with or without swarming, i.e. it is just as good at HTTPS as it is BitTorrent, and can do both at the same time.  GNUnet has the advantage of working from SHA-256 iirc., which is generally already included in the metadata of the repositories of various distributions, but I think it lacks a lot of other features and stability and ecosystem of alternative backends, compared to the BitTorrent network.
>
>         Of course, there is no harm in including other hashes along with BTIH, to allow people to experiment with alternative backends, while always ensuring that what works works well.  Another hash that may be useful to include is the Tiger Tree Hash, which is structurally very similar to BTIH, but stronger, iirc..
>
>         The first thing that the Guix project can do to signal interest in this architecture is to simply include the BTIH of each package in the repository metadata.  Be it in magnet URL form or not does not matter because the client can later convert that as needed.  The important thing is an authoritative statement in metadata that this version of this package has this BTIH.  Once that metadata is available, the game is on to implement swarming support, be it with Aria2c as a backend (as I recommend at least starting with) or otherwise.
>
>         I know that this architecture works well out of first-hand experience with APT Daemon written in Python.  The only failure I had with it was lack of upstream support.  So I consider it important to first attain the upstream approval before really investing more time into this.  I seem to remember suggesting this to the Nix project many years ago and didn't get anywhere, and now I don't have the energy to try to improve upstream projects if they reject my ideas, so I'll be interested to see whether you have any success with your attempt to do the same.
>
>         Good luck! ;-)
>
> Kind regards,
> James.
> --
> Wealth doesn't bring happiness, but poverty brings sadness.
> Sent from Debian with Claws Mail, using email subaddressing as an alternative to error-prone heuristical spam filtering.
> Postal: James R. Haigh, Middle Farm, Vennington, nr. Westbury, nr. Shrewsbury, Salop, SY5 9RG, Britain


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!
  2023-10-15 18:21           ` Josh Marshall
@ 2023-10-18  1:44             ` Josh Marshall
  2023-10-18  8:11             ` Christopher Baines
  1 sibling, 0 replies; 10+ messages in thread
From: Josh Marshall @ 2023-10-18  1:44 UTC (permalink / raw)
  To: James R. Haigh (+ML.GNU.Guix subaddress); +Cc: help-guix

How long is traditional before I can bump a thread?

On Sun, Oct 15, 2023 at 2:21 PM Josh Marshall
<joshua.r.marshall.1991@gmail.com> wrote:
>
> So it sounds like my first steps are to re-implement the downloads
> using aria2c.  This would affect the minimum base package, no?  Can I
> get some buy-in from maintainers that such changes are acceptable?
>
> On Fri, Oct 13, 2023 at 2:06 PM James R. Haigh (+ML.GNU.Guix
> subaddress) <JRHaigh+ML.GNU.Guix@runbox.com> wrote:
> >
> > Hi Josh,
> >
> > At Z-0400=2023-10-13Fri12:36:01, Josh Marshall sent:
> > > This is to parallelize connections which should never hurt downloading but can help.  Mirroring would be parallelizing for providing packages, what I want to implement is to parallelize obtaining packages.  Server side vs client side.
> >
> >         Please, if you are going to do something like this, please use a torrent architecture like BitTorrent or GNUnet – I suggest Aria2c as a very good CLI download backend that can be daemonised and sent instructions over a socket to add, pause, remove downloads, etc., and it supports magnet URLs including the existing nontorrent servers (via ‘as’ parameters, iirc.).
> >
> >         I actually implemented this in a local copy of APT Daemon many years ago (circa 2011), but the change was not accepted upstream to Launchpad (because I was not on bleeding-edge; I was too slow to keep-up with the upstream development).  My fork got forgotten about, because to get the full benefit the server would have had to have added a BitTorrent Info Hash (BTIH) to the metadata of each package, along with the MD5, SHA-256, etc. that it already did (not a big ask, really).  That said, without the full benefit of having the metadata, it did provide immediate benefit and I used it for many years, not upgrading my Ubuntu 11.04 Natty Narwhal that I was using back then until I really had to.
> >
> >         The immediate benefit that it provided was exactly as you described: It allowed parallelisation of nontorrent downloads, be it from the same server or from multiple mirrors.  Iirc., I achieved this by simply passing the download list to Aria2c in daemon mode, I think I also converted all the HTTP URLs to ‘as’ parameters in magnet links, so that multiple mirrors could be passed using multiple ‘as’ parameters in each magnet link.  Then I simply relied on Aria2c being amazing at parallelising everything that I had given it!  I then also implemented progress updates such that APT Daemon could reflect where Aria2c was up to.
> >
> >         The way I implemented this using Aria2c and magnet URLs meant that if additional hashes were known, they could be used as well, and so if the server metadata made the simple addition of adding BTIHs, it allows swarming to occur, which in-turn would massively reduce load on the central servers, and allow anyone who want to be a mirror to be a mirror simply by seeding indefinitely.  A default share ratio of 1.0 means that no user is a burden on the network, unless they deliberately change that.  Users can donate to the running costs of the project simply by increasing their share ratio, which adds another means of contribution that they may find easier than the others.
> >
> >         Anyone keen to keep old packages online can simply seed them indefinitely, so this is also really great for archival purposes.  Even if the central project loses interest in the old packages and deletes them, anyone else can keep them up.  The hashes ensure that they have not been tampered with.
> >
> >         There is also a really cool benefit that occurs, or can occur, on a LAN.  An entire network of computers can all swarm locally with each other, thus needing each package to only need downloading through the metered last mile bottleneck from the WAN precisely once – providing that local broadcasting is supported.  I think this requires Avahi, and I seem to remember that Aria2c supports this but I can't remember.  I don't ever remember getting this bit working but also I did not try hard because it would have required the metadata that I didn't have until after download, so even if I got it working it would not have been directly useful unless the APT repositories that I was using would include the BTIHs.
> >
> >         So yeah, loads of great benefits to this architecture, and I highly-recommend it: convert all existing URLs to magnet links (can be done client-side as I did; or server-side); optionally add any additional mirrors as additional ‘as’ parameters (again client-side or server-side); add ‘btih’ parameters to the magnet links (the BTIH must be included in the server metadata to get the full benefit of the swarming, but conversion to magnet link format can be done client-side or server-side); then simply pass all this to a really good parallelising backend such as Aria2c; then update any progress data and relay pause, resume, cancel, etc. to the backend.
> >
> >         One final note, as I am sure that there are a lot of GNUnet fans on this list, is that I would try Aria2c first to see how well it can work, and then try GNUnet or whatever else once you have a standard to benchmark against.  Both are Free Software, so no concern there.  Aria2c is an all-round download manager CLI that works with or without swarming, i.e. it is just as good at HTTPS as it is BitTorrent, and can do both at the same time.  GNUnet has the advantage of working from SHA-256 iirc., which is generally already included in the metadata of the repositories of various distributions, but I think it lacks a lot of other features and stability and ecosystem of alternative backends, compared to the BitTorrent network.
> >
> >         Of course, there is no harm in including other hashes along with BTIH, to allow people to experiment with alternative backends, while always ensuring that what works works well.  Another hash that may be useful to include is the Tiger Tree Hash, which is structurally very similar to BTIH, but stronger, iirc..
> >
> >         The first thing that the Guix project can do to signal interest in this architecture is to simply include the BTIH of each package in the repository metadata.  Be it in magnet URL form or not does not matter because the client can later convert that as needed.  The important thing is an authoritative statement in metadata that this version of this package has this BTIH.  Once that metadata is available, the game is on to implement swarming support, be it with Aria2c as a backend (as I recommend at least starting with) or otherwise.
> >
> >         I know that this architecture works well out of first-hand experience with APT Daemon written in Python.  The only failure I had with it was lack of upstream support.  So I consider it important to first attain the upstream approval before really investing more time into this.  I seem to remember suggesting this to the Nix project many years ago and didn't get anywhere, and now I don't have the energy to try to improve upstream projects if they reject my ideas, so I'll be interested to see whether you have any success with your attempt to do the same.
> >
> >         Good luck! ;-)
> >
> > Kind regards,
> > James.
> > --
> > Wealth doesn't bring happiness, but poverty brings sadness.
> > Sent from Debian with Claws Mail, using email subaddressing as an alternative to error-prone heuristical spam filtering.
> > Postal: James R. Haigh, Middle Farm, Vennington, nr. Westbury, nr. Shrewsbury, Salop, SY5 9RG, Britain


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!
  2023-10-15 18:21           ` Josh Marshall
  2023-10-18  1:44             ` Josh Marshall
@ 2023-10-18  8:11             ` Christopher Baines
  2023-11-07 17:29               ` JRHaigh+ML.GNU.Guix--- via
  1 sibling, 1 reply; 10+ messages in thread
From: Christopher Baines @ 2023-10-18  8:11 UTC (permalink / raw)
  To: Josh Marshall; +Cc: James R. Haigh (+ML.GNU.Guix subaddress), help-guix

[-- Attachment #1: Type: text/plain, Size: 654 bytes --]


Josh Marshall <joshua.r.marshall.1991@gmail.com> writes:
> what I want to implement is to parallelize obtaining
> packages.  Server side vs client side.

This is already possible, if you enable parallel build jobs, then you'll
get parallel downloads.

There's also been work to allow controling these things separately.

> So it sounds like my first steps are to re-implement the downloads
> using aria2c.  This would affect the minimum base package, no?  Can I
> get some buy-in from maintainers that such changes are acceptable?

Reimplementing fetching files/substitutes using a tool like aria2c would
probably greatly complicate the bootstrap path.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!
  2023-10-18  8:11             ` Christopher Baines
@ 2023-11-07 17:29               ` JRHaigh+ML.GNU.Guix--- via
  0 siblings, 0 replies; 10+ messages in thread
From: JRHaigh+ML.GNU.Guix--- via @ 2023-11-07 17:29 UTC (permalink / raw)
  To: Christopher Baines, help-guix; +Cc: Josh Marshall

[-- Attachment #1: Type: text/plain, Size: 708 bytes --]

Hi Christopher,

At Z+0100=2023-10-18Wed09:11:21, Christopher Baines sent:
> […]
> Reimplementing fetching files/substitutes using a tool like aria2c would probably greatly complicate the bootstrap path.

Why not make Aria2c an optional dependency?  It could be implemented such that it is not required for bootstrapping, but can optionally be used for package swarming thereafter?

Kind regards,
James.
-- 
Wealth doesn't bring happiness, but poverty brings sadness.
Sent from Debian with Claws Mail, using email subaddressing as an alternative to error-prone heuristical spam filtering.
Postal: James R. Haigh, Middle Farm, Vennington, nr. Westbury, nr. Shrewsbury, Salop, SY5 9RG, Britain

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-11-07 17:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-12  3:27 Architecture to reduce download time when pulling multiple packages Josh Marshall
2023-10-12  8:04 ` Christopher Baines
2023-10-12 18:09   ` Josh Marshall
2023-10-13 10:14     ` Christopher Baines
2023-10-13 16:36       ` Josh Marshall
2023-10-13 18:05         ` Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c! JRHaigh+ML.GNU.Guix--- via
2023-10-15 18:21           ` Josh Marshall
2023-10-18  1:44             ` Josh Marshall
2023-10-18  8:11             ` Christopher Baines
2023-11-07 17:29               ` JRHaigh+ML.GNU.Guix--- via

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).