From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Marusich Subject: Re: Using a CDN or some other mirror? Date: Sat, 08 Dec 2018 19:33:17 -0800 Message-ID: <87ftv7l6gy.fsf@gmail.com> References: <20181203154335.10366-1-ludo@gnu.org> <87tvju6145.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:50766) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gVpqc-0002gB-Jt for guix-devel@gnu.org; Sat, 08 Dec 2018 22:33:40 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gVpqZ-0002Zd-BK for guix-devel@gnu.org; Sat, 08 Dec 2018 22:33:38 -0500 List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Ludovic =?utf-8?Q?Court=C3=A8s?= , Ricardo Wurmus , Hartmut Goebel , "Thompson, David" , Meiyo Peng Cc: guix-devel@gnu.org, 33600@debbugs.gnu.org --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi everyone, ludo@gnu.org (Ludovic Court=C3=A8s) writes: > Ludovic Court=C3=A8s skribis: > > [...] I=E2=80=99m thinking about using a similar setup, but hosting the m= irror > on some Big Corp CDN or similar. Chris Marusich came up with a setup > along these lines a while back: > > https://lists.gnu.org/archive/html/guix-devel/2016-03/msg00312.html > > Compared to Chris=E2=80=99s setup, given that =E2=80=98guix publish=E2=80= =99 now provides > =E2=80=98Cache-Control=E2=80=99 headers (that was not the case back then,= see > ), > caching in the proxy should Just Work. > > I would like us to set up such a mirror for berlin and then have > ci.guix.info point to that. The project should be able to pay the > hosting fees. > > Thoughts? Regarding DNS, it would be nice if we could use an official GNU subdomain. If we can't use a GNU subdomain, we should at least make sure we have some kind of DNS auto-renewal set up so that nobody can poach our domain names. And the operators should take appropriate precautions when sharing any credentials used for managing it all. Regarding CDNs, I definitely think it's worth a try! Even Debian is using CloudFront (cloudfront.debian.net). In fact, email correspondence suggests that as of 2013, Amazon may even have been paying for it! https://lists.debian.org/debian-cloud/2013/05/msg00071.html I wonder if Amazon would be willing to pay for our CloudFront distribution if we asked them nicely? In any case, before deciding to use Amazon CloudFront for ci.guix.info, it would be prudent to estimate the cost. CloudFront, like most Amazon AWS services, is a "pay for what you use" model. The pricing is here: https://aws.amazon.com/cloudfront/pricing To accurately estimate the cost, we need to know how many requests we expect to receive, and how many bytes we expect to transfer out, during a single month. Do we have information like this for berlin today? Although I don't doubt that a CDN will perform better than what we have now, I do think it would be good to measure the performance so that we know for sure the money spent is actually providing a benefit. It would be nice to have some data before and after to measure how availability and performance have changed. Apart from anecdotes, what data do we have to determine whether performance has improved after introducing a CDN? For example, the following information could be useful: * Network load on the origin server(s) * Clients' latency to (the addresses pointed to by) ci.guix.info * Clients' throughput while downloading substitutes from ci.guix.info We don't log or collect client metrics, and that's fine. It could be useful to add code to Guix to measure things like this when the user asks to do so, but perhaps it isn't necessary. It may be good enough if people just volunteer to manually gather some information and share it. For example, you can define a shell function like this: =2D-8<---------------cut here---------------start------------->8--- measure_get () { curl -L \ -o /dev/null \ -w "url_effective: %{url_effective}\\n\ http_code: %{http_code}\\n\ num_connects: %{num_connects}\\n\ num_redirects: %{num_redirects}\\n\ remote_ip: %{remote_ip}\\n\ remote_port: %{remote_port}\\n\ size_download: %{size_download} B\\n\ speed_download: %{speed_download} B/s\\n\ time_appconnect: %{time_appconnect} s\\n\ time_connect: %{time_connect} s\\n\ time_namelookup: %{time_namelookup} s\\n\ time_pretransfer: %{time_pretransfer} s\\n\ time_redirect: %{time_redirect} s\\n\ time_starttransfer: %{time_starttransfer} s\\n\ time_total: %{time_total} s\\n" \ "$1" } =2D-8<---------------cut here---------------end--------------->8--- See "man curl" for the meaning of each metric. You can then use this function to measure a substitute download. Here's an example in which I download a large substitute (linux-libre) from one of my machines in Seattle: =2D-8<---------------cut here---------------start------------->8--- $ measure_get https://berlin.guixsd.org/nar/gzip/1bq783rbkzv9z9zdhivbvfzhsz= 2s5yac-linux-libre-4.19 2>/dev/null url_effective: https://berlin.guixsd.org/nar/gzip/1bq783rbkzv9z9zdhivbvfzhs= z2s5yac-linux-libre-4.19 http_code: 200 num_connects: 1 num_redirects: 0 remote_ip: 141.80.181.40 remote_port: 443 size_download: 69899433 B speed_download: 4945831.000 B/s time_appconnect: 0.885277 s time_connect: 0.459667 s time_namelookup: 0.254210 s time_pretransfer: 0.885478 s time_redirect: 0.000000 s time_starttransfer: 1.273994 s time_total: 14.133584 s $=20 =2D-8<---------------cut here---------------end--------------->8--- Here, it took 0.459667 - 0.254210 =3D 0.205457 seconds (about 205 ms) to establish the TCP connection after the DNS lookup. The average throughput was 1924285 bytes per second (about 40 megabits per second, where 1 megabit =3D 10^6 bits). It seems my connection to berlin is already pretty good! We can get more information about latency by using a tool like mtr: =2D-8<---------------cut here---------------start------------->8--- $ sudo mtr -c 10 --report-wide --tcp -P 443 berlin.guixsd.org Start: 2018-12-08T16:57:40-0800 HOST: localhost.localdomain Loss% Snt Last Avg= Best Wrst StDev [... I've omitted the intermediate hops because they aren't relevant ...] 13.|-- 141.80.181.40 0.0% 10 205.0 201= .9 194.9 212.8 5.6 =2D-8<---------------cut here---------------end--------------->8--- My machine's latency to berlin is about 202 ms, which matches what we calculated above. For experimentation, I've set up a CloudFront distribution at berlin-mirror.marusich.info that uses berlin.guixsd.org as its origin server. Let's repeat these steps to measure the performance of the distribution from my machine's perspective (before I did this, I made sure the GET would result in a cache hit by downloading the substitute once before and verifying that the same remote IP address was used): =2D-8<---------------cut here---------------start------------->8--- $ measure_get https://berlin-mirror.marusich.info/nar/gzip/1bq783rbkzv9z9zd= hivbvfzhsz2s5yac-linux-libre-4.19 2>/dev/null url_effective: https://berlin-mirror.marusich.info/nar/gzip/1bq783rbkzv9z9z= dhivbvfzhsz2s5yac-linux-libre-4.19 http_code: 200 num_connects: 1 num_redirects: 0 remote_ip: 13.32.254.57 remote_port: 443 size_download: 69899433 B speed_download: 9821474.000 B/s time_appconnect: 0.607593 s time_connect: 0.532417 s time_namelookup: 0.511086 s time_pretransfer: 0.608029 s time_redirect: 0.000000 s time_starttransfer: 0.663578 s time_total: 7.117266 s $ sudo mtr -c 10 --report-wide --tcp -P 443 berlin-mirror.marusich.info Start: 2018-12-08T17:04:48-0800 HOST: localhost.localdomain Loss% Snt Last Avg= Best Wrst StDev [... I've omitted the intermediate hops because they aren't relevant ...] 14.|-- server-52-84-21-199.sea32.r.cloudfront.net 0.0% 10 19.8 20= .3 14.3 28.9 4.9 =2D-8<---------------cut here---------------end--------------->8--- Establishing the TCP connection took about 21 ms (which matches the mtr output), and the throughput was about 79 megabits per second. (On this machine, 100 Mbps is the current link speed, according to dmesg output.) This means that in my case, when using CloudFront the latency is 10x lower, and the throughput (for a cache hit) is 2x higher, than using berlin.guixsd.org directly! It would be interesting to see what the performance is for others. Ricardo Wurmus writes: > Large ISPs also provide CDN services. I already contacted Deutsche > Telekom so that we can compare their CDN offer with the Amazon Cloudfont > setup that Chris has configured. That's great! There are many CDN services out there. I am unfamiliar with most of them. It will be good to see how Deutsche Telekom's offering compares to CloudFront. FYI, CloudFront has edge locations in the following parts of the world: https://aws.amazon.com/cloudfront/features/ Hartmut Goebel writes: > Am 03.12.2018 um 17:12 schrieb Ludovic Court=C3=A8s: >> Thus, I=E2=80=99m thinking about using a similar setup, but hosting the = mirror >> on some Big Corp CDN or similar. > > Isn't this a contradiction: Building a free infrastructure relaying on > servers from some Big Corporation? Let allow the privacy concerns > raising when delivering data via some Big Corporation. > > If delivering "packages" works via static data without requiring any > additional service, we could ask universities to host Guix, too. IMHO > this is a much preferred solution since this is a decentralized publish > infrastructure already in place for many GNU/Linux distributions. I understand your concern about using a third-party service. However, we wouldn't be using a CDN as a "software substitute", which is one of the primary risks of using a web service today: https://www.gnu.org/philosophy/who-does-that-server-really-serve.html Instead, we would be using a CDN as a performance optimization that is transparent to a Guix user. You seem unsettled by the idea of entrusting any part of substitute delivery to a third party, but concretely what risks do you foresee? Regarding your suggestion to ask universities to host mirrors (really, caching proxies), I think it could be a good idea. As Leo mentioned, the configuration to set up an NGINX caching proxy of Hydra (or berlin) is freely available in maintenance.git. Do you think we could convince some universities to host caching proxies that just run an NGINX web server using those configurations? If we can accomplish that, it may still be helpful. If there is interest in going down this path, I can explore some possibilities in the Seattle area. If the university-owned caching proxies are easily discoverable (i.e., we list them on the website), then users might manually set their substitute URL to point to one that's close by. Going further, if our DNS provider supports something like "geolocation routing" for DNS queries, we might even be able to create DNS records for ci.guix.info that point to those universities' caching proxies. In this way, when a user resolves ci.guix.info, they would get the address of a university-owned caching proxy close by. This could have the benefits of requiring less money than a full-fledged CDN like Amazon CloudFront, and also decentralizing the substitute delivery, while still remaining transparent to Guix users. However, it would still require us to rely on a third-party DNS service. For example, Amazon Route 53 provides this sort of geolocation routing: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.ht= ml#routing-policy-geo I wouldn't be surprised if there are other DNS providers out there who offer something similar. However, I also wouldn't be surprised if the overall performance of CloudFront turns out to be better. "Thompson, David" writes: > If AWS CloudFront is the path chosen, it may be worthwhile to follow > the "infrastructure as code" practice and use CloudFormation to > provision the CloudFront distribution and any other supporting > resources. The benefit is that there would be a record of exactly > *how* the project is using these commercial services and the setup > could be easily reproduced. The timing is interesting here because I > just attended the annual AWS conference on behalf of my employer and > while I was there I felt inspired to write a Guile API for building > CloudFormation "stacks". You can see a small sample of what it does > here: https://gist.github.com/davexunit/db4b9d3e67902216fbdbc66cd9c6413e Nice! That seems useful. I will have to play with it. I created my distributions manually using the AWS Management Console, since it's relatively easy to do. I agree it would be better to practice "infrastructure as code." On that topic, I've also heard good things about Terraform by HashiCorp, which is available under the Mozilla Public License 2.0: https://github.com/hashicorp/terraform Here is a comparison of Terraform and CloudFormation: https://www.terraform.io/intro/vs/cloudformation.html I looked briefly into packaging Terraform for Guix. It's written in Go. It seems possible, but I haven't invested enough time yet. As a final option, since the AWS CLI is already packaged in Guix, we could just drive CloudFormation or CloudFront directly from the CLI. Meiyo Peng writes: > I like the idea of IPFS. We should try it. It would be great if it works > well. > > If at some point we need to setup traditional mirrors like other major > Gnu/Linux distros, I can contact my friends in China to setup mirrors in > several universities. I was a member of LUG@USTC, which provides the > largest FLOSS mirror in China. IPFS would be neat. So would Gnunet. Heck, even a publication mechanism using good old BitTorrent would be nice. All of these would require changes to Guix, I suppose. A CDN would require no changes to Guix, and that's part of why it's so appealing. =2D-=20 Chris --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEy/WXVcvn5+/vGD+x3UCaFdgiRp0FAlwMjP0ACgkQ3UCaFdgi Rp0CrQ//fg4wLWX9FsZ4N03FuXwKscGFwUjqHic1cxXv9bjygDsHLwzxVaKxYf9t IZ93XHdBwSmsFPn+CUO4jhJ0MetQUumJmUr6XZRLEmS70XBuCh1iHibVj7YhPf3a +jHCmbkjVIHTY2VOiN3tR/pFcmtehn4Uma6ZZ4PZgaRNoxDxLiFEM/Fmv0v+mslV ZqGyQmiEC97I67XMFkLvPmB2p0yfJD/oFRtEd0saMLPuFsdrSk2g+OADiUrKIJr+ GRgBsBAP9VLqUfzlM5K6GqVQO9917bB93OnCMQz89ak47GQ27FiZZs3toK0QplyF YhXYPofOvb/zyy1qdrlSlhuXUjQXs/v6sa3TMxi1d0XCLcOR551qHie/tu8nQiZU cUxINBrRwkKngjaR0cEKfjLP8cAYoweuljyfe+LqIjFny+vQLEOipLOIzZBir3j3 qvl9rjy3McZkQoTy38+fS1ti8jj5TDzpwugpPxU3KpSW/HmZCE+O+IOGaDctWC4m ElWnnQNX6+INA7jl4RbYUqSobOV+OHZw1GZXO4YelUMVUBLEftmWXY3Pw/CrSDbY cz5v/WJNVS/Kzc0clsKyBG3OgvAPhXdvmKZi6cQfOgwGUfvCIJC9ONnT81qMIHnB LN5ZSaMduWwkXDxYCyJBVRG+AZEHN0F46eLttsALTkEdrVTbCCA= =H3rW -----END PGP SIGNATURE----- --=-=-=--