unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Packaging big generated data files?
@ 2022-12-07 10:33 Denis 'GNUtoo' Carikli
  2022-12-07 14:45 ` pelzflorian (Florian Pelz)
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Denis 'GNUtoo' Carikli @ 2022-12-07 10:33 UTC (permalink / raw)
  To: Guix-devel

[-- Attachment #1: Type: text/plain, Size: 2592 bytes --]

Hi,

Is there any policies or past decisions of the Guix project on
packaging big generated data files?

I've added packages for software like kiwix-tools and navit that both
work offline but that also need data files to be useful.

Navit is a (car) navigation software that need maps. The maps can be
generated from OpenStreetMap dumps with a tool available in Navit
source code (maptool)[1] which is not packaged yet. Binary map files can
also be downloaded directly from various sources.

Right now the biggest file possible for such maps is about 47 GiB
(for the whole planet).

As for kiwix-tools, it can serve offline versions of websites like
Wikipedia, and there too it needs files to work. The biggest file seems
to be the complete version of English Wikipedia with scaled down
pictures[2] and it takes about 89 GiB. I didn't look yet how these files
were generated but I guess that they somehow can be generated from
Wikipedia dumps.

Packaging the binary files (without generating them) can be useful as
it simplifies a lot the maintenance as one can just update the package
version and checksum to update these. It also enables to keep the
information (download URL, checksum, license) in one place and it
enables easy reuse by Guix services and/or configuration files.

If these files were generated in packages, it would also enable to
tweak the data, for instance by adding height data in navit maps. As
for kiwix compatible files, it would probably enable to decide when to
make the snapshots or enable to package additional wikis
(like the Libreplanet Wiki) or websites.

The issue here is probably the size of the generated files: they are
huge, so if they are packaged, they will most likely take significant
resources in the Guix infrastructure.

So what would be the way to go here? Would Guix accept patches to add
packages for these files in Guix proper?  

If so, does it needs to be done like with the ZFS (kernel module)
package where "#:substitutable? #f" is used to avoid redistributing
package builds? Or are other ways better for such use cases?

Note that so far I've only packaged locally only kiwix compatible files
for various wikis by just downloading already prepared files, so I
didn't look yet into navit maps or into generating all these files, so
I might miss some details about generating them.

References:
-----------
[1]https://navit.readthedocs.io/en/latest/maps.html#processing-osm-maps-yourself
[2]https://mirror.download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2022-05.zim

Denis.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Packaging big generated data files?
  2022-12-07 10:33 Packaging big generated data files? Denis 'GNUtoo' Carikli
@ 2022-12-07 14:45 ` pelzflorian (Florian Pelz)
  2022-12-10 17:08   ` Denis 'GNUtoo' Carikli
  2022-12-08 13:46 ` Csepp
  2022-12-12 13:51 ` zimoun
  2 siblings, 1 reply; 7+ messages in thread
From: pelzflorian (Florian Pelz) @ 2022-12-07 14:45 UTC (permalink / raw)
  To: Denis 'GNUtoo' Carikli; +Cc: Guix-devel

Denis 'GNUtoo' Carikli <GNUtoo@cyberdimension.org> writes:
> Is there any policies or past decisions of the Guix project on
> packaging big generated data files?

commit 183db725a4e7ef6a0ae5170bfa0967bb2eafded7
Author: Ricardo Wurmus <rekado@elephly.net>
Date:   Tue May 15 12:55:27 2018 +0200

    gnu: Add r-bsgenome-dmelanogaster-ucsc-dm6.
    
    * gnu/packages/bioconductor.scm (r-bsgenome-dmelanogaster-ucsc-dm6): New variable.

HTH.

Regards,
Florian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Packaging big generated data files?
  2022-12-07 10:33 Packaging big generated data files? Denis 'GNUtoo' Carikli
  2022-12-07 14:45 ` pelzflorian (Florian Pelz)
@ 2022-12-08 13:46 ` Csepp
  2022-12-10 17:19   ` Denis 'GNUtoo' Carikli
  2022-12-12 13:51 ` zimoun
  2 siblings, 1 reply; 7+ messages in thread
From: Csepp @ 2022-12-08 13:46 UTC (permalink / raw)
  To: Denis 'GNUtoo' Carikli; +Cc: guix-devel


Denis 'GNUtoo' Carikli <GNUtoo@cyberdimension.org> writes:

> [[PGP Signed Part:Undecided]]
> Hi,
>
> Is there any policies or past decisions of the Guix project on
> packaging big generated data files?
>
> I've added packages for software like kiwix-tools and navit that both
> work offline but that also need data files to be useful.
>
> Navit is a (car) navigation software that need maps. The maps can be
> generated from OpenStreetMap dumps with a tool available in Navit
> source code (maptool)[1] which is not packaged yet. Binary map files can
> also be downloaded directly from various sources.
>
> Right now the biggest file possible for such maps is about 47 GiB
> (for the whole planet).
>
> As for kiwix-tools, it can serve offline versions of websites like
> Wikipedia, and there too it needs files to work. The biggest file seems
> to be the complete version of English Wikipedia with scaled down
> pictures[2] and it takes about 89 GiB. I didn't look yet how these files
> were generated but I guess that they somehow can be generated from
> Wikipedia dumps.
>
> Packaging the binary files (without generating them) can be useful as
> it simplifies a lot the maintenance as one can just update the package
> version and checksum to update these. It also enables to keep the
> information (download URL, checksum, license) in one place and it
> enables easy reuse by Guix services and/or configuration files.
>
> If these files were generated in packages, it would also enable to
> tweak the data, for instance by adding height data in navit maps. As
> for kiwix compatible files, it would probably enable to decide when to
> make the snapshots or enable to package additional wikis
> (like the Libreplanet Wiki) or websites.
>
> The issue here is probably the size of the generated files: they are
> huge, so if they are packaged, they will most likely take significant
> resources in the Guix infrastructure.
>
> So what would be the way to go here? Would Guix accept patches to add
> packages for these files in Guix proper?  
>
> If so, does it needs to be done like with the ZFS (kernel module)
> package where "#:substitutable? #f" is used to avoid redistributing
> package builds? Or are other ways better for such use cases?
>
> Note that so far I've only packaged locally only kiwix compatible files
> for various wikis by just downloading already prepared files, so I
> didn't look yet into navit maps or into generating all these files, so
> I might miss some details about generating them.
>
> References:
> -----------
> [1]https://navit.readthedocs.io/en/latest/maps.html#processing-osm-maps-yourself
> [2]https://mirror.download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2022-05.zim
>
> Denis.
>
> [[End of PGP Signed Part]]

Could ZIM files be downloaded over bittorrent as fixed output
derivations?  They can be pretty huge.  Also if the system started
seeding them as well, that would be pretty cool.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Packaging big generated data files?
  2022-12-07 14:45 ` pelzflorian (Florian Pelz)
@ 2022-12-10 17:08   ` Denis 'GNUtoo' Carikli
  0 siblings, 0 replies; 7+ messages in thread
From: Denis 'GNUtoo' Carikli @ 2022-12-10 17:08 UTC (permalink / raw)
  To: pelzflorian (Florian Pelz); +Cc: Guix-devel

[-- Attachment #1: Type: text/plain, Size: 713 bytes --]

On Wed, 07 Dec 2022 15:45:01 +0100
"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> wrote:

> Denis 'GNUtoo' Carikli <GNUtoo@cyberdimension.org> writes:
> > Is there any policies or past decisions of the Guix project on
> > packaging big generated data files?
> 
> commit 183db725a4e7ef6a0ae5170bfa0967bb2eafded7
> Author: Ricardo Wurmus <rekado@elephly.net>
> Date:   Tue May 15 12:55:27 2018 +0200
> 
>     gnu: Add r-bsgenome-dmelanogaster-ucsc-dm6.
>     
>     * gnu/packages/bioconductor.scm
> (r-bsgenome-dmelanogaster-ucsc-dm6): New variable.
Thanks.

So I assume that we could do something like that for now and later on
see if it makes sense to generate the files.

Denis.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Packaging big generated data files?
  2022-12-08 13:46 ` Csepp
@ 2022-12-10 17:19   ` Denis 'GNUtoo' Carikli
  2022-12-11 10:16     ` Ludovic Courtès
  0 siblings, 1 reply; 7+ messages in thread
From: Denis 'GNUtoo' Carikli @ 2022-12-10 17:19 UTC (permalink / raw)
  To: Csepp; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 566 bytes --]

On Thu, 08 Dec 2022 14:46:51 +0100
Csepp <raingloom@riseup.net> wrote:
> Could ZIM files be downloaded over bittorrent as fixed output
> derivations?  They can be pretty huge.  Also if the system started
> seeding them as well, that would be pretty cool.
I've no idea how to generate fixed output derivations.

As for BiTorrent, ZIM files provided by kiwix can be downloaded over
it. As for using that in packages, all I found in Guix (beside
packages) was a Transmission service and associated test(s). So I guess
that would needs to be added.

Denis.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Packaging big generated data files?
  2022-12-10 17:19   ` Denis 'GNUtoo' Carikli
@ 2022-12-11 10:16     ` Ludovic Courtès
  0 siblings, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2022-12-11 10:16 UTC (permalink / raw)
  To: Denis 'GNUtoo' Carikli; +Cc: Csepp, guix-devel

Hi,

Denis 'GNUtoo' Carikli <GNUtoo@cyberdimension.org> skribis:

> On Thu, 08 Dec 2022 14:46:51 +0100
> Csepp <raingloom@riseup.net> wrote:
>> Could ZIM files be downloaded over bittorrent as fixed output
>> derivations?  They can be pretty huge.  Also if the system started
>> seeding them as well, that would be pretty cool.
> I've no idea how to generate fixed output derivations.

Origins are lowered into “fixed-output derivations”.  They’re
“fixed-output” because their content hash is known in advance, and thus,
the method you used to produce them doesn’t matter (info "(guix)
Derivations").

So you could specify an origin with ‘bittorrent-fetch’ (to be written)
instead of ‘url-fetch’.

HTH,
Ludo’.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Packaging big generated data files?
  2022-12-07 10:33 Packaging big generated data files? Denis 'GNUtoo' Carikli
  2022-12-07 14:45 ` pelzflorian (Florian Pelz)
  2022-12-08 13:46 ` Csepp
@ 2022-12-12 13:51 ` zimoun
  2 siblings, 0 replies; 7+ messages in thread
From: zimoun @ 2022-12-12 13:51 UTC (permalink / raw)
  To: Denis 'GNUtoo' Carikli, Guix-devel

Hi,

On Wed, 07 Dec 2022 at 11:33, Denis 'GNUtoo' Carikli <GNUtoo@cyberdimension.org> wrote:

> The issue here is probably the size of the generated files: they are
> huge, so if they are packaged, they will most likely take significant
> resources in the Guix infrastructure.
>
> So what would be the way to go here? Would Guix accept patches to add
> packages for these files in Guix proper?  

From my point of view, the data and the code should be packaged
separately; the package data using copy-build-system would be an input
for the package code.

> If so, does it needs to be done like with the ZFS (kernel module)
> package where "#:substitutable? #f" is used to avoid redistributing
> package builds? Or are other ways better for such use cases?

Yes, ’#:substitutable? #f’ seems the first way to go.


Cheers,
simon


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-12-12 15:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-07 10:33 Packaging big generated data files? Denis 'GNUtoo' Carikli
2022-12-07 14:45 ` pelzflorian (Florian Pelz)
2022-12-10 17:08   ` Denis 'GNUtoo' Carikli
2022-12-08 13:46 ` Csepp
2022-12-10 17:19   ` Denis 'GNUtoo' Carikli
2022-12-11 10:16     ` Ludovic Courtès
2022-12-12 13:51 ` zimoun

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).