From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id SLLyA3hWNGAXIAAA0tVLHw (envelope-from ) for ; Tue, 23 Feb 2021 01:12:24 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id WCCyOXdWNGD3PAAAbx9fmQ (envelope-from ) for ; Tue, 23 Feb 2021 01:12:23 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 79A3027D7F for ; Tue, 23 Feb 2021 02:12:23 +0100 (CET) Received: from localhost ([::1]:40454 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lEMFR-0002wh-Bu for larch@yhetil.org; Mon, 22 Feb 2021 20:12:21 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:43214) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lEMF5-0002wO-Fl for guix-devel@gnu.org; Mon, 22 Feb 2021 20:11:59 -0500 Received: from mail-40135.protonmail.ch ([185.70.40.135]:59563) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lEMF1-0002ar-Tc for guix-devel@gnu.org; Mon, 22 Feb 2021 20:11:58 -0500 Date: Tue, 23 Feb 2021 01:11:45 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail; t=1614042711; bh=RvcDmQF75Q5++Zv+Euk7QUIifwhAgPXyJoHROLNz1gw=; h=Date:To:From:Cc:Reply-To:Subject:In-Reply-To:References:From; b=ExOuuohNQ8nih/bhHeac2oeuMGzJUAWQvh1GO6ms2TAYiCsCtI4601B8T+Gpq/Lz/ nXOpn2CdomYwzg1OUvVCtdNjyJecqTWWYw3aOvdFupA9QA1nce16h0zJUtZXnNnKV8 2pRoRrrum/DwdFE9DVEv5mumc2I1qUTaQ99BYP9I= To: =?utf-8?Q?Ludovic_Court=C3=A8s?= From: raid5atemyhomework Cc: "guix-devel@gnu.org" Subject: Re: ZFS on Guix, again Message-ID: In-Reply-To: <87mtvweazb.fsf@gnu.org> References: <87mtvweazb.fsf@gnu.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=185.70.40.135; envelope-from=raid5atemyhomework@protonmail.com; helo=mail-40135.protonmail.ch X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=unavailable autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: raid5atemyhomework Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -1.57 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=protonmail.com header.s=protonmail header.b=ExOuuohN; dmarc=pass (policy=quarantine) header.from=protonmail.com; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 79A3027D7F X-Spam-Score: -1.57 X-Migadu-Scanner: scn1.migadu.com X-TUID: xBpK/PrBIAoX Hi Ludo' > Hi, > > Sorry for the delay; this isn=E2=80=99t as simple as it looks! > > I agree with =E5=AE=8B=E6=96=87=E6=AD=A6 regarding =E2=80=98file-system-s= ervice-type=E2=80=99. > > raid5atemyhomework raid5atemyhomework@protonmail.com skribis: > > > However, for the case where the user expects the "typical" ZFS style of= managing file systems, we need to mount all the ZFS file systems and ensur= e that they aer all already mounted by the time `file-systems` Shepherd ser= vice is started. This means we need to be able to extend the `requirement` = of the `file-systems` Shepherd service. And we need to do that without putt= ing any extra `/etc/fstab` entries since for "typical" ZFS style of managin= g file systems, they are required to not be put in `/etc/fstab`. > > Looks like this fstab issue is the main reason why you felt the need to > define an extra service type. Why is it important that ZFS not be > listed in /etc/fstab? Because on all non-Guix operating systems, they aren't listed in `/etc/fsta= b`: * https://docs.oracle.com/cd/E19120-01/open.solaris/817-2271/gaztn/index.ht= ml What ZFS users expect is that you just do something as simple as this: # zpool create mypool raidz2 /dev/disk/by-id/ata-Generic_M0D3L_53R14LN0= /dev/disk/by-id/ata-Generic_M0D3L_53R14LN1 /dev/disk/by-id/ata-Generic_M0= D3L_53R14LN2 /dev/disk/by-id/ata-Generic_M0D3L_53R14LN3 log mirror /dev/di= sk/by-id/ata-Generic_55DM0D3L_53R14LN0 /dev/disk/by-id/ata-Generic_55DM0D3= L_53R14LN1 /dev/disk/by-id/ata-Generic_55DM0D3L_53R14LN2 And what happens is: * The pool `mypool` is created containing a RAIDZ-2 of the 4 HDDs listed, w= ith a separate log device consisting of a mirror of 3 SSDs. * A filesystem `mypool` is created on the pool `mypool`. * The `mypool` filesystem is mounted on `/mypool`. * On all subsequent bootups, the `mypool` filesystem is mounted on `/mypool= `. In ZFS you are expected to have dozens of filesystems. If you have a new a= pplication, the general expectation is that you create a new filesystem for= it. In general you might have one pool, or maybe two or three, but you ho= st most of your data in multiple filesystems on that same pool. So for example you might want to create a filesystem for videos, which are = sequentially accessed and tend to be fairly large, so setting `recordsize= =3D1M` makes sense (good for sequential access, not so much for random, and= good for very large files measurable in dozens of megabytes). # zfs create -o recordsize=3D1M -o mountpoint=3D/home/raid5atemyhomewor= k/Videos mypool/videos The above command does: * The filesystem `videos` is created on the pool `mypool`. * The `mypool/videos` filesystem is mounted on `/home/raid5atemyhomework/Vi= deos`. * On all subsequent bootups, the filesystem is mounted on `/home/raid5atemy= homework/Videos`. Now I might also want to run say a PostgreSQL service. * PostgreSQL allocates in page sizes of 8k, so `recordsize=3D8k` is best. * PostgreSQL uses a journal, which has a different access pattern from the = rest of the data. Journals are written sequentially and read sequentially,= while the database itself is accessed randomly. * The data should have `logbias=3Dthroughput` to optimize and reduce use = of the ZIL SLOG, to avoid "log on a log" slowdown effects. * The journal itself should continue to use the default "latency". So I would do: # zfs create -o recordsize=3D8k -o logbias=3Dthroughput -o mountpoint= =3D/postgresql mypool/postgresql # zfs create -o logbias=3Dlatency -o mountpoint=3D/postgresql/pg_wal my= pool/postgresql/pg_wal That means creating two filesystems for a single application, one for the P= ostgreSQL data, the other for the PostgreSQL journal. What the above examples show is: * The habit for a ZFS user is to create many filesystems. On my own homela= b I have two filesystems (one for documents and code, one for videos and pi= ctures) for data I manage myself, and I have two other filesystems for two = different applications I am running as well. * Each filesystem has different tuning properties. On a server you might have a dozen or so ZFS filesystems for various applic= ations you need to run. There are also many other tuning parameters to twe= ak. If done by `/etc/fstab` it would lead to a fairly large file. The base logic here is that `/etc/fstab` has to be stored on disk anyway, a= nd ZFS can just store the same information on the disks it is managing dire= ctly. Then ZFS supports nice tabulated output of properties via `zfs list`= : # zfs list -o name,recordsize,logbias,atime,relatime NAME RECSIZE LOGBIAS ATIME RELATIME hddpool 128K latency off on hddpool/bitcoin 128K latency off on hddpool/common 128K latency off on hddpool/lightning 64K latency off on hddpool/media 1M latency off on And you can change parameters easily with `zfs set`. There are many dozens= of possible properties as well. Thus, the general expectation among ZFS users is ***not*** to use any kind = of `/etc/fstab` at all, because such a `/etc/fstab` would be ludicrously la= rge with a dozen filesystems and several properties. And the declarative `= file-system` Guix syntax is really just `/etc/fstab` in another format. So= the expectation for a ZFS user would be to keep using classic `zpool` and = `zfs` commands to manage the filesystems and parameters. The main purpose of the `operating-system` declaration is to allow the syst= em to be brought back again, but the configuration file has to exist on *so= me* permanent storage anyway so the information might as well be managed by= ZFS directly on the disks it is managing. ZFS also allows snapshotting of the configuration of the pool, so this isn'= t really an advantage to keeping the configuration of the pool in the `oper= ating-system` as well. You can rollback changes to the pool just as well a= s you can rollback `operating-system`. Since this is the expected behavior of ZFS, we should support it as much as= possible. If the user wants to really manage ZFS via `file-system` declarations, they= can set `mountpoint=3Dlegacy` and then the user can put them in `file-syst= em` declarations that then become `/etc/fstab` entries. But if the user do= esn't want to manage them via `file-system` declarations, we should also su= pport that use-case (because that is how ZFS is meant to be used in other o= perating systems). So we need to have the `file-system` Shepherd service a= lso wait for non-`/etc/fstab` filesystems like ZFS, not just those listed i= n `/etc/fstab`. Thanks raid5atemyhomework