[SOLVED] Proxmox on ZFS: bpool vs proxmox-boot-tool

Javex

New Member
Nov 24, 2023
7
12
3
I currently boot Proxmox via GRUB using LVM on a single SSD:

Bash:
sdc                              8:32   0 232.9G  0 disk
├─sdc1                           8:33   0  1007K  0 part
├─sdc2                           8:34   0   512M  0 part  /boot/efi
├─sdc3                           8:35   0   500M  0 part  /boot
└─sdc4                           8:36   0 231.9G  0 part
  └─cryptlvm                   252:0    0 231.9G  0 crypt
    ├─pve-root                 252:1    0    30G  0 lvm   /
    ├─pve-swap                 252:2    0     8G  0 lvm   [SWAP]
    ├─pve-data_tmeta           252:3    0   100M  0 lvm  
    │ └─pve-data-tpool         252:5    0   100G  0 lvm  
    │   ├─pve-data             252:6    0   100G  1 lvm  
    │   ├─pve-vm--104--disk--0 252:7    0     8G  0 lvm  
    │   ├─pve-vm--303--disk--0 252:8    0    32G  0 lvm  
    │   └─pve-vm--304--disk--0 252:9    0    32G  0 lvm  
    └─pve-data_tdata           252:4    0   100G  0 lvm  
      └─pve-data-tpool         252:5    0   100G  0 lvm  
        ├─pve-data             252:6    0   100G  1 lvm  
        ├─pve-vm--104--disk--0 252:7    0     8G  0 lvm  
        ├─pve-vm--303--disk--0 252:8    0    32G  0 lvm  
        └─pve-vm--304--disk--0 252:9    0    32G  0 lvm

(Note: I have modified the default install to add LUKS encryption)

I recently bought two new NVMe drives which I would like to run as a ZFS mirror and use that as the root pool for the Proxmox OS (eventually removing the old 256G SSD that currently hosts the OS (/dev/sdc above).

I've been looking around a bit and came across Debian Bookworm Root on ZFS which seems like an excellent guide full of detailed instructions. I noticed that they create two pools, the rpool (for root and all other datasets, equivalent to the pve LVM VG) and bpool (for /boot to hold the initrd & co). By contrast reading the Proxmox wiki in ZFS on Linux and Host Bootloader describes using proxmox-boot-tool to keep the ESPs in sync.

Searching didn't turn up any explanation for one way over the other: If one disk fails entirely, no amount of ZFS is going to enable it to boot as the UEFI firmware first needs to find a single drive's EFI partition which then has a bootloader that speaks ZFS. But at that point it may as well just look at the initrd within its own partition.

My question is then: Why use a boot pool at all? So that you don't have to sync the initrd & co to two partitions? And for doing this on Proxmox specifically: Should I thus avoid the boot pool part of the guide and instead align to how Proxmox sets things up with proxmox-boot-tool ? If anyone has a bit of insight on the pros & cons of the two approaches I'd be keen to learn more. I'm sure I can make both of them work, but would like to understand the tradeoffs beforehand.
 
Last edited:
both approaches are an attempt at fixing the issue of Grub not (properly) supporting ZFS. proxmox-boot-tool also takes care of the problem of ESPs only supporting vfat (in most implementations) and thus not supporting redundancy. if you want to stay close to the default install, I'd recommend proxmox-boot-tool (but size your ESPs accordingly, since it will contain kernel and initrds and not just the bootloader files, 512M is not very much in that case ;))
 
Incredible, thank you so much! I managed to migrate both hosts to use ZFS for the root device. Can report, using proxmox-boot-tool worked well. Cheers!
 
There's a third approach still to avoid GRUB (and other potential issues) - have the OS on a normal filesystem (sorry) and use ZFS for the VM/CT pool only. It's great when one suddenly needs to e.g. live-boot off debian and troubleshooting something which would be fun (with ZFS root).

I understand there's the mirror argument there, and the mdadm ("unsupported" discussion), but then again, with ZFS on root, is the proxmox-boot-tool replicating ESP partitions too? So that with one drive dead it would actually boot the next time?
 
  • Like
Reactions: Kingneutron
I understand there's the mirror argument there, and the mdadm ("unsupported" discussion), but then again, with ZFS on root, is the proxmox-boot-tool replicating ESP partitions too? So that with one drive dead it would actually boot the next time?
yes, that is one of its main purposes. it's not exactly replicating them, but manages the kernel+initrd+bootloader on each ESP that is registered with it so that the configured kernels are available on all of them. of course, booting still requires enough vdevs to exist to allow importing the rpool.
 
manages the kernel+initrd+bootloader on each ESP that is registered with it so that the configured kernels are available on all of them

Does it also ensure on each run that EFI boot entries have all those potential boot drives in a boot sequence? E.g. after fencing, will it have a chance to auto-boot from second drive if the first one is dead? Even Debian didn't go around it the most elegant way with mdadm [1]. Ever since I would wonder if all initrds are the same even if update_grub was called alone...

[1] https://wiki.debian.org/UEFI#RAID_for_the_EFI_System_Partition
 
that depends on which bootloader is used (and sometimes how buggy your UEFI implementation is). booting by selecting the drive as boot option (as opposed to an entry for the bootloader on the drive, which might or might not exist) should work in all cases.
 
that depends on which bootloader is used (and sometimes how buggy your UEFI implementation is). booting by selecting the drive as boot option (as opposed to an entry for the bootloader on the drive, which might or might not exist) should work in all cases.

I just very briefly checked, there's no efibootmgr calls in. What I meant was say e.g. there's a mirror and one drive goes bad, node fences, reboots, all starts up, pool degraded, hotspare kicks in, it rebuilds (sorry, resilvers). I don't think I will have EFI ESP populated on that new one. And EFI won't have boot entry there. That all has to happen manual, correct?
 
if you want your hot-spare to also be a boot drive replacement, you'd need to partition it, format+init the ESP on it, and only give the "data" partition part to ZFS as spare. NB - I haven't tried that, there might be some blocks in practice!
 
if you want your hot-spare to also be a boot drive replacement, you'd need to partition it, format+init the ESP on it, and only give the "data" partition part to ZFS as spare. NB - I haven't tried that, there might be some blocks in practice!

So e.g. shoving in a new drive and forget about it is also out of question. It has to be pre-partitioned and for even the hot spares proxmox-boot-tool init /dev/... would have had to be consciously run. No magic. :)
 
Last edited:
yes, proxmox-boot-tool only takes care of the ESPs that you register with it..
 
So the thing is ... ZFS for root, I suppose, is there to allow for mirror, but it does not exactly work like a hardware RAID would ... and one cannot have e.g. PERC there for OS and then HBA mode as well for VMs pool (on ZFS) ... it's a bit ... no good options left. Mdadm is "unsupported" (suffers on EFI too anyways).

Sorry I know there's no question here, but the tool + some hooks would have made it so much more no-brainer to push the ZFS if they went all the way (auto-update EFI boot entries, auto-create ESP).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!