how exactly do you add the remaining disks/pools to the machine? (do you add another vdev to rpool? or just more pools in the system? - how would the output of the above commands look like before adding the disks which breaks the booting? (no need to reproduce yet - just a short description might be enough)
As mentioned in
my later post adding the addition disks and pools is not related. I could ultimately reproduce the problem with just the two disks of rpool attached. If you'd still like redacted copies of the files let me know, but I'm reasonably sure that nothing in the additional pools is needed to reproduce. Whatever is wrong is more about the format of the rpool, and not the other pools (see below).
The root cause of my failure was the missing /etc/default/grub.d/zfs.cfg. Adding that file immediately corrected the issue once proxmox-boot-tool was run again (either via kernel upgrade or manual refresh).
All additional pools are just "zpool import" normally, No additional disks in rpool. Everything was reproducible on a 2-disk system.
EDIT: I tried reproducing this here locally with a VM (legacy boot, PVE-8.0 with ZFS RAID 1 as root) - upgrading to 8.1 did not break booting - or change the generated grub-config in any way apart from adding the newer kernels.
The PVE 8 installer appears to create /etc/default/grub.d/zfs.cfg. somehow. If you remove that file and run proxmox-boot-tool refresh you will be closer to the condition I started in. I expect that if you look at the Grub commandline in your VM you will find as I did that it contains two root directives, an one built by Grub and a correct one from /etc/default/grub.d/zfs.cfg.
I don't know why /etc/default/grub.d/zfs.cfg is missing. The machine in question started life as a Proxmox 5 system, and has been upgraded though the years. However, I have another machine that started as Proxmox 6 that is also missing that file. Luckily since that second machine uses EFI boot, it was not affected. Prior to this discussion I've never seen or modifiled that file, so I don't believe I did anything to directly influence its creation.
I've confirmed the Proxmox 7.4 installer also creates /etc/default/grub.d/zfs.cfg, I have not done a binary search to determine what installer first introduced it.
However, I think the underlying change that causes the failure in my case was a change in grub probe. Running grub-probe still (even today) produces an error:
> grub-probe --device /dev/sdb3 --target=fs_label
grub-probe: error: compression algorithm inherit not supported
I see a different error on the other machine, as this one uses ZFS raidz:
> grub-probe --device /dev/sda3 --target=fs_label
grub-probe: error: unknown filesystem.
In both cases, however, the effect is the same. As mentioned before, this error manifests as a silent failure in /etc/grub.d/10_linux line 92, producing a blank pool name that the Grub init scripts then cannot mount. If it is present, /etc/default/grub.d/zfs.cfg still forces the correct kernel commandline, but on a machine without that file the boot will fail.
Do we know why grub-probe is failing in these cases? I see reference to this in other forum comments that this has been fixed, but again, this is still true for me on systems that are up to date. The common thread seems to be that these are older pools.
So, to sum up, the sequence of the failure is:
- Install an older version of Proxmox (I may try to find out how old in a VM later when I have more time). The key is to _not_ have /etc/default/grub.d/zfs.cfg.
- At this point you are likely still able to boot because (I'm guessing) the older grub-probe works. Perhaps this is because of an older ZFS version?
- Upgrade to Proxmox 8.1. This will still work and boot because the initramfs for this upgrade will be built with a grub-probe that is still working (because its running on the older kernel or ZFS version?).
- Rebuilt the initramfs without /etc/default/grub.d/zfs.cfg on Proxmox 8.1, either manually or via kernel upgrade. Because grub-probe no longer works, at this point your grub kernel commandline should contain the wrong root directive.