ZFS root booting to busybox, but no displayed command, message or error?

I'm not sure I understand this? If the file is created by the proxmox installer I wouldn't expect it to be included in the grub package? But since it seems clear its required I would expect it to be included in some proxmox package?
no, it used to work, but now it doesn't (a regression) - the file was probably added to counteract this. since we now ship our own grub packages again, we could add a snippet there that detects / on zfs (via some other means than grub-probe) and adds the expected grub.cfg override.
 
yeah, grub simply doesn't cope with ZFS, hence our workaround of
- forcing "root=.." via a grub config snippet
- putting grub/systemd-boot+kernels on the ESP, both for legacy and EFI boot

as long as the generated ESP contents including grub.cfg is good, that warning can be ignored.

I will make only one comment so as not to hijack the thread, but is this (ZFS on root) all worth it? I was originally surprised that selecting ZFS in installer gets everything on ZFS. Having had some exciting experiences from the likes of another Debian based distro attempt for the same, I would just have expected that installer for "ZFS" just creates the ZFS pool for me for the VMs, but the OS can be on pretty small dedicated partition that any rescue tool can boot into and work with and no worries with any future updates. What's the benefit for a hypervisor on ZFS? Having a regular full snapshots (of a system that one cannot even get out of initramfs when it breaks)? Because if that was the idea, a separate PVE settings backup tool (missing) would have been probably easier to maintain and more efficient.

(For clarity, I absolutely do use ZFS for everything but the hypervisor myself too.)
 
no, it used to work, but now it doesn't (a regression) - the file was probably added to counteract this. since we now ship our own grub packages again, we could add a snippet there that detects / on zfs (via some other means than grub-probe) and adds the expected grub.cfg override.

Great, thanks for clarifying. It seems we agree this is something that Proxmox needs to address via some mechanism outside the installer, which was my primary concern.
 
Great, thanks for clarifying. It seems we agree this is something that Proxmox needs to address via some mechanism outside the installer, which was my primary concern.

could you please share:
* `lsblk`
* `zpool list`
* `zpool status`
* `proxmox-boot-tool status`

from the affected machine - maybe that'll help in reproducing the issue.
7) System boots normally.
8) Add remaining disks and pools to the machine (HBA, some U2 disks). System fails to boot again, with the same message and no error.
9) Removing all additional hardware persists the problem, back to the beginning.
how exactly do you add the remaining disks/pools to the machine? (do you add another vdev to rpool? or just more pools in the system? - how would the output of the above commands look like before adding the disks which breaks the booting? (no need to reproduce yet - just a short description might be enough)

Thanks!

EDIT: I tried reproducing this here locally with a VM (legacy boot, PVE-8.0 with ZFS RAID 1 as root) - upgrading to 8.1 did not break booting - or change the generated grub-config in any way apart from adding the newer kernels.
 
Last edited:
how exactly do you add the remaining disks/pools to the machine? (do you add another vdev to rpool? or just more pools in the system? - how would the output of the above commands look like before adding the disks which breaks the booting? (no need to reproduce yet - just a short description might be enough)

As mentioned in my later post adding the addition disks and pools is not related. I could ultimately reproduce the problem with just the two disks of rpool attached. If you'd still like redacted copies of the files let me know, but I'm reasonably sure that nothing in the additional pools is needed to reproduce. Whatever is wrong is more about the format of the rpool, and not the other pools (see below).

The root cause of my failure was the missing /etc/default/grub.d/zfs.cfg. Adding that file immediately corrected the issue once proxmox-boot-tool was run again (either via kernel upgrade or manual refresh).

All additional pools are just "zpool import" normally, No additional disks in rpool. Everything was reproducible on a 2-disk system.

EDIT: I tried reproducing this here locally with a VM (legacy boot, PVE-8.0 with ZFS RAID 1 as root) - upgrading to 8.1 did not break booting - or change the generated grub-config in any way apart from adding the newer kernels.

The PVE 8 installer appears to create /etc/default/grub.d/zfs.cfg. somehow. If you remove that file and run proxmox-boot-tool refresh you will be closer to the condition I started in. I expect that if you look at the Grub commandline in your VM you will find as I did that it contains two root directives, an one built by Grub and a correct one from /etc/default/grub.d/zfs.cfg.

I don't know why /etc/default/grub.d/zfs.cfg is missing. The machine in question started life as a Proxmox 5 system, and has been upgraded though the years. However, I have another machine that started as Proxmox 6 that is also missing that file. Luckily since that second machine uses EFI boot, it was not affected. Prior to this discussion I've never seen or modifiled that file, so I don't believe I did anything to directly influence its creation.

I've confirmed the Proxmox 7.4 installer also creates /etc/default/grub.d/zfs.cfg, I have not done a binary search to determine what installer first introduced it.

However, I think the underlying change that causes the failure in my case was a change in grub probe. Running grub-probe still (even today) produces an error:

> grub-probe --device /dev/sdb3 --target=fs_label
grub-probe: error: compression algorithm inherit not supported
I see a different error on the other machine, as this one uses ZFS raidz:
> grub-probe --device /dev/sda3 --target=fs_label
grub-probe: error: unknown filesystem.

In both cases, however, the effect is the same. As mentioned before, this error manifests as a silent failure in /etc/grub.d/10_linux line 92, producing a blank pool name that the Grub init scripts then cannot mount. If it is present, /etc/default/grub.d/zfs.cfg still forces the correct kernel commandline, but on a machine without that file the boot will fail.

Do we know why grub-probe is failing in these cases? I see reference to this in other forum comments that this has been fixed, but again, this is still true for me on systems that are up to date. The common thread seems to be that these are older pools.


So, to sum up, the sequence of the failure is:

  • Install an older version of Proxmox (I may try to find out how old in a VM later when I have more time). The key is to _not_ have /etc/default/grub.d/zfs.cfg.
  • At this point you are likely still able to boot because (I'm guessing) the older grub-probe works. Perhaps this is because of an older ZFS version?
  • Upgrade to Proxmox 8.1. This will still work and boot because the initramfs for this upgrade will be built with a grub-probe that is still working (because its running on the older kernel or ZFS version?).
  • Rebuilt the initramfs without /etc/default/grub.d/zfs.cfg on Proxmox 8.1, either manually or via kernel upgrade. Because grub-probe no longer works, at this point your grub kernel commandline should contain the wrong root directive.
 
Last edited:
OK, I've tried multiple installs and upgrade sequences in a VM, and I can't get to the same place. OTOH, based just on behavior I believe I can describe the sequence.

- Older versions of Proxmox do not install /etc/default/grub.d/zfs.cfg (I tested 5.4 and 6.4). Instead they generate /etc/default/grub with this content:
GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/pve-1 boot=zfs"
I do not have this content on my machine because the content of /etc/grub started on a non-ZFS system and was migrated to ZFS sometime around Proxmox 6. I copied over the file form the old machine to get customizations I needed (I.e. intel_iommu=on iommu=pt).

- The missing Grub option did not matter because "grub-probe --device /dev/sdb3 --target=fs_label" produced the correct pool label. I cannot prove this because I cannot roll back the affected machine, but this is the only way the machine could have remained bootable through all the upgrades between 6 and 8.

- Proxmox 8.1 has broken "grub-probe --device /dev/sdb3 --target=fs_label" for unclear reasons. I can still repro this error (see earlier post). I cannot repro in a VM. Since I had no other copy of the "root=ZFS=rpool/ROOT/pve-1" directive, my machine became unbootable.

Adding /etc/default/grub.d/zfs.cfg. forces the magic missing string to reappear, allowing the machine to boot.

So basically, over several releases, Proxmox seems to have included different workarounds to force "root=ZFS=rpool/ROOT/pve-1". Its not clear to me if these were the expected configuration alone, or if "grub-probe" was ever expected to work? As I observe the various post-install states it seems like it was always overriden intentionally.
 
Adding /etc/default/grub.d/zfs.cfg. forces the magic missing string to reappear, allowing the machine to boot.

So basically, over several releases, Proxmox seems to have included different workarounds to force "root=ZFS=rpool/ROOT/pve-1". Its not clear to me if these were the expected configuration alone, or if "grub-probe" was ever expected to work? As I observe the various post-install states it seems like it was always overriden intentionally.
I checked - the
"root=ZFS=rpool/ROOT/pve-1"

part has been there ever since installation with / on ZFS was supported - back then it was added to /etc/default/grub - the change to /etc/default/grub.d/zfs.cfg came some time 2021.
and grub-probe probably did not work (or only for the very beginning of OpenZFS)

Anyways - we'll check what can be done to at least warn the user that their system might be unbootable
 
As I am also having this problem and found the workaround via Google, I am also searching for a permanent solution.
What do I need to do to fix that, so I can boot the system without the initramfs workaround? Manipulating the grub.cfg in boot may also fix that until the next regeneration
 
Last edited:
I was facing the same problem (usually solved through backing everything up via zfs send | zfs receive remotely, recreating the partition layout & zpool, and restoring through zfs send | zfs receive remotely in reverse).

I discovered that the wrong line in /boot/grub/grub.cfg is actually added due to some (mis)configuration issue in (one or more) /etc/grub.d/ files.

I backed up the old folder, scp the files from a working system, then it generated the correct entry. What made me point in that direction was a very weird "GRUB2: error unknown command recordfail", which probably is due to a lack of include of some functions in the boot generation script.

Anyhow, these are the files that solve the problem, unsure if somebody can/wants to do a diff compared to their own and find the root cause.

The 10_linux_zfs was something I had copied from a (very specific ?) Ubuntu package, as this is not something that normally ships with Debian. I'm not sure if this is really required though.

It's still behaving a bit differently compared to the working system, when doing update-grub on pve16 (problematic system) AFTER applying the same configuration files in /etc/grub.d/ I get many more warnings (attached log).

While on pve15 (working system) I get the other (attached log).

It should be the same partition layout, same boot flags set by parted, etc. Difficult to understand why they still give different output.

But anyway, it boots now :).
 

Attachments

Hi,

I have the exact same problem and no /etc/default/grub.d/zfs.cfg. What should be the content of that file?
 
Hi,

I have the exact same problem and no /etc/default/grub.d/zfs.cfg. What should be the content of that file?
I have this ...

cat /etc/default/grub.d/zfs.cfg

Code:
GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX} root=ZFS=\"rpool/ROOT/debian\""
GRUB_CMDLINE_LINUX_DEFAULT="${GRUB_CMDLINE_LINUX_DEFAULT} root=ZFS=\"rpool/ROOT/debian\""

Optionally you could also add these (I add to each line) in case of a headless remote server, to prevent ZFS being stuck at importing pool if hostid does NOT match ....

Code:
zfs_force=1 boot=zfs
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!