ZFS root booting to busybox, but no displayed command, message or error?

AxxelH · Nov 27, 2023

Dell T30 booting from a mirrored rpool on SATA SSDs, BIOS legacy boot using Grub. I'm running unattended-upgrades so its likely I have recent packages from the unsupported repository, but I'm not sure.

When booting I'm dropped to busybox with the "No pool imported. Manually import the root pool..." messaage. However the preceeding "Command" "Message" and "Error" fields are all blank (no mention of any error or rpool). The documentation advice to add a rootdelay I've done manually by editing in Grub has no effect. I suspect this is because the problem isn't with the device timing. I've been booting from these disks for more than a year.

"zpool import -N rpool" imports the pool, and the pool is healthy per "zpool status". However an "exit" from busybox reports "No init found" and cycles back to busybox.
Attempting to choose an older kernel from Grub fails the same way (6.5.11-4-pve, 6.5.11-3-pve, 6.2.16-19-pve).

I'm guessing that I need to refresh something (?) using proxmox-boot-tool, but to do that I need to be able to boot, and I can't get out of busybox. All the forum references I find indicate that once rpool is imported the init should continue, what am I missing?

fabian · Nov 27, 2023

you also need to mount the root dataset (into the right directory) for it to proceed

IIRC it's something like "mount -o zfsutil -t zfs rpool/... root/"

AxxelH · Nov 27, 2023

Thank you for the hint, it seems like:

> zpool import -N rpool
> mkdir -p ROOT/pve-1
> mount -o zfsutil -t zfs rpool/ROOT/pve-1 ROOT/pve-1

Allows the machine to boot into Proxmox. However running "proxmox-boot-tool refresh" doesn't fix the issue, the same thing happens at the next boot. Any suggested repairs?

fabian · Nov 27, 2023

the main question is why doesn't the pool import automatically? is there anything else displayed that might give a hint?

AxxelH · Nov 27, 2023

Nothing seems amiss in dmesg, and the pool is healthy. Again, the busybox "Command" "Message" and "Error" fields are all blank.

Is it possible the zpool.cache is at issue? I'm unclear how that interacts with rpool at boot time specifically, but I see in it the initramfs.

fabian · Nov 27, 2023

it should not.. does dmesg contain any pointers about disk related hardware coming up late?

AxxelH · Nov 27, 2023

No dmesg about disk hardware problems, late or not. And again, adding rootdelay=60 to Grub (via the "e" to edit options during setup) had no effect, so I don't think it's a slow disk (these are SATA SSDs).

A different machine with EFI boot has no problems, and I've confirmed this machine will boot (from different disks) with EFI on a fresh 8.1 install, so I'm wondering if 8.1 has broken something for legacy boot or Grub configurations? Is there a way to force the installer to use Grub boot so I can test if that's the issue?

AxxelH · Nov 28, 2023

This gets weirder and weirder...

1) Install Proxmox 8.0 from ISO to a new drive using legacy boot. Using 8.0 instead of 8.1 so the rpool doesn't have ZFS features too new for my rescue disk (SysRescue with ZFS).
2) System boots normally with GRUB.
3) Use ZFS send/recv to replicate my old rpool to the new disk. Use proxmox-boot-manager to init and refresh the appropriate partitions. System boots normally with GRUB.
4) Add the SSDs as mirror to the new rpool, use proxmox-boot-manager to update the partitions on the newly mirrored disks.
5) System boots normally with Grub.
6) Remove the new disk from the rpool, leaving only the same SSDs I started with mirroring a new pool.
7) System boots normally.
8) Add remaining disks and pools to the machine (HBA, some U2 disks). System fails to boot again, with the same message and no error.
9) Removing all additional hardware persists the problem, back to the beginning.

I can repeat the same process using UEFI boot (starting with a clean install of Proxmox 8.0) and get to step 5 with the machine hanging at "EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path". systemd-boot is a lot less information unfortunately.

There have been no hardware changes (aside from the temporary disk).

Did 8.1 change something about how zfs is loaded in initramfs?

fabian · Nov 28, 2023

the only real change is that it's now possible to use grub-via-proxmox-boot-tool for ZFS+EFI..

can you try removing the zpool.cache file and updating the initramfs?

AxxelH · Nov 28, 2023

I've removed the zpool.cache and rebuilt initramfs, no change.

I'm now trying to copy (rsync) the old pool content on top of a working 8.0-installer-created rpool, followed by a proxmox-boot-tool refresh. I'll post here how it goes.

Its increasingly looking to me like something about the ZFS import running in initramfs doesn't like my other pools? All the pools are healthy, but something makes the new initramfs unhappy?

fabian · Nov 28, 2023

you could probably add more output or manually try to run the exact steps the initrd scripts take, if you really want to get to the bottom of it..

AxxelH · Nov 28, 2023

The rsync worked, though some of the containers won't start for unclear reasons. I'll restore them from backup to fix them I expect. Just to be thorough, I then replaced the working rsync with a zfs send of the same content, and that fails to boot. So whatever is going on seems specific to the actual dataset, and not its content. The dataset has gone through many proxmox upgrades, but nothing stands out in the zfs properties.

you could probably add more output or manually try to run the exact steps the initrd scripts take, if you really want to get to the bottom of it..

Are there suggested steps? At this point my assumption is that the dataset (which probably started with Proxmox 5) is somehow broken. I'm considering a full rebuild of the machine, as it's an old homelab. A clean install of everything might be worthwhile.

But in the near term, if the rsync gets things running that gets me off the immediate problem.

fabian · Nov 28, 2023

the main zfs script is in /usr/share/initramfs-tools/scripts/zfs - but be careful when experimenting (or maybe do it in a VM first to test your changes).

AxxelH · Nov 29, 2023

/usr/share/initramfs-tools/scripts/zfs

Thank you, that was the hint I needed. In particular, the parsing of rpool and bootfs. I had previously failed to notice that pve-boot-manager was producing this:

root=ZFS=/ROOT/pve-1

Instead of this:

root=ZFS=rpool/ROOT/pve-1

Without the pool name, the import doesn't run at all, which explains the lack of error messages.

The incorrect rpool name seems to be created in /etc/grub.d/10_linux line 92:

rpool=`${grub_probe} --device ${GRUB_DEVICE} --target=fs_label 2>/dev/null || true`

I'm not sure what GRUB_DEVICE is supposed to be, but assuming it's the ZFS partition, the probe fails with "/usr/sbin/grub-probe: error: compression algorithm inherit not supported", possibly because of the problem in this thread, which appears to still be present on an updated machine?

On a new install, however, this failure doesn't matter because something (the installer?) creates /etc/default/grub.d/zfs.cfg:

GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX root=ZFS=rpool/ROOT/pve-1 boot=zfs"

Since this forcibly suffixes the correct root= directive, and the last option takes precedence, the resulting grub commandline is bootable as the bad option is ignored. Adding it to my original zpool fixes my boot issue.

The various install scenarios I tried seem like red herrings. systemd-boot didn't work because I forgot this machine doesn't have an /etc/kernel/cmdline, which would be needed for it to boot. Its possible some of the initial rsync vs zfs send sequences worked because of different ZFS options on the resulting filesystems, I did not check.

I did not test if a fresh install eventually generates the wrong first version of the root= option, I suspect it will as long as grub-probe can fail silently. This may be a bug in the upstream grub.

However, it seems like the post-install state expects /etc/default/grub.d/zfs.cfg to exist, which appears Proxmox specific, and it did not exist on my upgraded system. This seems like a Proxmox bug?

fabian · Nov 29, 2023

thanks for the additional information!

systemd-boot only works on EFI-booted systems.

I will try to reproduce your issue!

fabian · Nov 29, 2023

However, it seems like the post-install state expects /etc/default/grub.d/zfs.cfg to exist, which appears Proxmox specific, and it did not exist on my upgraded system. This seems like a Proxmox bug?

that file is set up by the installer, so if your system predates that it is expected to be missing.

I can reproduce the problem if I remove that file though, seems to be a regression in grub packaging.

Taylan · Nov 29, 2023

I have a similar issue, with a twist. I didn't reboot after upgrading.
While upgrading to proxmox-kernel-6.5.11-6-pve-signed I received the error "/usr/sbin/grub-probe: error: compression algorithm inherit not supported"

I have following logs also:

Code:

Nov 29 14:38:11 Proxmox systemd[1]: Reloading.
Nov 29 14:38:12 Proxmox systemd[1]: zfs-import-scan.service - Import ZFS pools by device scanning was skipped because of an unmet condition check (ConditionFileNotEmpty=!/etc/zfs/zpool.cache).
Nov 29 14:38:12 Proxmox systemd[1]: zfs-import-scan.service - Import ZFS pools by device scanning was skipped because of an unmet condition check (ConditionFileNotEmpty=!/etc/zfs/zpool.cache).
Nov 29 14:38:12 Proxmox systemd[1]: Reloading.
Nov 29 14:38:12 Proxmox zed[2693]: Exiting
Nov 29 14:38:12 Proxmox systemd[1]: Stopping zfs-zed.service - ZFS Event Daemon (zed)...
Nov 29 14:38:12 Proxmox systemd[1]: zfs-zed.service: Deactivated successfully.
Nov 29 14:38:12 Proxmox systemd[1]: Stopped zfs-zed.service - ZFS Event Daemon (zed).
Nov 29 14:38:12 Proxmox systemd[1]: zfs-zed.service: Consumed 12.796s CPU time.
Nov 29 14:38:12 Proxmox systemd[1]: Started zfs-zed.service - ZFS Event Daemon (zed).
Nov 29 14:38:12 Proxmox zed[2181935]: ZFS Event Daemon 2.2.0-pve4 (PID 2181935)
Nov 29 14:38:12 Proxmox zed[2181935]: Processing events since eid=4834

fabian · Nov 29, 2023

yeah, grub simply doesn't cope with ZFS, hence our workaround of
- forcing "root=.." via a grub config snippet
- putting grub/systemd-boot+kernels on the ESP, both for legacy and EFI boot

as long as the generated ESP contents including grub.cfg is good, that warning can be ignored.

Taylan · Nov 29, 2023

fabian said:
yeah, grub simply doesn't cope with ZFS, hence our workaround of
- forcing "root=.." via a grub config snippet
- putting grub/systemd-boot+kernels on the ESP, both for legacy and EFI boot

as long as the generated ESP contents including grub.cfg is good, that warning can be ignored.

Thanks for the heads up!

AxxelH · Dec 1, 2023

systemd-boot only works on EFI-booted systems.

Understood. My system can boot either method. I was trying EFI boot to test if the problem was specific to Grub. As it turns out, it is specific to Grub, but I mistakenly associated the bug with EFI boot in an earlier post because my system doesn't have the necessary /etc/kernel/cmdline, So under systemd-boot I had the same failure (no ZFS root) but for a different (and expected) reason.

that file is set up by the installer, so if your system predates that it is expected to be missing.

I can reproduce the problem if I remove that file though, seems to be a regression in grub packaging.

I'm not sure I understand this? If the file is created by the proxmox installer I wouldn't expect it to be included in the grub package? But since it seems clear its required I would expect it to be included in some proxmox package?

ZFS root booting to busybox, but no displayed command, message or error?

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Member

We value your privacy