Hello! I have a small homelab cluster which I installed updates on today and ran into trouble on the second of three nodes. I'm hoping to get some advice about how to fix this without completely rebuilding the node.
All three nodes started at pve-manager/8.4.1/2a5fa54a8503f96d (running kernel: 6.8.12-9-pve) and I updated to the latest available packages through the UI.
The first node completed upgrades just fine and is now running pve-manager/8.4.6/c5b55b84d1f84ea6 (running kernel: 6.8.12-13-pve)
After upgrading and rebooting the second node, I noticed it took longer to reboot than expected (~15 minutes?) and I wasn't getting any response from it over the network, so I made my way to the basement, hard powered the node off, and hooked up a display and keyboard to begin diagnostics.
On boot, the GRUB menu came up, and when the default Proxmox Virtual Environment option started, I received this error message:
Initially I was able to go back to the previous kernel from the Advanced options menu in GRUB, and then I did some googling to figure out how to reinstall/repair the new kernel version, which led me to a combination of update-initramfs and update-grub commands, and later on proxmox-boot-tool refresh commands.
I was able to boot into the new kernel once, but after rebooting to verify that the fix would stick, it failed to load again, and this time the old kernel ALSO was missing. That led me down a path of troubleshooting from the Proxmox Installer in debug mode, fsck on the boot partition, mounting the zfs rpool and chrooting into it to run additional commands, etc.
At any rate, I am now stuck and would appreciate any advice for getting unstuck.
When I boot the node, I now receive this message:
I suspect something is off about my boot partition, but I'm a little out of my depth here so any advice is appreciate.
A bit more about my setup, in case it proves helpful:
All three nodes started at pve-manager/8.4.1/2a5fa54a8503f96d (running kernel: 6.8.12-9-pve) and I updated to the latest available packages through the UI.
The first node completed upgrades just fine and is now running pve-manager/8.4.6/c5b55b84d1f84ea6 (running kernel: 6.8.12-13-pve)
After upgrading and rebooting the second node, I noticed it took longer to reboot than expected (~15 minutes?) and I wasn't getting any response from it over the network, so I made my way to the basement, hard powered the node off, and hooked up a display and keyboard to begin diagnostics.
On boot, the GRUB menu came up, and when the default Proxmox Virtual Environment option started, I received this error message:
Code:
Loading Linux 6.8.12-13-pve ...
error: invalid cluster 0.
Loading initial ramdisk ...
error: you need to load the kernel first.
Press any key to continue...
Initially I was able to go back to the previous kernel from the Advanced options menu in GRUB, and then I did some googling to figure out how to reinstall/repair the new kernel version, which led me to a combination of update-initramfs and update-grub commands, and later on proxmox-boot-tool refresh commands.
I was able to boot into the new kernel once, but after rebooting to verify that the fix would stick, it failed to load again, and this time the old kernel ALSO was missing. That led me down a path of troubleshooting from the Proxmox Installer in debug mode, fsck on the boot partition, mounting the zfs rpool and chrooting into it to run additional commands, etc.
At any rate, I am now stuck and would appreciate any advice for getting unstuck.
When I boot the node, I now receive this message:
Code:
Loading Linux 6.8.12-13-pve ...
Loading initial ramdisk ...
error: premature end of file /initrd.img-6.8.12-13-pve.
Press any key to continue...
I suspect something is off about my boot partition, but I'm a little out of my depth here so any advice is appreciate.
A bit more about my setup, in case it proves helpful:
- Three-node cluster of HP Z2 G4 desktops, identical hardware (except some ceph OSD differences, but we're not getting to that point so I doubt it's relevant)
- ZFS mirror boot pool on NVMe drives
- UEFI with Secure Boot has been in use since the cluster was installed; no migration of boot methods has been performed, as far as I remember.