Critical failure after update — GRUB broken despite active enterprise subscription

Additional note (regarding Proxmox boot reliability)

Since Proxmox is frequently used in production environments and critical infrastructures, I think it would be very helpful if future releases strengthened the robustness of the boot handling system even further.

The boot process (ESP discovery, proxmox-boot-tool synchronization, systemd-boot/grub updates, shim handling, etc.) is one of the most sensitive parts of a hypervisor upgrade. When it works, upgrades are smooth — but when ESP synchronization is skipped for any reason, the result is a non-bootable system, which has major impact in mission-critical installations.

It would be great if Proxmox could:

  • perform stricter validation of ESP presence/UUIDs before kernel installation
  • warn the user more clearly when ESP sync is being skipped
  • provide a pre-upgrade check specifically for bootloader/ESP health
  • optionally block reboot when the ESP was not successfully updated
This would give administrators much more confidence when performing upgrades, especially on hosts that cannot afford downtime due to bootloader inconsistencies.
Again, this is meant as constructive feedback — Proxmox is excellent, and improving the safety of the boot chain would make it even stronger for critical environments.
 
I don’t think this issue is caused by faulty motherboard firmware or broken UEFI variable handling.
The reason is that multiple systems — from different vendors, with different chipsets, with Secure Boot both enabled and disabled, using ZFS and LVM — have shown the same behavior during the 8.4 → 8.5/9 upgrade path:
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
This message is the key.
It indicates that proxmox-boot-tool did not synchronize the ESP, meaning the kernel was updated but the bootloader was not. After reboot, the machine tries to start using an outdated or incomplete ESP, leading to a boot failure, even though the OS upgrade itself completed successfully.
This is a boot chain state/configuration issue, not a firmware reliability issue.

you are completely misunderstanding what that message means. there are two broad categories of PVE systems w.r.t. to booting:
- those were proxmox-boot-tool is used for managing (/syncing) the ESPs
- those were it is not

that message is printed for the second category and is completely benign. all it means is that your system is using a default Grub setup, with Grub directly installed onto the mounted ESP.

A genuine UEFI firmware bug typically shows signs like:
  • efibootmgr hanging or returning errors
  • NVRAM boot entries disappearing
  • Boards refusing to store new BootXXXX variables
However, this upgrade issue occurs even on systems that have always handled UEFI variables correctly. This is why the “broken firmware” explanation doesn’t seem to match the pattern observed across multiple independent reports.
In short:
The common factor in these cases appears to be:
  • The ESP was not listed in /etc/kernel/proxmox-boot-uuids,
    → therefore proxmox-boot-tool skipped synchronization,
    → and the system rebooted into an outdated ESP.

the log you posted indicates that the system crashed during the system upgrade, during handling of the Grub update. ESPs or bootloaders running out of sync cannot manifest that way, those would only break the next reboot! previous systems with those exact symptoms that were analyzed in detail showed that the crash occurred when grub-install wanted to write to the EFI variables (the actual handling of that write is done by the system firmware). there were instances were upgrading the system firmware made those symptoms go away.

This explains why different users, on different hardware, hit the same problem during the same part of the kernel/grub/systemd-boot update sequence.
To be clear: this is not a criticism of the Proxmox developers — the boot process is complex (ZFS + systemd-boot, LVM + grub-efi, shim-signed loaders, multiple ESP layouts, etc.). Even small differences in ESP setup can cause proxmox-boot-tool to skip syncing.
But based on the consistent pattern observed across different machines, this behavior does not appear related to defective motherboard firmware.
see above
 
you are completely misunderstanding what that message means. there are two broad categories of PVE systems w.r.t. to booting:
- those were proxmox-boot-tool is used for managing (/syncing) the ESPs
- those were it is not

that message is printed for the second category and is completely benign. all it means is that your system is using a default Grub setup, with Grub directly installed onto the mounted ESP.



the log you posted indicates that the system crashed during the system upgrade, during handling of the Grub update. ESPs or bootloaders running out of sync cannot manifest that way, those would only break the next reboot! previous systems with those exact symptoms that were analyzed in detail showed that the crash occurred when grub-install wanted to write to the EFI variables (the actual handling of that write is done by the system firmware). there were instances were upgrading the system firmware made those symptoms go away.


see above
Thanks for the detailed clarification — that helps separate two issues that were being conflated in several reports.
I understand now the distinction you’re drawing:
  • Systems using proxmox-boot-tool to manage/sync ESPs, where /etc/kernel/proxmox-boot-uuids is relevant.
  • Systems using a traditional GRUB installation, where that message is benign and expected, because GRUB is installed directly on the mounted ESP, not managed via proxmox-boot-tool.
So in GRUB-based setups, the “skipping ESP sync” message indeed does not indicate anything wrong by itself.
However, the pattern that caused confusion — and that many users (myself included) were trying to make sense of — is that the crash during the upgrade consistently happens in the same stage of the grub/EFI handling process across multiple machines and vendors. That’s what initially made people suspect an ESP/bootloader state issue instead of a firmware-level problem.

Your explanation clarifies that:
  • These crashes cannot be caused by an out-of-sync ESP, since that would only affect the next reboot, not the upgrade process itself.
  • The actual failure happens when grub-install attempts to write EFI variables, and the firmware is responsible for processing that call.
  • On previously analyzed cases, updating the motherboard firmware resolved the issue — which strongly suggests the UEFI implementation as the bottleneck.
This distinction finally makes sense and clears up the misconception.
Thanks for taking the time to explain this in depth.