PVE 8.4.17 Recover from Crashed Upgrade

triumphtruth

New Member
Jan 3, 2025
10
0
1
Guys I am in a real mess. And I am not sure how I get in this situation. And I need to recover from this, please help.

Proxmox Setup Details:

Promox Cluster : Total 2 Nodes

Node 1 - Primary
Node 2 - Secondary

Versions:
Node 1: 8.4.17 (Broke after this)
Node 2: 8.4.14 (healthy, didn’t update)

How it all started:
Everything was working fine. I logged in to my cluster and saw Node 1 is having question marks on the main node and the workloads. (this issue is known to me, and wanted to fix it later by adding a QDevice, it happens because my node 1 hosts firewall, after updates if the node 2 starts up and firewall has not loaded yet, quorum is lost)
Reading on the forums and elsewhere I found its the quorum issue. I restarted the corosync service to fix that and verified that now cluster is quorate. But the issue persisted.
Then I found pvestatd pve-daemon and pve-cluster services should be restarted as well. Once I did, nothing happened, issue persisted.
Then I thought lets restart the node and things will be fine.

I saw there are some updates pending on the node, and I thought I will install the updates and restart peacefully.

This destroyed the node. Still the workloads are running fine, but now the node is appearing as offline, and I have several issues in that node.

What happened during update, it tried to install packages but several packages particularly zfs and pve-manager packages started throwing the following errors:
Code:
Reload daemon failed: Transport endpoint is not connected

Failed to get unit file state for pvedaemon.service: Transport endpoint is not connected

Failed to get unit file state for pveproxy.service: Transport endpoint is not connected

Failed to get unit file state for spiceproxy.service: Transport endpoint is not connected

Failed to get unit file state for pvestatd.service: Transport endpoint is not connected

Failed to get unit file state for pvebanner.service: Transport endpoint is not connected

Failed to get unit file state for pvescheduler.service: Transport endpoint is not connected

Failed to get unit file state for pve-daily-update.timer: Transport endpoint is not connected

as well as some zfs errors.

Now everything is stuck here. What I have tried:
  • dpkg --configure -a (first time it resolved the zfs errors but can't fix the pve-manager installation)
  • systemctl restart pve-cluster (this returns the following error: Failed to get properties: Transport endpoint is not connected)
  • apt clean
  • apt upgrade (it returns me to run the dpkg --configure -a again)

So I am stuck in a loop essentially. I have not restarted the system and is vary of doing it. I can still SSH into the system, GUI is not usable though. I can connect to the secondary node and everything is working fine there.

Please let me know what to do. Please note that node 1 is running very important workloads (OPNSense, HomeAssistant MQTT) I really don't want to reinstall this node.

Any help will be appreciated.

Best Regards,
Muhammad Ayub.
 
Last edited:
apt upgrade(it returns me to run the dpkg --configure -a again)
I don't know if this is the reason, but in Proxmox one should NOT use apt upgrade (NOR apt-get upgrade ).

Only apt full-upgrade (or apt-get dist-upgrade ).

Disclaimer: I can't guarantee that in your current state of the system, running the proper command now will make it better or worse though.

If you have proper backups and all else fails, you can try it.
 
I don't know if this is the reason, but in Proxmox one should NOT use apt upgrade (NOR apt-get upgrade ).

Only apt full-upgrade (or apt-get dist-upgrade ).

Disclaimer: I can't guarantee that in your current state of the system, running the proper command now will make it better or worse though.

If you have proper backups and all else fails, you can try it.
To be honest I have tried apt full-upgrade and apt-get dist-upgrade both get stuck in the same issues.
 
To be honest I have tried apt full-upgrade and apt-get dist-upgrade both get stuck in the same issues.
I tried the apt full-upgrade right now and it gave me the following errors:

Code:
root@home:/tmp# apt full-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages were automatically installed and are no longer required:
  proxmox-kernel-6.8.12-14-pve-signed proxmox-kernel-6.8.12-5-pve-signed proxmox-kernel-6.8.12-7-pve-signed
  proxmox-kernel-6.8.12-9-pve-signed
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


Then I ran apt autoremove I got the following output:

Code:
root@home:/tmp# apt autoremove
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages will be REMOVED:
  proxmox-kernel-6.8.12-14-pve-signed proxmox-kernel-6.8.12-5-pve-signed proxmox-kernel-6.8.12-7-pve-signed
  proxmox-kernel-6.8.12-9-pve-signed
0 upgraded, 0 newly installed, 4 to remove and 0 not upgraded.
After this operation, 2,308 MB disk space will be freed.
Do you want to continue? [Y/n] y
(Reading database ... 105733 files and directories currently installed.)
Removing proxmox-kernel-6.8.12-14-pve-signed (6.8.12-14) ...
Examining /etc/kernel/postrm.d.
run-parts: executing /etc/kernel/postrm.d/initramfs-tools 6.8.12-14-pve /boot/vmlinuz-6.8.12-14-pve
update-initramfs: Deleting /boot/initrd.img-6.8.12-14-pve
run-parts: executing /etc/kernel/postrm.d/proxmox-auto-removal 6.8.12-14-pve /boot/vmlinuz-6.8.12-14-pve
run-parts: executing /etc/kernel/postrm.d/zz-proxmox-boot 6.8.12-14-pve /boot/vmlinuz-6.8.12-14-pve
Re-executing '/etc/kernel/postrm.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/EC9E-BEB8
        Copying kernel and creating boot-entry for 6.8.12-15-pve
        Copying kernel and creating boot-entry for 6.8.12-19-pve
run-parts: executing /etc/kernel/postrm.d/zz-update-grub 6.8.12-14-pve /boot/vmlinuz-6.8.12-14-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-19-pve
Found initrd image: /boot/initrd.img-6.8.12-19-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Found linux image: /boot/vmlinuz-6.8.12-15-pve
Found initrd image: /boot/initrd.img-6.8.12-15-pve
Found linux image: /boot/vmlinuz-6.8.12-9-pve
Found initrd image: /boot/initrd.img-6.8.12-9-pve
Found linux image: /boot/vmlinuz-6.8.12-7-pve
Found initrd image: /boot/initrd.img-6.8.12-7-pve
Found linux image: /boot/vmlinuz-6.8.12-5-pve
Found initrd image: /boot/initrd.img-6.8.12-5-pve
Found linux image: /boot/vmlinuz-6.8.12-4-pve
Found initrd image: /boot/initrd.img-6.8.12-4-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Adding boot menu entry for UEFI Firmware Settings ...
done
Removing proxmox-kernel-6.8.12-5-pve-signed (6.8.12-5) ...
Examining /etc/kernel/postrm.d.
run-parts: executing /etc/kernel/postrm.d/initramfs-tools 6.8.12-5-pve /boot/vmlinuz-6.8.12-5-pve
update-initramfs: Deleting /boot/initrd.img-6.8.12-5-pve
run-parts: executing /etc/kernel/postrm.d/proxmox-auto-removal 6.8.12-5-pve /boot/vmlinuz-6.8.12-5-pve
run-parts: executing /etc/kernel/postrm.d/zz-proxmox-boot 6.8.12-5-pve /boot/vmlinuz-6.8.12-5-pve
Re-executing '/etc/kernel/postrm.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/EC9E-BEB8
        Copying kernel and creating boot-entry for 6.8.12-15-pve
        Copying kernel and creating boot-entry for 6.8.12-19-pve
run-parts: executing /etc/kernel/postrm.d/zz-update-grub 6.8.12-5-pve /boot/vmlinuz-6.8.12-5-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-19-pve
Found initrd image: /boot/initrd.img-6.8.12-19-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Found linux image: /boot/vmlinuz-6.8.12-15-pve
Found initrd image: /boot/initrd.img-6.8.12-15-pve
Found linux image: /boot/vmlinuz-6.8.12-9-pve
Found initrd image: /boot/initrd.img-6.8.12-9-pve
Found linux image: /boot/vmlinuz-6.8.12-7-pve
Found initrd image: /boot/initrd.img-6.8.12-7-pve
Found linux image: /boot/vmlinuz-6.8.12-4-pve
Found initrd image: /boot/initrd.img-6.8.12-4-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Adding boot menu entry for UEFI Firmware Settings ...
done
Removing proxmox-kernel-6.8.12-7-pve-signed (6.8.12-7) ...
Examining /etc/kernel/postrm.d.
run-parts: executing /etc/kernel/postrm.d/initramfs-tools 6.8.12-7-pve /boot/vmlinuz-6.8.12-7-pve
update-initramfs: Deleting /boot/initrd.img-6.8.12-7-pve
run-parts: executing /etc/kernel/postrm.d/proxmox-auto-removal 6.8.12-7-pve /boot/vmlinuz-6.8.12-7-pve
run-parts: executing /etc/kernel/postrm.d/zz-proxmox-boot 6.8.12-7-pve /boot/vmlinuz-6.8.12-7-pve
Re-executing '/etc/kernel/postrm.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/EC9E-BEB8
        Copying kernel and creating boot-entry for 6.8.12-15-pve
        Copying kernel and creating boot-entry for 6.8.12-19-pve
run-parts: executing /etc/kernel/postrm.d/zz-update-grub 6.8.12-7-pve /boot/vmlinuz-6.8.12-7-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-19-pve
Found initrd image: /boot/initrd.img-6.8.12-19-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Found linux image: /boot/vmlinuz-6.8.12-15-pve
Found initrd image: /boot/initrd.img-6.8.12-15-pve
Found linux image: /boot/vmlinuz-6.8.12-9-pve
Found initrd image: /boot/initrd.img-6.8.12-9-pve
Found linux image: /boot/vmlinuz-6.8.12-4-pve
Found initrd image: /boot/initrd.img-6.8.12-4-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Adding boot menu entry for UEFI Firmware Settings ...
done
Removing proxmox-kernel-6.8.12-9-pve-signed (6.8.12-9) ...
Examining /etc/kernel/postrm.d.
run-parts: executing /etc/kernel/postrm.d/initramfs-tools 6.8.12-9-pve /boot/vmlinuz-6.8.12-9-pve
update-initramfs: Deleting /boot/initrd.img-6.8.12-9-pve
run-parts: executing /etc/kernel/postrm.d/proxmox-auto-removal 6.8.12-9-pve /boot/vmlinuz-6.8.12-9-pve
run-parts: executing /etc/kernel/postrm.d/zz-proxmox-boot 6.8.12-9-pve /boot/vmlinuz-6.8.12-9-pve
Re-executing '/etc/kernel/postrm.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/EC9E-BEB8
        Copying kernel and creating boot-entry for 6.8.12-15-pve
        Copying kernel and creating boot-entry for 6.8.12-19-pve
run-parts: executing /etc/kernel/postrm.d/zz-update-grub 6.8.12-9-pve /boot/vmlinuz-6.8.12-9-pve
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-19-pve
Found initrd image: /boot/initrd.img-6.8.12-19-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Found linux image: /boot/vmlinuz-6.8.12-15-pve
Found initrd image: /boot/initrd.img-6.8.12-15-pve
Found linux image: /boot/vmlinuz-6.8.12-4-pve
Found initrd image: /boot/initrd.img-6.8.12-4-pve
/usr/sbin/grub-probe: error: unknown filesystem.
Adding boot menu entry for UEFI Firmware Settings ...
done
Removing subscription nag from UI...


then I ran apt full-upgrade and apt-get dist-upgrade and both returned following outputs:

Code:
root@home:/tmp# apt full-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
root@home:/tmp# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

But issue is still there. What should I do?
 
Removing subscription nag from UI...

Oh, removing subscription nag. You had (or have) some non official addon, which is known to cause problems, especially during upgrade.

My advice is get rid off that thingy entirely.


P.S. See e.g. these threads:


 
Last edited:
Oh, subscription nag. You had (or have) some non official addon, which is known to cause problems, especially during upgrade.

My advice is get rid off that thingy entirely.


P.S. See e.g. these threads:


I removed the subscription nag script but still no luck. Problem is that the whole system is in a stuck state. I am not sure what to do.

I tried to check the service status using systemctl status pve-cluster but nothing returns. After a lot of time I just gets Failed to get properties: Transport endpoint is not connected.
 
Have you studied the threads I linked to? And possibly other similar? Have you tried the actions described there?
 
I tried several other things as well to identify the issue but nothing is helping. At one time I thought of upgrading things, but that is also not possible. Problem is I can't even find logs for the services. systemctl start pvestatd or systemctl start pve-cluster etc nothing is working. journalctl -xeu pvestatd -f or journalctl -xeu pve-cluster -f are only showing old logs, nothing new I can see.
 
You mean upgrading is not possible?
What commands do you execute, in detail?
And what are their results?
I meant I tried to upgrade things using apt dist-upgrade but it shows there is nothing to do everything is up to date.

BTW other day I was discussing things at the proxmox discord server, and someone shared this link to me:
https://www.reddit.com/r/Proxmox/comments/1h4o4pp/upgrade_from_827_to_83_failed_i_think/

He was having the same issue. So he is suggested me to restart the PC, as I also faced similar situation during / after the update.

I have not done this yet, due to fear that my whole home lab will become unusable, as my OPNSense is running on it. I will restart it after taking backups of the vms and taking it out of this node. Probably by today. Will update the thread of the outcome.