Proxmox 9 HA Issue: Node stuck in "old timestamp- dead" after apt dist-upgrade

dmembibre

Active Member
Sep 23, 2020
23
3
43
38
Hello,

Yesterday, one of the nodes in our cluster got stuck in the "old timestamp- dead" status. This occurred immediately after I performed an apt dist-upgrade, which included updates to the following packages:

1759919264687.png

1759919331190.png

The only solution I found to bring the node back online was to perform a physical reboot, which caused significant inconvenience, as the node was running close to 70 virtual machines that had to be evacuated.

1759919791899.png

I suspect this might be a bug in the new Proxmox 9 packages.

I performed a test on another node today:

  1. I put the node into maintenance mode.
  2. I ran the package update (apt dist-upgrade).
  3. Upon exiting maintenance mode, the node immediately reverted to the "old timestamp- dead" status.
This reproducible issue is causing major stability concerns with HA (High Availability). Has anyone else experienced this behavior after updating to Proxmox 9?

Any insight or assistance would be greatly appreciated. Thank you.

Code:
veversion -v
proxmox-ve: 9.0.0 (running kernel: 6.14.11-2-pve)
pve-manager: 9.0.10 (running version: 9.0.10/deb1ca707ec72a89)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.14.11-3-pve-signed: 6.14.11-3
proxmox-kernel-6.14: 6.14.11-3
proxmox-kernel-6.14.11-2-pve-signed: 6.14.11-2
proxmox-kernel-6.14.11-1-pve-signed: 6.14.11-1
amd64-microcode: 3.20250311.1
ceph: 19.2.3-pve2
ceph-fuse: 19.2.3-pve2
corosync: 3.1.9-pve2
criu: 4.1.1-1
dnsmasq: 2.91-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx10
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.11
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.4
libpve-network-perl: 1.1.8
libpve-rs-perl: 0.10.10
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-1
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
openvswitch-switch: 3.5.0-1+b1
proxmox-backup-client: 4.0.15-1
proxmox-backup-file-restore: 4.0.15-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.1.2
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.2
proxmox-widget-toolkit: 5.0.6
pve-cluster: 9.0.6
pve-container: 6.0.13
pve-docs: 9.0.8
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.17-1
pve-ha-manager: 5.0.4
pve-i18n: 3.6.0
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
pve-zsync: 2.4.0
qemu-server: 9.0.22
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve2
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1
 
Hi!

I haven't tried to reproduce it yet: When was the update performed? Has the watchdog been inactive before already (some recent entries in journalctl -u watchdog-mux)?
 
Hi!

Last week we removed all HA resources and updated the packages. A moment ago I started adding the HA resources again and the master was listed as 'old timestamp - dead'. This is the status of its watchdog.

1760535975489.png

The master has changed and other nodes are now listed as 'old timestamp - dead', but the node that was previously the master is now 'idle'.


1760536091600.png