[SOLVED] node left the cluster after upgrading it to debian-13

Mar 18, 2024
55
4
8
east of muc
good morning,

we are running a subscribed proxmox cluster. all proxmox installations are on of a self installed debian-12 machines.

following https://linuxconfig.org/how-to-upgrade-debian-to-latest-version i have upgraded one node to debian-13 and rebooted the machine.

the machine booted without any warning but it has kind of left the cluster.

`service corosync status` on the upgraded machine

prints

corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Tue 2025-10-14 14:09:15 CEST; 2s ago
Invocation: adc013e80fa346cb8148f7677eb66ced
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 11412 (corosync)
Tasks: 9 (limit: 309106)
Memory: 156.2M (peak: 156.3M)
CPU: 108ms
CGroup: /system.slice/corosync.service
`-11412 /usr/sbin/corosync -f

Oct 14 14:09:15 ms-pm07 corosync[11412]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 14 14:09:15 ms-pm07 corosync[11412]: [KNET ] host: host: 6 has no active links
Oct 14 14:09:15 ms-pm07 corosync[11412]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 14 14:09:15 ms-pm07 corosync[11412]: [KNET ] host: host: 6 has no active links
Oct 14 14:09:15 ms-pm07 corosync[11412]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Oct 14 14:09:15 ms-pm07 corosync[11412]: [KNET ] host: host: 6 has no active links
Oct 14 14:09:15 ms-pm07 corosync[11412]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Oct 14 14:09:15 ms-pm07 corosync[11412]: [QUORUM] Members[1]: 7
Oct 14 14:09:15 ms-pm07 corosync[11412]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 14 14:09:15 ms-pm07 systemd[1]: Started corosync.service - Corosync Cluster Engine.

on the other machines in the cluster `service corosync status` complains

Oct 14 14:00:00 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:02 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:04 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:05 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:07 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:09 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:10 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:12 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:13 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405
Oct 14 14:00:15 ms-pm01 corosync[1446]: [KNET ] rx: Packet rejected from 192.168.30.39:5405

is there anything i can do to convince the upgraded node to (re)join the cluster?

thanks for any hints
 
i'm afraid, i have accidental upgraded to pve9.

i have followed https://linuxconfig.org/how-to-upgrade-debian-to-latest-version and a file named
pve-enterprise.sources:

Types: deb
URIs: https://enterprise.proxmox.com/debian/pve
Suites: trixie
Components: pve-enterprise
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg

was placed in /etc/apt/sources.list.d. as this is a new server which was working fine with the bookworm installation and which does not yet have a subscription, i created a file proxmox.sources:

Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription
Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg

in /etc/apt/sources.list.d and commented out everything in pve-enterprise.sources

apt update && apt upgrade && apt full-upgrade than resulted in the broken cluster.
 
yes, i saw error messages:

Processing triggers for pve-manager (8.4.14) ...
user config - ignore invalid privilege 'VM.Monitor'
Job for pvedaemon.service failed.
See "systemctl status pvedaemon.service" and "journalctl -xeu pvedaemon.service" for details.
Job for pvestatd.service failed.
See "systemctl status pvestatd.service" and "journalctl -xeu pvestatd.service" for details.
Job for pveproxy.service failed.
See "systemctl status pveproxy.service" and "journalctl -xeu pveproxy.service" for details.
Job for pvescheduler.service failed.
See "systemctl status pvescheduler.service" and "journalctl -xeu pvescheduler.service" for details.
Processing triggers for man-db (2.11.2-2) ...
Processing triggers for ca-certificates (20250419) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
Processing triggers for dictionaries-common (1.30.10) ...
ispell-autobuildhash: Processing 'american' dict.
ispell-autobuildhash: Processing 'british' dict.
Processing triggers for pve-ha-manager (5.0.5) ...

and the output of

systemctl status pvestatd.service
was:

* pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-10-13 13:44:11 CEST; 23h ago
Process: 507976 ExecReload=/usr/bin/pvestatd restart (code=exited, status=255/EXCEPTION)
Main PID: 1880 (pvestatd)
Tasks: 1 (limit: 309260)
Memory: 151.0M
CPU: 15min 55.095s
CGroup: /system.slice/pvestatd.service
`-1880 pvestatd

Oct 14 12:47:37 ms-pm07 systemd[1]: Reloading pvestatd.service - PVE Status Daemon...
Oct 14 12:47:37 ms-pm07 pvestatd[507976]: unknown file 'ha/rules.cfg' at /usr/share/perl5/PVE/Cluster.pm line 524.
Oct 14 12:47:37 ms-pm07 pvestatd[507976]: Compilation failed in require at /usr/share/perl5/PVE/QemuServer.pm line 36.
Oct 14 12:47:37 ms-pm07 pvestatd[507976]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/QemuServer.pm line 36.
Oct 14 12:47:37 ms-pm07 pvestatd[507976]: Compilation failed in require at /usr/share/perl5/PVE/Service/pvestatd.pm line 21.
Oct 14 12:47:37 ms-pm07 pvestatd[507976]: BEGIN failed--compilation aborted at /usr/share/perl5/PVE/Service/pvestatd.pm line 21.
Oct 14 12:47:37 ms-pm07 pvestatd[507976]: Compilation failed in require at /usr/bin/pvestatd line 9.
Oct 14 12:47:37 ms-pm07 pvestatd[507976]: BEGIN failed--compilation aborted at /usr/bin/pvestatd line 9.
Oct 14 12:47:37 ms-pm07 systemd[1]: pvestatd.service: Control process exited, code=exited, status=255/EXCEPTION
Oct 14 12:47:37 ms-pm07 systemd[1]: Reload failed for pvestatd.service - PVE Status Daemon.

i found another message:

Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
No /etc/kernel/proxmox-boot-uuids found, skipping ESP sync.
Processing triggers for libc-bin (2.41-12) ...
Processing triggers for pve-manager (9.0.11) ...
user config - ignore invalid privilege 'VM.Monitor'
got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?
 
T
The error messages look a lot like this thread which suggest that you are using bookworm (Debian 12) repositories (or no repositories) for Debian 13. Make sure you have to correct Debian base repositories also: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_debian_base_repositories
sad to state that the machine is using debian 13 repositories and apt update states that all packages are up to date:

apt update
Hit:1 http://security.debian.org/debian-security trixie-security InRelease
Hit:2 http://download.proxmox.com/debian/pve trixie InRelease
Hit:3 http://ftp.gwdg.de/debian trixie InRelease
Hit:4 http://ftp.gwdg.de/debian trixie-updates InRelease
All packages are up to date.

meanwhile i am considering a bookworm reinstallation of the questionable host.
 
things are even worse: the existing cluster was influenced by the broken host - nobody was able to login to the proxmox webif and the loggedin persons were unable to connect to their vms. after i had shutdown the questionable host users could login and connect to their vms.

currently the questionable machine has been shutdown and as i do not think that it can be healed, i shall make a fresh bookworm installation in the afternoon and refrain from upgrading to trixie.
 
Hi,
please share the output of
Code:
pveversion -v
grep '' /etc/apt/sources.list.d/* /etc/apt/sources.list.d/
apt dist-upgrade

You don't have to confirm the upgrade yet, just to see what it says.

EDIT: sorry I was thinking about components, but they're not shown in the output, removed my wrong initial message
 
Last edited:
pveversion -v :
proxmox-ve: 9.0.0 (running kernel: 6.14.11-4-pve)
pve-manager: 9.0.11 (running version: 9.0.11/3bf5476b8a4699e2)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.14.11-4-pve-signed: 6.14.11-4
proxmox-kernel-6.14: 6.14.11-4
proxmox-kernel-6.8: 6.8.12-15
proxmox-kernel-6.8.12-15-pve-signed: 6.8.12-15
amd64-microcode: 3.20250311.1
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown: residual config
ifupdown2: 3.3.0-1+pmx10
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.11
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.4
libpve-network-perl: 1.1.8
libpve-rs-perl: 0.10.10
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-1
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.16-1
proxmox-backup-file-restore: 4.0.16-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.0
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.2
proxmox-widget-toolkit: 5.0.6
pve-cluster: 9.0.6
pve-container: 6.0.13
pve-docs: 9.0.8
pve-edk2-firmware: not correctly installed
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.17-2
pve-ha-manager: 5.0.5
pve-i18n: 3.6.1
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
qemu-server: 9.0.23
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve2
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1

grep '' /etc/apt/sources.list.d/* /etc/apt/sources.list.d/ :
/etc/apt/sources.list.d/debian.sources:# Modernized from /etc/apt/sources.list
/etc/apt/sources.list.d/debian.sources:Types: deb deb-src
/etc/apt/sources.list.d/debian.sources:URIs: http://ftp.gwdg.de/debian/
/etc/apt/sources.list.d/debian.sources:Suites: trixie
/etc/apt/sources.list.d/debian.sources:Components: main non-free-firmware
/etc/apt/sources.list.d/debian.sources:Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
/etc/apt/sources.list.d/debian.sources:
/etc/apt/sources.list.d/debian.sources:# Modernized from /etc/apt/sources.list
/etc/apt/sources.list.d/debian.sources:Types: deb deb-src
/etc/apt/sources.list.d/debian.sources:URIs: http://security.debian.org/debian-security/
/etc/apt/sources.list.d/debian.sources:Suites: trixie-security
/etc/apt/sources.list.d/debian.sources:Components: main non-free-firmware
/etc/apt/sources.list.d/debian.sources:Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
/etc/apt/sources.list.d/debian.sources:
/etc/apt/sources.list.d/debian.sources:# Modernized from /etc/apt/sources.list
/etc/apt/sources.list.d/debian.sources:Types: deb deb-src
/etc/apt/sources.list.d/debian.sources:URIs: http://ftp.gwdg.de/debian/
/etc/apt/sources.list.d/debian.sources:Suites: trixie-updates
/etc/apt/sources.list.d/debian.sources:Components: main non-free-firmware
/etc/apt/sources.list.d/debian.sources:Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
/etc/apt/sources.list.d/debian.sources:
/etc/apt/sources.list.d/debian.sources:
/etc/apt/sources.list.d/proxmox.sources:Types: deb
/etc/apt/sources.list.d/proxmox.sources:URIs: http://download.proxmox.com/debian/pve
/etc/apt/sources.list.d/proxmox.sources:Suites: trixie
/etc/apt/sources.list.d/proxmox.sources:Components: pve-no-subscription
/etc/apt/sources.list.d/proxmox.sources:Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
/etc/apt/sources.list.d/proxmox.sources:
/etc/apt/sources.list.d/pve-enterprise.sources:#Types: deb
/etc/apt/sources.list.d/pve-enterprise.sources:#URIs: https://enterprise.proxmox.com/debian/pve
/etc/apt/sources.list.d/pve-enterprise.sources:#Suites: trixie
/etc/apt/sources.list.d/pve-enterprise.sources:#Components: pve-enterprise
/etc/apt/sources.list.d/pve-enterprise.sources:#Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg
grep: /etc/apt/sources.list.d/: Is a directory

apt dist-upgrade :
cannot execute because the machine is offline (in order not to influence our existing cluster - see my last post) but i remember that it's last output stated that everything was up to date.
 
could you please save /var/log/apt/* before reinstalling, and provide the contents? the symptoms all look like you did some sort of partial upgrade because of a repository misconfiguration.. maybe you also had pending network changes that got activated by the reboot? those would explain why corosync suddenly rejected the node..
 
those look okay to me. what about corosync.conf and the network config? "journalctl -b" from the first boot after the full-upgrade would also be interesting.
 
those look okay to me. what about corosync.conf and the network config? "journalctl -b" from the first boot after the full-upgrade would also be interesting.
is this of academic interest or do you think we could heal the node?

regarding corosync.conf and network config: what exactly do you need?
regarding journalctl -b: i'm afraid that this is lost now.
 
I think it is likely possible to "heal" the node, but if you want to proceed with reinstallation, that's of course also fine!

regarding corosync.conf and network config: what exactly do you need?

the log message by corosync indicates that the other nodes reject the upgraded node's traffic because it's not originating from the expected address. this usually indicates some sort of network setup change that is either wrong or not reflected in corosync.conf.