Standby host not reachable in the cluster after update

jacotec

Member
Nov 19, 2024
52
16
8
Kerpen, DE
Hi,

I have three live hosts in my homelab and two standby servers. From time to time I fire up the standby servers and do Proxmox updates - as today.

One of my hosts did all fine - but the second standby hosts ist just a "red cross" in the cluster. I did the updates via SSH and rebooted the host - but still it does not come up in the cluster. Connecting to its web UI works fine. Logs after restarting corosync and pve-cluster:

Code:
Aug 03 21:25:56 pmx5 corosync[11248]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 03 21:25:56 pmx5 corosync[11248]:   [KNET  ] host: host: 2 has no active links
Aug 03 21:25:56 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Aug 03 21:25:56 pmx5 corosync[11248]:   [QUORUM] Sync members[1]: 3
Aug 03 21:25:56 pmx5 corosync[11248]:   [QUORUM] Sync joined[1]: 3
Aug 03 21:25:56 pmx5 corosync[11248]:   [TOTEM ] A new membership (3.232) was formed. Members joined: 3
Aug 03 21:25:56 pmx5 corosync[11248]:   [QUORUM] Members[1]: 3
Aug 03 21:25:56 pmx5 corosync[11248]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 03 21:25:56 pmx5 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Aug 03 21:25:57 pmx5 pveproxy[4701]: Cluster not quorate - extending auth key lifetime!
Aug 03 21:25:58 pmx5 pveproxy[4702]: Cluster not quorate - extending auth key lifetime!
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] rx: host: 1 link: 0 is up
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] rx: host: 5 link: 0 is up
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] rx: host: 4 link: 0 is up
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Aug 03 21:25:58 pmx5 corosync[11248]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 03 21:25:59 pmx5 corosync[11248]:   [QUORUM] Sync members[4]: 1 3 4 5
Aug 03 21:25:59 pmx5 corosync[11248]:   [QUORUM] Sync joined[3]: 1 4 5
Aug 03 21:25:59 pmx5 corosync[11248]:   [TOTEM ] A new membership (1.236) was formed. Members joined: 1 4 5
Aug 03 21:25:59 pmx5 corosync[11248]:   [CMAP  ] Received config version (11) is different than my config version (9)! Exiting
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Unloading all Corosync service engines.
Aug 03 21:25:59 pmx5 corosync[11248]:   [QB    ] withdrawing server sockets
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Aug 03 21:25:59 pmx5 corosync[11248]:   [QB    ] withdrawing server sockets
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync configuration map access
Aug 03 21:25:59 pmx5 corosync[11248]:   [QB    ] withdrawing server sockets
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync configuration service
Aug 03 21:25:59 pmx5 corosync[11248]:   [QB    ] withdrawing server sockets
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Aug 03 21:25:59 pmx5 corosync[11248]:   [QB    ] withdrawing server sockets
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync profile loading service
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Aug 03 21:25:59 pmx5 corosync[11248]:   [SERV  ] Service engine unloaded: corosync watchdog service
Aug 03 21:25:59 pmx5 snmpd[2856]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
Aug 03 21:25:59 pmx5 corosync[11248]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Aug 03 21:25:59 pmx5 corosync[11248]:   [MAIN  ] Corosync Cluster Engine exiting normally
Aug 03 21:25:59 pmx5 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Aug 03 21:25:59 pmx5 systemd[1]: corosync.service: Failed with result 'exit-code'.
Aug 03 21:26:00 pmx5 pmxcfs[11242]: [quorum] crit: quorum_initialize failed: 2
Aug 03 21:26:00 pmx5 pmxcfs[11242]: [confdb] crit: cmap_initialize failed: 2
Aug 03 21:26:00 pmx5 pmxcfs[11242]: [dcdb] crit: cpg_initialize failed: 2
Aug 03 21:26:00 pmx5 pmxcfs[11242]: [status] crit: cpg_initialize failed: 2
Aug 03 21:26:01 pmx5 pveproxy[4701]: Cluster not quorate - extending auth key lifetime!
Aug 03 21:26:01 pmx5 pveproxy[4702]: Cluster not quorate - extending auth key lifetime!
Aug 03 21:26:04 pmx5 pveproxy[4701]: Cluster not quorate - extending auth key lifetime!
Aug 03 21:26:05 pmx5 pveproxy[4703]: Cluster not quorate - extending auth key lifetime!
Aug 03 21:26:06 pmx5 pmxcfs[11242]: [quorum] crit: quorum_initialize failed: 2
Aug 03 21:26:06 pmx5 pmxcfs[11242]: [confdb] crit: cmap_initialize failed: 2
Aug 03 21:26:06 pmx5 pmxcfs[11242]: [dcdb] crit: cpg_initialize failed: 2
Aug 03 21:26:06 pmx5 pmxcfs[11242]: [status] crit: cpg_initialize failed: 2
Aug 03 21:26:07 pmx5 pveproxy[4702]: Cluster not quorate - extending auth key lifetime!
Aug 03 21:26:08 pmx5 pveproxy[4701]: Cluster not quorate - extending auth key lifetime!

I have no clue what's the problem. Any ideas?

Thanks,
Marco
 
Verify that /etc/corosync/corosync.conf and /etc/pve/corosync.conf - have the very same content on all nodes. Check "config_version: 123".

(( Anecdotical: once I have had a node turned off (for power saving in my homelab) and did manipulate the structure of the cluster by removing another node. This -of course- implied changes of corosync.conf. After turning on that spare node it was not able to re-synchronize these settings without manual intervention. I am not sure if that should have been possible; there were enough live nodes to make sure Quorum was always reached. Since then I absolutely make sure that all nodes are up and running when I add/remove nodes or modify corosync settings. ))
 
Verify that /etc/corosync/corosync.conf and /etc/pve/corosync.conf - have the very same content on all nodes. Check "config_version: 123".
Hi @UdoB ,

awesome, thanks! That was the issue - the files had been nearly fully identical - but on the live hosts it was config_version 11 and on the "abandoned" host it was 9.

I've modified the files on the bad host to version 11 and restarted Corosync - and here it is again! :cool:
 
  • Like
Reactions: UdoB