I have a 4-node cluster where one of the nodes has dropped out. The contents of /etc/pve/corosync.config on the failed node differs from the other nodes, pve-cluster.service is reporting errors and corosync.service fails.
Any ideas on how to resolve this?
journalctl -u corosyc shows
Any ideas on how to resolve this?
Code:
root@pve5:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2022-03-03 15:04:02 GMT; 3min 12s ago
Process: 4344 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 4380 (pmxcfs)
Tasks: 5 (limit: 9324)
Memory: 35.5M
CPU: 154ms
CGroup: /system.slice/pve-cluster.service
└─4380 /usr/bin/pmxcfs
Mar 03 15:07:01 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:01 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [quorum] crit: quorum_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [confdb] crit: cmap_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [quorum] crit: quorum_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [confdb] crit: cmap_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
root@pve5:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2022-03-03 15:04:07 GMT; 3min 31s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 4446 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
Process: 4583 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
Main PID: 4446 (code=exited, status=0/SUCCESS)
CPU: 148ms
Mar 03 15:04:06 pve5 corosync[4446]: [QB ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 03 15:04:06 pve5 corosync[4446]: [QB ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync profile loading service
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync resource monitoring service
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync watchdog service
Mar 03 15:04:07 pve5 corosync[4446]: [MAIN ] Corosync Cluster Engine exiting normally
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Failed with result 'exit-code'.
journalctl -u corosyc shows
Code:
Mar 03 15:04:03 pve5 corosync[4446]: [KNET ] host: host: 5 has no active links
Mar 03 15:04:03 pve5 corosync[4446]: [QUORUM] Members[1]: 3
Mar 03 15:04:03 pve5 corosync[4446]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 03 15:04:05 pve5 corosync[4446]: [KNET ] rx: host: 5 link: 0 is up
Mar 03 15:04:05 pve5 corosync[4446]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 03 15:04:05 pve5 corosync[4446]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Mar 03 15:04:05 pve5 corosync[4446]: [KNET ] pmtud: Global data MTU changed to: 1397
Mar 03 15:04:06 pve5 corosync[4446]: [KNET ] rx: host: 2 link: 0 is up
Mar 03 15:04:06 pve5 corosync[4446]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 03 15:04:06 pve5 corosync[4446]: [KNET ] rx: host: 1 link: 0 is up
Mar 03 15:04:06 pve5 corosync[4446]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 03 15:04:06 pve5 corosync[4446]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Mar 03 15:04:06 pve5 corosync[4446]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Mar 03 15:04:06 pve5 corosync[4446]: [QUORUM] Sync members[4]: 1 2 3 5
Mar 03 15:04:06 pve5 corosync[4446]: [QUORUM] Sync joined[3]: 1 2 5
Mar 03 15:04:06 pve5 corosync[4446]: [TOTEM ] A new membership (1.1117) was formed. Members joined: 1 2 5
Mar 03 15:04:06 pve5 corosync[4446]: [CMAP ] Received config version (9) is different than my config version (8)! Exiting
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Unloading all Corosync service engines.
Mar 03 15:04:06 pve5 corosync[4446]: [QB ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0
Mar 03 15:04:06 pve5 corosync[4446]: [QB ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync configuration map access
Mar 03 15:04:06 pve5 corosync[4446]: [QB ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync configuration service
Mar 03 15:04:06 pve5 corosync[4446]: [QB ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 03 15:04:06 pve5 corosync[4446]: [QB ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync profile loading service
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync resource monitoring service
Mar 03 15:04:06 pve5 corosync[4446]: [SERV ] Service engine unloaded: corosync watchdog service
Mar 03 15:04:07 pve5 corosync[4446]: [MAIN ] Corosync Cluster Engine exiting normally
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Failed with result 'exit-code'.
Last edited: