Resolving problem where corosync.config differs between nodes

guff666

Member
Nov 6, 2021
33
1
13
71
I have a 4-node cluster where one of the nodes has dropped out. The contents of /etc/pve/corosync.config on the failed node differs from the other nodes, pve-cluster.service is reporting errors and corosync.service fails.

Any ideas on how to resolve this?

Code:
root@pve5:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-03-03 15:04:02 GMT; 3min 12s ago
    Process: 4344 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 4380 (pmxcfs)
      Tasks: 5 (limit: 9324)
     Memory: 35.5M
        CPU: 154ms
     CGroup: /system.slice/pve-cluster.service
             └─4380 /usr/bin/pmxcfs

Mar 03 15:07:01 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:01 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [quorum] crit: quorum_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [confdb] crit: cmap_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [quorum] crit: quorum_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [confdb] crit: cmap_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
root@pve5:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-03-03 15:04:07 GMT; 3min 31s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 4446 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
    Process: 4583 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
   Main PID: 4446 (code=exited, status=0/SUCCESS)
        CPU: 148ms

Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync profile loading service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync watchdog service
Mar 03 15:04:07 pve5 corosync[4446]:   [MAIN  ] Corosync Cluster Engine exiting normally
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Failed with result 'exit-code'.

journalctl -u corosyc shows

Code:
Mar 03 15:04:03 pve5 corosync[4446]:   [KNET  ] host: host: 5 has no active links
Mar 03 15:04:03 pve5 corosync[4446]:   [QUORUM] Members[1]: 3
Mar 03 15:04:03 pve5 corosync[4446]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] rx: host: 5 link: 0 is up
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] rx: host: 2 link: 0 is up
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Mar 03 15:04:06 pve5 corosync[4446]:   [QUORUM] Sync members[4]: 1 2 3 5
Mar 03 15:04:06 pve5 corosync[4446]:   [QUORUM] Sync joined[3]: 1 2 5
Mar 03 15:04:06 pve5 corosync[4446]:   [TOTEM ] A new membership (1.1117) was formed. Members joined: 1 2 5
Mar 03 15:04:06 pve5 corosync[4446]:   [CMAP  ] Received config version (9) is different than my config version (8)! Exiting
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Unloading all Corosync service engines.
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync configuration map access
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync configuration service
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync profile loading service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync watchdog service
Mar 03 15:04:07 pve5 corosync[4446]:   [MAIN  ] Corosync Cluster Engine exiting normally
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Failed with result 'exit-code'.
 
Last edited:
Received config version (9) is different than my config version (8)! Exiting
It looks like there is a version mismatch. The configuration on the failing node is too old. See https://www.systutorials.com/docs/linux/man/5-corosync.conf/ and search for "config_version".

If there is no VM or container running on that node you should re-install Proxmox again from scratch and join the node to the cluster again. After removing the failed node from the cluster.

https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
 
Thanks. I did that, but a different problem has occured.
I'll open a new thread as the symptoms are different.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!