Resolving problem where corosync.config differs between nodes

guff666 · Mar 3, 2022

I have a 4-node cluster where one of the nodes has dropped out. The contents of /etc/pve/corosync.config on the failed node differs from the other nodes, pve-cluster.service is reporting errors and corosync.service fails.

Any ideas on how to resolve this?

Code:

root@pve5:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-03-03 15:04:02 GMT; 3min 12s ago
    Process: 4344 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 4380 (pmxcfs)
      Tasks: 5 (limit: 9324)
     Memory: 35.5M
        CPU: 154ms
     CGroup: /system.slice/pve-cluster.service
             └─4380 /usr/bin/pmxcfs

Mar 03 15:07:01 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:01 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [quorum] crit: quorum_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [confdb] crit: cmap_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:07 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [quorum] crit: quorum_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [confdb] crit: cmap_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [dcdb] crit: cpg_initialize failed: 2
Mar 03 15:07:13 pve5 pmxcfs[4380]: [status] crit: cpg_initialize failed: 2
root@pve5:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2022-03-03 15:04:07 GMT; 3min 31s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 4446 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
    Process: 4583 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
   Main PID: 4446 (code=exited, status=0/SUCCESS)
        CPU: 148ms

Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync profile loading service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync watchdog service
Mar 03 15:04:07 pve5 corosync[4446]:   [MAIN  ] Corosync Cluster Engine exiting normally
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Failed with result 'exit-code'.

journalctl -u corosyc shows

Code:

Mar 03 15:04:03 pve5 corosync[4446]:   [KNET  ] host: host: 5 has no active links
Mar 03 15:04:03 pve5 corosync[4446]:   [QUORUM] Members[1]: 3
Mar 03 15:04:03 pve5 corosync[4446]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] rx: host: 5 link: 0 is up
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Mar 03 15:04:05 pve5 corosync[4446]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] rx: host: 2 link: 0 is up
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Mar 03 15:04:06 pve5 corosync[4446]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Mar 03 15:04:06 pve5 corosync[4446]:   [QUORUM] Sync members[4]: 1 2 3 5
Mar 03 15:04:06 pve5 corosync[4446]:   [QUORUM] Sync joined[3]: 1 2 5
Mar 03 15:04:06 pve5 corosync[4446]:   [TOTEM ] A new membership (1.1117) was formed. Members joined: 1 2 5
Mar 03 15:04:06 pve5 corosync[4446]:   [CMAP  ] Received config version (9) is different than my config version (8)! Exiting
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Unloading all Corosync service engines.
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync configuration map access
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync configuration service
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Mar 03 15:04:06 pve5 corosync[4446]:   [QB    ] withdrawing server sockets
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync profile loading service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Mar 03 15:04:06 pve5 corosync[4446]:   [SERV  ] Service engine unloaded: corosync watchdog service
Mar 03 15:04:07 pve5 corosync[4446]:   [MAIN  ] Corosync Cluster Engine exiting normally
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Mar 03 15:04:07 pve5 systemd[1]: corosync.service: Failed with result 'exit-code'.

gurubert · Mar 4, 2022

guff666 said:
Received config version (9) is different than my config version (8)! Exiting

It looks like there is a version mismatch. The configuration on the failing node is too old. See https://www.systutorials.com/docs/linux/man/5-corosync.conf/ and search for "config_version".

If there is no VM or container running on that node you should re-install Proxmox again from scratch and join the node to the cluster again. After removing the failed node from the cluster.

https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

guff666 · Mar 5, 2022

Thanks. I did that, but a different problem has occured.
I'll open a new thread as the symptoms are different.

Search

Search

Resolving problem where corosync.config differs between nodes

guff666

Member

gurubert

Distinguished Member

guff666

Member

We value your privacy