corosync out of sync

richieman

Member
Apr 16, 2021
15
0
21
55
Hello. I had a problem with a node and it was turned off. While it was off I added a new node to the cluster to replace it. Now the original node is fixed and I turned it on again but now corosync is out of sync because a new node was added while it was off. I got important VM's in there. How can I get it back in sync?
Thanks for any help!
Richard

journalctl -u corosync.service:


Code:
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Oct 21 12:35:30 ripr corosync[8417]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Oct 21 12:35:30 ripr corosync[8417]:   [QUORUM] Sync members[6]: 1 2 3 4 5 6
Oct 21 12:35:30 ripr corosync[8417]:   [QUORUM] Sync joined[5]: 1 3 4 5 6
Oct 21 12:35:30 ripr corosync[8417]:   [TOTEM ] A new membership (1.61921) was formed. Members joined: 1 3 4 5 6
Oct 21 12:35:30 ripr corosync[8417]:   [CMAP  ] Received config version (44) is different than my config version (43)! Exiting
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Unloading all Corosync service engines.
Oct 21 12:35:30 ripr corosync[8417]:   [QB    ] withdrawing server sockets
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Oct 21 12:35:30 ripr corosync[8417]:   [QB    ] withdrawing server sockets
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync configuration map access
Oct 21 12:35:30 ripr corosync[8417]:   [QB    ] withdrawing server sockets
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync configuration service
Oct 21 12:35:30 ripr corosync[8417]:   [QB    ] withdrawing server sockets
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Oct 21 12:35:30 ripr corosync[8417]:   [QB    ] withdrawing server sockets
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync profile loading service
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Oct 21 12:35:30 ripr corosync[8417]:   [SERV  ] Service engine unloaded: corosync watchdog service
Oct 21 12:35:31 ripr corosync[8417]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Oct 21 12:35:31 ripr corosync[8417]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Oct 21 12:35:31 ripr corosync[8417]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Oct 21 12:35:31 ripr corosync[8417]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
Oct 21 12:35:31 ripr corosync[8417]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Oct 21 12:35:31 ripr corosync[8417]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Oct 21 12:35:31 ripr corosync[8417]:   [MAIN  ] Corosync Cluster Engine exiting normally
 
Could you please send us the output of

Code:
corosync-cfgtool -n

from the node with issues and from one without issues?

Please also send us the contents of the corosync config files at

Code:
/etc/pve/corosync.conf
/etc/corosync/corosync.conf

from both nodes.
 
In the mean time I seem to have resolved the issue. Here is what I did. First I copied /etc/corosync/corosync.conf from a working machine. corosync still would not start due to the same error.

I figured I had to edit /etc/pve/corosync.conf but it was read-only. So I tried:
Bash:
pvecm expected 1
But it gave an error: Cannot initialize CMAP service

Then I restarted "pve-cluster" and to my surprise it was working again and /etc/pve/corosync.conf is back in sync. After that I had to fix a lot of other unrelated issues on the node but it seems to work now.
 
Ok, good to hear, but make sure to remove the `pvecm expected 1` , this is a sure way to run into a out-of-sync state. Make sure your corosync files match in a byte-per-byte fashion on all hosts, you can use `sha256sum` to verify they match.

You can use the `corosync-cfgtool -n` to see if all the connections are OK (they should report both 'enabled' and 'connected'). You can also use `pvecm status` to see the current corosync status.