split brain recovery

adambialy

New Member
Jan 24, 2019
3
0
1
38
Hi all
I got problem I think with split brain. Can't find too much about that, and how to recover from this on Internet so asking for help you guys.
What has happen is I tried to remove one node (wrongly named) and after that (I think after following: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node, pvecm expected 1)
I had weird situation:
split.png
on first screen I'm logged into pve-b16-c2 on second to pve-b16-c1 (c1/2 corresponding dell chasis)
so from c1 I can see all c1 servers from c2 I can see all c2 servers.
Could anybody push me to right direction how to recover from it please?
Any advice much appreciated.
Thanks
 
Seems the clusternetwork is flakey/got botched - you need to make sure that all nodes (in both chassis) can see each other via corosync
check the outputs of
* `pvecm status`
* `pvecm members`
* `journalctl -r` (general logs)
* `journalctl -r -u corosync.service` - corosync logs
* `journalctl -r -u pve-cluster.service` - pmxcfs logs

with the pvecm status (and deep down with `/etc/corosync/corosync.conf`) you see which nodes' ips are used and where the connection might have been lost.

Keep in mind that corosync usually runs over multicast - maybe the switch between both chassis has currently a problem?

Hope this helps!