Hey all,
Currently testing CEPH and Proxmox HA on a multi datacenter configuration (with black fiber, so sub ms latency.) I tried to recreate the worst scenario possible : what if all nodes lose the corosync layer? This should never happen, but that's what the lab is for.
I'm using Proxmox 8.2.2 and Ceph latest version.
Upon removing the corosync link, all nodes rebooted as expected and were left in a no quorum state. Now upon restarting the corosync link, one of my node straight up refused to connect back to the cluster. Not an issue as I can quit and rejoin the cluster. However, This leads me to discover another issue : one of the two remaining nodes is flooding the other one with corosync packets, causing packet loss and BIG cluster instability. (I'm seeing about 16000 packets in a 10 second tcpdump session).
Any idea on what exactly is happening here?
Currently testing CEPH and Proxmox HA on a multi datacenter configuration (with black fiber, so sub ms latency.) I tried to recreate the worst scenario possible : what if all nodes lose the corosync layer? This should never happen, but that's what the lab is for.
I'm using Proxmox 8.2.2 and Ceph latest version.
Upon removing the corosync link, all nodes rebooted as expected and were left in a no quorum state. Now upon restarting the corosync link, one of my node straight up refused to connect back to the cluster. Not an issue as I can quit and rejoin the cluster. However, This leads me to discover another issue : one of the two remaining nodes is flooding the other one with corosync packets, causing packet loss and BIG cluster instability. (I'm seeing about 16000 packets in a 10 second tcpdump session).
Any idea on what exactly is happening here?