We've spent the past 6 months trying to make a 50 node PVE cluster stable. While we managed to identify and report multiple issues with corosync that cause instabilities, at this point it is still far from stable. The issue isn't just keeping the corosync cluster alive. Whenever Corosync has one of its "episodes", all the nodes in the cluster start flooding each other with UDP traffic. In other words, all the nodes in the cluster DDoS each other. If corosync and your VMs/LXCs share a NIC, it will make these guests unreachable. Sometimes it gets bad enough to the point that you can't even SSH in. Given this risk of catastrophic failure at large scales, is anyone at Proxmox looking into the viability of alternatives like using an external Zookeeper cluster for larger PVE clusters? The only prior mention of this I could find was this mailing list thread from 2016:
https://lists.proxmox.com/pipermail/pve-devel/2016-September/022909.html
https://lists.proxmox.com/pipermail/pve-devel/2016-September/022909.html