1) So you mean this problem sure cause by when I turn on HA right ?yes, the watchdog is exactly what ensures that a node that is not part of the cluster quorum shuts itself down so that another node can take over its guests. check the logs of the "corosync" unit, it will tell you when each node lost contact with others..
So how I can fix this to continue to use HA sir.yes, this is HA doing its job. corosync notices nodes not being up/connected to eachother, and every node that is not part of the majority/quorum will "kill itself".
"journalctl -b -u corosync" will give you the log since bootup.
1) How do I know my cluster network is stable enough sir, I'm using Mellanox 40Gps for every nodes.you need to ensure your cluster network is stable enough.. that is a requirement for HA.
Do you use just one network card for the cluster network? In that case the latencies may be the reason why corosync is losing the connection.How do I know my cluster network is stable enough sir, I'm using Mellanox 40Gps for every nodes.
The Proxmox VE cluster stack requires a reliable network with latencies under 5milliseconds (LAN performance) between all nodes to operate stably.