[High Availability] Cause reboot all node?

kengrass

New Member
Mar 14, 2024
10
0
1
Before I turn on HA for all my 10 nodes (all VM in these nodes too), it still uptime fine without any problem, but after turn on - It cause reboot all my 10 nodes frequently ( few days a time ), anyone got this trouble too?
 
Last edited:
  • Like
Reactions: kengrass
I send you logs, after turn on HA, it start to reboot 3 times - 3 days ( 19 - 24 - 27 ), please help me check if it relate to HA turn on or not sir!

I'm using 8.2.7 ver
 

Attachments

  • 3.jpg
    3.jpg
    167.2 KB · Views: 8
  • 2.jpg
    2.jpg
    949.4 KB · Views: 6
  • 0.jpg
    0.jpg
    612.4 KB · Views: 6
  • 1.jpg
    1.jpg
    810.9 KB · Views: 6
Last edited:
yes, the watchdog is exactly what ensures that a node that is not part of the cluster quorum shuts itself down so that another node can take over its guests. check the logs of the "corosync" unit, it will tell you when each node lost contact with others..
 
  • Like
Reactions: kengrass
yes, the watchdog is exactly what ensures that a node that is not part of the cluster quorum shuts itself down so that another node can take over its guests. check the logs of the "corosync" unit, it will tell you when each node lost contact with others..
1) So you mean this problem sure cause by when I turn on HA right ?
2) How to check logs of the "corosync" sir?

Sorry I'm totally new of proxmox
 
yes, this is HA doing its job. corosync notices nodes not being up/connected to eachother, and every node that is not part of the majority/quorum will "kill itself".

"journalctl -b -u corosync" will give you the log since bootup.
 
  • Like
Reactions: kengrass
yes, this is HA doing its job. corosync notices nodes not being up/connected to eachother, and every node that is not part of the majority/quorum will "kill itself".

"journalctl -b -u corosync" will give you the log since bootup.
So how I can fix this to continue to use HA sir.
 
you need to ensure your cluster network is stable enough.. that is a requirement for HA.
 
  • Like
Reactions: kengrass
you need to ensure your cluster network is stable enough.. that is a requirement for HA.
1) How do I know my cluster network is stable enough sir, I'm using Mellanox 40Gps for every nodes.
2) Do I need to config anything else or after beside turn on HA ( create group HA and turn on HA for every VM ) ?
 
How do I know my cluster network is stable enough sir, I'm using Mellanox 40Gps for every nodes.
Do you use just one network card for the cluster network? In that case the latencies may be the reason why corosync is losing the connection.

From the documentation [1]:
The Proxmox VE cluster stack requires a reliable network with latencies under 5milliseconds (LAN performance) between all nodes to operate stably.

Maybe add a physical second network that is slow (1 GBit/s) and is completely reserved for corosync. See the chapter "Separate Cluster Network" in [1] for this.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network
 
  • Like
Reactions: kengrass
yeah, please read the documentation about the requirements for clustering and HA!
 
  • Like
Reactions: kengrass

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!