[High Availability] Cause reboot all node?

kengrass · Oct 28, 2024

Before I turn on HA for all my 10 nodes (all VM in these nodes too), it still uptime fine without any problem, but after turn on - It cause reboot all my 10 nodes frequently ( few days a time ), anyone got this trouble too?

fabian · Oct 28, 2024

if you want to enable HA, you need to ensure that your network is stable enough for corosync to not lose quorum:

https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

otherwise, it can happen that some or all nodes in the cluster get fenced:

https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing

kengrass · Oct 28, 2024

I send you logs, after turn on HA, it start to reboot 3 times - 3 days ( 19 - 24 - 27 ), please help me check if it relate to HA turn on or not sir!

I'm using 8.2.7 ver

fabian · Oct 28, 2024

yes, the watchdog is exactly what ensures that a node that is not part of the cluster quorum shuts itself down so that another node can take over its guests. check the logs of the "corosync" unit, it will tell you when each node lost contact with others..

kengrass · Oct 28, 2024

fabian said:
yes, the watchdog is exactly what ensures that a node that is not part of the cluster quorum shuts itself down so that another node can take over its guests. check the logs of the "corosync" unit, it will tell you when each node lost contact with others..

1) So you mean this problem sure cause by when I turn on HA right ?
2) How to check logs of the "corosync" sir?

Sorry I'm totally new of proxmox

fabian · Oct 28, 2024

yes, this is HA doing its job. corosync notices nodes not being up/connected to eachother, and every node that is not part of the majority/quorum will "kill itself".

"journalctl -b -u corosync" will give you the log since bootup.

kengrass · Oct 28, 2024

fabian said:
yes, this is HA doing its job. corosync notices nodes not being up/connected to eachother, and every node that is not part of the majority/quorum will "kill itself".

"journalctl -b -u corosync" will give you the log since bootup.

So how I can fix this to continue to use HA sir.

fabian · Oct 28, 2024

you need to ensure your cluster network is stable enough.. that is a requirement for HA.

kengrass · Oct 28, 2024

fabian said:
you need to ensure your cluster network is stable enough.. that is a requirement for HA.

1) How do I know my cluster network is stable enough sir, I'm using Mellanox 40Gps for every nodes.
2) Do I need to config anything else or after beside turn on HA ( create group HA and turn on HA for every VM ) ?

Azunai333 · Oct 28, 2024

kengrass said:
How do I know my cluster network is stable enough sir, I'm using Mellanox 40Gps for every nodes.

Do you use just one network card for the cluster network? In that case the latencies may be the reason why corosync is losing the connection.

From the documentation [1]:

The Proxmox VE cluster stack requires a reliable network with latencies under 5milliseconds (LAN performance) between all nodes to operate stably.

Maybe add a physical second network that is slow (1 GBit/s) and is completely reserved for corosync. See the chapter "Separate Cluster Network" in [1] for this.

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

fabian · Oct 28, 2024

yeah, please read the documentation about the requirements for clustering and HA!

Search

Search

[High Availability] Cause reboot all node?

kengrass

New Member

fabian

Proxmox Staff Member

kengrass

New Member

Attachments

fabian

Proxmox Staff Member

kengrass

New Member

fabian

Proxmox Staff Member

kengrass

New Member

fabian

Proxmox Staff Member

kengrass

New Member

Azunai333

Active Member

fabian

Proxmox Staff Member

We value your privacy