Many unattended reboot of all nodes in a short laps time

de Thysebaert · Jul 1, 2021

Hi,
We have 3 nodes as physical servers at OVH. (PVE 5.4-13)
The 3 nodes are in a same cluster. One NIC connected to the public network (internet) and a second NIC to the vracks virtual infrastructure at OVH to bring some vlan's in private networks. One vlan is dedicated to the traffic for the cluster (corosync) using a private @IP
The infrastructure is running without any problems from 3 years and the uptime of each node was at this time arround 1 years.
The are 22 Vm running (windows and linux)
We have for the cluster 44CPU, 314 GB and ressources usage are low.

Yesterday, all 3 nodes suddently reboot in a laps time of 3 minutes. That' append 5 times. All the infra come in line and the cluster come operational after each reboot.
There a re no update maintenance or jobs schedulled for the reboot.
Now all the infra is well running but a don't have any explications about this reboot. Stange that reboot occurs in a short laps time for all nodes.

What may be append ? Brief network errors on the vlan used by the cluster ? Why a short network interruption mays cause these reboot ? How to track this incident and investigate it ?

Do you have some ideas ?

fabian · Jul 1, 2021

if you have HA enabled, a (long enough) loss of corosync connectivity will result in fencing of non-quorate nodes. this should be visible in the logs though (corosync, pve-cluster, pve-ha-crm and pve-ha-lrm units).

de Thysebaert · Jul 2, 2021

Thanks, it's seams that this was the issue . After the 5 reboot of all nodes I had removed the HA configuration an effectively no new reboot occures.
But why fencing of non-quorate nodes do a reboot of all nodes ?
Now I also investigates at the provider why lost or low of connectivity occures at this time.
thx

fabian · Jul 2, 2021

fencing requires the node to go down..

de Thysebaert · Jul 2, 2021

Ok I understand ... is there a solution to tweak this parameter ?

fabian · Jul 2, 2021

what parameter? if you want HA, you need fencing to ensure guests are not running on multiple nodes.. this is a hard requirement.

de Thysebaert · Jul 2, 2021

Ok I understand , Thx for the explanation

Search

Search

Many unattended reboot of all nodes in a short laps time

de Thysebaert

Well-Known Member

fabian

Proxmox Staff Member

de Thysebaert

Well-Known Member

fabian

Proxmox Staff Member

de Thysebaert

Well-Known Member

fabian

Proxmox Staff Member

de Thysebaert

Well-Known Member