Half of the hosts in the cluster automatically restart due to abnormality

sky_me · Mar 22, 2024

I especially want to know what protection mechanism the PVE cluster has to allow the host to automatically restart.
Environment: There are 13 hosts in the cluster: node1-13
Version: pve-manager/6.4-4/337d6701 (running kernel: 5.4.106-1-pve)

Web environment:
There are two switches A and B. ens8f0 and ens8f1 are connected to port 2 of switch A and B respectively. The switch supports lacp
Instructions:
The node3 host has done lacp hash layer3+4, and the ens8f0 port is not lit. I asked the computer room personnel to test the ens8f0 cable (originally node3 was down on port 2 of the A switch, and changed to port 5 of the A switch up. Enfs8f1 is still on the B switch. Port 2) During the process of replacing ports 2 and 5 with ens8f0, I found that half of the hosts in my cluster had restarted abnormally.
key log：
node3 (operation line change host)

node5 or node7-10 （restart host）：

bbgeek17 · Mar 22, 2024

You have a large and somewhat complex environment. Properly troubleshooting it requires time and knowledge commitment that likely exceeds volunteer time. If this is a business critical infrastructure - you should have a subscription.

That said, you seem to have had network issues over time, based on your post history. Some related to suspected vlan loops in the network.

I took a quick glance at your logs, it seems that corosync started complaining about members leaving before the system detected port DOWN event. This suggests to me that something is not optimally configured.

Of the top of my head you should check:
- mlag connection between switch A and B
- FAST LACP should be enabled on every channel and side
- that port distribution is correct
- you should allocate maintenance window and test cable pulls and connectivity of each node/port. Something that should _always_ be done before going into production
- confirm that your switch is properly able to handle L3/4 hashing and its compatible with your client. Do keep in mind that the layer 3+4 policy is not fully LACP or 802.3ad compliant. We dont recommend it to our customers.
- there may be modifications needed to default Mellanox firmware configuration that affect link state detection.

Good luck in your troubleshooting.

PS We have seen an issue when a port update on a switch with large number of VLANs would take the entire switch out to lunch, affecting everything.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

sky_me · Mar 25, 2024

bbgeek17 said:
You have a large and somewhat complex environment. Properly troubleshooting it requires time and knowledge commitment that likely exceeds volunteer time. If this is a business critical infrastructure - you should have a subscription.

That said, you seem to have had network issues over time, based on your post history. Some related to suspected vlan loops in the network.

I took a quick glance at your logs, it seems that corosync started complaining about members leaving before the system detected port DOWN event. This suggests to me that something is not optimally configured.

Of the top of my head you should check:
- mlag connection between switch A and B
- FAST LACP should be enabled on every channel and side
- that port distribution is correct
- you should allocate maintenance window and test cable pulls and connectivity of each node/port. Something that should _always_ be done before going into production
- confirm that your switch is properly able to handle L3/4 hashing and its compatible with your client. Do keep in mind that the layer 3+4 policy is not fully LACP or 802.3ad compliant. We dont recommend it to our customers.
- there may be modifications needed to default Mellanox firmware configuration that affect link state detection.

Good luck in your troubleshooting.

PS We have seen an issue when a port update on a switch with large number of VLANs would take the entire switch out to lunch, affecting everything.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Thank you very much for your answer. This problem really bothers me. I also asked network colleagues to help me investigate this problem, and no other unusual problems were found. Including my previous test environment, when adding nodes normally, all hosts restarted inexplicably.
I really don't know why the host restarts on its own. Is this caused by some mechanism?
Thank you very much for your answer!

Azunai333 · Mar 25, 2024

sky_me said:
I really don't know why the host restarts on its own. Is this caused by some mechanism?

In the log of node5 line 97 the watchdog triggered and restarted the node.
This happens if:

You have HA configured and enabled HA (high availability).
Look at "Datacenter > HA". If there is a HA VM on a node, it will arm the watchdog on that node. Additionally there is a master node, the watchdog is armed there, too.
If such a node is losing quorum, this will prevent write access to "/etc/pve" and it can't refresh the watchdog anymore. After 1 minute the watchdog triggers and restarts the node. This is part of the HA fencing as PVE expects the VM to failover to another node.

See [1] for details on HA.
As @bbgeek17 already wrote, you probably have a network issue. That's why you are losing quorum.

[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing

sky_me · Mar 25, 2024

Azunai333 said:
In the log of node5 line 97 the watchdog triggered and restarted the node.
This happens if:

You have HA configured and enabled HA (high availability).
Look at "Datacenter > HA". If there is a HA VM on a node, it will arm the watchdog on that node. Additionally there is a master node, the watchdog is armed there, too.

If such a node is losing quorum, this will prevent write access to "/etc/pve" and it can't refresh the watchdog anymore. After 1 minute the watchdog triggers and restarts the node. This is part of the HA fencing as PVE expects the VM to failover to another node.

See [1] for details on HA.
As @bbgeek17 already wrote, you probably have a network issue. That's why you are losing quorum.

[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing

thk bro
This is my HA. I did add virtual machines before to achieve failover, but due to over-allocation of resources, the failover failed and caused restart problems, so I removed all the virtual machines.

In addition, I would like to ask, is the watchdog mechanism inevitable? The network mode mode4 layer3+4 seems to cause this problem. Which mode should I use in a production environment?
thks

Search

Search

Half of the hosts in the cluster automatically restart due to abnormality

sky_me

Member

Attachments

bbgeek17

Distinguished Member

sky_me

Member

Azunai333

Active Member

sky_me

Member