Half of the hosts in the cluster automatically restart due to abnormality

sky_me

Member
Dec 27, 2021
19
1
8
23
I especially want to know what protection mechanism the PVE cluster has to allow the host to automatically restart.
Environment:
There are 13 hosts in the cluster: node1-13
Version: pve-manager/6.4-4/337d6701 (running kernel: 5.4.106-1-pve)
1711099899356.png
Web environment:
There are two switches A and B. ens8f0 and ens8f1 are connected to port 2 of switch A and B respectively. The switch supports lacp
Instructions:
The node3 host has done lacp hash layer3+4, and the ens8f0 port is not lit. I asked the computer room personnel to test the ens8f0 cable (originally node3 was down on port 2 of the A switch, and changed to port 5 of the A switch up. Enfs8f1 is still on the B switch. Port 2) During the process of replacing ports 2 and 5 with ens8f0, I found that half of the hosts in my cluster had restarted abnormally.
key log:
node3 (operation line change host)
1711100365817.png
node5 or node7-10 (restart host):
1711100510116.png
1711100566488.png
 

Attachments

  • syslog-node3.txt
    107.9 KB · Views: 2
  • syslog-node5.txt
    37.9 KB · Views: 3
Last edited:
You have a large and somewhat complex environment. Properly troubleshooting it requires time and knowledge commitment that likely exceeds volunteer time. If this is a business critical infrastructure - you should have a subscription.

That said, you seem to have had network issues over time, based on your post history. Some related to suspected vlan loops in the network.

I took a quick glance at your logs, it seems that corosync started complaining about members leaving before the system detected port DOWN event. This suggests to me that something is not optimally configured.

Of the top of my head you should check:
- mlag connection between switch A and B
- FAST LACP should be enabled on every channel and side
- that port distribution is correct
- you should allocate maintenance window and test cable pulls and connectivity of each node/port. Something that should _always_ be done before going into production
- confirm that your switch is properly able to handle L3/4 hashing and its compatible with your client. Do keep in mind that the layer 3+4 policy is not fully LACP or 802.3ad compliant. We dont recommend it to our customers.
- there may be modifications needed to default Mellanox firmware configuration that affect link state detection.

Good luck in your troubleshooting.

PS We have seen an issue when a port update on a switch with large number of VLANs would take the entire switch out to lunch, affecting everything.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
You have a large and somewhat complex environment. Properly troubleshooting it requires time and knowledge commitment that likely exceeds volunteer time. If this is a business critical infrastructure - you should have a subscription.

That said, you seem to have had network issues over time, based on your post history. Some related to suspected vlan loops in the network.

I took a quick glance at your logs, it seems that corosync started complaining about members leaving before the system detected port DOWN event. This suggests to me that something is not optimally configured.

Of the top of my head you should check:
- mlag connection between switch A and B
- FAST LACP should be enabled on every channel and side
- that port distribution is correct
- you should allocate maintenance window and test cable pulls and connectivity of each node/port. Something that should _always_ be done before going into production
- confirm that your switch is properly able to handle L3/4 hashing and its compatible with your client. Do keep in mind that the layer 3+4 policy is not fully LACP or 802.3ad compliant. We dont recommend it to our customers.
- there may be modifications needed to default Mellanox firmware configuration that affect link state detection.

Good luck in your troubleshooting.

PS We have seen an issue when a port update on a switch with large number of VLANs would take the entire switch out to lunch, affecting everything.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thank you very much for your answer. This problem really bothers me. I also asked network colleagues to help me investigate this problem, and no other unusual problems were found. Including my previous test environment, when adding nodes normally, all hosts restarted inexplicably.
I really don't know why the host restarts on its own. Is this caused by some mechanism?
Thank you very much for your answer!
 
I really don't know why the host restarts on its own. Is this caused by some mechanism?
In the log of node5 line 97 the watchdog triggered and restarted the node.
This happens if:
  1. You have HA configured and enabled HA (high availability).
    Look at "Datacenter > HA". If there is a HA VM on a node, it will arm the watchdog on that node. Additionally there is a master node, the watchdog is armed there, too.
  2. If such a node is losing quorum, this will prevent write access to "/etc/pve" and it can't refresh the watchdog anymore. After 1 minute the watchdog triggers and restarts the node. This is part of the HA fencing as PVE expects the VM to failover to another node.
See [1] for details on HA.
As @bbgeek17 already wrote, you probably have a network issue. That's why you are losing quorum.

[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing
 
In the log of node5 line 97 the watchdog triggered and restarted the node.
This happens if:
  1. You have HA configured and enabled HA (high availability).
    Look at "Datacenter > HA". If there is a HA VM on a node, it will arm the watchdog on that node. Additionally there is a master node, the watchdog is armed there, too.
  2. If such a node is losing quorum, this will prevent write access to "/etc/pve" and it can't refresh the watchdog anymore. After 1 minute the watchdog triggers and restarts the node. This is part of the HA fencing as PVE expects the VM to failover to another node.
See [1] for details on HA.
As @bbgeek17 already wrote, you probably have a network issue. That's why you are losing quorum.

[1] https://pve.proxmox.com/wiki/High_Availability#ha_manager_fencing
thk bro
This is my HA. I did add virtual machines before to achieve failover, but due to over-allocation of resources, the failover failed and caused restart problems, so I removed all the virtual machines.
1711358620783.png1711358818816.png
1711358917739.png
In addition, I would like to ask, is the watchdog mechanism inevitable? The network mode mode4 layer3+4 seems to cause this problem. Which mode should I use in a production environment?
thks
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!