DELL, EEE, Proxmox, HA and Restarting servers

Nov 2, 2023
20
0
1
Good afternoon everyone.

Currently we have three DELL R740 servers in the company and we have Proxmox 8.3.2, which are configured in a cluster with high availability and we use CEPH.
For some time now we have had unexpected restarts in the three local servers that the company has when we have congestion or a failure in the network.
Checking on the Internet we found a possible cause for the restarts and it is that in the BIOS of each server the Energy-Efficient Ethernet (EEE) option is active. We do some tests by disconnecting the network cables from the server and it turns off after a short time. So, when we disable the Energy-Efficient Ethernet (EEE) option on each network card in the BIOS and we do the tests again we do not have any restart of the server, that is, it stays on.
So far so good, but when I check in Proxmox that even though a virtual machine that we have configured on the server we tested did not perform the high availability process and the virtual machine did not move to another server in the cluster.
Seeing this, we re-enable the Energy-Efficient Ethernet (EEE) option in the BIOS and after this we observe that high availability did work on the aforementioned virtual machine.
Could someone help me solve this case of the servers not restarting after a network failure, but also having high availability work?

https://www.dell.com/support/kbdoc/...icient-ethernet-eee-or-green-ethernet?lang=en

José Fermín Francisco Ferreras
 
Last edited:
Wrong Forum
 
Good afternoon everyone.

Currently we have three DELL R740 servers in the company and we have Proxmox 8.3.2, which are configured in a cluster with high availability and we use CEPH.
For some time now we have had unexpected restarts in the three local servers that the company has when we have congestion or a failure in the network.
Checking on the Internet we found a possible cause for the restarts and it is that in the BIOS of each server the Energy-Efficient Ethernet (EEE) option is active. We do some tests by disconnecting the network cables from the server and it turns off after a short time. So, when we disable the Energy-Efficient Ethernet (EEE) option on each network card in the BIOS and we do the tests again we do not have any restart of the server, that is, it stays on.
So far so good, but when I check in Proxmox that even though a virtual machine that we have configured on the server we tested did not perform the high availability process and the virtual machine did not move to another server in the cluster.
Seeing this, we re-enable the Energy-Efficient Ethernet (EEE) option in the BIOS and after this we observe that high availability did work on the aforementioned virtual machine.
Could someone help me solve this case of the servers not restarting after a network failure, but also having high availability work?

https://www.dell.com/support/kbdoc/...icient-ethernet-eee-or-green-ethernet?lang=en

José Fermín Francisco Ferreras

Wrong Sub-Forum.

For Migration set this:
1741341040726.png
 
So far so good, but when I check in Proxmox that even though a virtual machine that we have configured on the server we tested did not perform the high availability process and the virtual machine did not move to another server in the cluster.
What exactly happend on the network and pve hosts when the VM did not move to another host in the cluster?

Could someone help me solve this case of the servers not restarting after a network failure, but also having high availability work?
Remember that PVE HA will only try to restart VMs configured for HA if the host loses quorum, and for that all corosync links have to fail. It will not move a VM if just "a network" fails.

EEE should be disabled in a PVE cluster. I suppose that EEE takes down your corosync interfaces, so the host loses quorum, HA fences the host (reboots it) and VMs are moved to other host of the cluster.
 
  • Like
Reactions: smueller
Wrong Sub-Forum.

For Migration set this:
View attachment 83366
Hi smueller
Today we made some changes to Proxmox, we made the changes you recommended. A test was done on a node without virtual machines, the network cables were disconnected and the node did not restart, but when the node has at least one virtual machine and the same test is performed, the node waits for the virtual machine that is on it to be migrated and then it restarts. Why does this happen?