Proxmox VE all cluster went down, NIC/HA issue?

Fear37

Member
Dec 10, 2019
11
0
21
28
Hi guys,

On May 3rd, we noticed all our VM went down, we have a cluster of 3 PVE with HA/CEPH enabled.
All of them showed

Code:
May  3 10:26:24 pve1 kernel: [   21.134110] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May  3 10:26:24 pve1 kernel: [   21.445181] e1000e: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May  3 10:34:14 pve1 kernel: [  491.306825] e1000e: eno2 NIC Link is Down
May  3 10:34:14 pve1 kernel: [  491.311122] e1000e: eno1 NIC Link is Down
May  3 10:34:16 pve1 kernel: [  493.209536] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
May  3 10:34:18 pve1 kernel: [  495.187927] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
May  3 10:34:20 pve1 kernel: [  497.203920] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
May  3 10:34:22 pve1 kernel: [  499.187796] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
May  3 10:34:24 pve1 kernel: [  501.203776] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
May  3 10:34:25 pve1 kernel: [  501.499415] e1000e: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May  3 10:34:25 pve1 kernel: [  501.868500] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May  3 10:34:30 pve1 kernel: [  507.218234] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
May  3 10:34:30 pve1 kernel: [  507.218281]  libcrc32c psmouse ahci isci i2c_i801 libahci lpc_ich libsas e1000e scsi_transport_sas wmi
May  3 10:34:30 pve1 kernel: [  507.218471] e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
May  3 10:34:34 pve1 kernel: [  511.076104] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Nothing can be entered on the console. I've pressed Ctrl-Alt-Del and it eventually stucked at
1620330291823.png

Any idea what happened? Hard reboot fixed everything.
 
Might be checksum offloading. What card do you have exactly? What is the output of the following commands?
Code:
ethtool -k eno1
ethtool -k eno2
 
Might be checksum offloading. What card do you have exactly? What is the output of the following commands?
Code:
ethtool -k eno1
ethtool -k eno2
Code:
Features for eno1:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

Only difference with eno2 is that:

Code:
rx-vlan-filter: on [fixed]
 
Turning settings like tso, gso, rx, off seemed to help a couple of users in the forum that had similar problems => I'd suggest trying that
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!