Very flaky network with Intel X710

danb35

Renowned Member
Oct 31, 2015
84
6
73
tl;dr: I'm experiencing a very flaky network connection on a Dell PowerEdge R630 with the Intel X710/i350 network daughter card. The node itself is dropping offline and coming back online at (apparent) random, and VMs and containers on the node are also very flaky.

Background: I've had a Proxmox cluster running on three nodes of a PowerEdge C6220 II for a while. It's run reasonably well, but for a variety of reasons, I'm wanting to move those nodes to R630s. So ordered one from eBay configured as desired, moved the boot drives (ZFS mirror boot pool) to the new hardware, edited /etc/network/interfaces to reflect the new interface names, and figured I'd be good to go.

Well, not so much. The first problem I encountered was that the primary interface was down and stayed down. Replacing the SFP+ optic with another one (both Intel-compatible units from fs.com) brought the link up, mostly. But it still drops, and VMs/containers on that system are very flaky.

Not sure where I should be looking. I don't see anything untoward in /var/log/syslog, but I'm not confident I know what to look for. Output of pveversion -v, lspci -v -s 01:00.0, and content of /etc/network/interfaces below:
Code:
root@pve3 ➜  ~ pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.107-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-2
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph: 17.2.5-pve1
ceph-fuse: 17.2.5-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
root@pve3 ➜  ~ lspci -v -s 01:00.0
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
    DeviceName: NIC1
    Subsystem: Dell Ethernet 10G 4P X710/I350 rNDC
    Flags: bus master, fast devsel, latency 0, IRQ 48, NUMA node 0
    Memory at 91000000 (64-bit, prefetchable) [size=16M]
    Memory at 92008000 (64-bit, prefetchable) [size=32K]
    Expansion ROM at 92100000 [disabled] [size=512K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
    Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
    Capabilities: [a0] Express Endpoint, MSI 00
    Capabilities: [e0] Vital Product Data
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [140] Device Serial Number a8-56-b8-ff-ff-4b-43-e4
    Capabilities: [1a0] Transaction Processing Hints
    Capabilities: [1b0] Access Control Services
    Capabilities: [1d0] Secondary PCI Express
    Kernel driver in use: i40e
    Kernel modules: i40e

root@pve3 ➜  ~ cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual # aka enp1s0f0

auto eno2
iface eno2 inet static
    address 192.168.5.103/24

auto #
iface # inet manual

iface eno1 inet manual

iface eno3 inet manual

iface eno4 inet manual

iface eno3 inet manual # aka enp8s0f0
iface eno4 inet manual # aka enp8s0f1

auto vmbr0
iface vmbr0 inet static
    address 192.168.1.5/24
    gateway 192.168.1.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
 
I've tried swapping the network daughter card for an X520-based unit; that seems to be working better. Same optics, cables, network configuration, etc. Seems odd. It'd still be nice to see why the X710 is flaky.
 
Are network cards connected to a 10G switch?
Try to disable flow control on SFP+ ports on switch, if it possible.
 
Read probably in the man page that you should not put comments on the same line.
iface eno1 inet manual # aka enp1s0f0
Should be.
iface eno1 inet manual
# aka enp1s0f0
Couple of other instances of that in your interfaces file. Don't know if that is what is causing the issue but worth a try.
 
We've had issues with LLDP and there cards, turned out the card had a LLDP client in it's firmware. Disabling that solved a lot of weird issues like packetloss or in some cases simply not working.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!