NIC Link is Down every few minutes

IroesStrongarm · Jun 20, 2023

I'm trying to get to the bottom of a problem I just started having about middle of last week. I have a node that is almost two years old and it's worked flawlessly up until now. I did rebuild it about 4-5 months back as I wanted to the OS drive to be a mirrored ZFS pool, but nothing changed otherwise and I imported the original VM pool in as well. One of my VMs is an OpenVPN server. I noticed last week that I started having issues where every few minutes the connection would completely stall for about 5 seconds and then come back in. I wasn't disconnected from the server, but had no traffic moving in either direction. It happens regularly, but not like clockwork at exactly every 5 minutes. I noticed similar behavior when streaming in the internal network from a Windows gaming VM. I started trying to figure out if the problem was the node, the switch, etc. The logs at the time were so filled with the following error that I didn't notice the NIC error, but have since found it looking for it specifically.

Code:

Jun 17 00:00:06 zeus kernel: kvm [97149]: ignored rdmsr: 0xc0010293 data 0x0
Jun 17 00:00:11 zeus kernel: kvm_msr_ignored_check: 104 callbacks suppressed

Yesterday, before finding the NIC error, I finally put off clustering my three nodes. Upon doing so I've been able to see the node go offline for a few seconds and return. I've also activated a second physical NIC specifically for corosync and replication. This NIC also goes down. The two share a controller I'm sure so I'm wondering if that is starting to go bad or if I'm missing something. Here is the error from the log that I will regularly get.

Code:

Jun 20 10:15:28 zeus kernel: ixgbe 0000:09:00.1 enp9s0f1: NIC Link is Down
Jun 20 10:15:28 zeus kernel: ixgbe 0000:09:00.0 enp9s0f0: NIC Link is Down
Jun 20 10:15:28 zeus kernel: vmbr0: port 1(enp9s0f0) entered disabled state
Jun 20 10:15:29 zeus corosync[201510]:   [KNET  ] link: host: 3 link: 0 is down
Jun 20 10:15:29 zeus corosync[201510]:   [KNET  ] link: host: 2 link: 0 is down
Jun 20 10:15:29 zeus corosync[201510]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 20 10:15:29 zeus corosync[201510]:   [KNET  ] host: host: 3 has no active links
Jun 20 10:15:29 zeus corosync[201510]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 20 10:15:29 zeus corosync[201510]:   [KNET  ] host: host: 2 has no active links
Jun 20 10:15:30 zeus corosync[201510]:   [TOTEM ] Token has not been received in 2737 ms
Jun 20 10:15:31 zeus corosync[201510]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Jun 20 10:15:32 zeus kernel: ixgbe 0000:09:00.0 enp9s0f0: NIC Link is Up 10 Gbps, Flow Control: None
Jun 20 10:15:32 zeus kernel: vmbr0: port 1(enp9s0f0) entered blocking state
Jun 20 10:15:32 zeus kernel: vmbr0: port 1(enp9s0f0) entered forwarding state
Jun 20 10:15:32 zeus kernel: ixgbe 0000:09:00.1 enp9s0f1: NIC Link is Up 10 Gbps, Flow Control: None

I hadn't updated Proxmox prior to the issue arising. I then updated it last week, and again today, and it still persists through all of it. Is this likely a symptom of failing hardware or am I missing something bigger here? I have room to place a standalone network card in the server and transition to it. Do you think that might solve it? I'm open to any suggestions that I should be trying.

For now I've migrated my more critical VMs to the other nodes so they maintain proper functionality. As a note, no VMs or services ever complained about the outage. I even pinged the main server indefinitely from a Windows machine yesterday and while I felt I could visually see the hang, the ping never reported any drops or even longer times for response from the server for any packets. I'm at a loss and would really appreciate any help or insight. Thank you.

Spoonman2002 · Jun 20, 2023

What device is the PVE host connected to, probably a network switch?
Is there LAGG defined (bonding of two ports)?

IroesStrongarm · Jun 20, 2023

Spoonman2002 said:
What device is the PVE host connected to, probably a network switch?
Is there LAGG defined (bonding of two ports)?

Hi, yes it is connected to a network switch. The same one as the other nodes. There is no bonding of two ports. I have changed nothing of the setup and it was working perfectly for 4-5 months since I rebuilt it with the mirrored OS drives.

This all just started being an issue last week.

Spoonman2002 · Jun 20, 2023

You said you ran an infinite ping command to the PVE host.
Can you do an infinite ping from the PVE host to, for example, your router (or a similar always on device)

IroesStrongarm · Jun 20, 2023

Spoonman2002 said:
You said you ran an infinite ping command to the PVE host.
Can you do an infinite ping from the PVE host to, for example, your router (or a similar always on device)

I went ahead and pinged a completely separate node not part of the cluster. Took a bit till the NIC dropped. While I can't say exactly which sequence number was when it dropped, it'll be one of these I stopped on. These times stayed pretty consistent throughout the whole run but I do see the 4% packet loss at the bottom which likely represents our failure.

I've also noticed that now that I've migrated many of the guests off this node, both NICs don't always go offline together. Sometimes it's just the management/guest bridge leaving the corosync NIC up.

Someone mentioned perhaps it's an overheating issue, though the boards management interface claims the onboard LAN is only 57C. So either it's reading the IPMI LAN, reading wrong, or it's not a temp issue as the temperature should be fine.

jbu_inmedias · 2024-10-03T15:53:54+0200

@IroesStrongarm Did you ever manage to find the culprit?

Search

Search

NIC Link is Down every few minutes

IroesStrongarm

Member

Spoonman2002

Active Member

IroesStrongarm

Member

Spoonman2002

Active Member

IroesStrongarm

Member

Attachments

jbu_inmedias

Active Member