ixgbe port starts flapping infinitely after an AMD IOMMU error.

rahman

Renowned Member
Nov 1, 2010
100
5
83
Hi,

We have 3 node proxmox cluster, 2x SuperMicro and 1x Gigabyte servers, all with dual socket AMD EPYC 7742 64-Core Processors. SuperMicro servers have onboard dual port
X540-AT2 (rev 01) 10G copper ports.

The problem is on random interval (some times once or twice a month, sometimes once a week), one of the ixgbe ports starts to fail, continuously resetting. The problem is always starting with a AMD IOMMU error. When this happens only rebooting the host solves it, ifdown/ifup does not solve it. Here is a sample of the logs:

Code:
un 16 14:41:32 sm2 kernel: ixgbe 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xed51d180 flags=0x0000]
Jun 16 14:41:33 sm2 corosync[4327]:   [KNET  ] host: host: 1 has no active links
Jun 16 14:41:35 sm2 corosync[4327]:   [TOTEM ] Token has not been received in 2737 ms
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:36 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:41:36 sm2 corosync[4327]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Jun 16 14:41:37 sm2 corosync[4327]:   [KNET  ] host: host: 2 has no active links
Jun 16 14:41:40 sm2 corosync[4327]:   [QUORUM] Sync members[3]: 1 2 3
Jun 16 14:41:40 sm2 corosync[4327]:   [TOTEM ] A new membership (1.785) was formed. Members
Jun 16 14:41:40 sm2 corosync[4327]:   [QUORUM] Members[3]: 1 2 3
Jun 16 14:41:40 sm2 corosync[4327]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:42:20 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:20 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:20 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout

This was happening on only "sm1" host. We do not pass-through any PCIe devices on any host. Only single usb device (not usb controller, a license dongle) is pass-through to a Windows server VM. So to if if its somehow related with usb pass-through, I moved the VM and the USB dongle to the "sm2" host a weeks ago. Yesterday, the same error occurred again, but this time its on "sm2".

We are having this problem for a couple of years. Some times after a period of time, the second port also starts resetting and we loose all connections to the host. Sometimes it causes a Kernel Panic if we leave it this sate for days.

So any idea what should be the problem? How should we proceed? As we do not pass-through any PCIe cards or controllers, should we disable all IOMMU related setting on server bios? Or use any kernel params to prevent this?

Regards,

Rahman
 
As we do not pass-through any PCIe cards or controllers, should we disable all IOMMU related setting on server bios? Or use any kernel params to prevent this?
There is usually a IOMMU setting in the BIOS and setting that to Disabled will disable IOMMU protection and features. You can also add the amd_iommu=off kernel parameter.

I doubt that disabling IOMMU will prevent the issue as the IOMMU only detects a problematic memory access. I have no idea why the device does such a thing. Maybe it's a driver bug, which might be fixed in a newer Linux kernel (or not exist in an older kernel version). Or maybe it's triggered by some other event.

Anyway, it looks like you can disable IOMMU without losing any functionality to see if it has any influence on the issue.