Hi,
We have 3 node proxmox cluster, 2x SuperMicro and 1x Gigabyte servers, all with dual socket AMD EPYC 7742 64-Core Processors. SuperMicro servers have onboard dual port
X540-AT2 (rev 01) 10G copper ports.
The problem is on random interval (some times once or twice a month, sometimes once a week), one of the ixgbe ports starts to fail, continuously resetting. The problem is always starting with a AMD IOMMU error. When this happens only rebooting the host solves it, ifdown/ifup does not solve it. Here is a sample of the logs:
This was happening on only "sm1" host. We do not pass-through any PCIe devices on any host. Only single usb device (not usb controller, a license dongle) is pass-through to a Windows server VM. So to if if its somehow related with usb pass-through, I moved the VM and the USB dongle to the "sm2" host a weeks ago. Yesterday, the same error occurred again, but this time its on "sm2".
We are having this problem for a couple of years. Some times after a period of time, the second port also starts resetting and we loose all connections to the host. Sometimes it causes a Kernel Panic if we leave it this sate for days.
So any idea what should be the problem? How should we proceed? As we do not pass-through any PCIe cards or controllers, should we disable all IOMMU related setting on server bios? Or use any kernel params to prevent this?
Regards,
Rahman
We have 3 node proxmox cluster, 2x SuperMicro and 1x Gigabyte servers, all with dual socket AMD EPYC 7742 64-Core Processors. SuperMicro servers have onboard dual port
X540-AT2 (rev 01) 10G copper ports.
The problem is on random interval (some times once or twice a month, sometimes once a week), one of the ixgbe ports starts to fail, continuously resetting. The problem is always starting with a AMD IOMMU error. When this happens only rebooting the host solves it, ifdown/ifup does not solve it. Here is a sample of the logs:
Code:
un 16 14:41:32 sm2 kernel: ixgbe 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xed51d180 flags=0x0000]
Jun 16 14:41:33 sm2 corosync[4327]: [KNET ] host: host: 1 has no active links
Jun 16 14:41:35 sm2 corosync[4327]: [TOTEM ] Token has not been received in 2737 ms
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:41:35 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:36 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:41:36 sm2 corosync[4327]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Jun 16 14:41:37 sm2 corosync[4327]: [KNET ] host: host: 2 has no active links
Jun 16 14:41:40 sm2 corosync[4327]: [QUORUM] Sync members[3]: 1 2 3
Jun 16 14:41:40 sm2 corosync[4327]: [TOTEM ] A new membership (1.785) was formed. Members
Jun 16 14:41:40 sm2 corosync[4327]: [QUORUM] Members[3]: 1 2 3
Jun 16 14:41:40 sm2 corosync[4327]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:44 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:41:53 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:02 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: Reset adapter
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: RXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: TXDCTL.ENABLE for one or more queues not cleared within the polling period
Jun 16 14:42:11 sm2 kernel: ixgbe 0000:41:00.0: primary disable timed out
Jun 16 14:42:20 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:20 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
Jun 16 14:42:20 sm2 kernel: ixgbe 0000:41:00.0 enp65s0f0: initiating reset due to tx timeout
This was happening on only "sm1" host. We do not pass-through any PCIe devices on any host. Only single usb device (not usb controller, a license dongle) is pass-through to a Windows server VM. So to if if its somehow related with usb pass-through, I moved the VM and the USB dongle to the "sm2" host a weeks ago. Yesterday, the same error occurred again, but this time its on "sm2".
We are having this problem for a couple of years. Some times after a period of time, the second port also starts resetting and we loose all connections to the host. Sometimes it causes a Kernel Panic if we leave it this sate for days.
So any idea what should be the problem? How should we proceed? As we do not pass-through any PCIe cards or controllers, should we disable all IOMMU related setting on server bios? Or use any kernel params to prevent this?
Regards,
Rahman