[SOLVED] ixgbe hang

Jan 15, 2025
3
0
1
Hi folks,

I was running into an issue with an X550 NIC using the ixgbe driver in my PVE host (R740xd) hanging and dropping traffic whenever I tried pushing a modest amount of traffic through it. Note that I did not have pass-through enabled, nor was I attempting to pass any devices through. The hang was preceded by a bunch of DMAR faults as seen below:
Code:
Feb 12 21:22:54 cocytus kernel: DMAR: DRHD: handling fault status reg 2
Feb 12 21:22:54 cocytus kernel: DMAR: [DMA Write NO_PASID] Request device [86:00.0] fault addr 0x0 [fault reason 0x05] P>
Feb 12 21:22:54 cocytus kernel: DMAR: DRHD: handling fault status reg 102
Feb 12 21:22:54 cocytus kernel: DMAR: [DMA Read NO_PASID] Request device [86:00.0] fault addr 0xe3e78000 [fault reason 0>
Feb 12 21:22:54 cocytus kernel: DMAR: [DMA Read NO_PASID] Request device [86:00.0] fault addr 0xe3e7b000 [fault reason 0>
Feb 12 21:22:54 cocytus kernel: DMAR: [DMA Read NO_PASID] Request device [86:00.0] fault addr 0xe3e7c000 [fault reason 0>
Feb 12 21:22:54 cocytus kernel: DMAR: [DMA Read NO_PASID] Request device [86:00.0] fault addr 0xe3e71000 [fault reason 0>
Feb 12 21:22:54 cocytus kernel: DMAR: [DMA Read NO_PASID] Request device [86:00.0] fault addr 0xe3e73000 [fault reason 0>
Feb 12 21:22:54 cocytus kernel: DMAR: [DMA Read NO_PASID] Request device [86:00.0] fault addr 0xe3e75000 [fault reason 0>
Feb 12 21:23:00 cocytus kernel: ixgbe 0000:86:00.0 ens5f0: Detected Tx Unit Hang
                                  Tx Queue             <9>
                                  TDH, TDT             <74>, <7a>
                                  next_to_use          <7a>
                                  next_to_clean        <4c>
                                tx_buffer_info[next_to_clean]
                                  time_stamp           <115bcc4b4>
                                  jiffies              <115bcdb80>

I looked everywhere, but wasn't able to find anything conclusive, so had to do some experimentation. I updated the NIC firmware, fiddled with the device offload settings, and enabled intel_iommu in the boot params and SR-IOV in the BIOS, but nothing seemed to resolve the issue. That is, at least until I added iommu=pt to my boot command line and this solved my problem. No more DMAR faults and ixgbe hangs.

I realize that this does not solve the underlying issue that would likely manifest if attempting to pass this device through, but if you don't need to and this issue pops-up, I hope this info helps.