[SOLVED] [Proxmox 8] [Kernel 6.2.16-4-pve]: ixgbe driver fails to load due to PCI device probing failure

bjorn-helgaas · Feb 14, 2024

Thanks for this, Tim. Since the pastebin will expire soon, I opened this bugzilla report to preserve it indefinitely.

I don't think this is the ECAM/MCFG issue that I first suspected.

I don't have a good theory, but if you have a change, you might see whether booting with "amd_iommu=off" makes any difference. More details in the bugzilla I opened.

fortechitsolutions · Feb 14, 2024

Hi, thank you for the reply and suggestion! I have just tried this, ie, added the "amd_iommu=off" stanza to my boot flags / rebooted. So far from what I can tell there is no change, we see ~same messages in dmesg about ixgbe failed prob error -5
and my public facing nic is absent
and no network

for now it feels like my best solution is to reinstall onto proxmox 7 and leave it alone there for a while
and not touch a proxmox8 upgrade on this host for a while
alas

but if you have any other suggestion or thoughts please do let me know!
Thanks,

Tim

fortechitsolutions · Feb 14, 2024

footnote for clarity, I am happy to continue to poke some test/config adjust changes on this host for another few days. Ideally I need to get this thing into production early next week but I have buffer in my schedule right now so can keep mucking about with "try this, does anything change?" poke and prod on this server. I do appreciate your help. I did find just now via google search your detailed post to bugzilla talking about all this stuff and I am happy you are involved in looking at this

- would be nice if problem can be identified / resolved but I realize that may not happen in short term, but we shall see.

bjorn-helgaas · Feb 14, 2024

I don't really have any ideas here. If possible, collect the output of "sudo lspci -vvxxxx" before loading the ixgbe module. I assume that will show good data for ixgbe, since Linux was able to enumerate it. Then load the ixgbe module and collect the same output again. I assume this will show a lot of bogus (~0) data for ixgbe, since it seems like the device isn't responding when the driver probes it. If you can attach both sets of output to the bugzilla, that would be awesome.

fortechitsolutions · Feb 15, 2024

Hi, thank you for the added followup. The lspci gave .. a fair mass of output, I am pasting below a series of 6 screenshot to capture the ixgbe relevant chunk from lspci -vvxxxx output

then bit further below, output from remove and reinstall the module, and then below that, dmesg capture after doing so.

I am not sure this gives you any more info that is of use.

Tim

bjorn-helgaas · Feb 15, 2024

Hmm, you're right, the screenshots are hard to deal with. Is there any way to capture the output in files and get those off the machine? I don't see any lspci output from after loading the ixgbe driver. I hoped that (especially for the 00:03.1 root port and the 05:00.0 ixgbe device) might have clues about why ixgbe isn't responding.

fortechitsolutions · Feb 15, 2024

I will see what I can do. It means rebooting the box again temporarily in rescue mode to get the captures off that are generated when it has no network. Will do so and update after that.

Tim

fortechitsolutions · Feb 15, 2024

Hi, OK, I just got the content, it is in pastebin, there are 3 different paste due to size

before lspci > https://pastebin.com/eZuddMG0
after lspci > https://pastebin.com/RRDsDYAy
dmesg after > https://pastebin.com/z07BJ8mY

please let me know if this is more useful / and possibly if you see anything of interest?

Thanks,

Tim

jesse.brandeburg · Feb 16, 2024

Tim, any chance you could try booting with pci=noaer kernel option?

I've also noticed that your OVH system only enables 32-bit BARs which is a BIOS option change to fix. However I didn't see anything wrong with the address layouts where the BIOS might be overlapping some resources, but I also didn't look at every detail. 64 bit BARs are better for sure, as you're not trying to cram all devices into 4G of mappings.

All this said, it would be only workarounds for your setup, as I'd defer to Bjorn the PCI expert what might have broken things in 6.1.x

jesse.brandeburg · Feb 16, 2024

fortechitsolutions said:
Hi, OK, I just got the content, it is in pastebin, there are 3 different paste due to size

before lspci > https://pastebin.com/eZuddMG0
after lspci > https://pastebin.com/RRDsDYAy
dmesg after > https://pastebin.com/z07BJ8mY

please let me know if this is more useful / and possibly if you see anything of interest?

Thanks,

Tim

Uploaded these to bugzilla.

fortechitsolutions · Feb 16, 2024

Hi, I only just tried this now, added the kernel boot stanza, pci=noaer
and now it boots, and it works, I am using the pinned 'older' kernel
I will try without pinning and see if it still works

below are pastebin links for lspci and dmesg output from the successful boot in case of interest

Thank you for this suggestion - seems like progress!

Tim

LSPCI verbose output > https://pastebin.com/8dZAU1aE
dmesg output > https://pastebin.com/vmVbtPWA

fortechitsolutions · Feb 16, 2024

Footnote,

I unpinned and was able to boot, with the revised boot stanza thus:

Code:

dmesg hint tells me:

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.5.11-8-pve root=UUID=5b7d97ba-41f2-4b1f-8b38-ba885925c617 ro nomodeset iommu=pt console=tty0 console=ttyS0,115200n8 pci=noaer


root@ns5025232:~# uname -a
Linux ns5025232 6.5.11-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) x86_64 GNU/Linux

So. It seems this is all well-and-good now with the one boot flag being set, pci=noaer
and now maybe I leave it to the experts to ponder what this actually means in terms of why the 10g driver is happy / PCIE is happy / with this adjustment. And if something needs to be mentioned somewhere (bugzilla, other?) etc for benefit of universe etc ?

Thank you!

Tim

bjorn-helgaas · Feb 17, 2024

Thanks very much for checking this out. "pcie=noaer" is an interim workaround, but there must be a PCI core problem that needs to be fixed because I don't want users to have to find and use that flag.

fortechitsolutions · Feb 18, 2024

OK! yes - I agree - it seems this flag is a good workaround, but ultimately there is some underlying problem that should be addressed if possible. I am not sure but it seems like next steps would be (a) Someone who is familiar with PCIE Debug will review the logs I captured (b) they might spot something (c) possibly this will lead to ?change to kernel or pcie driver or intel 10gig driver? in upstream (d) eventually I get different kernel/module combo to try (e) then I can install latest-new and try that (ie, without the pcie=noaer flag) to see if the workaround is not required any longer for this to function properly {or not?}. I'm OK to be involved with testing on this as possible in future - the server is not yet production right now and probably won't go production for a week . Then after that I can probably do 'maintenance' at booked times after-hours (ie, not during regular M-F office hours more or less) and a ~1 hour reboot cycle debug should be doable just to test things out. If that will be helpful. ie, I guess please just post back to this thread to ping me if you want me to do anything around testing ? Thank you!

GraceAboundz · Mar 2, 2024

I believe I have the same issue as the OP, but with a Chelsio T420-CR SFP+ card. Am I correct that the patch mentioned at the bottom of this thread is the solution for kernel 6.5? If so, can anyone provide clarity on how to apply the patch? I attached a screenshot of my error below.

I would just rollback to kernel 6.1 until a new kernel with the fix comes along (I think I read somewhere will be kernel 6.8?), but unfortunately I updated my ZFS pool and it now has a feature flag that isn't supported by 6.1. So, this patch may prevent me from having to rebuild my two clustered nodes. Which would be awesome!

Thanks for anyone's time and help!

fortechitsolutions · Mar 3, 2024

Hi just to followup, for clarity, there is no patch and no pinning with the workaround I am using. Simply make sure your kernel boot parameters set in /etc/default/grub and then designate pcie=noaer as a required option. Then rebuild your grub boot config file / update grub / and then reboot, and you should be golden.

Tim

GraceAboundz · Mar 4, 2024

Thanks for the suggestion, @fortechitsolutions. I tried adding the pcie=noaer boot parameter to no avail. I re-read this thread again and I think we are experiencing different issues. As @bjorn-helgaas pointed out, the OPs issue ended up being the ECAM/MCFG issue, whereas yours is something different.

@bjorn-helgaas, it appears that your patch is the solution I need. Thank you for your work on this. Is it possible to apply your patch to my PVE 8 install running the current 6.5 kernel? Or do I need to either pin 6.1 or wait for the 6.8 kernel to hit the PVE repos? Thanks, again!

GraceAboundz · Mar 7, 2024

I have an update for anyone who comes across this thread in the future. Shortly after my last post, I came across this thread.

It has exactly the solution I needed! Here is the important part:

Workarounds

The workaround is either one of those: telling the kernel to reserve the blocks it's reclaiming early on, OR preventing mmconfig to happen so it doesn't attempt to seek out for "unused" blocks.

You can do the first one by reading your dmesg and seeking for the failure on the can't reserve message while loading the mpt3sas driver, then check if it's within a previously reclaimed block. If so, then simply add this block to a reserve kernel parameter.

The second workaround is simply adding either pci=nommconf OR pci=realloc=off, but to be honest I would stay away from those two as I'm not quite sure they can impact other devices you may have.

Personally, I used the "pci=realloc=off" kernel boot parameter and it's been solid. I also tested the "pci=nommconf" parameter and it worked, as well. I couldn't get the "reserve" parameter to work, but I'm pretty sure that's because I wasn't doing it right.

From what I've read, this problem will be fixed in kernel 6.8, and the Proxmox devs may even be working on backporting it to an earlier pve-kernel. At that point, you would be able to remove whichever kernel boot parameter you decided on.

I hope this helps!

dff · Mar 24, 2024

I'm running PVE 8.1.5 and i can only get the X550 controller to function with the 6.1.10-1-pve kernel pinned. I havent tried custom kernel parameters as id rather just pin the older kernel.

[SOLVED] [Proxmox 8] [Kernel 6.2.16-4-pve]: ixgbe driver fails to load due to PCI device probing failure

New Member

Renowned Member

Renowned Member

New Member

Renowned Member

Attachments

New Member

Renowned Member

Renowned Member

New Member

New Member

Renowned Member

Renowned Member

New Member

Renowned Member

Member

Renowned Member

Member

Member

Workarounds​

Member

Workarounds