[SOLVED] [Proxmox 8] [Kernel 6.2.16-4-pve]: ixgbe driver fails to load due to PCI device probing failure

Thanks for this, Tim. Since the pastebin will expire soon, I opened this bugzilla report to preserve it indefinitely.

I don't think this is the ECAM/MCFG issue that I first suspected.

I don't have a good theory, but if you have a change, you might see whether booting with "amd_iommu=off" makes any difference. More details in the bugzilla I opened.
 
Hi, thank you for the reply and suggestion! I have just tried this, ie, added the "amd_iommu=off" stanza to my boot flags / rebooted. So far from what I can tell there is no change, we see ~same messages in dmesg about ixgbe failed prob error -5
and my public facing nic is absent
and no network

for now it feels like my best solution is to reinstall onto proxmox 7 and leave it alone there for a while
and not touch a proxmox8 upgrade on this host for a while
alas

but if you have any other suggestion or thoughts please do let me know!
Thanks,

Tim
 
footnote for clarity, I am happy to continue to poke some test/config adjust changes on this host for another few days. Ideally I need to get this thing into production early next week but I have buffer in my schedule right now so can keep mucking about with "try this, does anything change?" poke and prod on this server. I do appreciate your help. I did find just now via google search your detailed post to bugzilla talking about all this stuff and I am happy you are involved in looking at this :-) - would be nice if problem can be identified / resolved but I realize that may not happen in short term, but we shall see.
 
I don't really have any ideas here. If possible, collect the output of "sudo lspci -vvxxxx" before loading the ixgbe module. I assume that will show good data for ixgbe, since Linux was able to enumerate it. Then load the ixgbe module and collect the same output again. I assume this will show a lot of bogus (~0) data for ixgbe, since it seems like the device isn't responding when the driver probes it. If you can attach both sets of output to the bugzilla, that would be awesome.
 
Hi, thank you for the added followup. The lspci gave .. a fair mass of output, I am pasting below a series of 6 screenshot to capture the ixgbe relevant chunk from lspci -vvxxxx output

then bit further below, output from remove and reinstall the module, and then below that, dmesg capture after doing so.

I am not sure this gives you any more info that is of use.


Tim
 

Attachments

  • 1-lspci-part-one.png
    1-lspci-part-one.png
    231.1 KB · Views: 9
  • 2-lspci-part-two.png
    2-lspci-part-two.png
    252 KB · Views: 9
  • 3-lspci-part-three.png
    3-lspci-part-three.png
    183.3 KB · Views: 8
  • 5-lspci-part-five.png
    5-lspci-part-five.png
    260.9 KB · Views: 7
  • 6-lspci-part-six.png
    6-lspci-part-six.png
    182.7 KB · Views: 7
  • 7-insmod ref cap.png
    7-insmod ref cap.png
    113.3 KB · Views: 7
  • 8-dmesg capture after insmod ref cap.png
    8-dmesg capture after insmod ref cap.png
    586.1 KB · Views: 9
Hmm, you're right, the screenshots are hard to deal with. Is there any way to capture the output in files and get those off the machine? I don't see any lspci output from after loading the ixgbe driver. I hoped that (especially for the 00:03.1 root port and the 05:00.0 ixgbe device) might have clues about why ixgbe isn't responding.
 
I will see what I can do. It means rebooting the box again temporarily in rescue mode to get the captures off that are generated when it has no network. Will do so and update after that.


Tim
 
Tim, any chance you could try booting with pci=noaer kernel option?

I've also noticed that your OVH system only enables 32-bit BARs which is a BIOS option change to fix. However I didn't see anything wrong with the address layouts where the BIOS might be overlapping some resources, but I also didn't look at every detail. 64 bit BARs are better for sure, as you're not trying to cram all devices into 4G of mappings.

All this said, it would be only workarounds for your setup, as I'd defer to Bjorn the PCI expert what might have broken things in 6.1.x
 
Hi, I only just tried this now, added the kernel boot stanza, pci=noaer
and now it boots, and it works, I am using the pinned 'older' kernel
I will try without pinning and see if it still works

below are pastebin links for lspci and dmesg output from the successful boot in case of interest

Thank you for this suggestion - seems like progress!

Tim

LSPCI verbose output > https://pastebin.com/8dZAU1aE
dmesg output > https://pastebin.com/vmVbtPWA
 
Footnote,

I unpinned and was able to boot, with the revised boot stanza thus:

Code:
dmesg hint tells me:

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.5.11-8-pve root=UUID=5b7d97ba-41f2-4b1f-8b38-ba885925c617 ro nomodeset iommu=pt console=tty0 console=ttyS0,115200n8 pci=noaer


root@ns5025232:~# uname -a
Linux ns5025232 6.5.11-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) x86_64 GNU/Linux

So. It seems this is all well-and-good now with the one boot flag being set, pci=noaer
and now maybe I leave it to the experts to ponder what this actually means in terms of why the 10g driver is happy / PCIE is happy / with this adjustment. And if something needs to be mentioned somewhere (bugzilla, other?) etc for benefit of universe etc ?

Thank you!

Tim
 
Thanks very much for checking this out. "pcie=noaer" is an interim workaround, but there must be a PCI core problem that needs to be fixed because I don't want users to have to find and use that flag.
 
OK! yes - I agree - it seems this flag is a good workaround, but ultimately there is some underlying problem that should be addressed if possible. I am not sure but it seems like next steps would be (a) Someone who is familiar with PCIE Debug will review the logs I captured (b) they might spot something (c) possibly this will lead to ?change to kernel or pcie driver or intel 10gig driver? in upstream (d) eventually I get different kernel/module combo to try (e) then I can install latest-new and try that (ie, without the pcie=noaer flag) to see if the workaround is not required any longer for this to function properly {or not?}. I'm OK to be involved with testing on this as possible in future - the server is not yet production right now and probably won't go production for a week . Then after that I can probably do 'maintenance' at booked times after-hours (ie, not during regular M-F office hours more or less) and a ~1 hour reboot cycle debug should be doable just to test things out. If that will be helpful. ie, I guess please just post back to this thread to ping me if you want me to do anything around testing ? Thank you!
 
I believe I have the same issue as the OP, but with a Chelsio T420-CR SFP+ card. Am I correct that the patch mentioned at the bottom of this thread is the solution for kernel 6.5? If so, can anyone provide clarity on how to apply the patch? I attached a screenshot of my error below.

I would just rollback to kernel 6.1 until a new kernel with the fix comes along (I think I read somewhere will be kernel 6.8?), but unfortunately I updated my ZFS pool and it now has a feature flag that isn't supported by 6.1. So, this patch may prevent me from having to rebuild my two clustered nodes. Which would be awesome! :)

Thanks for anyone's time and help!

Screenshot 2024-03-02 at 1.02.08 PM.png
 
Hi just to followup, for clarity, there is no patch and no pinning with the workaround I am using. Simply make sure your kernel boot parameters set in /etc/default/grub and then designate pcie=noaer as a required option. Then rebuild your grub boot config file / update grub / and then reboot, and you should be golden.

Tim
 
Thanks for the suggestion, @fortechitsolutions. I tried adding the pcie=noaer boot parameter to no avail. I re-read this thread again and I think we are experiencing different issues. As @bjorn-helgaas pointed out, the OPs issue ended up being the ECAM/MCFG issue, whereas yours is something different.

@bjorn-helgaas, it appears that your patch is the solution I need. Thank you for your work on this. Is it possible to apply your patch to my PVE 8 install running the current 6.5 kernel? Or do I need to either pin 6.1 or wait for the 6.8 kernel to hit the PVE repos? Thanks, again!
 
I have an update for anyone who comes across this thread in the future. Shortly after my last post, I came across this thread.

It has exactly the solution I needed! Here is the important part:

Workarounds​


The workaround is either one of those: telling the kernel to reserve the blocks it's reclaiming early on, OR preventing mmconfig to happen so it doesn't attempt to seek out for "unused" blocks.

You can do the first one by reading your dmesg and seeking for the failure on the can't reserve message while loading the mpt3sas driver, then check if it's within a previously reclaimed block. If so, then simply add this block to a reserve kernel parameter.

The second workaround is simply adding either pci=nommconf OR pci=realloc=off, but to be honest I would stay away from those two as I'm not quite sure they can impact other devices you may have.
Personally, I used the "pci=realloc=off" kernel boot parameter and it's been solid. I also tested the "pci=nommconf" parameter and it worked, as well. I couldn't get the "reserve" parameter to work, but I'm pretty sure that's because I wasn't doing it right.

From what I've read, this problem will be fixed in kernel 6.8, and the Proxmox devs may even be working on backporting it to an earlier pve-kernel. At that point, you would be able to remove whichever kernel boot parameter you decided on.

I hope this helps!
 
I'm running PVE 8.1.5 and i can only get the X550 controller to function with the 6.1.10-1-pve kernel pinned. I havent tried custom kernel parameters as id rather just pin the older kernel.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!