igb AMD-Vi IO_PAGE_FAULT after PVE 8.4 upgrade

brainsoft

Member
Jun 25, 2024
36
3
8
Hi Everyone,
Appologies, this is going to be a bit rough and hopefully someone can point me to the proper logs for additional information as required. I've learned my lesson not to not try to "Fix" things before reaching out for help so I don't make it worse. Unfortuntely I still have not learned my lesson about not upgrading PVE because I have problems every time.

I wish the GUI had a way to select a couple of upgrades at a time or some way to install less than everything at once, so I have no idea which upgrade may have cause the problems. Ideally I would have done all the package updates separately, before doing the 8.4 related upgrade, but things don't really appear to be broken down that way, other than the two groups upstream/Debian, and Proxmox.

Anyways, this is non-production 3-node cluster, no HA or ceph. It turns on and runs fine, but over the next few hours this node will crash again for the 3rd time since upgrading last night. Actually, it lasted about 10 minutes before crashing as I am typing this.

[ 592.660802] igb 0000:04:0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xbe16a0c0 flags=0x0020]
[ 592.660802] igb 0000:04:0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xbe16a1c0 flags=0x0020]
[ 592.660802] igb 0000:04:0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0xbe16a8c0 flags=0x0020]

Lots of lines like this, the only difference seems to be the 3 different addresses.

That specific device on 04: is the i211 nic and it is in IOMMU group 15, and so are the other 2 nics (PCIe gigabit, PCIex4 10gbe ethernet ACQ107, and onboard intel). I assume that whole group is the main controller in the cpu, or maybe the northbridge I guess.

Any help or guidance is appreaciated. 2 of the nics are bridged and ahve worked flawlessly until upgrading last night, so I think maybe a driver issue? None of them are using iommu passthrough, they are all directly for this host/node.

And just now the truenas guest has crashed out, connecting to the console looks like a crashed gpu driver. Machine has an old GTX750 that Plex lxc has unpriviledged access to, also running flawless for a couple of weeks before the upgrade. I'm turning off most of the guests and restarting the node and I'll see if it is still online when I get home from work. That may point in a LXC breaking the kernel or something, but my I'm still pretty earlier in my linux troubleshooting journey.
 
Nope, all guests turned off, still locked up. Different address block this time, but still 3 unique addresses if that means anything. Next I'll disable the bridge with the two addon cards and see if it stabilizes.
 
Well this is what I came home to, locked up and offline in PVE. Very different output, not sure if the kernel is still spitting out messages after it locks up but I don't love hitting that reset button. Only solace is hoping that it already crashed before I kill it like this.

Any thoughts?
 

Attachments

  • 20250415_143942~2.jpg
    20250415_143942~2.jpg
    972.5 KB · Views: 10
With limited networking and no guests I still got the issue, but the single onboard nic is still in use so it may still be that.

I also have mismatched ram with custom timings, (instead of overclocking on one, i underclocked on the other and the matched timings to great success) Again, running fine for months and for many high-stress test days before that, but I removed one set and will also see if it keeps happening. unbuffered ECC ram for the ryzen platform is too expensive and there is no used market for it, so everything will remain as consumer hardware.

Nothing new has appeared in the upgrade list since yesterday so no hot fixes. Maybe an updated driver?

Any help is appreciated
 
well I think it may have been the memory, I forgot I removed and reinstalled half the set when trying to sell them, then put them back in. Thought I had the timings right but guess not and problem seems to have disappeared after removing the ram.

If anyone has good source of unbuffered ecc DDR4 udimms, I"d love to get a much ram as this board can handle, 64gb would be great, but f* me the prices are 10x.