Proxmox 9.1.7 RTX 6000 Pro drops off PCI bus

kur1j

New Member
Jan 18, 2025
9
3
3
System boots up fine, proxmox/OS detects the RTX 6000 Pro devices no issue, they all show up in "lspci". After an arbitrary amount of time (few days), the devices that are NOT passed through to a VM (e.g. are unassigned) will drop off of the bus. There looks to be some weird kernel errors. Rebooting the server brings them back to being detected by lspci. Any thoughts on where to go from here?

If the devices are assigned to a GPU, they seem to never drop off the PCI bus.

Code:
uname -a
Linux prox 6.17.13-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.13-2 (2026-03-13T08:06Z) x86_64 GNU/Linux


Code:
[428093.296377] vfio-pci 0000:bb:00.0: Unable to change power state from D3hot to D0, device inaccessible

[428093.296596] pcieport 0000:b9:01.0: pciehp: Slot(4012): Link Down

[428093.297480] pcieport 0000:b9:01.0: pciehp: Slot(4012): Card not present

[428093.357952] ------------[ cut here ]------------

[428093.358187] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13

[428093.358365] shift exponent 64 is too large for 64-bit type 'long unsigned int'

[428093.358549] CPU: 177 UID: 0 PID: 3612174 Comm: kworker/177:5 Tainted: P           O        6.17.13-2-pve #1 PREEMPT(voluntary)

[428093.358553] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE

[428093.358554] Hardware name: Supermicro SYS-522GA-NRT/X14DBG-AP, BIOS 1.4 07/15/2025

[428093.358557] Workqueue: pm pm_runtime_work

[428093.358574] Call Trace:

[428093.358579]  <TASK>

[428093.358585]  dump_stack_lvl+0x5f/0x90

[428093.358592]  dump_stack+0x10/0x18

[428093.358593]  ubsan_epilogue+0x9/0x39

[428093.358601]  __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113

[428093.358603]  pci_restore_iov_state.cold+0x16/0x21

[428093.358607]  ? pci_enable_acs+0xfa/0x190

[428093.358612]  pci_restore_state.part.0+0x1fb/0x3a0

[428093.358623]  pci_restore_state+0x1e/0x30

[428093.358624]  pci_pm_runtime_resume+0x3b/0xf0

[428093.358627]  ? __pfx_pci_pm_runtime_resume+0x10/0x10

[428093.358628]  __rpm_callback+0x48/0x1f0

[428093.358629]  ? ktime_get_mono_fast_ns+0x39/0xd0

[428093.358636]  ? __pfx_pci_pm_runtime_resume+0x10/0x10

[428093.358637]  rpm_callback+0x6e/0x80

[428093.358638]  ? __pfx_pci_pm_runtime_resume+0x10/0x10

[428093.358639]  rpm_resume+0x4cc/0x6f0

[428093.358640]  ? queue_delayed_work_on+0x81/0x90

[428093.358646]  pm_runtime_work+0x80/0xe0

[428093.358647]  process_one_work+0x188/0x370

[428093.358649]  worker_thread+0x33a/0x480

[428093.358650]  ? __pfx_worker_thread+0x10/0x10

[428093.358651]  kthread+0x108/0x220

[428093.358654]  ? __pfx_kthread+0x10/0x10

[428093.358655]  ret_from_fork+0x205/0x240

[428093.358661]  ? __pfx_kthread+0x10/0x10

[428093.358663]  ret_from_fork_asm+0x1a/0x30

[428093.358668]  </TASK>

[428093.363654] ---[ end trace ]---

[428093.374924] pcieport 0000:b9:01.0: pciehp: Slot(4012): Card present

[428094.395760] pcieport 0000:b9:01.0: pciehp: Slot(4012): No link

[434447.216757] vfio-pci 0000:cc:00.0: Unable to change power state from D3hot to D0, device inaccessible

[434447.216787] pcieport 0000:ca:01.0: pciehp: Slot(5009): Link Down

[434447.218010] pcieport 0000:ca:01.0: pciehp: Slot(5009): Card not present

[434447.291404] pcieport 0000:ca:01.0: pciehp: Slot(5009): Card present

[434448.094883] pci 0000:cc:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint

[434448.095466] pci 0000:cc:00.0: BAR 0 [mem 0x00000000-0x03ffffff 64bit pref]

[434448.095748] pci 0000:cc:00.0: BAR 2 [mem 0x00000000-0x1fffffffff 64bit pref]

[434448.095957] pci 0000:cc:00.0: BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]

[434448.096184] pci 0000:cc:00.0: Max Payload Size set to 256 (was 128, max 256)

[434448.096745] pci 0000:cc:00.0: Enabling HDA controller

[434448.099572] pci 0000:cc:00.0: PME# supported from D0 D3hot

[434448.100314] pci 0000:cc:00.0: VF BAR 0 [mem 0x00000000-0x0003ffff 64bit pref]

[434448.100438] pci 0000:cc:00.0: VF BAR 0 [mem 0x00000000-0x00bfffff 64bit pref]: contains BAR 0 for 48 VFs

[434448.100583] pci 0000:cc:00.0: VF BAR 2 [mem 0x00000000-0xffffffff 64bit pref]

[434448.100704] pci 0000:cc:00.0: VF BAR 2 [mem 0x00000000-0x2fffffffff 64bit pref]: contains BAR 2 for 48 VFs

[434448.100837] pci 0000:cc:00.0: VF BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]

[434448.100959] pci 0000:cc:00.0: VF BAR 4 [mem 0x00000000-0x5fffffff 64bit pref]: contains BAR 4 for 48 VFs

[434448.105230] pci 0000:cc:00.0: Adding to iommu group 136

[434448.111434] pcieport 0000:ca:01.0: bridge window [io  0x1000-0x0fff] to [bus cc] add_size 1000

[434448.111634] pcieport 0000:c9:00.0: Assigned bridge window [mem 0xde000000-0xde4fffff] to [bus ca-d0] cannot fit 0x200000 required for 0000:ca:01.0 bridging to [bus cc]

[434448.111931] pcieport 0000:ca:01.0: bridge window [mem 0x00000000] to [bus cc] requires relaxed alignment rules

[434448.112078] pcieport 0000:ca:01.0: bridge window [mem 0x00100000-0x000fffff] to [bus cc] add_size 200000 add_align 100000

[434448.112243] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.112392] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.112557] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.112718] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.112895] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.113065] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.113235] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.113408] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.113601] pci 0000:cc:00.0: BAR 2 [mem 0x2fa000000000-0x2fbfffffffff 64bit pref]: assigned

[434448.113862] pci 0000:cc:00.0: VF BAR 2 [mem 0x2fc000000000-0x2fefffffffff 64bit pref]: assigned

[434448.114080] pci 0000:cc:00.0: BAR 0 [mem 0x2ff000000000-0x2ff003ffffff 64bit pref]: assigned

[434448.114348] pci 0000:cc:00.0: BAR 4 [mem 0x2ff004000000-0x2ff005ffffff 64bit pref]: assigned

[434448.114643] pci 0000:cc:00.0: VF BAR 4 [mem 0x2ff006000000-0x2ff065ffffff 64bit pref]: assigned

[434448.114882] pci 0000:cc:00.0: VF BAR 0 [mem 0x2ff066000000-0x2ff066bfffff 64bit pref]: assigned

[434448.115127] pcieport 0000:ca:01.0: PCI bridge to [bus cc]

[434448.115367] pcieport 0000:ca:01.0:   bridge window [mem 0x2fa000000000-0x2ff066bfffff 64bit pref]

[434448.115629] PCI: No. 2 try to assign unassigned res

[434448.115843] pcieport 0000:ca:03.0: resource 14 [mem 0xde200000-0xde3fffff] released

[434448.116005] pcieport 0000:ca:03.0: PCI bridge to [bus ce]

[434448.116173] pcieport 0000:ca:04.0: resource 14 [mem 0xde000000-0xde1fffff] released

[434448.116330] pcieport 0000:ca:04.0: PCI bridge to [bus cf]

[434448.116506] pcieport 0000:c9:00.0: resource 14 [mem 0xde000000-0xde4fffff] released

[434448.116696] pcieport 0000:c9:00.0: PCI bridge to [bus ca-d0]

[434448.116877] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.117045] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.117202] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.117351] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.117503] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.117680] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.117832] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.117977] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.118125] pcieport 0000:ca:01.0: PCI bridge to [bus cc]

[434448.118304] pcieport 0000:ca:01.0:   bridge window [mem 0x2fa000000000-0x2ff066bfffff 64bit pref]

[439265.014985] mlx5_core 0000:a7:00.1: Using 56-bit DMA addresses

[462942.054093] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.002 msecs

[485846.881453] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.002 msecs
 
Unable to change power state from D3hot to D0, device inaccessible
it seems something wants to change the power state of the device and is unable to

do you passthrough any other device on that server? do you use the acs override kernel commandline?

more from the log would also be interesting (e.g. 'journalctl -b' outputs the log since the last boot)
 
it seems something wants to change the power state of the device and is unable to

do you passthrough any other device on that server? do you use the acs override kernel commandline?

more from the log would also be interesting (e.g. 'journalctl -b' outputs the log since the last boot)
Well I think power state issue is potentially a red herring because the devices have dropped off the PCI bus and I can't get them back without restarting the whole system. Maybe the problem is that the power state WAS changed initially and it bugged out? But right now, no commands work with pciscan or anything to try and get them back on the bus, which is why the power state change wouldn't be able to find them. Granted I'm not an expert in that.

This system has 8x RTX 6000 Pro units 4 of the RTX 6000s are passed to VMs. Those are operating completely normal. The 4 that were unassigned (e.g. not passed through) after (7 days roughly) they dropped off the PCI bus. I am not doing anything special for PCI passthrough. I am simply going to my "cluster" in proxmox --> added the PCI devices under "Resource Mappings" and then on the VMs added the PCI device. The 4 were assiged to a GPU and 4 were left "unassigned".

Devices that have been assigned through that method above. The others that are not assigned, were NEVER assigned to a VM and then they just disappeared off the PCI bus.
Code:
0000:3d:00.0 - Working; Assigned to VM
0000:3e:00.0 - Working; Assigned to VM
0000:4e:00.0 - Working; Assigned to VM
0000:4f:00.0 - Working; Assigned to VM
0000:ba:00.0 - no longer on PCI bus; not assigned
0000:bb:00.0 - no longer on PCI bus; not assigned
0000:cc:00.0 - no longer on PCI bus; not assigned
0000:cd:00.0 - no longer on PCI bus; not assigned


File was too large so I had to trim it up.

Looking at ba:00 device during boot it shows up at 11:49:05

Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: BAR 0 [mem 0x2ef000000000-0x2ef003ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: BAR 2 [mem 0x2ea000000000-0x2ebfffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: BAR 4 [mem 0x2ef064000000-0x2ef065ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: enabling Extended Tags
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: Enabling HDA controller
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: PME# supported from D0 D3hot
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x2ef066000000-0x2ef06603ffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x2ef066000000-0x2ef066bfffff 64bit pref]: contains BAR 0 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x2ec000000000-0x2ec0ffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x2ec000000000-0x2eefffffffff 64bit pref]: contains BAR 2 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x2ef004000000-0x2ef005ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x2ef004000000-0x2ef063ffffff 64bit pref]: contains BAR 4 for 48 VFs

Slightly later during boot

Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: Adding to iommu group 150

Then 7 days later

Code:
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 0 [mem 0x00000000-0x03ffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 2 [mem 0x00000000-0x1fffffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: Max Payload Size set to 256 (was 128, max 256)
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: Enabling HDA controller
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: PME# supported from D0 D3hot
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x00000000-0x0003ffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x00000000-0x00bfffff 64bit pref]: contains BAR 0 for 48 VFs
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x00000000-0xffffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x00000000-0x2fffffffff 64bit pref]: contains BAR 2 for 48 VFs
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x00000000-0x5fffffff 64bit pref]: contains BAR 4 for 48 VFs
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: Adding to iommu group 150
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  0x1000-0x0fff] to [bus ba] add_size 1000
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b8:00.0: Assigned bridge window [mem 0xd6000000-0xd64fffff] to [bus b9-be] cannot fit 0x200000 required for 0000:b9:00.0 bridging to [bus ba]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem 0x00000000] to [bus ba] requires relaxed alignment rules
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem 0x00100000-0x000fffff] to [bus ba] add_size 200000 add_align 100000
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 2 [mem 0x2ea000000000-0x2ebfffffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x2ec000000000-0x2eefffffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 0 [mem 0x2ef000000000-0x2ef003ffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 4 [mem 0x2ef004000000-0x2ef005ffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x2ef006000000-0x2ef065ffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x2ef066000000-0x2ef066bfffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: PCI bridge to [bus ba]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0:   bridge window [mem 0x2ea000000000-0x2ef066bfffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: PCI: No. 2 try to assign unassigned res
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:03.0: resource 14 [mem 0xd6200000-0xd63fffff] released
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:03.0: PCI bridge to [bus bd]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:04.0: resource 14 [mem 0xd6000000-0xd61fffff] released
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:04.0: PCI bridge to [bus be]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b8:00.0: resource 14 [mem 0xd6000000-0xd64fffff] released
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b8:00.0: PCI bridge to [bus b9-be]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: PCI bridge to [bus ba]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0:   bridge window [mem 0x2ea000000000-0x2ef066bfffff 64bit pref]

and then a few hours later you see

Code:
Apr 10 13:06:41 host-XXXX kernel: vfio-pci 0000:ba:00.0: Unable to change power state from D3hot to D0, device inaccessible

But looking for bb:00.0 can see, booted around April 3 11:49 and the card is accessible.
Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: BAR 0 [mem 0x2e9000000000-0x2e9003ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: BAR 2 [mem 0x2e4000000000-0x2e5fffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: BAR 4 [mem 0x2e9064000000-0x2e9065ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: enabling Extended Tags
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: Enabling HDA controller
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: PME# supported from D0 D3hot
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 0 [mem 0x2e9066000000-0x2e906603ffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 0 [mem 0x2e9066000000-0x2e9066bfffff 64bit pref]: contains BAR 0 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 2 [mem 0x2e6000000000-0x2e60ffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 2 [mem 0x2e6000000000-0x2e8fffffffff 64bit pref]: contains BAR 2 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 4 [mem 0x2e9004000000-0x2e9005ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 4 [mem 0x2e9004000000-0x2e9063ffffff 64bit pref]: contains BAR 4 for 48 VFs

Still booting...but later in the file
Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: Adding to iommu group 151

The next thing you see in the logs is on April 8th 10:43:26 it disappeared and is inaccessible

Code:
Apr 08 10:43:26 host-XXXX kernel: vfio-pci 0000:bb:00.0: Unable to change power state from D3hot to D0, device inaccessible
Apr 08 10:43:26 host-XXXX kernel: pcieport 0000:b9:01.0: pciehp: Slot(4012): Link Down
Apr 08 10:43:26 host-XXXX kernel: pcieport 0000:b9:01.0: pciehp: Slot(4012): Card not present
Apr 08 10:43:26 host-XXXX kernel: ------------[ cut here ]------------
Apr 08 10:43:26 host-XXXX kernel: UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
Apr 08 10:43:26 host-XXXX kernel: shift exponent 64 is too large for 64-bit type 'long unsigned int'
Apr 08 10:43:26 host-XXXX kernel: CPU: 177 UID: 0 PID: 3612174 Comm: kworker/177:5 Tainted: P           O        6.17.13-2-pve #1 PREEMPT(voluntary)
Apr 08 10:43:26 host-XXXX kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
Apr 08 10:43:26 host-XXXX kernel: Hardware name: Supermicro SYS-522GA-NRT/X14DBG-AP, BIOS 1.4 07/15/2025
Apr 08 10:43:26 host-XXXX kernel: Workqueue: pm pm_runtime_work
Apr 08 10:43:26 host-XXXX kernel: Call Trace:
Apr 08 10:43:26 host-XXXX kernel:  <TASK>
Apr 08 10:43:26 host-XXXX kernel:  dump_stack_lvl+0x5f/0x90
Apr 08 10:43:26 host-XXXX kernel:  dump_stack+0x10/0x18
Apr 08 10:43:26 host-XXXX kernel:  ubsan_epilogue+0x9/0x39
Apr 08 10:43:26 host-XXXX kernel:  __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113
Apr 08 10:43:26 host-XXXX kernel:  pci_restore_iov_state.cold+0x16/0x21
Apr 08 10:43:26 host-XXXX kernel:  ? pci_enable_acs+0xfa/0x190
Apr 08 10:43:26 host-XXXX kernel:  pci_restore_state.part.0+0x1fb/0x3a0
Apr 08 10:43:26 host-XXXX kernel:  pci_restore_state+0x1e/0x30
Apr 08 10:43:26 host-XXXX kernel:  pci_pm_runtime_resume+0x3b/0xf0
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  __rpm_callback+0x48/0x1f0
Apr 08 10:43:26 host-XXXX kernel:  ? ktime_get_mono_fast_ns+0x39/0xd0
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  rpm_callback+0x6e/0x80
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  rpm_resume+0x4cc/0x6f0
Apr 08 10:43:26 host-XXXX kernel:  ? queue_delayed_work_on+0x81/0x90
Apr 08 10:43:26 host-XXXX kernel:  pm_runtime_work+0x80/0xe0
Apr 08 10:43:26 host-XXXX kernel:  process_one_work+0x188/0x370
Apr 08 10:43:26 host-XXXX kernel:  worker_thread+0x33a/0x480
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_worker_thread+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  kthread+0x108/0x220
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_kthread+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  ret_from_fork+0x205/0x240
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_kthread+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  ret_from_fork_asm+0x1a/0x30
Apr 08 10:43:26 host-XXXX kernel:  </TASK>
Apr 08 10:43:26 host-XXXX kernel: ---[ end trace ]---
Apr 08 10:43:26 host-XXXX kernel: pcieport 0000:b9:01.0: pciehp: Slot(4012): Card present
 

Attachments

Last edited:
ok thanks for the logs.

while the ubsan line is something somebody should take a look at, i think it's also just a symptom of the underlying issue.

if the card is not in use, i doubt this is a software bug since the kernel does not just randomly disconnect pci devices if they're not doing anything...
a pcie device not responding after it's not in use sounds like a hardware issue (e.g. mainboard bios or gpu firmware) to me.

can you check if there are any bios updates?

I get that each testing cycle takes a week, but sadly there is not much to go on here. could you test assigning one of the 4 unassigned cards to a vm that just loads the driver (does not have to do anything) just to rule out e.g. the physical cards and placement on the mainboard vs assigned/unassigned state.
 
ok thanks for the logs.

while the ubsan line is something somebody should take a look at, i think it's also just a symptom of the underlying issue.

if the card is not in use, i doubt this is a software bug since the kernel does not just randomly disconnect pci devices if they're not doing anything...
a pcie device not responding after it's not in use sounds like a hardware issue (e.g. mainboard bios or gpu firmware) to me.

can you check if there are any bios updates?

I get that each testing cycle takes a week, but sadly there is not much to go on here. could you test assigning one of the 4 unassigned cards to a vm that just loads the driver (does not have to do anything) just to rule out e.g. the physical cards and placement on the mainboard vs assigned/unassigned state.
Yeah, this one is rough....

No BIOS updates available. The latest BIOS for these systems are applied.

>I get that each testing cycle takes a week, but sadly there is not much to go on here. could you test assigning one of the 4 unassigned cards to a vm that just loads the driver (does not have to do anything) just to rule out e.g. the physical cards and placement on the mainboard vs assigned/unassigned state.


That is what is going on here. The bottom 4 are NOT assigned to a VM (just sitting in unattached) and they dropped off.

Code:
0000:3d:00.0 - Working; Assigned to VM
0000:3e:00.0 - Working; Assigned to VM
0000:4e:00.0 - Working; Assigned to VM
0000:4f:00.0 - Working; Assigned to VM
0000:ba:00.0 - no longer on PCI bus; not assigned
0000:bb:00.0 - no longer on PCI bus; not assigned
0000:cc:00.0 - no longer on PCI bus; not assigned
0000:cd:00.0 - no longer on PCI bus; not assigned


I rebooted the system...all PCI devices came back and were detected by the OS. Within 4 hours one of the devices dropped off the PCIBus. So it seems random.

Code:
0000:3d:00.0 - Working; Assigned to VM
0000:3e:00.0 - Working; Assigned to VM
0000:4e:00.0 - Working; Assigned to VM
0000:4f:00.0 - Working; Assigned to VM
0000:ba:00.0 - detectable by OS; not assigned to VM
0000:bb:00.0 - detectable by OS; not assigned to VM
0000:cd:00.0 - no longer on PCI bus; not assigned
0000:ce:00.0 - detectable by OS; not assigned to VM


Again it was one of the units that was unassigned to a VM.

I'm wondering if there is a power state issue maybe? The idle ones get put into suspend mode and it can't be woke up?
 
I'm wondering if there is a power state issue maybe? The idle ones get put into suspend mode and it can't be woke up?
possibly, but hard to say with the data here. you could try things like disabling power state management (e.g. in the bios of with pcie_aspm=off on the kernel commandline)

sorry if my last post was badly worded, what i actually wanted you to test (if possible) is leave one of the currently assigned ones unassigned, and assign one of the currently unassigned ones. (e.g. leave 3e:00.0 unassigned, and assign bb:00.0 instead). this would be to rule out that the physical placement of the card and the cards itself + their power connection. the last post has the same cards unassigned as the first post AFAICT

sorry that i can't give any more hints.
 
possibly, but hard to say with the data here. you could try things like disabling power state management (e.g. in the bios of with pcie_aspm=off on the kernel commandline)

sorry if my last post was badly worded, what i actually wanted you to test (if possible) is leave one of the currently assigned ones unassigned, and assign one of the currently unassigned ones. (e.g. leave 3e:00.0 unassigned, and assign bb:00.0 instead). this would be to rule out that the physical placement of the card and the cards itself + their power connection. the last post has the same cards unassigned as the first post AFAICT

sorry that i can't give any more hints.

Gotcha thanks I'll give that a shot, see below.

Current state was:

I rebooted this server on Friday and then by the end of the weekend I had 1 GPU that had dropped from the PCI bus and then one that was complaining about iommugroup not being correct.


Code:
0000:ba:00.0 - not assigned to VM - Configuration for iommugroup not correct ('152 != '138')
0000:cd:00.0 - not assigned to VM - Cannot find PCI id 0000:cd:00.0

lspci | grep -i nvidia
3d:00.0 3D controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] (rev a1)
3e:00.0 3D controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] (rev a1)
4e:00.0 3D controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] (rev a1)
4f:00.0 3D controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] (rev a1)
ba:00.0 3D controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] (rev a1)
bb:00.0 3D controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] (rev a1)
ce:00.0 3D controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition] (rev a1)

So doing the test you mentioned I had to reboot the server to get the GPU back detected.

But now I have reversed the assigned GPUs to do the test you mentioned:

As of my recent reboot 2026-05-11 9AM CST

Code:
0000:3d:00.0 - not assigned; detected
0000:3e:00.0 - not assigned; detected
0000:4e:00.0 - not assigned; detected
0000:4f:00.0 - not assigned; detected
0000:ba:00.0 - Working; Assigned to VM
0000:bb:00.0 - Working; Assigned to VM
0000:cc:00.0 - Working; Assigned to VM
0000:cd:00.0 - Working; Assigned to VM

So...assuming that the GPUs assigned to a VM continue to work properly...so example in this case the ba:00.0, bb:00.0, cc:00.0, cd:00.0 continue working properly and then the other devices 3d:00.0, 3e:00.0, 4e:00.0, 4f:00.0 start to fall off the PCI bus, what could the assignment of the PCI device to a VM be doing that would prevent it from falling off the bus?
 
Last edited:
Ironically...happened quicker than I thought...this particular system seems to see the issue more than our other systems.

From this morning all cards were being detected
Code:
0000:3d:00.0 - not assigned; detected
0000:3e:00.0 - not assigned; detected
0000:4e:00.0 - not assigned; detected
0000:4f:00.0 - not assigned; detected
0000:ba:00.0 - Working; Assigned to VM
0000:bb:00.0 - Working; Assigned to VM
0000:cc:00.0 - Working; Assigned to VM
0000:cd:00.0 - Working; Assigned to VM

lspci -tv

Code:
+-[0000:3a]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
|           \-02.0-[3b-41]----00.0-[3c-41]--+-00.0-[3d]----00.0  NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition]
|                                           +-01.0-[3e]----00.0  NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition]
|                                           +-02.0-[3f]--+-00.0  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           |            \-00.1  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           +-03.0-[40]--
|                                           \-04.0-[41]--
+-[0000:4b]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
|           \-02.0-[4c-53]----00.0-[4d-53]--+-00.0-[4e]----00.0  NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition]
|                                           +-01.0-[4f]----00.0  NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition]
|                                           +-02.0-[50]--+-00.0  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           |            \-00.1  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           +-03.0-[51]--
|                                           +-04.0-[52]--
|                                           \-1f.0-[53]----00.0  Broadcom / LSI PCIe Switch management endpoint

So within 4 hours one of the previous used GPUs that hasn't had problems for several weeks has now dropped off the PCIBus.

Code:
# date
Mon May 11 04:05:03 PM CDT 2026
0000:3d:00.0 - not assigned; detected
0000:3e:00.0 - not assigned; detected
0000:4e:00.0 - not assigned; no longer detected by the OS
0000:4f:00.0 - not assigned; detected
0000:ba:00.0 - Working; Assigned to VM
0000:bb:00.0 - Working; Assigned to VM
0000:cc:00.0 - Working; Assigned to VM
0000:cd:00.0 - Working; Assigned to VM



Code:
+-[0000:3a]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
|           \-02.0-[3b-41]----00.0-[3c-41]--+-00.0-[3d]----00.0  NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition]
|                                           +-01.0-[3e]----00.0  NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition]
|                                           +-02.0-[3f]--+-00.0  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           |            \-00.1  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           +-03.0-[40]--
|                                           \-04.0-[41]--
+-[0000:4b]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
|           \-02.0-[4c-53]----00.0-[4d-53]--+-00.0-[4e]--
| ----------SEE ABOVE FROM THE LAST TREE IT WAS HERE 6 HOURS AGO----
|                                           +-01.0-[4f]----00.0  NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Server Edition]
|                                           +-02.0-[50]--+-00.0  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           |            \-00.1  Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
|                                           +-03.0-[51]--
|                                           +-04.0-[52]--
|                                           \-1f.0-[53]----00.0  Broadcom / LSI PCIe Switch management endpoint


This is all consistent with the dmesg

System booted up around 9AM CST. By 11:24AM the the 0000:4e GPU had dropped off the PCI Bus.


Code:
[Mon May 11 08:58:59 2026] fwbr123i0: port 2(tap123i0) entered forwarding state
[Mon May 11 08:59:15 2026] vfio-pci 0000:cd:00.0: Enabling HDA controller
[Mon May 11 08:59:15 2026] vfio-pci 0000:cd:00.0: resetting
[Mon May 11 08:59:15 2026] vfio-pci 0000:cd:00.0: reset done
[Mon May 11 08:59:15 2026] vfio-pci 0000:ba:00.0: Enabling HDA controller
[Mon May 11 08:59:15 2026] vfio-pci 0000:ba:00.0: resetting
[Mon May 11 08:59:15 2026] vfio-pci 0000:ba:00.0: reset done
[Mon May 11 08:59:16 2026] vfio-pci 0000:ba:00.0: resetting
[Mon May 11 08:59:16 2026] vfio-pci 0000:ba:00.0: reset done
[Mon May 11 08:59:16 2026] vfio-pci 0000:cd:00.0: resetting
[Mon May 11 08:59:16 2026] vfio-pci 0000:cd:00.0: reset done
[Mon May 11 08:59:18 2026] vfio-pci 0000:bb:00.0: Enabling HDA controller
[Mon May 11 08:59:18 2026] vfio-pci 0000:bb:00.0: resetting
[Mon May 11 08:59:18 2026] vfio-pci 0000:bb:00.0: reset done
[Mon May 11 08:59:18 2026] vfio-pci 0000:ce:00.0: Enabling HDA controller
[Mon May 11 08:59:18 2026] vfio-pci 0000:ce:00.0: resetting
[Mon May 11 08:59:19 2026] vfio-pci 0000:ce:00.0: reset done
[Mon May 11 08:59:19 2026] vfio-pci 0000:ce:00.0: resetting
[Mon May 11 08:59:19 2026] vfio-pci 0000:ce:00.0: reset done
[Mon May 11 08:59:19 2026] vfio-pci 0000:bb:00.0: resetting
[Mon May 11 08:59:19 2026] vfio-pci 0000:bb:00.0: reset done
[Mon May 11 09:00:19 2026] ice 0000:cb:00.0: Using 56-bit DMA addresses
[Mon May 11 09:45:09 2026] ice 0000:2a:00.0: Using 56-bit DMA addresses
[Mon May 11 09:45:14 2026] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[Mon May 11 10:05:32 2026] perf: interrupt took too long (3163 > 3136), lowering kernel.perf_event_max_sample_rate to 63000
[Mon May 11 10:36:36 2026] perf: interrupt took too long (3960 > 3953), lowering kernel.perf_event_max_sample_rate to 50000
[Mon May 11 11:17:16 2026] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.002 msecs
[Mon May 11 11:17:16 2026] perf: interrupt took too long (7943 > 4950), lowering kernel.perf_event_max_sample_rate to 25000
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: pciehp: Slot(2001): Link Down
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: pciehp: Slot(2001): Card not present
[Mon May 11 11:24:25 2026] ------------[ cut here ]------------
[Mon May 11 11:24:25 2026] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[Mon May 11 11:24:25 2026] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[Mon May 11 11:24:25 2026] CPU: 43 UID: 0 PID: 2812 Comm: irq/105-pciehp Tainted: P S         O        7.0.2-2-pve #1 PREEMPT(lazy)
[Mon May 11 11:24:25 2026] Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
[Mon May 11 11:24:25 2026] Hardware name: Supermicro SYS-522GA-NRT/X14DBG-AP, BIOS 1.4 07/15/2025
[Mon May 11 11:24:25 2026] Call Trace:
[Mon May 11 11:24:25 2026]  <TASK>
[Mon May 11 11:24:25 2026]  dump_stack_lvl+0x5f/0x90
[Mon May 11 11:24:25 2026]  dump_stack+0x10/0x18
[Mon May 11 11:24:25 2026]  ubsan_epilogue+0x9/0x39
[Mon May 11 11:24:25 2026]  __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113
[Mon May 11 11:24:25 2026]  pci_rebar_bytes_to_size.cold+0x16/0x1e
[Mon May 11 11:24:25 2026]  pci_restore_iov_state+0x1a8/0x1e0
[Mon May 11 11:24:25 2026]  ? pci_enable_acs+0xe1/0x170
[Mon May 11 11:24:25 2026]  pci_restore_state+0x1c3/0x280
[Mon May 11 11:24:25 2026]  pci_pm_runtime_resume+0x3b/0xf0
[Mon May 11 11:24:25 2026]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[Mon May 11 11:24:25 2026]  __rpm_callback+0x4b/0x1f0
[Mon May 11 11:24:25 2026]  ? ktime_get_mono_fast_ns+0x3c/0xd0
[Mon May 11 11:24:25 2026]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[Mon May 11 11:24:25 2026]  rpm_callback+0x6e/0x80
[Mon May 11 11:24:25 2026]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
[Mon May 11 11:24:25 2026]  rpm_resume+0x4d6/0x700
[Mon May 11 11:24:25 2026]  ? xa_find+0x94/0x110
[Mon May 11 11:24:25 2026]  __pm_runtime_resume+0x4e/0x80
[Mon May 11 11:24:25 2026]  device_release_driver_internal+0xfe/0x270
[Mon May 11 11:24:25 2026]  device_release_driver+0x12/0x20
[Mon May 11 11:24:25 2026]  pci_stop_bus_device+0x69/0x90
[Mon May 11 11:24:25 2026]  pci_stop_and_remove_bus_device+0x12/0x30
[Mon May 11 11:24:25 2026]  pciehp_unconfigure_device+0x97/0x1a0
[Mon May 11 11:24:25 2026]  pciehp_disable_slot+0x68/0x110
[Mon May 11 11:24:25 2026]  pciehp_handle_presence_or_link_change+0x76/0x370
[Mon May 11 11:24:25 2026]  pciehp_ist+0x15b/0x1e0
[Mon May 11 11:24:25 2026]  irq_thread_fn+0x24/0x70
[Mon May 11 11:24:25 2026]  irq_thread+0x1c6/0x330
[Mon May 11 11:24:25 2026]  ? __pfx_irq_thread_fn+0x10/0x10
[Mon May 11 11:24:25 2026]  ? __pfx_irq_thread_dtor+0x10/0x10
[Mon May 11 11:24:25 2026]  ? __pfx_irq_thread+0x10/0x10
[Mon May 11 11:24:25 2026]  kthread+0xf7/0x130
[Mon May 11 11:24:25 2026]  ? __pfx_kthread+0x10/0x10
[Mon May 11 11:24:25 2026]  ret_from_fork+0x2dc/0x3a0
[Mon May 11 11:24:25 2026]  ? __pfx_kthread+0x10/0x10
[Mon May 11 11:24:25 2026]  ret_from_fork_asm+0x1a/0x30
[Mon May 11 11:24:25 2026]  </TASK>
[Mon May 11 11:24:25 2026] ---[ end trace ]---
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: pciehp: Slot(2001): Card present
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: BAR 0 [mem 0x00000000-0x03ffffff 64bit pref]
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: BAR 2 [mem 0x00000000-0x1fffffffff 64bit pref]
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: Max Payload Size set to 256 (was 128, max 256)
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: Enabling HDA controller
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: PME# supported from D0 D3hot
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 0 [mem 0x00000000-0x0003ffff 64bit pref]
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 0 [mem 0x00000000-0x00bfffff 64bit pref]: contains BAR 0 for 48 VFs
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 2 [mem 0x00000000-0xffffffff 64bit pref]
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 2 [mem 0x00000000-0x2fffffffff 64bit pref]: contains BAR 2 for 48 VFs
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 4 [mem 0x00000000-0x5fffffff 64bit pref]: contains BAR 4 for 48 VFs
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: Adding to iommu group 44
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00000000] to [bus 4e] add_size 200000 add_align 100000
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: failed to assign
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: failed to assign
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: failed to assign
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: failed to assign
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: BAR 2 [mem 0x25a000000000-0x25bfffffffff 64bit pref]: assigned
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 2 [mem 0x25c000000000-0x25efffffffff 64bit pref]: assigned
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: BAR 0 [mem 0x25f000000000-0x25f003ffffff 64bit pref]: assigned
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: BAR 4 [mem 0x25f004000000-0x25f005ffffff 64bit pref]: assigned
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 4 [mem 0x25f006000000-0x25f065ffffff 64bit pref]: assigned
[Mon May 11 11:24:25 2026] pci 0000:4e:00.0: VF BAR 0 [mem 0x25f066000000-0x25f066bfffff 64bit pref]: assigned
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: PCI bridge to [bus 4e]
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0:   bridge window [mem 0x25a000000000-0x25f066bfffff 64bit pref]
[Mon May 11 11:24:25 2026] PCI: No. 2 try to assign unassigned res
[Mon May 11 11:24:25 2026] release child resource [mem 0xb2e00000-0xb2e7ffff pref]
[Mon May 11 11:24:25 2026] release child resource [mem 0xb2e80000-0xb2efffff pref]
[Mon May 11 11:24:25 2026] pcieport 0000:4d:02.0: bridge window [mem 0xb2e00000-0xb2efffff]: releasing
[Mon May 11 11:24:25 2026] pcieport 0000:4c:00.0: bridge window [mem 0xb2e00000-0xb2efffff]: releasing
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00000000] to [bus 4e] add_size 200000 add_align 100000
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: failed to assign
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: failed to assign
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [io  size 0x1000]: failed to assign
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: can't assign; no space
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: bridge window [mem size 0x00200000]: failed to assign
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0: PCI bridge to [bus 4e]
[Mon May 11 11:24:25 2026] pcieport 0000:4d:00.0:   bridge window [mem 0x25a000000000-0x25f066bfffff 64bit pref]
[Mon May 11 11:24:25 2026] NovaCore 0000:4e:00.0: enabling device (0000 -> 0002)
[Mon May 11 11:24:25 2026] NovaCore 0000:4e:00.0: Unsupported chipset: boot42 = 0x1b2a1000 (architecture 0x1b, implementation 0x2)
[Mon May 11 13:31:59 2026] vfio-pci 0000:4e:00.0: Unable to change power state from D3hot to D0, device inaccessible
[Mon May 11 13:31:59 2026] pcieport 0000:4d:00.0: pciehp: Slot(2001): Link Down
[Mon May 11 13:31:59 2026] pcieport 0000:4d:00.0: pciehp: Slot(2001): Card not present
[Mon May 11 13:31:59 2026] pcieport 0000:4d:00.0: pciehp: Slot(2001): Card present
[Mon May 11 13:32:00 2026] pcieport 0000:4d:00.0: pciehp: Slot(2001): No link

So with different GPUs failing I don't think its a particular GPU that is bad.
 
Last edited: