Proxmox 9.1.7 RTX 6000 Pro drops off PCI bus

kur1j

New Member
Jan 18, 2025
6
3
3
System boots up fine, proxmox/OS detects the RTX 6000 Pro devices no issue, they all show up in "lspci". After an arbitrary amount of time (few days), the devices that are NOT passed through to a VM (e.g. are unassigned) will drop off of the bus. There looks to be some weird kernel errors. Rebooting the server brings them back to being detected by lspci. Any thoughts on where to go from here?

If the devices are assigned to a GPU, they seem to never drop off the PCI bus.

Code:
uname -a
Linux prox 6.17.13-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.13-2 (2026-03-13T08:06Z) x86_64 GNU/Linux


Code:
[428093.296377] vfio-pci 0000:bb:00.0: Unable to change power state from D3hot to D0, device inaccessible

[428093.296596] pcieport 0000:b9:01.0: pciehp: Slot(4012): Link Down

[428093.297480] pcieport 0000:b9:01.0: pciehp: Slot(4012): Card not present

[428093.357952] ------------[ cut here ]------------

[428093.358187] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13

[428093.358365] shift exponent 64 is too large for 64-bit type 'long unsigned int'

[428093.358549] CPU: 177 UID: 0 PID: 3612174 Comm: kworker/177:5 Tainted: P           O        6.17.13-2-pve #1 PREEMPT(voluntary)

[428093.358553] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE

[428093.358554] Hardware name: Supermicro SYS-522GA-NRT/X14DBG-AP, BIOS 1.4 07/15/2025

[428093.358557] Workqueue: pm pm_runtime_work

[428093.358574] Call Trace:

[428093.358579]  <TASK>

[428093.358585]  dump_stack_lvl+0x5f/0x90

[428093.358592]  dump_stack+0x10/0x18

[428093.358593]  ubsan_epilogue+0x9/0x39

[428093.358601]  __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113

[428093.358603]  pci_restore_iov_state.cold+0x16/0x21

[428093.358607]  ? pci_enable_acs+0xfa/0x190

[428093.358612]  pci_restore_state.part.0+0x1fb/0x3a0

[428093.358623]  pci_restore_state+0x1e/0x30

[428093.358624]  pci_pm_runtime_resume+0x3b/0xf0

[428093.358627]  ? __pfx_pci_pm_runtime_resume+0x10/0x10

[428093.358628]  __rpm_callback+0x48/0x1f0

[428093.358629]  ? ktime_get_mono_fast_ns+0x39/0xd0

[428093.358636]  ? __pfx_pci_pm_runtime_resume+0x10/0x10

[428093.358637]  rpm_callback+0x6e/0x80

[428093.358638]  ? __pfx_pci_pm_runtime_resume+0x10/0x10

[428093.358639]  rpm_resume+0x4cc/0x6f0

[428093.358640]  ? queue_delayed_work_on+0x81/0x90

[428093.358646]  pm_runtime_work+0x80/0xe0

[428093.358647]  process_one_work+0x188/0x370

[428093.358649]  worker_thread+0x33a/0x480

[428093.358650]  ? __pfx_worker_thread+0x10/0x10

[428093.358651]  kthread+0x108/0x220

[428093.358654]  ? __pfx_kthread+0x10/0x10

[428093.358655]  ret_from_fork+0x205/0x240

[428093.358661]  ? __pfx_kthread+0x10/0x10

[428093.358663]  ret_from_fork_asm+0x1a/0x30

[428093.358668]  </TASK>

[428093.363654] ---[ end trace ]---

[428093.374924] pcieport 0000:b9:01.0: pciehp: Slot(4012): Card present

[428094.395760] pcieport 0000:b9:01.0: pciehp: Slot(4012): No link

[434447.216757] vfio-pci 0000:cc:00.0: Unable to change power state from D3hot to D0, device inaccessible

[434447.216787] pcieport 0000:ca:01.0: pciehp: Slot(5009): Link Down

[434447.218010] pcieport 0000:ca:01.0: pciehp: Slot(5009): Card not present

[434447.291404] pcieport 0000:ca:01.0: pciehp: Slot(5009): Card present

[434448.094883] pci 0000:cc:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint

[434448.095466] pci 0000:cc:00.0: BAR 0 [mem 0x00000000-0x03ffffff 64bit pref]

[434448.095748] pci 0000:cc:00.0: BAR 2 [mem 0x00000000-0x1fffffffff 64bit pref]

[434448.095957] pci 0000:cc:00.0: BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]

[434448.096184] pci 0000:cc:00.0: Max Payload Size set to 256 (was 128, max 256)

[434448.096745] pci 0000:cc:00.0: Enabling HDA controller

[434448.099572] pci 0000:cc:00.0: PME# supported from D0 D3hot

[434448.100314] pci 0000:cc:00.0: VF BAR 0 [mem 0x00000000-0x0003ffff 64bit pref]

[434448.100438] pci 0000:cc:00.0: VF BAR 0 [mem 0x00000000-0x00bfffff 64bit pref]: contains BAR 0 for 48 VFs

[434448.100583] pci 0000:cc:00.0: VF BAR 2 [mem 0x00000000-0xffffffff 64bit pref]

[434448.100704] pci 0000:cc:00.0: VF BAR 2 [mem 0x00000000-0x2fffffffff 64bit pref]: contains BAR 2 for 48 VFs

[434448.100837] pci 0000:cc:00.0: VF BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]

[434448.100959] pci 0000:cc:00.0: VF BAR 4 [mem 0x00000000-0x5fffffff 64bit pref]: contains BAR 4 for 48 VFs

[434448.105230] pci 0000:cc:00.0: Adding to iommu group 136

[434448.111434] pcieport 0000:ca:01.0: bridge window [io  0x1000-0x0fff] to [bus cc] add_size 1000

[434448.111634] pcieport 0000:c9:00.0: Assigned bridge window [mem 0xde000000-0xde4fffff] to [bus ca-d0] cannot fit 0x200000 required for 0000:ca:01.0 bridging to [bus cc]

[434448.111931] pcieport 0000:ca:01.0: bridge window [mem 0x00000000] to [bus cc] requires relaxed alignment rules

[434448.112078] pcieport 0000:ca:01.0: bridge window [mem 0x00100000-0x000fffff] to [bus cc] add_size 200000 add_align 100000

[434448.112243] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.112392] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.112557] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.112718] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.112895] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.113065] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.113235] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.113408] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.113601] pci 0000:cc:00.0: BAR 2 [mem 0x2fa000000000-0x2fbfffffffff 64bit pref]: assigned

[434448.113862] pci 0000:cc:00.0: VF BAR 2 [mem 0x2fc000000000-0x2fefffffffff 64bit pref]: assigned

[434448.114080] pci 0000:cc:00.0: BAR 0 [mem 0x2ff000000000-0x2ff003ffffff 64bit pref]: assigned

[434448.114348] pci 0000:cc:00.0: BAR 4 [mem 0x2ff004000000-0x2ff005ffffff 64bit pref]: assigned

[434448.114643] pci 0000:cc:00.0: VF BAR 4 [mem 0x2ff006000000-0x2ff065ffffff 64bit pref]: assigned

[434448.114882] pci 0000:cc:00.0: VF BAR 0 [mem 0x2ff066000000-0x2ff066bfffff 64bit pref]: assigned

[434448.115127] pcieport 0000:ca:01.0: PCI bridge to [bus cc]

[434448.115367] pcieport 0000:ca:01.0:   bridge window [mem 0x2fa000000000-0x2ff066bfffff 64bit pref]

[434448.115629] PCI: No. 2 try to assign unassigned res

[434448.115843] pcieport 0000:ca:03.0: resource 14 [mem 0xde200000-0xde3fffff] released

[434448.116005] pcieport 0000:ca:03.0: PCI bridge to [bus ce]

[434448.116173] pcieport 0000:ca:04.0: resource 14 [mem 0xde000000-0xde1fffff] released

[434448.116330] pcieport 0000:ca:04.0: PCI bridge to [bus cf]

[434448.116506] pcieport 0000:c9:00.0: resource 14 [mem 0xde000000-0xde4fffff] released

[434448.116696] pcieport 0000:c9:00.0: PCI bridge to [bus ca-d0]

[434448.116877] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.117045] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.117202] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.117351] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.117503] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space

[434448.117680] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign

[434448.117832] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: can't assign; no space

[434448.117977] pcieport 0000:ca:01.0: bridge window [io  size 0x1000]: failed to assign

[434448.118125] pcieport 0000:ca:01.0: PCI bridge to [bus cc]

[434448.118304] pcieport 0000:ca:01.0:   bridge window [mem 0x2fa000000000-0x2ff066bfffff 64bit pref]

[439265.014985] mlx5_core 0000:a7:00.1: Using 56-bit DMA addresses

[462942.054093] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.002 msecs

[485846.881453] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.002 msecs
 
Unable to change power state from D3hot to D0, device inaccessible
it seems something wants to change the power state of the device and is unable to

do you passthrough any other device on that server? do you use the acs override kernel commandline?

more from the log would also be interesting (e.g. 'journalctl -b' outputs the log since the last boot)
 
it seems something wants to change the power state of the device and is unable to

do you passthrough any other device on that server? do you use the acs override kernel commandline?

more from the log would also be interesting (e.g. 'journalctl -b' outputs the log since the last boot)
Well I think power state issue is potentially a red herring because the devices have dropped off the PCI bus and I can't get them back without restarting the whole system. Maybe the problem is that the power state WAS changed initially and it bugged out? But right now, no commands work with pciscan or anything to try and get them back on the bus, which is why the power state change wouldn't be able to find them. Granted I'm not an expert in that.

This system has 8x RTX 6000 Pro units 4 of the RTX 6000s are passed to VMs. Those are operating completely normal. The 4 that were unassigned (e.g. not passed through) after (7 days roughly) they dropped off the PCI bus. I am not doing anything special for PCI passthrough. I am simply going to my "cluster" in proxmox --> added the PCI devices under "Resource Mappings" and then on the VMs added the PCI device. The 4 were assiged to a GPU and 4 were left "unassigned".

Devices that have been assigned through that method above. The others that are not assigned, were NEVER assigned to a VM and then they just disappeared off the PCI bus.
Code:
0000:3d:00.0 - Working; Assigned to VM
0000:3e:00.0 - Working; Assigned to VM
0000:4e:00.0 - Working; Assigned to VM
0000:4f:00.0 - Working; Assigned to VM
0000:ba:00.0 - no longer on PCI bus; not assigned
0000:bb:00.0 - no longer on PCI bus; not assigned
0000:cc:00.0 - no longer on PCI bus; not assigned
0000:cd:00.0 - no longer on PCI bus; not assigned


File was too large so I had to trim it up.

Looking at ba:00 device during boot it shows up at 11:49:05

Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: BAR 0 [mem 0x2ef000000000-0x2ef003ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: BAR 2 [mem 0x2ea000000000-0x2ebfffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: BAR 4 [mem 0x2ef064000000-0x2ef065ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: enabling Extended Tags
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: Enabling HDA controller
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: PME# supported from D0 D3hot
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x2ef066000000-0x2ef06603ffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x2ef066000000-0x2ef066bfffff 64bit pref]: contains BAR 0 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x2ec000000000-0x2ec0ffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x2ec000000000-0x2eefffffffff 64bit pref]: contains BAR 2 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x2ef004000000-0x2ef005ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x2ef004000000-0x2ef063ffffff 64bit pref]: contains BAR 4 for 48 VFs

Slightly later during boot

Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:ba:00.0: Adding to iommu group 150

Then 7 days later

Code:
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 0 [mem 0x00000000-0x03ffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 2 [mem 0x00000000-0x1fffffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: Max Payload Size set to 256 (was 128, max 256)
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: Enabling HDA controller
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: PME# supported from D0 D3hot
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x00000000-0x0003ffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x00000000-0x00bfffff 64bit pref]: contains BAR 0 for 48 VFs
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x00000000-0xffffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x00000000-0x2fffffffff 64bit pref]: contains BAR 2 for 48 VFs
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x00000000-0x5fffffff 64bit pref]: contains BAR 4 for 48 VFs
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: Adding to iommu group 150
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  0x1000-0x0fff] to [bus ba] add_size 1000
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b8:00.0: Assigned bridge window [mem 0xd6000000-0xd64fffff] to [bus b9-be] cannot fit 0x200000 required for 0000:b9:00.0 bridging to [bus ba]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem 0x00000000] to [bus ba] requires relaxed alignment rules
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem 0x00100000-0x000fffff] to [bus ba] add_size 200000 add_align 100000
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 2 [mem 0x2ea000000000-0x2ebfffffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 2 [mem 0x2ec000000000-0x2eefffffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 0 [mem 0x2ef000000000-0x2ef003ffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: BAR 4 [mem 0x2ef004000000-0x2ef005ffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 4 [mem 0x2ef006000000-0x2ef065ffffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pci 0000:ba:00.0: VF BAR 0 [mem 0x2ef066000000-0x2ef066bfffff 64bit pref]: assigned
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: PCI bridge to [bus ba]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0:   bridge window [mem 0x2ea000000000-0x2ef066bfffff 64bit pref]
Apr 10 11:29:12 host-XXXX kernel: PCI: No. 2 try to assign unassigned res
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:03.0: resource 14 [mem 0xd6200000-0xd63fffff] released
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:03.0: PCI bridge to [bus bd]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:04.0: resource 14 [mem 0xd6000000-0xd61fffff] released
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:04.0: PCI bridge to [bus be]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b8:00.0: resource 14 [mem 0xd6000000-0xd64fffff] released
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b8:00.0: PCI bridge to [bus b9-be]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [mem size 0x00200000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: can't assign; no space
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: bridge window [io  size 0x1000]: failed to assign
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0: PCI bridge to [bus ba]
Apr 10 11:29:12 host-XXXX kernel: pcieport 0000:b9:00.0:   bridge window [mem 0x2ea000000000-0x2ef066bfffff 64bit pref]

and then a few hours later you see

Code:
Apr 10 13:06:41 host-XXXX kernel: vfio-pci 0000:ba:00.0: Unable to change power state from D3hot to D0, device inaccessible

But looking for bb:00.0 can see, booted around April 3 11:49 and the card is accessible.
Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: BAR 0 [mem 0x2e9000000000-0x2e9003ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: BAR 2 [mem 0x2e4000000000-0x2e5fffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: BAR 4 [mem 0x2e9064000000-0x2e9065ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: enabling Extended Tags
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: Enabling HDA controller
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: PME# supported from D0 D3hot
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 0 [mem 0x2e9066000000-0x2e906603ffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 0 [mem 0x2e9066000000-0x2e9066bfffff 64bit pref]: contains BAR 0 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 2 [mem 0x2e6000000000-0x2e60ffffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 2 [mem 0x2e6000000000-0x2e8fffffffff 64bit pref]: contains BAR 2 for 48 VFs
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 4 [mem 0x2e9004000000-0x2e9005ffffff 64bit pref]
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: VF BAR 4 [mem 0x2e9004000000-0x2e9063ffffff 64bit pref]: contains BAR 4 for 48 VFs

Still booting...but later in the file
Code:
Apr 03 11:49:05 host-XXXX kernel: pci 0000:bb:00.0: Adding to iommu group 151

The next thing you see in the logs is on April 8th 10:43:26 it disappeared and is inaccessible

Code:
Apr 08 10:43:26 host-XXXX kernel: vfio-pci 0000:bb:00.0: Unable to change power state from D3hot to D0, device inaccessible
Apr 08 10:43:26 host-XXXX kernel: pcieport 0000:b9:01.0: pciehp: Slot(4012): Link Down
Apr 08 10:43:26 host-XXXX kernel: pcieport 0000:b9:01.0: pciehp: Slot(4012): Card not present
Apr 08 10:43:26 host-XXXX kernel: ------------[ cut here ]------------
Apr 08 10:43:26 host-XXXX kernel: UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
Apr 08 10:43:26 host-XXXX kernel: shift exponent 64 is too large for 64-bit type 'long unsigned int'
Apr 08 10:43:26 host-XXXX kernel: CPU: 177 UID: 0 PID: 3612174 Comm: kworker/177:5 Tainted: P           O        6.17.13-2-pve #1 PREEMPT(voluntary)
Apr 08 10:43:26 host-XXXX kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
Apr 08 10:43:26 host-XXXX kernel: Hardware name: Supermicro SYS-522GA-NRT/X14DBG-AP, BIOS 1.4 07/15/2025
Apr 08 10:43:26 host-XXXX kernel: Workqueue: pm pm_runtime_work
Apr 08 10:43:26 host-XXXX kernel: Call Trace:
Apr 08 10:43:26 host-XXXX kernel:  <TASK>
Apr 08 10:43:26 host-XXXX kernel:  dump_stack_lvl+0x5f/0x90
Apr 08 10:43:26 host-XXXX kernel:  dump_stack+0x10/0x18
Apr 08 10:43:26 host-XXXX kernel:  ubsan_epilogue+0x9/0x39
Apr 08 10:43:26 host-XXXX kernel:  __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113
Apr 08 10:43:26 host-XXXX kernel:  pci_restore_iov_state.cold+0x16/0x21
Apr 08 10:43:26 host-XXXX kernel:  ? pci_enable_acs+0xfa/0x190
Apr 08 10:43:26 host-XXXX kernel:  pci_restore_state.part.0+0x1fb/0x3a0
Apr 08 10:43:26 host-XXXX kernel:  pci_restore_state+0x1e/0x30
Apr 08 10:43:26 host-XXXX kernel:  pci_pm_runtime_resume+0x3b/0xf0
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  __rpm_callback+0x48/0x1f0
Apr 08 10:43:26 host-XXXX kernel:  ? ktime_get_mono_fast_ns+0x39/0xd0
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  rpm_callback+0x6e/0x80
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_pci_pm_runtime_resume+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  rpm_resume+0x4cc/0x6f0
Apr 08 10:43:26 host-XXXX kernel:  ? queue_delayed_work_on+0x81/0x90
Apr 08 10:43:26 host-XXXX kernel:  pm_runtime_work+0x80/0xe0
Apr 08 10:43:26 host-XXXX kernel:  process_one_work+0x188/0x370
Apr 08 10:43:26 host-XXXX kernel:  worker_thread+0x33a/0x480
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_worker_thread+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  kthread+0x108/0x220
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_kthread+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  ret_from_fork+0x205/0x240
Apr 08 10:43:26 host-XXXX kernel:  ? __pfx_kthread+0x10/0x10
Apr 08 10:43:26 host-XXXX kernel:  ret_from_fork_asm+0x1a/0x30
Apr 08 10:43:26 host-XXXX kernel:  </TASK>
Apr 08 10:43:26 host-XXXX kernel: ---[ end trace ]---
Apr 08 10:43:26 host-XXXX kernel: pcieport 0000:b9:01.0: pciehp: Slot(4012): Card present
 

Attachments

Last edited: