System boots up fine, proxmox/OS detects the RTX 6000 Pro devices no issue, they all show up in "lspci". After an arbitrary amount of time (few days), the devices that are NOT passed through to a VM (e.g. are unassigned) will drop off of the bus. There looks to be some weird kernel errors. Rebooting the server brings them back to being detected by lspci. Any thoughts on where to go from here?
If the devices are assigned to a GPU, they seem to never drop off the PCI bus.
If the devices are assigned to a GPU, they seem to never drop off the PCI bus.
Code:
uname -a
Linux prox 6.17.13-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.13-2 (2026-03-13T08:06Z) x86_64 GNU/Linux
Code:
[428093.296377] vfio-pci 0000:bb:00.0: Unable to change power state from D3hot to D0, device inaccessible
[428093.296596] pcieport 0000:b9:01.0: pciehp: Slot(4012): Link Down
[428093.297480] pcieport 0000:b9:01.0: pciehp: Slot(4012): Card not present
[428093.357952] ------------[ cut here ]------------
[428093.358187] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[428093.358365] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[428093.358549] CPU: 177 UID: 0 PID: 3612174 Comm: kworker/177:5 Tainted: P O 6.17.13-2-pve #1 PREEMPT(voluntary)
[428093.358553] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[428093.358554] Hardware name: Supermicro SYS-522GA-NRT/X14DBG-AP, BIOS 1.4 07/15/2025
[428093.358557] Workqueue: pm pm_runtime_work
[428093.358574] Call Trace:
[428093.358579] <TASK>
[428093.358585] dump_stack_lvl+0x5f/0x90
[428093.358592] dump_stack+0x10/0x18
[428093.358593] ubsan_epilogue+0x9/0x39
[428093.358601] __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113
[428093.358603] pci_restore_iov_state.cold+0x16/0x21
[428093.358607] ? pci_enable_acs+0xfa/0x190
[428093.358612] pci_restore_state.part.0+0x1fb/0x3a0
[428093.358623] pci_restore_state+0x1e/0x30
[428093.358624] pci_pm_runtime_resume+0x3b/0xf0
[428093.358627] ? __pfx_pci_pm_runtime_resume+0x10/0x10
[428093.358628] __rpm_callback+0x48/0x1f0
[428093.358629] ? ktime_get_mono_fast_ns+0x39/0xd0
[428093.358636] ? __pfx_pci_pm_runtime_resume+0x10/0x10
[428093.358637] rpm_callback+0x6e/0x80
[428093.358638] ? __pfx_pci_pm_runtime_resume+0x10/0x10
[428093.358639] rpm_resume+0x4cc/0x6f0
[428093.358640] ? queue_delayed_work_on+0x81/0x90
[428093.358646] pm_runtime_work+0x80/0xe0
[428093.358647] process_one_work+0x188/0x370
[428093.358649] worker_thread+0x33a/0x480
[428093.358650] ? __pfx_worker_thread+0x10/0x10
[428093.358651] kthread+0x108/0x220
[428093.358654] ? __pfx_kthread+0x10/0x10
[428093.358655] ret_from_fork+0x205/0x240
[428093.358661] ? __pfx_kthread+0x10/0x10
[428093.358663] ret_from_fork_asm+0x1a/0x30
[428093.358668] </TASK>
[428093.363654] ---[ end trace ]---
[428093.374924] pcieport 0000:b9:01.0: pciehp: Slot(4012): Card present
[428094.395760] pcieport 0000:b9:01.0: pciehp: Slot(4012): No link
[434447.216757] vfio-pci 0000:cc:00.0: Unable to change power state from D3hot to D0, device inaccessible
[434447.216787] pcieport 0000:ca:01.0: pciehp: Slot(5009): Link Down
[434447.218010] pcieport 0000:ca:01.0: pciehp: Slot(5009): Card not present
[434447.291404] pcieport 0000:ca:01.0: pciehp: Slot(5009): Card present
[434448.094883] pci 0000:cc:00.0: [10de:2bb5] type 00 class 0x030200 PCIe Legacy Endpoint
[434448.095466] pci 0000:cc:00.0: BAR 0 [mem 0x00000000-0x03ffffff 64bit pref]
[434448.095748] pci 0000:cc:00.0: BAR 2 [mem 0x00000000-0x1fffffffff 64bit pref]
[434448.095957] pci 0000:cc:00.0: BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
[434448.096184] pci 0000:cc:00.0: Max Payload Size set to 256 (was 128, max 256)
[434448.096745] pci 0000:cc:00.0: Enabling HDA controller
[434448.099572] pci 0000:cc:00.0: PME# supported from D0 D3hot
[434448.100314] pci 0000:cc:00.0: VF BAR 0 [mem 0x00000000-0x0003ffff 64bit pref]
[434448.100438] pci 0000:cc:00.0: VF BAR 0 [mem 0x00000000-0x00bfffff 64bit pref]: contains BAR 0 for 48 VFs
[434448.100583] pci 0000:cc:00.0: VF BAR 2 [mem 0x00000000-0xffffffff 64bit pref]
[434448.100704] pci 0000:cc:00.0: VF BAR 2 [mem 0x00000000-0x2fffffffff 64bit pref]: contains BAR 2 for 48 VFs
[434448.100837] pci 0000:cc:00.0: VF BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
[434448.100959] pci 0000:cc:00.0: VF BAR 4 [mem 0x00000000-0x5fffffff 64bit pref]: contains BAR 4 for 48 VFs
[434448.105230] pci 0000:cc:00.0: Adding to iommu group 136
[434448.111434] pcieport 0000:ca:01.0: bridge window [io 0x1000-0x0fff] to [bus cc] add_size 1000
[434448.111634] pcieport 0000:c9:00.0: Assigned bridge window [mem 0xde000000-0xde4fffff] to [bus ca-d0] cannot fit 0x200000 required for 0000:ca:01.0 bridging to [bus cc]
[434448.111931] pcieport 0000:ca:01.0: bridge window [mem 0x00000000] to [bus cc] requires relaxed alignment rules
[434448.112078] pcieport 0000:ca:01.0: bridge window [mem 0x00100000-0x000fffff] to [bus cc] add_size 200000 add_align 100000
[434448.112243] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space
[434448.112392] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign
[434448.112557] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: can't assign; no space
[434448.112718] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: failed to assign
[434448.112895] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space
[434448.113065] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign
[434448.113235] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: can't assign; no space
[434448.113408] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: failed to assign
[434448.113601] pci 0000:cc:00.0: BAR 2 [mem 0x2fa000000000-0x2fbfffffffff 64bit pref]: assigned
[434448.113862] pci 0000:cc:00.0: VF BAR 2 [mem 0x2fc000000000-0x2fefffffffff 64bit pref]: assigned
[434448.114080] pci 0000:cc:00.0: BAR 0 [mem 0x2ff000000000-0x2ff003ffffff 64bit pref]: assigned
[434448.114348] pci 0000:cc:00.0: BAR 4 [mem 0x2ff004000000-0x2ff005ffffff 64bit pref]: assigned
[434448.114643] pci 0000:cc:00.0: VF BAR 4 [mem 0x2ff006000000-0x2ff065ffffff 64bit pref]: assigned
[434448.114882] pci 0000:cc:00.0: VF BAR 0 [mem 0x2ff066000000-0x2ff066bfffff 64bit pref]: assigned
[434448.115127] pcieport 0000:ca:01.0: PCI bridge to [bus cc]
[434448.115367] pcieport 0000:ca:01.0: bridge window [mem 0x2fa000000000-0x2ff066bfffff 64bit pref]
[434448.115629] PCI: No. 2 try to assign unassigned res
[434448.115843] pcieport 0000:ca:03.0: resource 14 [mem 0xde200000-0xde3fffff] released
[434448.116005] pcieport 0000:ca:03.0: PCI bridge to [bus ce]
[434448.116173] pcieport 0000:ca:04.0: resource 14 [mem 0xde000000-0xde1fffff] released
[434448.116330] pcieport 0000:ca:04.0: PCI bridge to [bus cf]
[434448.116506] pcieport 0000:c9:00.0: resource 14 [mem 0xde000000-0xde4fffff] released
[434448.116696] pcieport 0000:c9:00.0: PCI bridge to [bus ca-d0]
[434448.116877] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space
[434448.117045] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign
[434448.117202] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: can't assign; no space
[434448.117351] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: failed to assign
[434448.117503] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: can't assign; no space
[434448.117680] pcieport 0000:ca:01.0: bridge window [mem size 0x00200000]: failed to assign
[434448.117832] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: can't assign; no space
[434448.117977] pcieport 0000:ca:01.0: bridge window [io size 0x1000]: failed to assign
[434448.118125] pcieport 0000:ca:01.0: PCI bridge to [bus cc]
[434448.118304] pcieport 0000:ca:01.0: bridge window [mem 0x2fa000000000-0x2ff066bfffff 64bit pref]
[439265.014985] mlx5_core 0000:a7:00.1: Using 56-bit DMA addresses
[462942.054093] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1.002 msecs
[485846.881453] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.002 msecs