Adding 6600XT GPU causes VM to run one core at 100% and never boot

brakels

New Member
Sep 8, 2023
10
0
1
Boston, MA USA
I've got a Supermicro X11SPA-TF motherboard with an Intel Xeon 6226R running Proxmox 8.1.1, and I've got a Sapphire 6600XT Nitro+ 8GB in CPU Slot 3 (x16, PCIe 3.0) and an NVIDIA 1660 Ti in CPU Slot 7 (x16, PCIe 3.0).

When I assign the NVIDIA card to a VM, it boots and works fine (tested with Windows 11 so far). If I assign the 6600XT to a VM, the VM never boots, and instead it sits with a single core at 100% and does nothing (no errors, no problems with the host, and I'm free to start/stop the VM as many times as I want). I tested this with the client running Windows 11 and macOS 14; both VMs boot and work fine with a virtual VGA card, but do nothing when the 6600XT is assigned to them. The Windows 11 VM boots fine with the NVIDIA card, and I was able to install drivers and Parsec and run FurMark, so that appears to be fine.

I've reboot the node countless times, and tried a lot of permutations of settings, all with the same result. One thing I haven't tried yet is removing the NVIDIA GPU from the system. Should I?

I did try dumping the 6600XT's BIOS, but I only was able to get ~119KB. However, I used identifiers in that rom to find the full rom on techpowerup, (it matched byte-for-byte with what I was able to dump, but went on for a full 1MB). I used that full 1MB rom with the VM. This resulted in no change in the observed behavior.

Here's a dump of information, and I'm happy to provide more:

Bash:
$ cat /proc/cmdline
initrd=\EFI\proxmox\6.5.11-7-pve\initrd.img-6.5.11-7-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt

$ cat /etc/modules
vfio
vfio_iommu_type1
vfio_pci

$ cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1 report_ignored_msrs=0

$ cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:2182,10de:1aeb,10de:1aec,10de:1aed,1002:73ff,1002:ab28,1002:1478,1002:1479

$ cat /etc/modprobe.d/blacklist.conf
blacklist radeon
blacklist amdgpu
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist nvidia_drm

$ dmesg | grep "IOMMU enabled"
[    0.270400] DMAR: IOMMU enabled

$ dmesg | grep remapping
[    0.751347] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.752247] DMAR-IR: Enabled IRQ remapping in x2apic mode

$ dmesg | grep -i vfio
[    6.463302] VFIO - User Level meta-driver version: 0.3
[    6.468926] vfio-pci 0000:65:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    6.469051] vfio_pci: add [10de:2182[ffffffff:ffffffff]] class 0x000000/00000000
[    6.516107] vfio_pci: add [10de:1aeb[ffffffff:ffffffff]] class 0x000000/00000000
[    6.516145] vfio_pci: add [10de:1aec[ffffffff:ffffffff]] class 0x000000/00000000
[    6.516162] vfio_pci: add [10de:1aed[ffffffff:ffffffff]] class 0x000000/00000000
[    6.516183] vfio-pci 0000:1b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    6.516330] vfio_pci: add [1002:73ff[ffffffff:ffffffff]] class 0x000000/00000000
[    6.616182] vfio_pci: add [1002:ab28[ffffffff:ffffffff]] class 0x000000/00000000
[    6.616243] vfio_pci: add [1002:1478[ffffffff:ffffffff]] class 0x000000/00000000
[    6.616285] vfio_pci: add [1002:1479[ffffffff:ffffffff]] class 0x000000/00000000

$ pvesh get /nodes/nodename/hardware/pci --pci-class-blacklist ""
# output trimmed for relevance, happy to provide full output if that is helpful
┌──────────┬────────┬──────────────┬────────────┬────────┬──────────────────────────────────────────────────────────────┬──────┬──────────────────┬───────────────────────┬──────────────────┬──────────
│ class    │ device │ id           │ iommugroup │ vendor │ device_name                                                  │ mdev │ subsystem_device │ subsystem_device_name │ subsystem_vendor │ subsystem
╞══════════╪════════╪══════════════╪════════════╪════════╪══════════════════════════════════════════════════════════════╪══════╪══════════════════╪═══════════════════════╪══════════════════╪══════════
│ 0x030000 │ 0x2000 │ 0000:04:00.0 │         44 │ 0x1a03 │ ASPEED Graphics Family                                       │      │ 0x1b28           │                       │ 0x15d9           │ Super Mic
│ 0x030000 │ 0x73ff │ 0000:1b:00.0 │         12 │ 0x1002 │ Navi 23 [Radeon RX 6600/6600 XT/6600M]                       │      │ 0xe448           │                       │ 0x1da2           │ Sapphire
│ 0x030000 │ 0x2182 │ 0000:65:00.0 │          5 │ 0x10de │ TU116 [GeForce GTX 1660 Ti]                                  │      │ 0x1333           │                       │ 0x196e           │ PNY
│ 0x040300 │ 0xab28 │ 0000:1b:00.1 │         13 │ 0x1002 │ Navi 21/23 HDMI/DP Audio Controller                          │      │ 0xab28           │                       │ 0x1002           │ Advanced
│ 0x040300 │ 0x1aeb │ 0000:65:00.1 │          5 │ 0x10de │ TU116 High Definition Audio Controller                       │      │ 0x1333           │                       │ 0x196e           │ PNY
│ 0x060400 │ 0x1478 │ 0000:19:00.0 │         10 │ 0x1002 │ Navi 10 XL Upstream Port of PCI Express Switch               │      │ 0x0000           │                       │ 0x0000           │
│ 0x060400 │ 0x1479 │ 0000:1a:00.0 │         11 │ 0x1002 │ Navi 10 XL Downstream Port of PCI Express Switch             │      │ 0x1479           │                       │ 0x1002           │ Advanced
│ 0x0c0330 │ 0x1aec │ 0000:65:00.2 │          5 │ 0x10de │ TU116 USB 3.1 Host Controller                                │      │ 0x1333           │                       │ 0x196e           │ PNY
│ 0x0c8000 │ 0x1aed │ 0000:65:00.3 │          5 │ 0x10de │ TU116 USB Type-C UCSI Controller                             │      │ 0x1333           │                       │ 0x196e           │ PNY
└──────────┴────────┴──────────────┴────────────┴────────┴──────────────────────────────────────────────────────────────┴──────┴──────────────────┴───────────────────────┴──────────────────┴──────────

Any suggestions on what to try next would be greatly appreciated! Thanks in advance!
 
Is the 6600XT used (showing output) when the system POSTs and Proxmox boots? Then you might need this work-around (since you don't let amdgpu take the GPU): https://forum.proxmox.com/posts/478351/
There is no point in blacklisting radeon as it is not used. You are not blacklisting snd_hda_intel which is used by the GPU for audio (or other drivers for other functions of the GPU)? In my experience, there is no need to blacklist amdgpu etc. nor to early bind AMD GPUs; just let amdgpu take the GPU (and release it from the boot screen) before starting the VM.

EDIT: That last part does depend on the 6600XT resetting properly (FLR), which I don't know for sure it does.
 
Last edited:
Is the 6600XT used (showing output) when the system POSTs and Proxmox boots? Then you might need this work-around (since you don't let amdgpu take the GPU): https://forum.proxmox.com/posts/478351/
There is no point in blacklisting radeon as it is not used. You are not blacklisting snd_hda_intel which is used by the GPU for audio (or other drivers for other functions of the GPU)? In my experience, there is no need to blacklist amdgpu etc. nor to early bind AMD GPUs; just let amdgpu take the GPU (and release it from the boot screen) before starting the VM.

The 6600XT is NOT used during boot. I forgot to mention that I was able to disable that in the motherboard's BIOS. Each PCIe slot had a setting that could be set to EFI or disable, and I set them all to disable.

OK, so I should add snd_hda_intel to the list of blacklists, but remove radeon and amdgpu? I'll go try that.
 
The 6600XT is NOT used during boot. I forgot to mention that I was able to disable that in the motherboard's BIOS. Each PCIe slot had a setting that could be set to EFI or disable, and I set them all to disable.
I'm not convinced. You might want to try the initcall_blacklist=sysfb_init work-around and enable EFI for the 6600XT.
OK, so I should add snd_hda_intel to the list of blacklists, but remove radeon and amdgpu? I'll go try that.
If you're going for blacklisting (snd_hda_intel and maybe others) then keep blacklisting amdgpu! If you're going for letting the drivers load for the GPU, then don't blacklist any necessary drivers (or early bind them to vfio-pci). Best not to mix different approaches and that will most likely not work.

EDIT: Either blacklist everything, or early bind to vfio-pci (but make sure to use softdep to make sure vfio-pci loads before the actual drivers) ,or let the drivers load (and free the GPU from the boot process).
EDIT2: Your reply just disappeared, probably because of the spam filter. Please give it some time (maybe read my edits of my posts).
 
Last edited:
I see. So let me make sure I understand what is going on:

The goal is that the booting host OS doesn't take ownership of the devices we want to make available for VMs? So the two approaches are: 1) blacklist the drivers so they can't be loaded and thus the host can't take ownership of the devices, or 2) bind them to vfio-pci, in which case it will intercept (?) the host OS from claiming sole ownership, and thus the device is still available for use by VMs despite the host drivers loading?

I'm new to a lot of this, and have been trying to follow a number of different guides, many of which I've since learned are outdated, so I appreciate you helping me lock down the basics!

I did try initcall_blacklist=sysfb_init at one point, but I will add that to the queue of things to try and report back.
 
So my previous message says "This message is awaiting moderator approval, and is invisible to normal visitors". I edited it 2 times in quick succession because I didn't know how to format the inline code. I now see the "Preview" button, which I will try to remember exists going forward.

OK, so wrapping my head around vfio-pci: so softdep tells modprobe(?) to that the graphics drivers depend on vfio being loaded first, and once vfio loads, it can claim the devices before the drivers get a chance? Is that the right idea? And is vfio only claiming the devices listed in the vfio.conf file, or is that used for something else?
 
OK, so wrapping my head around vfio-pci: so softdep tells modprobe(?) to that the graphics drivers depend on vfio being loaded first, and once vfio loads, it can claim the devices before the drivers get a chance? Is that the right idea? And is vfio only claiming the devices listed in the vfio.conf file, or is that used for something else?
Yes, I think you've got it. If you go that route, you'll want to add softdep amdgpu pre: vfio-pci and similar for snd_hda_intel (and maybe others?).
 
In your current situation, I would first try the initcall_blacklist=sysfb_init work-around and enable EFI for the 6600XT. But you won't have a Proxmox host console anymore, which can make troubleshooting more difficult. What GPU do you use for Proxmox itself (in case remote access fails)?
 
What GPU do you use for Proxmox itself (in case remote access fails)?

The motherboard has a built-in video card with VGA out. In my original post, it is the "ASPEED Graphics Family" device.

OK, I'll leave the blacklists for now, which currently looks like this:
Code:
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
blacklist nvidia_drm
blacklist amdgpu
blacklist snd_hda_intel

I've added the initcall_blacklist=sysfb_init to etc/kernel/cmdline, reran proxmox-boot-tool refresh, and will now enable the EFI for the PCIe slot in the BIOS.
 
So I changed the BIOS to EFI for all the PCIe slots, and once it booted, I ssh'd in and confirmed my cmdline updated (via cat /proc/cmdline).

The boot itself went like this:
  • The BIOS POST/splash screen appears on the Onboard VGA output.
  • The BIOS POST/splash switches to the 6600XT HDMI output (VGA goes dark)
  • Proxmox boots and BOTH outputs show the Proxmox boot menu.
  • Proxmox continues booting, and the HDMI output stops updating, and just shows a single line "EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path".
  • After a couple seconds the VGA output lights up and I see the boot logs with the yellow [OK] scrolling by.
  • The VGA output eventually completes the boot and shows the text login prompt.
I logged into the Proxmox web GUI, swapped the PCI device to the 6600XT, and started it. Aaaaaaaaaannnd, same behavior as before (1 core at 100%, VM does not ever boot).
 
So I changed the BIOS to EFI for all the PCIe slots
I'm not sure what that does exactlym sorry.
The boot itself went like this:
  • The BIOS POST/splash screen appears on the Onboard VGA output.
  • The BIOS POST/splash switches to the 6600XT HDMI output (VGA goes dark)
So the 6600XT is used during boot.
  • Proxmox boots and BOTH outputs show the Proxmox boot menu.
  • Proxmox continues booting, and the HDMI output stops updating, and just shows a single line "EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path".
  • After a couple seconds the VGA output lights up and I see the boot logs with the yellow [OK] scrolling by.
  • The VGA output eventually completes the boot and shows the text login prompt.
I logged into the Proxmox web GUI, swapped the PCI device to the 6600XT, and started it. Aaaaaaaaaannnd, same behavior as before (1 core at 100%, VM does not ever boot).
Maybe your 6600XT does not reset properly? If so, you can only use it once (and you need to reboot Proxmox after stopping the VM before the VM can be started). And you need to make sure it is not used during boot. Disabling EFI for the PCIe slot of the 6600XT might not be enough for this.
Do you know of anyone on the internet that has gotten your exact 6600XT make and model working with passthrough?
Maybe try swapping the NVidia GPU with the 6600XT (which will require changing both VMs since the PCI IDs will swap also), if the NVidia GPU resets properly, and that way the 6600XT is not touched before starting the VM.
 
Maybe your 6600XT does not reset properly?

I was assuming if this happened, it would have more obvious behavior (like bringing down the host). But that is not based on fact, just assumptions from stuff I've been reading...

But I think removing the NVIDIA GPU and trying different slots isn't a terrible idea. I can't physically fit the 6600XT in the slot where the NVIDIA is currently, but I have other slots to choose from.

I'm also tempted to just swap the 6600XT with an NVIDIA 2070 I have my desktop, and then if the 2070 and the 1660Ti both work together in separate VMs, then I would just be left with the problem of getting macOS to play nice with the 2070. Maybe that is less of a hassle than continuing to try to get this 6600XT to work?

The specific card is the Sapphire RX 6600 XT Nitro+, and according to the partial rom dump I got, it is a 113-1E448MU-O6Q.
 
I was assuming if this happened, it would have more obvious behavior (like bringing down the host). But that is not based on fact, just assumptions from stuff I've been reading...
No, it just does not reset properly and does not come back up. Check journalctl from around the time of starting the VM.
But I think removing the NVIDIA GPU and trying different slots isn't a terrible idea. I can't physically fit the 6600XT in the slot where the NVIDIA is currently, but I have other slots to choose from.
Make sure to check the IOMMU groups when using different slots.
I'm also tempted to just swap the 6600XT with an NVIDIA 2070 I have my desktop, and then if the 2070 and the 1660Ti both work together in separate VMs, then I would just be left with the problem of getting macOS to play nice with the 2070. Maybe that is less of a hassle than continuing to try to get this 6600XT to work?
I cannot comment on macOS, sorry. I have no personal experience with NVidia because they purposefully hindered passthrough in the past and I switched to AMD GPUs that are known to work (sometimes only with vendor-reset).
The specific card is the Sapphire RX 6600 XT Nitro+, and according to the partial rom dump I got, it is a 113-1E448MU-O6Q.
That one appears to work with passthrough without any issues: https://www.reddit.com/r/VFIO/comments/tq9j5v/need_help_compiling_a_list_of_amd_6000_series/ Maybe ask that reddit-user for tips?
 
OK, I swapped cards, and with the 1660Ti and 2070 both plugged in, the BIOS locks up during POST and never boots. It at least told me in plain english that it was trying to enumerate the PCI devices.

I did some experiments with the cards in different slots, and got confusing results, so then I dug into the BIOS and manual, and realized I'd never looked into how the PCIe lanes were routed on this MB. So I looked up the CPU, and it has 48 PCIe lanes, and then I searched around for a block diagram of the MB's lane routing, and low and behold, 16 lanes go to the 4 m.2 slots, leaving only 32 lanes. That would be fine if I just had the two GPUs, but I also have an Intel 2x 10GB ethernet card eating up 8 lanes.

So I now wonder if that's the problem. Maybe the 6600XT boot with 8 lanes, but then freaked out when the VM thought it had 16 to work with? And maybe NVIDIA cards are just more sensitive to that kind of thing at boot?

So tomorrow I think I'll try removing the network card and seeing if I can get it to boot with 2 NVIDIA cards, and if so, then try swapping back in the 6600XT and see if I can finally get that to work. If that fixes it, then I'll have to decide: 2GPUs, or 1GPU and 2x 10GB ethernet... Or if there's some way to split the 4x m.2 into 2x m.2 and one 8x slot? The block diagram leads me to believe that is not possible, but maybe it is!

Not sure if I'm allowed to post links, but here's the block diagram I found for an X11SPA-T.
 
OK, I swapped cards, and with the 1660Ti and 2070 both plugged in, the BIOS locks up during POST and never boots. It at least told me in plain english that it was trying to enumerate the PCI devices.
Why swap when you are not using the 6600XT or maybe this was about fitting again.
I did some experiments with the cards in different slots, and got confusing results, so then I dug into the BIOS and manual, and realized I'd never looked into how the PCIe lanes were routed on this MB. So I looked up the CPU, and it has 48 PCIe lanes, and then I searched around for a block diagram of the MB's lane routing, and low and behold, 16 lanes go to the 4 m.2 slots, leaving only 32 lanes. That would be fine if I just had the two GPUs, but I also have an Intel 2x 10GB ethernet card eating up 8 lanes.
Some slots share lanes via the southbridge (called PCH by Intel) or multiplexer chips on the motherboard (in general).
So I now wonder if that's the problem. Maybe the 6600XT boot with 8 lanes, but then freaked out when the VM thought it had 16 to work with? And maybe NVIDIA cards are just more sensitive to that kind of thing at boot?
Passthrough handles this fine and PCIe is supposed to be compatible both ways.
So tomorrow I think I'll try removing the network card and seeing if I can get it to boot with 2 NVIDIA cards, and if so, then try swapping back in the 6600XT and see if I can finally get that to work. If that fixes it, then I'll have to decide: 2GPUs, or 1GPU and 2x 10GB ethernet... Or if there's some way to split the 4x m.2 into 2x m.2 and one 8x slot? The block diagram leads me to believe that is not possible, but maybe it is!

Not sure if I'm allowed to post links, but here's the block diagram I found for an X11SPA-T.
The number of lanes is usually not the problem and only limit (maximum) performance. The PCH will share the bandwidth over the devices connected to it (and they can be a bit slower that way). Passthrough often words best with the PCIe lanes from the CPU (with integrated northbridge/IOMMU) but is all depends on the capabilities of the southbridge etc.
 
So, good news! If I pull the Intel NIC, then my original problem goes away and I can boot into a VM! I'm installing the AMD drivers in my Windows VM right now, and will try the macOS VM next.

And to respond to your last post:
Why swap when you are not using the 6600XT or maybe this was about fitting again.

So I pulled the 6600XT and put it in my desktop to make sure it worked fine. I booted windows, installed all the drivers, and ran some benchmarks, and it seemed completely normal. So while that was happening, I took the 2070 that I pulled from my desktop and tested it in my proxmox server. So that's what I meant by "swapped", I swapped the 6600XT for the 2070 I hadn't previously tried using with proxmox.

Some slots share lanes via the southbridge (called PCH by Intel) or multiplexer chips on the motherboard (in general).

Yeah, looking at the block diagram link I posted, none of the MB slots go through the south bridge, but here's my understanding:

1. Slot 1 seems to be either x16 or nothing, depending on if you need any of the m.2 slots. I am using 2 of the m.2 slots, so I assume there's no way for me to get anything out of Slot 1.
2. Slots 2,3,4,5 share 16 lanes, but with a PEX8747 (48 lane switch) and a hand full of IT8898 Muxes. I don't fully grok the limitations of this setup. The block diagram doesn't explain to me when you can expect which slots to co-exist.
3. Slots 6 and 7 share 16 lanes, and I think it can work as either 0/16, or 8/8.

For all my previous tests, I'd put the 1660Ti in Slot 7 (the 6600XT doesn't quick fit here), my Intel NIC in Slot 5, and the 6600XT/2070 in Slot 3.

Passthrough handles this fine and PCIe is supposed to be compatible both ways.
There is a newer BIOS available from Supermicro, but they warn not to install it without a specific reason, but I can't find release notes or a version history or anything to let me know what actually changed between versions. But there's always a chance I've hit a bug that they fixed later on. I suppose I should reach out to them.
 
Yeah, looking at the block diagram link I posted, none of the MB slots go through the south bridge, but here's my understanding:

1. Slot 1 seems to be either x16 or nothing, depending on if you need any of the m.2 slots. I am using 2 of the m.2 slots, so I assume there's no way for me to get anything out of Slot 1.
Either it's 4x4 or 1x16 or it's both but they share the bandwidth. Doesn't the motherboard manual explain which slots are mutually exclusive?
2. Slots 2,3,4,5 share 16 lanes, but with a PEX8747 (48 lane switch) and a hand full of IT8898 Muxes. I don't fully grok the limitations of this setup. The block diagram doesn't explain to me when you can expect which slots to co-exist.
Either it's sharing bandwidth or the lanes are switched depending on devices plugged in. Or maybe you get to chose in the motherboard BIOS
3. Slots 6 and 7 share 16 lanes, and I think it can work as either 0/16, or 8/8.
Probably, which might indicate that slot 1 is unusable when you use the M.2 slots.
For all my previous tests, I'd put the 1660Ti in Slot 7 (the 6600XT doesn't quick fit here), my Intel NIC in Slot 5, and the 6600XT/2070 in Slot 3.

There is a newer BIOS available from Supermicro, but they warn not to install it without a specific reason, but I can't find release notes or a version history or anything to let me know what actually changed between versions. But there's always a chance I've hit a bug that they fixed later on. I suppose I should reach out to them.
At least they can explain which slots are mutually exclusive.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!