Proxmox 7.2 AMD FirePro s7150 MxGPU vGPU passthrough

rmartin7793

Member
Aug 29, 2022
8
1
8
Hello, I am finally posting after months of passively trying to get AMD vGPU to work. I have recently had a significant breakthrough and am able to actually see the virtual GPUs on my test system. Which is huge as I spent a weeks messing around with the original AMD GIM repo and trying to get it to compile on my system, to then discover that the old code does not compile on Kernel 5.00+. At the beginning of this journey I knew nearly nothing about Linux kernel modules, compiling or writing C++, or ensuring the correct BIOs settings were enabled. I tried to read every thread I could find on the s7150, GIM, and MxGPU. This has felt like trying to beat one of those seemingly insurmountable boss fights in a game, yet I have learned a ton so far.

Where I am at now as stated above is that I can see the virtual GPUs when I run lspci yet I get the following error at the bottom of the output:
gim error init_register_init_state:3641) Failed to INIT PF for initial register 'init-state'

I am wondering if anyone has managed to navigate past this point?

Here are the system details:
Huananzhi ch8-x99
Intel(R) Xeon(R) CPU E5-2650 v3
Aspeed AST2400 system graphics
AMD s7150 FirePro

The verbose details of my error are as follows:

Code:
- - - - AMD PCI-e devices - - - -

02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XT GL [FirePro S7150]
02:02.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
02:02.1 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
02:02.2 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
02:02.3 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
02:02.4 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
02:02.5 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
02:02.6 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
02:02.7 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XTV GL [FirePro S7150V]
[    7.583546] gim error:(wait_cmd_complete:1683)  wait_cmd_complete -- time out after 3.000075060 sec
[    7.583579] gim error:(wait_cmd_complete:1692)   Cmd = 0x17, Status = 0x0
[    7.583600] gim error:(dump_gpu_status:1417) **** dump gpu status begin for struct adapter 2:00.00
[    7.583652] gim error:(dump_gpu_status:1457)  mmGRBM_STATUS = 0x3028
[    7.583676] gim error:(dump_gpu_status:1460)  mmGRBM_STATUS2 = 0x8
[    7.583716] gim error:(dump_gpu_status:1463)  mmSRBM_STATUS = 0x20000040
[    7.583734] gim error:(dump_gpu_status:1466)  mmSRBM_STATUS2 = 0x0
[    7.583751] gim error:(dump_gpu_status:1469)  mmSDMA0_STATUS_REG = 0x46dee557
[    7.583769] gim error:(dump_gpu_status:1472)  mmSDMA1_STATUS_REG = 0x46dee557
[    7.583793] gim error:(check_me_cntl:1388)   ME HALTED!
[    7.583807] gim error:(check_me_cntl:1391)   PFP HALTED!
[    7.583821] gim error:(check_me_cntl:1394)   CE HALTED!
[    7.583835] gim error:(dump_gpu_status:1588) **** dump gpu status end
[    7.583852] gim error:(init_register_init_state:3641) Failed to INIT PF for initial register 'init-state'
AMDVBFLASH version 4.69, Copyright (c) 2020 Advanced Micro Devices, Inc.


adapter seg  bn dn dID       asic           flash      romsize test    bios p/n 
======= ==== == == ==== =============== ============== ======= ==== ================
   0    0000 02 00 6929 Tonga           GD25Q41B         80000 pass 113-C76720NOSRIOV

- - - - - - - - - - - - - - - - -
 
Last edited:
thank you for your quick response, am I correct in understanding that If my BIOS only supports UEFI not Legacy then I am unable to use this motherboard with an s7150 and need to get a motherboard that has legacy boot?
 
no, that's not what i meant. in the mainboard i tested, there is an option for the 'option rom' for each pci device, which has a setting of 'legacy' and 'uefi'. this setting was independent of the overall uefi/bios setting
the thing is, that the card seems to need to be initialized in bios mode, and with our board that is supported even in efi boot mode
 
Thank you for clarifying. the options on this bios are oddly organized, I will look to see if I can find such an option and report back.
 
Is it possible to manually trigger legacy option rom with grub cmds or with in PVE after boot? two years ago we had done some work with a Xeon Phi CPU and for some reason the bios did not correctly set the CPU MSRs to enable AVX512. I recognize that this was a different scenario but I am wondering if it is possible to set optrom to legacy outside of BIOs prior to the pcie devices initializing. Or perhaps set optrom to legacy from within the system post-boot and then reinitialize the card?

I've checked my bios thoroughly, and did not find a pci-e oprom setting. I did find a setting that mentioned OpRom, however, and it makes me wonder if this board is incapable of using legacy oprom with a UEFI install.

here is what it said:
x99OpRom.jpg
 
whats the options you can select there? it sounds like you should be able to select 'legacy' there?
 
I believe the whole idea of Fast Boot is to not do any (legacy) BIOS ROM initialization for devices (and boot more quickly that way). Maybe disable Fast Boot first and you might have more options?
 
  • Like
Reactions: dcsapak
I will try again to see if it is possible to fully disable fast boot, as I so desperately want this to me a problem with me overlooking something.

Here are photos of the options I get, and when I disable fastboot, I still get the error. If disabling fast boot does not work is it possible to reinitialize the card either after full boot or with grub options?

fastBootOptions.jpg

IMG_20221017_140849.jpg
 
i don't really know, you'd have to ask amd...
 
So I have spoken to a friend of mine, and we tried resetting the device from within the system, to discover the reset method appears to be bouncing the entire PCI-e bus. I am migrating the VMs on this host and will try tomorrow and report results here. My theory is that this bios may not support legacy optRoms and if we can manually reset the PCI-e bus after boot but before VMs start this may serve as a workaround.

I know this card is old and the effort required to get this combo working may prove to have diminishing return, but if legacy features begin to be phased out in bios, then it may prove useful to the community to have a functional workaround. If we win I'll do a write up with details, but I'll report our findings here either way.

Code:
root@ch8:/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0# ls
aer_dev_correctable       driver_override  modalias      resource2                subsystem_device
aer_dev_fatal             enable           msi_bus       resource2_wc             subsystem_vendor
aer_dev_nonfatal          firmware_node    msi_irqs      resource4                uevent
ari_enabled               getvf            numa_node     resource5                vendor
boot_vga                  gpuvf            power         revision                 virtfn0
broken_parity_status      hotlink_reset    power_state   rom                      virtfn1
class                     iommu            relvf         sriov_drivers_autoprobe  virtfn2
config                    iommu_group      remove        sriov_numvfs             virtfn3
consistent_dma_mask_bits  irq              rescan        sriov_offset             virtfn4
current_link_speed        link             reset         sriov_stride             virtfn5
current_link_width        local_cpulist    reset_method  sriov_totalvfs           virtfn6
d3cold_allowed            local_cpus       resource      sriov_vf_device          virtfn7
device                    max_link_speed   resource0     sriov_vf_total_msix      waiting_for_supplier
dma_mask_bits             max_link_width   resource0_wc  subsystem
root@ch8:/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0# echo "1" > reset
-bash: echo: write error: Inappropriate ioctl for device
root@ch8:/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0# echo 1 > reset
-bash: echo: write error: Inappropriate ioctl for device
root@ch8:/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0# cat reset_method
bus

Regardless this is forcing me to learn a ton about PCI-e and drivers, so like a demonstrator aircraft that gets canceled, worst case scenario I walk away with new insights and knowledge in my inventory.
 
Last edited:
  • Like
Reactions: leesteken
So I have found an OLD forum post that explains all hardware is initialized by Linux and that BIOS settings can be overridden by the OS. Is this true only of Legacy BIOS and not with UEFI? If this is possible with UEFI then the concept of completely reinitializing the FirePro after boot may not be out of the question.

Code:
How does Linux override BIOS settings
 whho   07-11-2009 12:12 PM
Hello,

In the process of configuring of my server, I come to a question: How does Linux override BIOS settings?

For example, if I disable a HD in BIOS, windows will not see it. It just disappears. However, Linux will see them regardless of how many drives are disabled in BIOS. Does windows depend on certain BIOS api, while Linux doesn't depend on it? Does Linux kernel overwrite the loaded BIOS programs/data, or Linux just ignore them? How does it all work?

Thanks!

whho

Code:
kilgoretrout    07-11-2009 03:04 PM
Once the linux kernel loads, linux kisses the bios good bye. All hardware detection is then done through the linux kernel. The only thing the bios does for linux is identify the boot device and load the kernel via the bootloader. Linux doesn't overwrite anything in the bios so much as completely ignore it once the kernel loads.

In windows, hardware detection is much more tightly bound to the bios. For example, older bioses had limitations on the maximum size of the hard drives it would recognize. These over sized drives would be undetected by the bios and windows but be plainly visible in linux if you had linux on another bootable drive within the size limits of the bios. It's more a matter of the history of the PC than anything else. The PC bios was designed to work with the old DOS based windows and the quirks of the windows kernel. For the development of the linux kernel, you didn't want or need to have hardware detection tied to the bios. Once the kernel loads, linux accesses the hardware directly.

Code:
whho    07-11-2009 11:08 PM
Hi Kilgo,

Thanks for your explanation! I really like the history bit!

Do you have any information or links explaining in more detail what address bios is loaded into and how a boot loader access the information in the bios?

Does a boot loader call a function or access a certain register or read a certain block of memory or what?

Thanks again! Any idea is welcome!

whho

Code:
Erik_FL    07-12-2009 12:19 AM
It's basically a choice for an operating system to pay attention to the configuration information kept by the ACPI (Advanced Configuration and Power Interface). The BIOS creates a database of the detected hardware, disks, etc. It can also configure some devices based on the information the hardware provides as part of Plug and Play.

Newer versions of Windows require an ACPI BIOS and will not work without it unless you install a special HAL (Hardware Abstraction Layer).

Linux does not require ACPI although you can load modules that support the power management features of ACPI and hot plugging.

ACPI provides assistance for power management. The BIOS still has to be involved in placing hardware in low power modes so Linux does use the BIOS ACPI interface. On laptops it may be necessary to load special ACPI modules for Linux to support all the power management features provided by the BIOS.




https://www.linuxquestions.org/ques...es-linux-override-bios-settings-739404-print/
 
i don't fully understand how these posts relate to your problem. AFAIU the card has an embedded OPROM which seems to initialize the card. These OPROMS can come in different variants (BIOS vs UEFI) and the BIOS chooses which (if any) is executed on boot.
AFAIK this must happen before anything else is loaded (especially the legacy BIOS ones). There is nothing the linux kernel can do to execute the OPROM. I guess the driver could also initialize the card, but that would have to be implemented there. As i said, i'd contact and ask AMD for help (if they give any)
 
i don't fully understand how these posts relate to your problem. AFAIU the card has an embedded OPROM which seems to initialize the card. These OPROMS can come in different variants (BIOS vs UEFI) and the BIOS chooses which (if any) is executed on boot.
AFAIK this must happen before anything else is loaded (especially the legacy BIOS ones). There is nothing the linux kernel can do to execute the OPROM. I guess the driver could also initialize the card, but that would have to be implemented there. As i said, i'd contact and ask AMD for help (if they give any)

Sorry for the late reply. You seem to be correct, there was no way that we found to reinitialize the card. AMD seems to have moved on from this card for understandable reasons. The BIOS which we had flashed on the board was confirmed by the dev to not have legacy support of any kind, but we had used this bios to add PCI-e bifurcation, needed for our NVME arrays.


Where we are at now is that I have had to hire him to develop a variant of this bios at an honest but significant price. If this solves our issue I will at least update confirming for the record.
 
Sorry to dig this one up again,

But I just forked a fork of the original gim repo by AMD and updated it to support the 6.8.x.x linux kernel.

It took a fair amount of fiddling and updating but I modified the code to be backwards compatible, and after having some issues with the actual module installation, I can now see the virtual GPUs in the `lspci` output, without the errors described above.

1722216328025.png

I will be attempting to test this against a Windows 11 & Ubunutu 20.04/22.04 LTS guest OS and will revert back here with the results.

Technical Details:
Host PC
  • ASUS SAGE WRx8 SE Motherboard
  • AMD Ryzen Threadripper Pro 3955WX
  • AMD FirePro S7150
Repo
GIM Fork - at time of writing code is present on kernel5.11 branch, contrary to the name of the branch, it supports up to 6.8.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!