Grid K2 crashes my machine

Ejo2001

New Member
Feb 5, 2021
6
0
1
23

Grid K2 problems in proxmox​

renderTimingPixel.png

Hello! I'm having a few problems with my Nvidia grid K2 cards, help and thoughts appreciated! (Also, this is my first Proxmox post, sorry if it isn't very well made)

Background
So I ordered 2x Nvidia Grid K2 from ebay, and I put them inside a proxmox machine. The first card ran perfectly fine on the first vGPU, but when I assigned the seccond vGPU to another machine, that machine would get Error code 43. I replaced it and mounted the seccond card that I bought, and this time the whole proxmox host machine restarts when I try to boot the computer that has been assigned the seccond vGPU. Am I doing something wrong, or is the cards I bought simply defect?

Host specs
CPU:
AMD Ryzen 3600X
GPU: Nvidia Grid K2, Nvidia GT 720 (For display output)
Storage: 2Tb HDD, 500Gb SSD
Motherboard: Gigabyte b450m ds3h
RAM: 16Gb corsair vengence DDR4

Thanks in advance!

/Ejo
 
ist there any error/warning/etc. in dmesg or syslog/journal? (you may have to activate a persistent journal to not lose the log after the crash)
do you passthrough vgpus? (if yes which driver do you use)
or the whole cards?

how do the vm configs look like?
are the iommu groups ok?
 
ist there any error/warning/etc. in dmesg or syslog/journal? (you may have to activate a persistent journal to not lose the log after the crash)
do you passthrough vgpus? (if yes which driver do you use)
or the whole cards?

how do the vm configs look like?
are the iommu groups ok?
syslog and dmesg

I am quite new to Linux/Proxmox, so I don't know exactly how that works. I opened the syslog and can't see anything that looks like an error, in dmesg this line seems to be repeated, though I don't think it is related to my issue?

[ 8.468487] EDAC amd64: Node 0: DRAM ECC disabled. [ 8.468489] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or force module loading by setting 'ecc_enable_override'. (Note that use of the override may cause unknown side effects.)


Passthrough


I am trying to split the 2 GPUs on the card so that each VM can have it's own GPU (Trying to make it work like 2x GTX 690). I had some problem on this motherboard in the begining which was that when I activated IOMMU in the BIOS, Proxmox wouldn't boot, and would get this weird error:

1613047565491.png

However, a friend of mine managed to help me start the computer with Iommu enabled by changing the grub file to this:

1613047673333.png

I don't know if any of this could be the issue.


Which driver do you use?

I'm not sure if you mean a driver in proxmox or for the GPU, but I use is the standard Grid K2 driver for Windows 10 on Nvidias site:
1613047830932.png

It works well on the first vGPU on both cards, but not the seccond (Atleast not on the first I tried where I got error code 43)

Iommu groups

I might be wrong, but I believe the iommu groups are okay? I can't see any group with the same numbers.
1613046465244.png
 
ok, the part in the kernel commandline that says 'pcie_acs_override' is potentially the cause

it splits the devices into different iommu groups, even if the hardware would put them together. while that works sometimes, it can lead to instability and crashes if the hardware is not
meant to be used that way, so that can be the cause

and the errors that you had when enabling iommu can indicate that the hardware does not properly support/implement those features

maybe there is a bios upgrade that can help, but it may well be that this is not possible with this hardware
 
ok, the part in the kernel commandline that says 'pcie_acs_override' is potentially the cause

it splits the devices into different iommu groups, even if the hardware would put them together. while that works sometimes, it can lead to instability and crashes if the hardware is not
meant to be used that way, so that can be the cause

and the errors that you had when enabling iommu can indicate that the hardware does not properly support/implement those features

maybe there is a bios upgrade that can help, but it may well be that this is not possible with this hardware
The bios is up to date, I updated it when I encountered that problem before. I have some old spare computers, I will see if any of them can run it. I will post again in the thread when I have tested it :)
 
@dcsapak Okay so I have tried the cards in 2 seperate computers and the results are... Interessting?

When I tried the first card, it worked, but the seccond vGPU still had Error code 43. They also didn't display anything but a green screen whenever I tried to connect using parsec. After reinstalling the drivers on the first GPU and then rebooting, both vGPUs started to work fine without any driver issues and no green screens.

The seccond grid card was similair. They didn't have the error 43 on them, but they only displayed a green screen in parsec. Then I reinstalled drivers on them, and all of a sudden both vGPUs worked. Then I tried to combine both vGPUs in one machine, and suddenly it broke down again. I then restarted that machine again, and then it worked.

I'm not sure what to think of it at this point. I don't know if they work properly, or if I'm just lucky (or clumsy). I will conduct some more testing and see what happens, but this seems a bit more promising. Thanks for your help! (Might add on more later on as I explore these cards more)
 
@dcsapak Okay so I have tried the cards in 2 seperate computers and the results are... Interessting?

When I tried the first card, it worked, but the seccond vGPU still had Error code 43. They also didn't display anything but a green screen whenever I tried to connect using parsec. After reinstalling the drivers on the first GPU and then rebooting, both vGPUs started to work fine without any driver issues and no green screens.

The seccond grid card was similair. They didn't have the error 43 on them, but they only displayed a green screen in parsec. Then I reinstalled drivers on them, and all of a sudden both vGPUs worked. Then I tried to combine both vGPUs in one machine, and suddenly it broke down again. I then restarted that machine again, and then it worked.

I'm not sure what to think of it at this point. I don't know if they work properly, or if I'm just lucky (or clumsy). I will conduct some more testing and see what happens, but this seems a bit more promising. Thanks for your help! (Might add on more later on as I explore these cards more)
Hey :)

Did you reach the success with yours k2 cards?
I m really interrested too for my home lab :D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!