Two identical GPU Passthrough to two different VM

Thatoo · Apr 4, 2022

Hello,

I have two identical GPU, Nvidia 1050 TI.
I managed to follow explanation to passthrough my GPU to my VM thanks to
https://pve.proxmox.com/wiki/PCI(e)_Passthrough
https://pve.proxmox.com/wiki/Pci_passthrough
https://gist.github.com/qubidt/64f617e959725e934992b080e677656f

cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1c82,10de:0fb9 disable_vga=1

and I added the PCI device to each VM like this

Both VM are working nicely with the GPU I indicate but not simultaneously.
If one VM is started and I launch the second one, then the first one stops immediately.
Is it because both GPU have the same vendor and device IDs so I hav created only one file /etc/modprobe.d/vfio.conf with only this content
options vfio-pci ids=10de:1c82,10de:0fb9 disable_vga=1
?

I modify GRUB the minimum possible for now :
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pci=noaer"

GPU are isolated in different iommu groups, group 14 and group 18.

root@dsqf:~# find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/17/devices/0000:04:00.0
/sys/kernel/iommu_groups/7/devices/0000:00:16.0
/sys/kernel/iommu_groups/15/devices/0000:02:00.0
/sys/kernel/iommu_groups/5/devices/10000:e0:17.0
/sys/kernel/iommu_groups/5/devices/0000:00:0e.0
/sys/kernel/iommu_groups/13/devices/0000:00:1f.0
/sys/kernel/iommu_groups/13/devices/0000:00:1f.5
/sys/kernel/iommu_groups/13/devices/0000:00:1f.3
/sys/kernel/iommu_groups/13/devices/0000:00:1f.4
/sys/kernel/iommu_groups/3/devices/0000:00:08.0
/sys/kernel/iommu_groups/11/devices/0000:00:1c.3
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/18/devices/0000:05:00.1
/sys/kernel/iommu_groups/18/devices/0000:05:00.0
/sys/kernel/iommu_groups/8/devices/0000:00:17.0
/sys/kernel/iommu_groups/16/devices/0000:03:00.0
/sys/kernel/iommu_groups/6/devices/0000:00:14.2
/sys/kernel/iommu_groups/6/devices/0000:00:14.0
/sys/kernel/iommu_groups/14/devices/0000:01:00.0
/sys/kernel/iommu_groups/14/devices/0000:01:00.1
/sys/kernel/iommu_groups/4/devices/0000:00:0a.0
/sys/kernel/iommu_groups/12/devices/0000:00:1c.4
/sys/kernel/iommu_groups/2/devices/0000:00:02.0
/sys/kernel/iommu_groups/10/devices/0000:00:1c.2
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/9/devices/0000:00:1c.0

Thank you in advance for your help.
Best regards,
Thatoo

Thatoo · Apr 4, 2022

It's ok, the problem was not coming from GPU passthrough but from RAM amount dedicated to each VM...

leesteken · Apr 4, 2022

Because PCI(e) devices can de DMA to any part of the VM memory at any time, all must be pinned into actual RAM when using passthrough. Therefore, ballooning and KSM also won't reclaim memory. I guess you noticed messages in journalctl that OOM killed the process with the most memory in use, which was the other VM?

Thatoo · Apr 6, 2022

I'm not sure to understand all the fine technical details you wrote but I guess I got the overall meaning and I thank you for this explanation.

Thatoo · May 16, 2022

Sorry to come back to you @leesteken but if I believe I understood the problem, I could not find any solution...

I ended up starting only one single VM with 20 Go of RAM over 32 Go (to be sure Proxmox would not come in conflict with the VM) but still the VM might stop time to time suddenly (I stopped all other VM and CT but still)...

I can forget about the idea to have two VM with GPU Passthough but I really need at least one single VM with GPU Passthrough to remain alive all the time with 20 Go and be able to allocate some RAM to other CT.

Is there any solution?
I don't know what information I could give to be clearer.

leesteken · May 16, 2022

Do you see OoM errors in the Syslog (or journalctl command)? Then you are still using too much memory in total.
How many other VMs and CTs are running? How much memory do they use (and what is the maximum memory)? Are you also using ZFS (which can take 50% of your memory)?

Thatoo · May 16, 2022

Thank you for your quick answer and for trying to help me.

The computer has 32 Go of Memory.
The single VM (with GPU Passthrough, running Windows 11) has 20 Go. No other VM nor CT are currently running. However, Proxmox display a 95% of RAM usage. Proxmox use 8.9 Go when all VM are stopped. That's sound a lot to me, I didn't realize that.
In the proxmox shell, grep OoM /var/log/syslog don't give me any result.
Proxmox use ZFS, I didn't know this issue of 50% memory. I thought ZFS was the recommended system in Proxmox. What is better?

leesteken · May 16, 2022

ZFS has very nice features and its CoW-nature help with snapshots and backups. You might want to read the manual about setting limits for ZFS. I don't know what the perfect number would be, but I suggest trying a minimum of 1073741824 and a maximum of 3221225472 for your system (which is about 10% of your memory).
If you want less overhead and less features/protection, you could reinstall Proxmox with LVM. But setting limits might be good enough.
I can understand that you want to run one VM for the nice backup and snapshot feature of Proxmox, but in general using a server virtualization platform where you give 63% to a single VM is a bit uncommon. Maybe get more memory and run more VMs? Or install the VM bare metal and don't use Proxmox in between?

Thatoo · May 17, 2022

Well, the current situation is for testing this gpu passthrough system.
The idea is of course to add CTs on the side of this VM.
Here is what we want to have at the end :
- 1 VM with Windows and GPU Pass through : this VM is for us (we would connect through RDP) to be able to use some specific software running only on windows and requiring important resources (RAM and GPU) such as CAD software
- several debian CTs for our website and collaborative tools such as nextcloud and matrix

We plan to increase the amount of RAM when possible but for now, we have only 32 Go.

We would need at least 16 Go for the windows VM so I wanted to test how much remain for the the CTs I want to install (before installing them).
For that I started giving 24 Go to the windows VM (8 Go free) and then reduce little by little but when I reach to 20 Go (12 Go free) and still it was unstable (better but still), then it means (at least I understand it like that) I have a very maximum of 4 Go for CT (20-16) and it means that proxmox needs up to (32-20=) 12 Go and I was really surprised...

ideally, till we can increase the amount of RAM, what I would like is to tell Proxmox that 16 Go ( a bit more would be nice) are dedicated to this windows VM and that the rest of RAM is free to use for Proxmox and CTs but it doesn't seem to be possible. Unfortunately, the whole amount of RAM has to be fully managed by Proxmox and this VM might feel missing some RAM at some point (if proxmox use a bit of ram for itself or other CT) and thus turn off. Am I right?

Could it be an improvement for Proxmox to have the choice to secure an amount of RAM to a specific VM that needs it. Well, of course, the best would be that such VM (with GPU Passthrough) would not turn off as other VM

but I'm not sure how much it is possible or not.

I hope I managed to explain clearly what I'm trying to do and what I hope it is possible to do.

Search

Search

Two identical GPU Passthrough to two different VM

Thatoo

Member

Thatoo

Member

leesteken

Distinguished Member

Thatoo

Member

Thatoo

Member

leesteken

Distinguished Member

Thatoo

Member

leesteken

Distinguished Member

Thatoo

Member