Proxmox host crashes under heavy guest load

autumnlight

New Member
Jan 25, 2025
9
0
1
EDIT: It seems like too high of vram usage crashes everything, managed to bring both gpus to max usage with no issues

Basically I got a single vm setup at the moment with dual 3090 passthough. if both of the gpus are under max use for a few second it crashes the host. (above image is guest, below in black is proxmox)
1738526951133.png
1738526985539.png

Gpu temp seems to spike at 80deg?
Guest is fedora kionite. with actual nvidia drivers

Only errors I recieve: (guest)
1738527280822.png
 
Last edited:
Hello autumnlight! Could you please post:
  1. The hardware configuration of the host server, including power supply.
  2. The output of journalctl --since <TIME> beginning with a previous boot until the crash. Please include the journal for both the host and the VM in question. Please include the full journal and not only the errors.
  3. The configuration of the VM in question.
Also, please make sure that you followed the documentation on PCI(e) passthrough. Also, make sure to check out the wiki page on PCI(e) passthrough.
 
Motherboard: TRX 40 Creator. Bios version 1.86 (latest stable release)
CPU: Threadripper 3960X.
GPU: 2x 3090 (one by EVGA, one by some other brand)
PSU: 1650W Thermaltake Toughpower GF3 1650W
Memory: 256GB of DDR4 Memory.
Disks configuration: (the nvme0m1 is my old os, everything else is proxmox.)
1738637247547.png

Logs:
Host: (previous boot) (I caused the crash at the end by logging into the vm and starting a high vram usage tool. I successfully ran the blender benchmark with each gpu. [I added them as file attachments due to limits]
3:
```
agent: 1
args: -virtfs local,path=/aibox-standalone-pool/shared,security_model=none,mount_tag=aibox-standalone-pool
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 20
cpu: x86-64-v2
efidisk0: aibox-replicate-pool:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: mapping=gpu-3090-4,pcie=1
hostpci1: mapping=gpu-3090-3,pcie=1
ide2: local:iso/Fedora-Kinoite-ostree-x86_64-41-1.4.iso,media=cdrom,size=3664930K
machine: q35
memory: 131072
meta: creation-qemu=9.0.2,ctime=1738545049
name: ai-discord-vm
net0: virtio=BC:24:11:10:92:74,bridge=vmbr0,firewall=1,rate=7.5
numa: 0
onboot: 1
ostype: l26
scsi0: aibox-replicate-pool:vm-100-disk-1,iothread=1,size=120G
scsihw: virtio-scsi-single
smbios1: uuid=8b53dca3-4d5a-4f65-a7ed-e28b9913bac0
sockets: 1
vmgenid: a7f82373-ce62-47a7-ad5b-188179bb5f7e
```

I am going though the wiki right now to validate if I actually done everything, but I should have. (I followed some other guys and asked a bunch of people for help)

The benchmarks: gpu util 100% is no issue, only issue is if I go like above 60-70% of vram issues instant crash1738639099810.png

Scores:
https://opendata.blender.org/benchmarks/25c93ec2-fafe-4b92-ab21-588148b7cf1d/
https://opendata.blender.org/benchmarks/5ffacb60-1d78-4526-aba5-ac0e3b35d195/
 

Attachments

Last edited:
Hello autumnlight! Could you please post:
  1. The hardware configuration of the host server, including power supply.
  2. The output of journalctl --since <TIME> beginning with a previous boot until the crash. Please include the journal for both the host and the VM in question. Please include the full journal and not only the errors.
  3. The configuration of the VM in question.
Also, please make sure that you followed the documentation on PCI(e) passthrough. Also, make sure to check out the wiki page on PCI(e) passthrough.
(also, something I just learned, we can (to get the logs of a specific boot) do `journalctl -b 1 (for lets say previous boot) --no-pager`
 
Here is the output of
dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
lspci -nnk
cat /etc/default/grub
cat /etc/modprobe.d/vfio.conf
 

Attachments