SSD disconnected when powering on 2 gpu passthrough vms

Hiroi

Member
Oct 15, 2021
11
1
8
21
First of all this thread is basically the aftermath of this one. It's not necessary to read it, but it paints a good picture of how this thing behaves.
So I have 2 GPUs that I am passing through to 2 VMs (usually one windows and one linux or mac). Everything works fine when I am starting them separately, but when I start the 2 VMs with both GPUs passed, the pc feels sluggish for about 10 seconds, it freezes for a bit and then it usually works fine. But sometimes it doesn't and I have to hard reset my whole pc (this usually results in corruption on my windows machine, for the example programs like notepad don't work anymore).

However, last day I managed to grab some logs of the 2 VMs from proxmox before it killed itself. What I found was pretty shocking.
For context, I had my windows VM started and I started a linux VM, both with GPUs passed through. The windows VM froze for a bit and then it kinda became unresponsive and I was only able to move my mouse. I then stopped the windows VM and tried to start it again. These are the logs for both VMs:

Windows VM:
Code:
WARNING: Couldn't find device with uuid lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY.
  WARNING: VG pve is missing PV lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY (last written to /dev/nvme0n1).
  WARNING: Couldn't find all devices for LV pve/data_tdata while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV pve/win_tmeta while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV pve/win_tdata while checking used and assumed devices.
  WARNING: Couldn't find device with uuid lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY.
  WARNING: VG pve is missing PV lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY (last written to /dev/nvme0n1).
  WARNING: Couldn't find all devices for LV pve/data_tdata while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV pve/win_tmeta while checking used and assumed devices.
  WARNING: Couldn't find all devices for LV pve/win_tdata while checking used and assumed devices.
/bin/swtpm exit with status 7:
TASK ERROR: start failed: command 'swtpm_setup --tpmstate file:///dev/pve/vm-100-disk-2 --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1
Linux VM:
Code:
TASK ERROR: start failed: command '/usr/bin/kvm -id 103 -name 'aclinux,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/103.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/103.pid -daemonize -smbios 'type=1,uuid=5d1b1873-8c6a-4553-b029-124420dffdf4' -drive 'if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.secboot.fd' -drive 'if=pflash,unit=1,id=drive-efidisk0,format=raw,file=/dev/pve/vm-103-disk-0,size=540672' -smp '8,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga none -nographic -cpu 'host,kvm=off,+kvm_pv_eoi,+kvm_pv_unhalt' -m 8124 -object 'iothread,id=iothread-virtioscsi0' -object 'iothread,id=iothread-virtioscsi2' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=fd72db77-b9c1-4576-b409-d55b03b7f41c' -device 'nec-usb-xhci,id=xhci,bus=pci.1,addr=0x1b' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:23:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,multifunction=on' -device 'vfio-pci,host=0000:23:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' -device 'usb-host,bus=xhci.0,hostbus=5,hostport=1,id=usb0' -device 'usb-host,bus=xhci.0,hostbus=5,hostport=2,id=usb1' -chardev 'socket,path=/var/run/qemu-server/103.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:acca3b9b5162' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=/dev/pve/vm-103-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' -device 'virtio-scsi-pci,id=virtioscsi2,bus=pci.3,addr=0x3,iothread=iothread-virtioscsi2' -drive 'file=/dev/pve/vm-103-disk-2,if=none,id=drive-scsi2,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi2.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi2,id=scsi2,rotation_rate=1' -netdev 'type=tap,id=net0,ifname=tap103i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=BC:24:11:38:F9:F1,netdev=net0,bus=pci.0,addr=0x12,id=net0' -machine 'type=pc-q35-6.1+pve0'' failed: got timeout

The UUID lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY is my 1TB ADATA SWORDFISH ssd that is in a lvm config with another 500GB ssd. And it looks like it got disconnected somehow. That would explain the corruption in the windows VM and the fact that my whole pc can't function after that. This looks to be a hardware problem more than problem with proxmox itself.

So based on all that I have 2 theories, but aren't cheap or fast to fix sadly. My motherboard isn't the best, it's just something that has the PCI-E lanes connected to the CPU instead of one to the chipset, it's a MSI MPG X570 GAMING PLUS. So I think that maybe it has to do with the PCI-E lanes being somehow overused the ssd is the victim of that thus disconnecting.

The other theory might be that my power supply somehow can't keep up with all the parts being powered up like this? But It seems like a pretty stupid idea to be honest, because my components don't really come close to the power supply capacity even when they are running at max TDP (which they are not).
For reference my PC specs are:
Code:
Ryzen 5800X
1st GPU: AMD Rx 6800
2nd GPU AMD Rx 540
2 NVME SSDs (one is 1TB and the other 500GB)
Motherboard: MSI MPG X570 GAMING PLUS
Power Supply:  Corsair CX750, 80+ Bronze, 750W
And a bunch of other hdd drives

Also if it's needed, here are my 2 vms configs:
Windows VM:
Code:
agent: 1
bios: ovmf
boot: order=scsi0;ide0;ide2;net0
cores: 16
cpu: host
efidisk0: local-win:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:2f:00,pcie=1,x-vga=1
hostpci1: 0000:25:00,pcie=1
ide0: local:iso/virtio-win-0.1.240.iso,media=cdrom,size=612812K
ide2: local:iso/Win11_23H2_English_x64v2.iso,media=cdrom,size=6653034K
machine: pc-q35-8.1
memory: 16000
meta: creation-qemu=8.1.2,ctime=1705771704
name: winmox
net0: virtio=BC:24:11:F5:76:FA,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: local-win:vm-100-disk-1,iothread=1,size=740G,ssd=1
scsi1: /dev/disk/by-id/ata-WDC_WD10EZEX-22MFCA0_WD-WCC6Y7VSS6SE-part1,backup=0,size=953867M
scsi2: /dev/disk/by-id/ata-WDC_WD5000AAKX-00ERMA0_WD-WCC2EKX98642-part1,backup=0,size=476938M
scsihw: virtio-scsi-single
smbios1: uuid=d852b726-c2ff-42aa-8830-9cae8dfd2acd
sockets: 1
tpmstate0: local-win:vm-100-disk-2,size=4M,version=v2.0
vmgenid: d22d03bd-575c-4c39-8e89-8e36dc167ad2
Linux VM:
Code:
agent: 1
bios: ovmf
boot: order=scsi0
cores: 8
cpu: host
efidisk0: local-lvm:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:23:00,pcie=1,x-vga=1
machine: pc-q35-6.1
memory: 8124
meta: creation-qemu=8.1.2,ctime=1706486036
name: aclinux
net0: virtio=BC:24:11:38:F9:F1,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-103-disk-1,iothread=1,size=100G,ssd=1
scsi2: local-lvm:vm-103-disk-2,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=5d1b1873-8c6a-4553-b029-124420dffdf4
sockets: 1
usb0: host=5-1,usb3=1
usb1: host=5-2,usb3=1
vmgenid: fd72db77-b9c1-4576-b409-d55b03b7f41c
And the lsblk:
Code:
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                            8:0    0 931.5G  0 disk
└─sda1                         8:1    0 931.5G  0 part
sdb                            8:16   0 465.8G  0 disk
└─sdb1                         8:17   0 465.8G  0 part
sdc                            8:32   0 232.9G  0 disk
├─sdc1                         8:33   0   100M  0 part
├─sdc2                         8:34   0    16M  0 part
├─sdc3                         8:35   0 232.2G  0 part
└─sdc4                         8:36   0   591M  0 part
nvme1n1                      259:0    0 476.9G  0 disk
├─nvme1n1p1                  259:1    0  1007K  0 part
├─nvme1n1p2                  259:2    0     1G  0 part /boot/efi
└─nvme1n1p3                  259:3    0 475.9G  0 part
  ├─pve-swap                 252:0    0    14G  0 lvm  [SWAP]
  ├─pve-root                 252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           252:2    0   3.5G  0 lvm 
  │ └─pve-data-tpool         252:4    0 520.5G  0 lvm 
  │   ├─pve-data             252:5    0 520.5G  1 lvm 
  │   ├─pve-vm--101--disk--0 252:13   0     4M  0 lvm 
  │   ├─pve-vm--101--disk--1 252:14   0   300G  0 lvm 
  │   ├─pve-vm--103--disk--0 252:15   0     4M  0 lvm 
  │   ├─pve-vm--103--disk--1 252:16   0   100G  0 lvm 
  │   ├─pve-vm--103--disk--2 252:17   0   100G  0 lvm 
  │   └─pve-vm--104--disk--0 252:18   0    30G  0 lvm 
  └─pve-data_tdata           252:3    0 520.5G  0 lvm 
    └─pve-data-tpool         252:4    0 520.5G  0 lvm 
      ├─pve-data             252:5    0 520.5G  1 lvm 
      ├─pve-vm--101--disk--0 252:13   0     4M  0 lvm 
      ├─pve-vm--101--disk--1 252:14   0   300G  0 lvm 
      ├─pve-vm--103--disk--0 252:15   0     4M  0 lvm 
      ├─pve-vm--103--disk--1 252:16   0   100G  0 lvm 
      ├─pve-vm--103--disk--2 252:17   0   100G  0 lvm 
      └─pve-vm--104--disk--0 252:18   0    30G  0 lvm 
nvme0n1                      259:4    0 931.5G  0 disk
├─pve-data_tdata             252:3    0 520.5G  0 lvm 
│ └─pve-data-tpool           252:4    0 520.5G  0 lvm 
│   ├─pve-data               252:5    0 520.5G  1 lvm 
│   ├─pve-vm--101--disk--0   252:13   0     4M  0 lvm 
│   ├─pve-vm--101--disk--1   252:14   0   300G  0 lvm 
│   ├─pve-vm--103--disk--0   252:15   0     4M  0 lvm 
│   ├─pve-vm--103--disk--1   252:16   0   100G  0 lvm 
│   ├─pve-vm--103--disk--2   252:17   0   100G  0 lvm 
│   └─pve-vm--104--disk--0   252:18   0    30G  0 lvm 
├─pve-win_tmeta              252:6    0   3.1G  0 lvm 
│ └─pve-win-tpool            252:8    0   760G  0 lvm 
│   ├─pve-win                252:9    0   760G  1 lvm 
│   ├─pve-vm--100--disk--0   252:10   0     4M  0 lvm 
│   ├─pve-vm--100--disk--2   252:11   0     4M  0 lvm 
│   └─pve-vm--100--disk--1   252:12   0   740G  0 lvm 
└─pve-win_tdata              252:7    0   760G  0 lvm 
  └─pve-win-tpool            252:8    0   760G  0 lvm 
    ├─pve-win                252:9    0   760G  1 lvm 
    ├─pve-vm--100--disk--0   252:10   0     4M  0 lvm 
    ├─pve-vm--100--disk--2   252:11   0     4M  0 lvm 
    └─pve-vm--100--disk--1   252:12   0   740G  0 lvm


So what do you think the problem is? Should I get another Motherboard and basically rebuild the whole pc and see how it works? Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!