First of all this thread is basically the aftermath of this one. It's not necessary to read it, but it paints a good picture of how this thing behaves.
So I have 2 GPUs that I am passing through to 2 VMs (usually one windows and one linux or mac). Everything works fine when I am starting them separately, but when I start the 2 VMs with both GPUs passed, the pc feels sluggish for about 10 seconds, it freezes for a bit and then it usually works fine. But sometimes it doesn't and I have to hard reset my whole pc (this usually results in corruption on my windows machine, for the example programs like notepad don't work anymore).
However, last day I managed to grab some logs of the 2 VMs from proxmox before it killed itself. What I found was pretty shocking.
For context, I had my windows VM started and I started a linux VM, both with GPUs passed through. The windows VM froze for a bit and then it kinda became unresponsive and I was only able to move my mouse. I then stopped the windows VM and tried to start it again. These are the logs for both VMs:
Windows VM:
Linux VM:
The UUID lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY is my 1TB ADATA SWORDFISH ssd that is in a lvm config with another 500GB ssd. And it looks like it got disconnected somehow. That would explain the corruption in the windows VM and the fact that my whole pc can't function after that. This looks to be a hardware problem more than problem with proxmox itself.
So based on all that I have 2 theories, but aren't cheap or fast to fix sadly. My motherboard isn't the best, it's just something that has the PCI-E lanes connected to the CPU instead of one to the chipset, it's a MSI MPG X570 GAMING PLUS. So I think that maybe it has to do with the PCI-E lanes being somehow overused the ssd is the victim of that thus disconnecting.
The other theory might be that my power supply somehow can't keep up with all the parts being powered up like this? But It seems like a pretty stupid idea to be honest, because my components don't really come close to the power supply capacity even when they are running at max TDP (which they are not).
For reference my PC specs are:
Also if it's needed, here are my 2 vms configs:
Windows VM:
Linux VM:
And the lsblk:
So what do you think the problem is? Should I get another Motherboard and basically rebuild the whole pc and see how it works? Thanks!
So I have 2 GPUs that I am passing through to 2 VMs (usually one windows and one linux or mac). Everything works fine when I am starting them separately, but when I start the 2 VMs with both GPUs passed, the pc feels sluggish for about 10 seconds, it freezes for a bit and then it usually works fine. But sometimes it doesn't and I have to hard reset my whole pc (this usually results in corruption on my windows machine, for the example programs like notepad don't work anymore).
However, last day I managed to grab some logs of the 2 VMs from proxmox before it killed itself. What I found was pretty shocking.
For context, I had my windows VM started and I started a linux VM, both with GPUs passed through. The windows VM froze for a bit and then it kinda became unresponsive and I was only able to move my mouse. I then stopped the windows VM and tried to start it again. These are the logs for both VMs:
Windows VM:
Code:
WARNING: Couldn't find device with uuid lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY.
WARNING: VG pve is missing PV lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY (last written to /dev/nvme0n1).
WARNING: Couldn't find all devices for LV pve/data_tdata while checking used and assumed devices.
WARNING: Couldn't find all devices for LV pve/win_tmeta while checking used and assumed devices.
WARNING: Couldn't find all devices for LV pve/win_tdata while checking used and assumed devices.
WARNING: Couldn't find device with uuid lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY.
WARNING: VG pve is missing PV lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY (last written to /dev/nvme0n1).
WARNING: Couldn't find all devices for LV pve/data_tdata while checking used and assumed devices.
WARNING: Couldn't find all devices for LV pve/win_tmeta while checking used and assumed devices.
WARNING: Couldn't find all devices for LV pve/win_tdata while checking used and assumed devices.
/bin/swtpm exit with status 7:
TASK ERROR: start failed: command 'swtpm_setup --tpmstate file:///dev/pve/vm-100-disk-2 --createek --create-ek-cert --create-platform-cert --lock-nvram --config /etc/swtpm_setup.conf --runas 0 --not-overwrite --tpm2 --ecc' failed: exit code 1
Code:
TASK ERROR: start failed: command '/usr/bin/kvm -id 103 -name 'aclinux,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/103.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/103.pid -daemonize -smbios 'type=1,uuid=5d1b1873-8c6a-4553-b029-124420dffdf4' -drive 'if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.secboot.fd' -drive 'if=pflash,unit=1,id=drive-efidisk0,format=raw,file=/dev/pve/vm-103-disk-0,size=540672' -smp '8,sockets=1,cores=8,maxcpus=8' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga none -nographic -cpu 'host,kvm=off,+kvm_pv_eoi,+kvm_pv_unhalt' -m 8124 -object 'iothread,id=iothread-virtioscsi0' -object 'iothread,id=iothread-virtioscsi2' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=fd72db77-b9c1-4576-b409-d55b03b7f41c' -device 'nec-usb-xhci,id=xhci,bus=pci.1,addr=0x1b' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:23:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,multifunction=on' -device 'vfio-pci,host=0000:23:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' -device 'usb-host,bus=xhci.0,hostbus=5,hostport=1,id=usb0' -device 'usb-host,bus=xhci.0,hostbus=5,hostport=2,id=usb1' -chardev 'socket,path=/var/run/qemu-server/103.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:acca3b9b5162' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=/dev/pve/vm-103-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' -device 'virtio-scsi-pci,id=virtioscsi2,bus=pci.3,addr=0x3,iothread=iothread-virtioscsi2' -drive 'file=/dev/pve/vm-103-disk-2,if=none,id=drive-scsi2,format=raw,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi2.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi2,id=scsi2,rotation_rate=1' -netdev 'type=tap,id=net0,ifname=tap103i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=BC:24:11:38:F9:F1,netdev=net0,bus=pci.0,addr=0x12,id=net0' -machine 'type=pc-q35-6.1+pve0'' failed: got timeout
The UUID lQfHF9-hw7e-bQz7-j359-Uf9V-r0Pp-rx03QY is my 1TB ADATA SWORDFISH ssd that is in a lvm config with another 500GB ssd. And it looks like it got disconnected somehow. That would explain the corruption in the windows VM and the fact that my whole pc can't function after that. This looks to be a hardware problem more than problem with proxmox itself.
So based on all that I have 2 theories, but aren't cheap or fast to fix sadly. My motherboard isn't the best, it's just something that has the PCI-E lanes connected to the CPU instead of one to the chipset, it's a MSI MPG X570 GAMING PLUS. So I think that maybe it has to do with the PCI-E lanes being somehow overused the ssd is the victim of that thus disconnecting.
The other theory might be that my power supply somehow can't keep up with all the parts being powered up like this? But It seems like a pretty stupid idea to be honest, because my components don't really come close to the power supply capacity even when they are running at max TDP (which they are not).
For reference my PC specs are:
Code:
Ryzen 5800X
1st GPU: AMD Rx 6800
2nd GPU AMD Rx 540
2 NVME SSDs (one is 1TB and the other 500GB)
Motherboard: MSI MPG X570 GAMING PLUS
Power Supply: Corsair CX750, 80+ Bronze, 750W
And a bunch of other hdd drives
Also if it's needed, here are my 2 vms configs:
Windows VM:
Code:
agent: 1
bios: ovmf
boot: order=scsi0;ide0;ide2;net0
cores: 16
cpu: host
efidisk0: local-win:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:2f:00,pcie=1,x-vga=1
hostpci1: 0000:25:00,pcie=1
ide0: local:iso/virtio-win-0.1.240.iso,media=cdrom,size=612812K
ide2: local:iso/Win11_23H2_English_x64v2.iso,media=cdrom,size=6653034K
machine: pc-q35-8.1
memory: 16000
meta: creation-qemu=8.1.2,ctime=1705771704
name: winmox
net0: virtio=BC:24:11:F5:76:FA,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win11
scsi0: local-win:vm-100-disk-1,iothread=1,size=740G,ssd=1
scsi1: /dev/disk/by-id/ata-WDC_WD10EZEX-22MFCA0_WD-WCC6Y7VSS6SE-part1,backup=0,size=953867M
scsi2: /dev/disk/by-id/ata-WDC_WD5000AAKX-00ERMA0_WD-WCC2EKX98642-part1,backup=0,size=476938M
scsihw: virtio-scsi-single
smbios1: uuid=d852b726-c2ff-42aa-8830-9cae8dfd2acd
sockets: 1
tpmstate0: local-win:vm-100-disk-2,size=4M,version=v2.0
vmgenid: d22d03bd-575c-4c39-8e89-8e36dc167ad2
Code:
agent: 1
bios: ovmf
boot: order=scsi0
cores: 8
cpu: host
efidisk0: local-lvm:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:23:00,pcie=1,x-vga=1
machine: pc-q35-6.1
memory: 8124
meta: creation-qemu=8.1.2,ctime=1706486036
name: aclinux
net0: virtio=BC:24:11:38:F9:F1,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-103-disk-1,iothread=1,size=100G,ssd=1
scsi2: local-lvm:vm-103-disk-2,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=5d1b1873-8c6a-4553-b029-124420dffdf4
sockets: 1
usb0: host=5-1,usb3=1
usb1: host=5-2,usb3=1
vmgenid: fd72db77-b9c1-4576-b409-d55b03b7f41c
Code:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 931.5G 0 disk
└─sda1 8:1 0 931.5G 0 part
sdb 8:16 0 465.8G 0 disk
└─sdb1 8:17 0 465.8G 0 part
sdc 8:32 0 232.9G 0 disk
├─sdc1 8:33 0 100M 0 part
├─sdc2 8:34 0 16M 0 part
├─sdc3 8:35 0 232.2G 0 part
└─sdc4 8:36 0 591M 0 part
nvme1n1 259:0 0 476.9G 0 disk
├─nvme1n1p1 259:1 0 1007K 0 part
├─nvme1n1p2 259:2 0 1G 0 part /boot/efi
└─nvme1n1p3 259:3 0 475.9G 0 part
├─pve-swap 252:0 0 14G 0 lvm [SWAP]
├─pve-root 252:1 0 96G 0 lvm /
├─pve-data_tmeta 252:2 0 3.5G 0 lvm
│ └─pve-data-tpool 252:4 0 520.5G 0 lvm
│ ├─pve-data 252:5 0 520.5G 1 lvm
│ ├─pve-vm--101--disk--0 252:13 0 4M 0 lvm
│ ├─pve-vm--101--disk--1 252:14 0 300G 0 lvm
│ ├─pve-vm--103--disk--0 252:15 0 4M 0 lvm
│ ├─pve-vm--103--disk--1 252:16 0 100G 0 lvm
│ ├─pve-vm--103--disk--2 252:17 0 100G 0 lvm
│ └─pve-vm--104--disk--0 252:18 0 30G 0 lvm
└─pve-data_tdata 252:3 0 520.5G 0 lvm
└─pve-data-tpool 252:4 0 520.5G 0 lvm
├─pve-data 252:5 0 520.5G 1 lvm
├─pve-vm--101--disk--0 252:13 0 4M 0 lvm
├─pve-vm--101--disk--1 252:14 0 300G 0 lvm
├─pve-vm--103--disk--0 252:15 0 4M 0 lvm
├─pve-vm--103--disk--1 252:16 0 100G 0 lvm
├─pve-vm--103--disk--2 252:17 0 100G 0 lvm
└─pve-vm--104--disk--0 252:18 0 30G 0 lvm
nvme0n1 259:4 0 931.5G 0 disk
├─pve-data_tdata 252:3 0 520.5G 0 lvm
│ └─pve-data-tpool 252:4 0 520.5G 0 lvm
│ ├─pve-data 252:5 0 520.5G 1 lvm
│ ├─pve-vm--101--disk--0 252:13 0 4M 0 lvm
│ ├─pve-vm--101--disk--1 252:14 0 300G 0 lvm
│ ├─pve-vm--103--disk--0 252:15 0 4M 0 lvm
│ ├─pve-vm--103--disk--1 252:16 0 100G 0 lvm
│ ├─pve-vm--103--disk--2 252:17 0 100G 0 lvm
│ └─pve-vm--104--disk--0 252:18 0 30G 0 lvm
├─pve-win_tmeta 252:6 0 3.1G 0 lvm
│ └─pve-win-tpool 252:8 0 760G 0 lvm
│ ├─pve-win 252:9 0 760G 1 lvm
│ ├─pve-vm--100--disk--0 252:10 0 4M 0 lvm
│ ├─pve-vm--100--disk--2 252:11 0 4M 0 lvm
│ └─pve-vm--100--disk--1 252:12 0 740G 0 lvm
└─pve-win_tdata 252:7 0 760G 0 lvm
└─pve-win-tpool 252:8 0 760G 0 lvm
├─pve-win 252:9 0 760G 1 lvm
├─pve-vm--100--disk--0 252:10 0 4M 0 lvm
├─pve-vm--100--disk--2 252:11 0 4M 0 lvm
└─pve-vm--100--disk--1 252:12 0 740G 0 lvm
So what do you think the problem is? Should I get another Motherboard and basically rebuild the whole pc and see how it works? Thanks!