[SOLVED] Soft lockup after GPU passthrough

atylv

New Member
Apr 30, 2020
6
0
1
26
Dear All,

I am a new user of PVE 6.1-7 and also new to this forum.
I am using an E2244G CPU and a SuperMicro X11SCL-IF motherboard. My only win10 VM is built with 8 CPU cores, 16GB memory, and a pass-through GPU GTX 1660 super.

In short, once I started my VM and shutdown it, I would get a bug (It won't happen if I don't start my VM or never shut it down):
watchdog: BUG: soft lockup - CPU#5 stuck for 22s [kvm:1295]
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: $5-....: (15000 ticks this GP) idle=46a/1/0x4000000000000002 softirq=48625/48625 fqs=6276

I thought it could be about my GPU drivers, but I am not sure about it. The reason is that this bug only occurred after I successfully pass-through the GPU and install the gpu driver on the Win10 VM. However, I don't know what to do. I wonder if anyone had the same problem and got a solution?

Thank you all!

Yours,
Harold
 
Passthrough related bugs are often tricky to debug, and just based on the log snippet you posted I don't think I've seen this specific issue before.

In general, here are some tips that can help with stability:
  • Use a specific 'romfile' for your GPU
  • Boot in UEFI mode, disable legacy support in BIOS
  • Put the GPU into a different PCIe slot
  • Make sure your BIOS is up-to-date - the BIOS actually plays quite a big role in passthrough, so this is usually a good idea in general
Make sure to thorougly read through our guide as well: https://pve.proxmox.com/wiki/Pci_passthrough

More logs and general information would be necessary (i.e. 'pveversion -v', 'dmesg', 'journalctl -e', '/etc/pve/qemu-server/<vmid>.conf', etc...) to say anything specific about your issue.
 
I believe I may be seeing a similar issue. Upon rebooting a guest Windows VM with GPU passthrough, I'm sometimes seeing the host lockup with soft lockup and hard lockup messages. Nothing is recorded in any log files that I can find though. Each time I have rebooted the host to restore operation. Other than that the issue has occurred both times when I was rebooting VM1 while VM 2 was running, I haven't been able to reproduce the issue. Only two VM1 and VM2 were running the last time this happened.

I'm a bit in over my head on this one. We obviously need more information on what exactly is failing and since the logs are useless, I think I'd need to enable a backtrace or crash dump of some sorts and then wait for the issue to reoccur? Maybe I'm rebooting the host too soon before the console spits out the info we need?

I've been using GPU passthrough for a while and haven't had this issue before so my best (uneducated) guess would be that recent changes could be relevant:
  • Now using virtio-win-0.1.173 for guest drivers and guest agent
  • Now using additional CPU flags: +spec-ctrl;+ssbd;+hv-tlbflush;+aes
  • Now using Windows 10 1909 vs Windows 10 1809
  • Probably using a newer kernel version after running updates recently?

Screenshot of host before I reset to restore operation
crash 1 crop.png


Code:
root@proxmox-8x2080Ti:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-11 (running version: 6.1-11/f2f18736)
pve-kernel-helper: 6.1-9
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.2
libpve-access-control: 6.0-7
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-1
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-6
pve-cluster: 6.1-8
pve-container: 3.1-4
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.1-1
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-20
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Code:
May 10 11:31:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:31:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:31:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:32:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:32:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:32:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:33:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:33:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:33:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:33:13 proxmox-8x2080Ti pvedaemon[45689]: <root@pam> successful auth for user 'root@pam'
May 10 11:34:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:34:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:34:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:35:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:35:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:35:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:35:29 proxmox-8x2080Ti pvedaemon[45689]: <root@pam> successful auth for user 'root@pam'
May 10 11:36:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:36:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:36:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:36:11 proxmox-8x2080Ti pveproxy[20084]: worker exit
May 10 11:36:11 proxmox-8x2080Ti pvedaemon[23445]: starting vnc proxy UPID:proxmox-8x2080Ti:00005B95:00062924:5EB83B8B:vncproxy:105:root@pam:
May 10 11:36:11 proxmox-8x2080Ti pvedaemon[14189]: <root@pam> starting task UPID:proxmox-8x2080Ti:00005B95:00062924:5EB83B8B:vncproxy:105:root@pam:
May 10 11:36:44 proxmox-8x2080Ti pveproxy[6785]: worker exit
May 10 11:36:44 proxmox-8x2080Ti pveproxy[2498]: worker 6785 finished
May 10 11:36:44 proxmox-8x2080Ti pveproxy[2498]: starting 1 worker(s)
May 10 11:36:44 proxmox-8x2080Ti pveproxy[2498]: worker 24038 started
May 10 11:36:46 proxmox-8x2080Ti pveproxy[2498]: worker 2476 finished
May 10 11:36:46 proxmox-8x2080Ti pveproxy[2498]: starting 1 worker(s)
May 10 11:36:46 proxmox-8x2080Ti pveproxy[2498]: worker 24052 started
May 10 11:36:51 proxmox-8x2080Ti pveproxy[24051]: got inotify poll request in wrong process - disabling inotify
May 10 11:37:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:37:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:37:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:37:25 proxmox-8x2080Ti sshd[24776]: Accepted password for root from 192.168.31.101 port 64639 ssh2
May 10 11:37:25 proxmox-8x2080Ti sshd[24776]: pam_unix(sshd:session): session opened for user root by (uid=0)
May 10 11:37:25 proxmox-8x2080Ti systemd-logind[1998]: New session 5 of user root.
May 10 11:37:25 proxmox-8x2080Ti systemd[1]: Started Session 5 of user root.
May 10 11:37:26 proxmox-8x2080Ti sshd[24776]: pam_unix(sshd:session): session closed for user root
May 10 11:37:26 proxmox-8x2080Ti systemd[1]: session-5.scope: Succeeded.
May 10 11:37:26 proxmox-8x2080Ti systemd-logind[1998]: Session 5 logged out. Waiting for processes to exit.
May 10 11:37:26 proxmox-8x2080Ti systemd-logind[1998]: Removed session 5.
May 10 11:38:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:38:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:38:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:39:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:39:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:39:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:40:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:40:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:40:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.
May 10 11:40:31 proxmox-8x2080Ti pveproxy[2498]: worker 10940 finished
May 10 11:40:31 proxmox-8x2080Ti pveproxy[2498]: starting 1 worker(s)
May 10 11:40:31 proxmox-8x2080Ti pveproxy[2498]: worker 28019 started
May 10 11:40:32 proxmox-8x2080Ti pveproxy[28018]: got inotify poll request in wrong process - disabling inotify
May 10 11:40:32 proxmox-8x2080Ti pveproxy[28018]: worker exit
May 10 11:41:00 proxmox-8x2080Ti systemd[1]: Starting Proxmox VE replication runner...
May 10 11:41:01 proxmox-8x2080Ti systemd[1]: pvesr.service: Succeeded.
May 10 11:41:01 proxmox-8x2080Ti systemd[1]: Started Proxmox VE replication runner.

Code:
root@proxmox-8x2080Ti:~# cat /etc/pve/qemu-server/105.conf
agent: 1
balloon: 0
bios: ovmf
boot: cdn
bootdisk: scsi1
cores: 8
cpu: host,hidden=1,flags=+pcid;+spec-ctrl;+ssbd;+pdpe1gb;+hv-tlbflush;+aes,hv-vendor-id=whatever
efidisk0: NVME-thin:vm-105-disk-0,size=4M
hostpci0: 05:00,pcie=1
hugepages: 1024
ide0: none,media=cdrom
machine: q35
memory: 16384
name: VM1
net0: e1000=82:A5:9F:A4:C6:81,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
scsi1: NVME-thin:vm-105-disk-1,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=128a8ccc-124d-4f64-a15b-1bee3ee6c3af
sockets: 1
vmgenid: 8677e93d-6886-4668-88e5-4165279908fa

Code:
root@proxmox-8x2080Ti:~# cat /etc/pve/qemu-server/106.conf
agent: 1
balloon: 0
bios: ovmf
boot: cdn
bootdisk: scsi1
cores: 8
cpu: host,hidden=1,flags=+pcid;+spec-ctrl;+ssbd;+pdpe1gb;+hv-tlbflush;+aes,hv-vendor-id=whatever
efidisk0: NVME-thin:vm-106-disk-1,size=4M
hostpci0: 08:00,pcie=1
hugepages: 1024
ide0: none,media=cdrom
machine: q35
memory: 16384
name: VM2
net0: e1000=C6:57:96:FC:09:A6,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
scsi1: NVME-thin:vm-106-disk-0,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=7818e0a8-f244-45bb-8107-a16ca32da51a
sockets: 1
vmgenid: cb8b427f-2db7-489c-a53a-e0930209b27b
 

Attachments

  • dmesg.txt
    157.1 KB · Views: 3
  • apt_history.log
    2.1 KB · Views: 1
Thank you all. I found my problem may not be GPU related so I open a new one with more details.
 
I am having the same issue. I am running pve 7.0 with a singgle guest windows 10. I have two graphics cards to test, one is gtx 1660 and one is rx5600xt. I was able to passthrough 1660 fine, but when I try to do the same thing to rx5600xt, almost everytime I try to shutdown/restart or start (after a success shutdown), there will be couple cpus get the soft lockup message, and bascailly the entire host is stuck - I can't stop the machine from the UI, can't stop the process by killing the process, etc ... I have to physically restart the pve host machine. The problem is automatically resolved once I removed rx5600xt from the vm.
 
Passthrough related bugs are often tricky to debug, and just based on the log snippet you posted I don't think I've seen this specific issue before.

In general, here are some tips that can help with stability:
  • Use a specific 'romfile' for your GPU
  • Boot in UEFI mode, disable legacy support in BIOS
  • Put the GPU into a different PCIe slot
  • Make sure your BIOS is up-to-date - the BIOS actually plays quite a big role in passthrough, so this is usually a good idea in general
Make sure to thorougly read through our guide as well: https://pve.proxmox.com/wiki/Pci_passthrough

More logs and general information would be necessary (i.e. 'pveversion -v', 'dmesg', 'journalctl -e', '/etc/pve/qemu-server/<vmid>.conf', etc...) to say anything specific about your issue.
Hi Stefan_r,

Could you elaborate what you mean by "Use a specific 'romfile' for your GPU"

Is this something I have to find and download, and if so, how do i use it within the Proxmox config?

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!