Ubuntu 20.04 VM with GPU Passthrough Crashes Server on Shutdown

skytrooper09

New Member
Jul 20, 2020
8
1
3
33
Hey all,

This is my first time posting, so apologies up front if content or protocol is found lacking.

I recently upgraded (via new install) from 5.4 to Proxmox VE 6.2 and have been trying to configure a VM with Ubuntu 20.04 and GPU passthrough. I was successful in this, and I can use the VM directly via the GPU and a USB passthrough device, so all good there. Pretty much everything about it is functioning well and as expected, for the most part. I was even able to enable nested virtualization for Android emulation.

However, I am running into a (big) problem where stopping, restarting, or shutting down the Ubuntu VM causes the entire server to crash! I'm talking full freeze up, fans to 100%, followed by the horrible motherboard chime that tells you something went terribly wrong. My only guess is that this is related to the GPU, but that is truly just a guess, and I'm not sure how it could be working so well for normal operation and only blowing up when stopping. I haven't been able to find any concrete issues in the logs.

I've included some logs and configurations below, but can anyone point me in the right direction on how to fix or even troubleshoot this issue? I'm no system admin, and I only have a novice understanding of system-level linux and kernel mechanics.

Machine Details
Base: HP Z420
CPU: Xeon E5-1650 (6C/12T @ 3.2GHz)
Memory: 32GB Unregistered ECC DDR3
PVE Install Drive: 120GB Kingston SSD
VM Install Drive: 500GB Crucial SSD
GPU (passthrough): Gigabyte RX 480
GPU (unused): Nvidia Quadro NVS 450

VM Configuration
OS: Ubuntu 20.04
AMD Driver: Default (built-in)
agent: 1
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
balloon: 4096
bios: ovmf
boot: cdn
bootdisk: scsi0
cores: 8
cpu: host,hidden=1,flags=+pcid
efidisk0: vm_storage:vm-111-disk-1,size=128K
hostpci0: 08:00,pcie=1,x-vga=1
ide2: none,media=cdrom
machine: q35
memory: 16384
name: Ubuntu-20.04
net0: virtio=66:B7:75:4D:B4:77,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: vm_storage:vm-111-disk-0,size=128G
scsihw: virtio-scsi-single
smbios1: uuid=aff5d1c4-8802-4497-8f7d-b8cf61ad7cf8
sockets: 1
usb0: host=046d:c52b
vmgenid: 32dd9ed3-6969-4bdc-a600-6cc051bdbeaf

/var/log/syslog
This is the log of simply starting up the server, logging into the web interface, and trying to shut down the VM. So the last successful log you'll see before the crash is the successful login event, then a barf of symbols. What happens after that is when I boot the machine back up.
Jul 20 17:36:40 pve pve-guests[1792]: <root@pam> end task UPID:pve:00000715:00000560:5F160E28:startall::root@pam: OK
Jul 20 17:36:40 pve systemd[1]: Started PVE guests.
Jul 20 17:36:40 pve systemd[1]: Reached target Multi-User System.
Jul 20 17:36:40 pve systemd[1]: Reached target Graphical Interface.
Jul 20 17:36:40 pve systemd[1]: Starting Update UTMP about System Runlevel Changes...
Jul 20 17:36:40 pve systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Jul 20 17:36:40 pve systemd[1]: Started Update UTMP about System Runlevel Changes.
Jul 20 17:36:40 pve systemd[1]: Startup finished in 23.384s (firmware) + 6.275s (loader) + 3.288s (kernel) + 1min 14.601s (userspace) = 1min 47.550s.
Jul 20 17:36:41 pve systemd-timesyncd[1345]: Timed out waiting for reply from [2001:418:8405:4002::3]:123 (2.debian.pool.ntp.org).
Jul 20 17:36:41 pve systemd-timesyncd[1345]: Synchronized to time server for the first time 64.22.253.155:123 (2.debian.pool.ntp.org).
Jul 20 17:37:00 pve systemd[1]: Starting Proxmox VE replication runner...
Jul 20 17:37:01 pve systemd[1]: pvesr.service: Succeeded.
Jul 20 17:37:01 pve systemd[1]: Started Proxmox VE replication runner.
Jul 20 17:38:00 pve systemd[1]: Starting Proxmox VE replication runner...
Jul 20 17:38:01 pve systemd[1]: pvesr.service: Succeeded.
Jul 20 17:38:01 pve systemd[1]: Started Proxmox VE replication runner.
Jul 20 17:38:27 pve pvedaemon[1690]: <root@pam> successful auth for user 'root@pam'
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jul 20 17:57:36 pve kernel: [ 0.000000] Linux version 5.4.34-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200) ()
Jul 20 17:57:36 pve kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.34-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on
Jul 20 17:57:36 pve kernel: [ 0.000000] KERNEL supported cpus:
Jul 20 17:57:36 pve kernel: [ 0.000000] Intel GenuineIntel
Jul 20 17:57:36 pve kernel: [ 0.000000] AMD AuthenticAMD
Jul 20 17:57:36 pve systemd-modules-load[468]: Module 'vfio' is builtin
Jul 20 17:57:36 pve systemd-modules-load[468]: Module 'vfio_iommu_type1' is builtin
Jul 20 17:57:36 pve kernel: [ 0.000000] Hygon HygonGenuine
Jul 20 17:57:36 pve systemd-modules-load[468]: Module 'vfio_pci' is builtin
Jul 20 17:57:36 pve kernel: [ 0.000000] Centaur CentaurHauls
Jul 20 17:57:36 pve systemd-modules-load[468]: Module 'vfio_virqfd' is builtin
Jul 20 17:57:36 pve kernel: [ 0.000000] zhaoxin Shanghai
Jul 20 17:57:36 pve lvm[466]: 1 logical volume(s) in volume group "pve" monitored
Jul 20 17:57:36 pve kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

/etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1

/etc/modprobe.d/blacklist.conf
blacklist radeon
blacklist nouveau
blacklist nvidia
 
Last edited:
  • Like
Reactions: FelixCLC
So I guess I'm kinda dumb, but hopefully this helps someone else.

I was following this guide from the Proxmox wiki when setting up my GPU passthrough, and it lists the drivers to blacklist that I had included. However, my system was using the amdgpu driver, not the radeon driver. I added the amdgpu driver to the blacklist and was able to shut down the VM without a crash.
 
I added the amdgpu driver to the blacklist and was able to shut down the VM without a crash.

Shut down, yes.
Reboot, no.

The adventure continues.

Edit:
Shutdown/stop is successful only some of the time. For example, I was able to stop the VM and re-start it 2 times while debugging settings in the web interface, but the 3rd time I stopped it the server crashed again.
 
Last edited:
Have you managed to solve this issue? I'm having similar issues with Nvidia passthrough.
Hope you can find these resources helpful:
https://www.reddit.com/r/Proxmox/comments/i3sblh/pci_gpu_passthrough_causes_host_crash/
https://forum.proxmox.com/threads/nvidia-gtx-1660-super-gpu-passthrough-to-win10-vm.65019/

Then don't forget to try to update your BIOS. I'm now trying to troubleshoot my crashes (which might happen even after 10hrs of uptime).

Try to blacklist all video-related drivers:
Code:
blacklist radeon

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

blacklist nvidia
blacklist i2c-nvidia-gpu


My thread: https://forum.proxmox.com/threads/proxmox-node-hangs-after-a-while.80317/
 
It has been a while now, but no I never did solve this issue. I did try black listing and made sure drivers and BIOS were up to date, but I ended up just migrating the VM that needed the GPU to a separate

I'm now trying to troubleshoot my crashes (which might happen even after 10hrs of uptime).

This sounds very similar to some other behavior I was seeing on my server, but I don't know how related they are. Like you, I noticed that my server would randomly freeze up and require a hard reboot, and it would happen randomly without any interaction from me. I never figured that out either.
 
This sounds very similar to some other behavior I was seeing on my server, but I don't know how related they are. Like you, I noticed that my server would randomly freeze up and require a hard reboot, and it would happen randomly without any interaction from me. I never figured that out either.
So sad, this is the exact same issue I'm having right now.... Luckily it's been almost 4hrs and my server is still not crashing (crossing all fingers I have).
What's even stranger is that I'm using Nvidia and you are using AMD, so it couldn't be due to drivers but it might be something deeper in the OS. Hope the Proxmox staff can read this thread and help us debug.

My specs:
MOBO: Gigabyte x99 GA UD3
CPU: i7 5820k
GPUs: Gigabyte Nvidia 960 and Asus 750Ti + random GeForce 6200 for shell
RAM: memtester said everything was fine and I have ~3GB spare RAM
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!