Kernel Null Pointer Exception when starting a VM with a specific GPU on VE 7.1

SloppyJalopy

New Member
Aug 10, 2021
2
0
1
30
Hi,

I've just upgraded my homelab's core virtualization server's Proxmox instance from 6.4 to 7.1. The method I used was that I backed up all of my VMs to my local Proxmox Backup Server instance running on a separate server, wiped the server completely, cleanly installed 7.1, and restored the VMs. Everything seems to have been upgraded just fine, with one exception:

The server is a Dell PowerEdge R730, and it contains two NVIDIA Quadro P4000 GPUs. Starting any VM with the first GPU works just fine, and utilizing the GPU within the VM works successfully. Starting any VM with the second GPU, however, does a few things:

- It instantly makes the Linux kernel report a null pointer exception to do with Nouveau.
- It spawns a Proxmox 'start VM' task that cannot be stopped in any way, and the VM never actually starts or even attempts to start; there is no VM instance for me to ps aux | grep <VM ID> and kill -9 <VM PID>. Viewing the task's logs just displays 'no content'.
- After a short while, sometimes a few seconds but also sometimes a few minutes, the system will become completely unresponsive. Shutting down Proxmox will result in all of the VMs shutting down, but Proxmox itself will never shutdown. If I view its console over Dell's iDRAC, it appears to be frozen - there's no messages regarding the VMs shutting down and I can't input a username to login (or anything at all for that matter). At this point, I must forcibly restart the server from within iDRAC. This only happens after starting a VM with this GPU, as Proxmox was able to run overnight without issue as long as that VM remained offline.

I have no reason to believe that something has gone wrong with the GPU itself; it was just working perfectly fine within Proxmox 6.4.



The kernel's null pointer exception is as follows, pulled from the server just after starting a VM with the second GPU with journalctl -xe, before it comes unresponsive:
Code:
Nov 26 10:15:40 fc kernel: BUG: kernel NULL pointer dereference, address: 00000000000002c8
Nov 26 10:15:40 fc kernel: #PF: supervisor read access in kernel mode
Nov 26 10:15:40 fc kernel: #PF: error_code(0x0000) - not-present page
Nov 26 10:15:40 fc kernel: PGD 0 P4D 0
Nov 26 10:15:40 fc kernel: Oops: 0000 [#1] SMP PTI
Nov 26 10:15:40 fc kernel: CPU: 60 PID: 663 Comm: kworker/60:1 Tainted: P           O      5.13.19-1-pve #1
Nov 26 10:15:40 fc kernel: Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 2.13.0 05/14/2021
Nov 26 10:15:40 fc kernel: Workqueue: events drm_connector_free_work_fn [drm]
Nov 26 10:15:40 fc kernel: RIP: 0010:nouveau_connector_aux_xfer+0x2f/0x120 [nouveau]
Nov 26 10:15:40 fc kernel: Code: 55 48 89 e5 41 55 41 54 53 48 89 f3 48 83 ec 10 4c 8b 4e 10 48 8b b7 e8 fa ff ff 65 48 8b 04 25 28 00 00 00 48 89 45 e0 31 c0 <48> 8b 96 c8 02 00 00 48 81 c6 c8 02 00 00 44 88 4d df 48 39 d6 74
Nov 26 10:15:40 fc kernel: RSP: 0018:ffffa2fd8efebcb8 EFLAGS: 00010246
Nov 26 10:15:40 fc kernel: RAX: 0000000000000000 RBX: ffffa2fd8efebcf8 RCX: ffffa2fd8efebd87
Nov 26 10:15:40 fc kernel: RDX: ffff9733e74a9840 RSI: 0000000000000000 RDI: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: RBP: ffffa2fd8efebce0 R08: 0000000000000001 R09: 0000000000000001
Nov 26 10:15:40 fc kernel: R10: 0000000000000058 R11: ffff97534ea8ab48 R12: 0000000000000001
Nov 26 10:15:40 fc kernel: R13: 0000000000000000 R14: ffffa2fd8efebd87 R15: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: FS:  0000000000000000(0000) GS:ffff9752bf980000(0000) knlGS:0000000000000000
Nov 26 10:15:40 fc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 26 10:15:40 fc kernel: CR2: 00000000000002c8 CR3: 000000014d3b2002 CR4: 00000000003726e0
Nov 26 10:15:40 fc kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 26 10:15:40 fc kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 26 10:15:40 fc kernel: Call Trace:
Nov 26 10:15:40 fc kernel:  ? __cond_resched+0x1a/0x50
Nov 26 10:15:40 fc kernel:  drm_dp_dpcd_access+0x72/0x110 [drm_kms_helper]
Nov 26 10:15:40 fc kernel:  drm_dp_dpcd_write+0x81/0xc0 [drm_kms_helper]
Nov 26 10:15:40 fc kernel:  drm_dp_cec_adap_enable+0x3d/0x70 [drm_kms_helper]
Nov 26 10:15:40 fc kernel:  __cec_s_phys_addr.part.0+0xbb/0x250 [cec]
Nov 26 10:15:40 fc kernel:  __cec_s_phys_addr+0x2c/0x30 [cec]
Nov 26 10:15:40 fc kernel:  cec_unregister_adapter+0xe8/0x140 [cec]
Nov 26 10:15:40 fc kernel:  drm_dp_cec_unregister_connector+0x2f/0x50 [drm_kms_helper]
Nov 26 10:15:40 fc kernel:  nouveau_connector_destroy+0x54/0x80 [nouveau]
Nov 26 10:15:40 fc kernel:  drm_connector_free_work_fn+0x77/0x90 [drm]
Nov 26 10:15:40 fc kernel:  process_one_work+0x220/0x3c0
Nov 26 10:15:40 fc kernel:  worker_thread+0x53/0x420
Nov 26 10:15:40 fc kernel:  ? process_one_work+0x3c0/0x3c0
Nov 26 10:15:40 fc kernel:  kthread+0x12b/0x150
Nov 26 10:15:40 fc kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 10:15:40 fc kernel:  ret_from_fork+0x22/0x30
Nov 26 10:15:40 fc kernel: Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter bonding tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x8>
Nov 26 10:15:40 fc kernel:  scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid crc32_pclmul igb xhci_pci ixgbe mpt3sas ehci_pci i2c_algo_>
Nov 26 10:15:40 fc kernel: CR2: 00000000000002c8
Nov 26 10:15:40 fc kernel: ---[ end trace 5413aa93ef7f6728 ]---
Nov 26 10:15:40 fc kernel: RIP: 0010:nouveau_connector_aux_xfer+0x2f/0x120 [nouveau]
Nov 26 10:15:40 fc kernel: Code: 55 48 89 e5 41 55 41 54 53 48 89 f3 48 83 ec 10 4c 8b 4e 10 48 8b b7 e8 fa ff ff 65 48 8b 04 25 28 00 00 00 48 89 45 e0 31 c0 <48> 8b 96 c8 02 00 00 48 81 c6 c8 02 00 00 44 88 4d df 48 39 d6 74
Nov 26 10:15:40 fc kernel: RSP: 0018:ffffa2fd8efebcb8 EFLAGS: 00010246
Nov 26 10:15:40 fc kernel: RAX: 0000000000000000 RBX: ffffa2fd8efebcf8 RCX: ffffa2fd8efebd87
Nov 26 10:15:40 fc kernel: RDX: ffff9733e74a9840 RSI: 0000000000000000 RDI: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: RBP: ffffa2fd8efebce0 R08: 0000000000000001 R09: 0000000000000001
Nov 26 10:15:40 fc kernel: R10: 0000000000000058 R11: ffff97534ea8ab48 R12: 0000000000000001
Nov 26 10:15:40 fc kernel: R13: 0000000000000000 R14: ffffa2fd8efebd87 R15: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: FS:  0000000000000000(0000) GS:ffff9752bf980000(0000) knlGS:0000000000000000
Nov 26 10:15:40 fc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 26 10:15:40 fc kernel: CR2: 00000000000002c8 CR3: 000000014d3b2002 CR4: 00000000003726e0
Nov 26 10:15:40 fc kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 26 10:15:40 fc kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

This is with kernel version 5.13.19-1-pve, with the following Proxmox package versions (again, from a fresh 7.1 install using the Proxmox ISO from the website, with the No Subscription repo activated and no available updates):


Code:
root@fc:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: not correctly installed (running version: 7.1-6/4e61e21c)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: not correctly installed
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: not correctly installed
proxmox-backup-file-restore: not correctly installed
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-3
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: not correctly installed
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
 
Looks like I fixed it!

After creating this thread and pulling up those version numbers with pveversion -v, I found it odd that it was saying that a few packages were not correctly installed; after all, I installed Proxmox, and then immediately set up its No Subscription repository and ran the updater from within the GUI as the very first step, before continuing on and setting up the cluster configuration.

This updates seemed to go well, however, I just ran apt update followed by apt dist-upgrade within the GUI's shell, and after running dist-update, dpkg said that it was interrupted and that I needed to run dpkg --configure -a to fix it. I did, and it applied some updates to the packages listed above, followed by running initramfs and update-grub.

Alongside this, I did realize that after reinstalling Proxmox, I didn't add the GPU to any vfio-pci config within modprobe.d, nor did I blacklist nouveau. lspci -nn showed that both GPUs were using the Nouveau driver. I would think that Proxmox's console is using the server's Matrox on-motherboard video chip (which I believe is what Dell uses to display Proxmox's console within iDRAC's virtual console, but unsure), and that despite having the Nouveau driver loaded, that there wouldn't be issues as Proxmox isn't technically using either GPU. Regardless, I checked the backup I made of the /etc folder before wiping Proxmox 6.4, and I did setup the vfio.conf and blacklist.conf files previously, so I recreated those files in 7.1.

I then rebooted Proxmox, and tried starting a VM with the GPU that was causing issues, and the VM started without issue!
 
Sounds similar to a amdgpu issues I have, where kernel 5.11.22-5 (PVE 7.0) did not need blacklisting or early binding to vfio-pci, but 5.11.22-7 and later (PVE 7.1) did need that. As if both nouveau and amdgpu would nicely unbind for vfio-pci before, but now crash with newer kernels. This also happend to others. The work-around for me was the same as yours: bind early or blacklist. I'm still not sure whether this is a bug or the drivers changed and the old behavior was never really supported.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!