Hi,
I've just upgraded my homelab's core virtualization server's Proxmox instance from 6.4 to 7.1. The method I used was that I backed up all of my VMs to my local Proxmox Backup Server instance running on a separate server, wiped the server completely, cleanly installed 7.1, and restored the VMs. Everything seems to have been upgraded just fine, with one exception:
The server is a Dell PowerEdge R730, and it contains two NVIDIA Quadro P4000 GPUs. Starting any VM with the first GPU works just fine, and utilizing the GPU within the VM works successfully. Starting any VM with the second GPU, however, does a few things:
- It instantly makes the Linux kernel report a null pointer exception to do with Nouveau.
- It spawns a Proxmox 'start VM' task that cannot be stopped in any way, and the VM never actually starts or even attempts to start; there is no VM instance for me to
- After a short while, sometimes a few seconds but also sometimes a few minutes, the system will become completely unresponsive. Shutting down Proxmox will result in all of the VMs shutting down, but Proxmox itself will never shutdown. If I view its console over Dell's iDRAC, it appears to be frozen - there's no messages regarding the VMs shutting down and I can't input a username to login (or anything at all for that matter). At this point, I must forcibly restart the server from within iDRAC. This only happens after starting a VM with this GPU, as Proxmox was able to run overnight without issue as long as that VM remained offline.
I have no reason to believe that something has gone wrong with the GPU itself; it was just working perfectly fine within Proxmox 6.4.
The kernel's null pointer exception is as follows, pulled from the server just after starting a VM with the second GPU with
This is with kernel version
I've just upgraded my homelab's core virtualization server's Proxmox instance from 6.4 to 7.1. The method I used was that I backed up all of my VMs to my local Proxmox Backup Server instance running on a separate server, wiped the server completely, cleanly installed 7.1, and restored the VMs. Everything seems to have been upgraded just fine, with one exception:
The server is a Dell PowerEdge R730, and it contains two NVIDIA Quadro P4000 GPUs. Starting any VM with the first GPU works just fine, and utilizing the GPU within the VM works successfully. Starting any VM with the second GPU, however, does a few things:
- It instantly makes the Linux kernel report a null pointer exception to do with Nouveau.
- It spawns a Proxmox 'start VM' task that cannot be stopped in any way, and the VM never actually starts or even attempts to start; there is no VM instance for me to
ps aux | grep <VM ID>
and kill -9 <VM PID>
. Viewing the task's logs just displays 'no content'.- After a short while, sometimes a few seconds but also sometimes a few minutes, the system will become completely unresponsive. Shutting down Proxmox will result in all of the VMs shutting down, but Proxmox itself will never shutdown. If I view its console over Dell's iDRAC, it appears to be frozen - there's no messages regarding the VMs shutting down and I can't input a username to login (or anything at all for that matter). At this point, I must forcibly restart the server from within iDRAC. This only happens after starting a VM with this GPU, as Proxmox was able to run overnight without issue as long as that VM remained offline.
I have no reason to believe that something has gone wrong with the GPU itself; it was just working perfectly fine within Proxmox 6.4.
The kernel's null pointer exception is as follows, pulled from the server just after starting a VM with the second GPU with
journalctl -xe
, before it comes unresponsive:
Code:
Nov 26 10:15:40 fc kernel: BUG: kernel NULL pointer dereference, address: 00000000000002c8
Nov 26 10:15:40 fc kernel: #PF: supervisor read access in kernel mode
Nov 26 10:15:40 fc kernel: #PF: error_code(0x0000) - not-present page
Nov 26 10:15:40 fc kernel: PGD 0 P4D 0
Nov 26 10:15:40 fc kernel: Oops: 0000 [#1] SMP PTI
Nov 26 10:15:40 fc kernel: CPU: 60 PID: 663 Comm: kworker/60:1 Tainted: P O 5.13.19-1-pve #1
Nov 26 10:15:40 fc kernel: Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 2.13.0 05/14/2021
Nov 26 10:15:40 fc kernel: Workqueue: events drm_connector_free_work_fn [drm]
Nov 26 10:15:40 fc kernel: RIP: 0010:nouveau_connector_aux_xfer+0x2f/0x120 [nouveau]
Nov 26 10:15:40 fc kernel: Code: 55 48 89 e5 41 55 41 54 53 48 89 f3 48 83 ec 10 4c 8b 4e 10 48 8b b7 e8 fa ff ff 65 48 8b 04 25 28 00 00 00 48 89 45 e0 31 c0 <48> 8b 96 c8 02 00 00 48 81 c6 c8 02 00 00 44 88 4d df 48 39 d6 74
Nov 26 10:15:40 fc kernel: RSP: 0018:ffffa2fd8efebcb8 EFLAGS: 00010246
Nov 26 10:15:40 fc kernel: RAX: 0000000000000000 RBX: ffffa2fd8efebcf8 RCX: ffffa2fd8efebd87
Nov 26 10:15:40 fc kernel: RDX: ffff9733e74a9840 RSI: 0000000000000000 RDI: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: RBP: ffffa2fd8efebce0 R08: 0000000000000001 R09: 0000000000000001
Nov 26 10:15:40 fc kernel: R10: 0000000000000058 R11: ffff97534ea8ab48 R12: 0000000000000001
Nov 26 10:15:40 fc kernel: R13: 0000000000000000 R14: ffffa2fd8efebd87 R15: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: FS: 0000000000000000(0000) GS:ffff9752bf980000(0000) knlGS:0000000000000000
Nov 26 10:15:40 fc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 26 10:15:40 fc kernel: CR2: 00000000000002c8 CR3: 000000014d3b2002 CR4: 00000000003726e0
Nov 26 10:15:40 fc kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 26 10:15:40 fc kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 26 10:15:40 fc kernel: Call Trace:
Nov 26 10:15:40 fc kernel: ? __cond_resched+0x1a/0x50
Nov 26 10:15:40 fc kernel: drm_dp_dpcd_access+0x72/0x110 [drm_kms_helper]
Nov 26 10:15:40 fc kernel: drm_dp_dpcd_write+0x81/0xc0 [drm_kms_helper]
Nov 26 10:15:40 fc kernel: drm_dp_cec_adap_enable+0x3d/0x70 [drm_kms_helper]
Nov 26 10:15:40 fc kernel: __cec_s_phys_addr.part.0+0xbb/0x250 [cec]
Nov 26 10:15:40 fc kernel: __cec_s_phys_addr+0x2c/0x30 [cec]
Nov 26 10:15:40 fc kernel: cec_unregister_adapter+0xe8/0x140 [cec]
Nov 26 10:15:40 fc kernel: drm_dp_cec_unregister_connector+0x2f/0x50 [drm_kms_helper]
Nov 26 10:15:40 fc kernel: nouveau_connector_destroy+0x54/0x80 [nouveau]
Nov 26 10:15:40 fc kernel: drm_connector_free_work_fn+0x77/0x90 [drm]
Nov 26 10:15:40 fc kernel: process_one_work+0x220/0x3c0
Nov 26 10:15:40 fc kernel: worker_thread+0x53/0x420
Nov 26 10:15:40 fc kernel: ? process_one_work+0x3c0/0x3c0
Nov 26 10:15:40 fc kernel: kthread+0x12b/0x150
Nov 26 10:15:40 fc kernel: ? set_kthread_struct+0x50/0x50
Nov 26 10:15:40 fc kernel: ret_from_fork+0x22/0x30
Nov 26 10:15:40 fc kernel: Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter bonding tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x8>
Nov 26 10:15:40 fc kernel: scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid crc32_pclmul igb xhci_pci ixgbe mpt3sas ehci_pci i2c_algo_>
Nov 26 10:15:40 fc kernel: CR2: 00000000000002c8
Nov 26 10:15:40 fc kernel: ---[ end trace 5413aa93ef7f6728 ]---
Nov 26 10:15:40 fc kernel: RIP: 0010:nouveau_connector_aux_xfer+0x2f/0x120 [nouveau]
Nov 26 10:15:40 fc kernel: Code: 55 48 89 e5 41 55 41 54 53 48 89 f3 48 83 ec 10 4c 8b 4e 10 48 8b b7 e8 fa ff ff 65 48 8b 04 25 28 00 00 00 48 89 45 e0 31 c0 <48> 8b 96 c8 02 00 00 48 81 c6 c8 02 00 00 44 88 4d df 48 39 d6 74
Nov 26 10:15:40 fc kernel: RSP: 0018:ffffa2fd8efebcb8 EFLAGS: 00010246
Nov 26 10:15:40 fc kernel: RAX: 0000000000000000 RBX: ffffa2fd8efebcf8 RCX: ffffa2fd8efebd87
Nov 26 10:15:40 fc kernel: RDX: ffff9733e74a9840 RSI: 0000000000000000 RDI: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: RBP: ffffa2fd8efebce0 R08: 0000000000000001 R09: 0000000000000001
Nov 26 10:15:40 fc kernel: R10: 0000000000000058 R11: ffff97534ea8ab48 R12: 0000000000000001
Nov 26 10:15:40 fc kernel: R13: 0000000000000000 R14: ffffa2fd8efebd87 R15: ffff97536b5e6518
Nov 26 10:15:40 fc kernel: FS: 0000000000000000(0000) GS:ffff9752bf980000(0000) knlGS:0000000000000000
Nov 26 10:15:40 fc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 26 10:15:40 fc kernel: CR2: 00000000000002c8 CR3: 000000014d3b2002 CR4: 00000000003726e0
Nov 26 10:15:40 fc kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 26 10:15:40 fc kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
This is with kernel version
5.13.19-1-pve
, with the following Proxmox package versions (again, from a fresh 7.1 install using the Proxmox ISO from the website, with the No Subscription repo activated and no available updates):
Code:
root@fc:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve)
pve-manager: not correctly installed (running version: 7.1-6/4e61e21c)
pve-kernel-5.13: 7.1-4
pve-kernel-helper: 7.1-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: not correctly installed
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: not correctly installed
proxmox-backup-file-restore: not correctly installed
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.4-3
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: not correctly installed
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-2
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3