Kernel Crash in amdgpu driver when using MST

karypid

Member
Mar 7, 2021
22
7
8
46
Hello,

I have a Dell UltraSharp U4025QW monitor which has a feature that causes the kernel to crash: the iMST (internal Multi-stream Transport).

This iMST allows you to use the "daisy chain" DisplayPort feature and connect two separate monitors to the same port. In this particular model, there is an internal "splitter" that allows the monitor to split its screen into two parts, and present itself as two separate "daisy-chained" monitors instead of one. In other words, the host Linux system considers this to be two monitors connected to the same GPU port with daisy-chaining.

Now, I have a Radeon 6800XT with GPU passthrough that crashes when I try to boot the VM with this feature on. This is the kernel output as soon as the VM starts:

Code:
Apr 07 21:02:11 pve pvesh[2310]: Starting VM 100
Apr 07 21:02:11 pve pve-guests[4755]: start VM 100: UPID:pve:00001293:00000F91:6612FBC3:qmstart:100:root@pam:
Apr 07 21:02:11 pve pve-guests[2311]: <root@pam> starting task UPID:pve:00001293:00000F91:6612FBC3:qmstart:100:root@pam:
Apr 07 21:02:11 pve kernel: Console: switching to colour dummy device 80x25
Apr 07 21:02:11 pve pvedaemon[2293]: <root@pam> successful auth for user 'root@pam'
Apr 07 21:02:12 pve kernel: amdgpu 0000:0a:00.0: amdgpu: amdgpu: finishing device.

Apr 07 21:02:13 pve kernel: [drm] amdgpu: ttm finalized
Apr 07 21:02:13 pve kernel: BUG: kernel NULL pointer dereference, address: 0000000000000120
Apr 07 21:02:13 pve kernel: #PF: supervisor read access in kernel mode
Apr 07 21:02:13 pve kernel: #PF: error_code(0x0000) - not-present page
Apr 07 21:02:13 pve kernel: PGD 0 P4D 0
Apr 07 21:02:13 pve kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 07 21:02:13 pve kernel: CPU: 12 PID: 251 Comm: kworker/12:1 Tainted: P           O       6.5.13-3-pve #1
Apr 07 21:02:13 pve kernel: Hardware name: Gigabyte Technology Co., Ltd. X570S AERO G/X570S AERO G, BIOS F5g 09/20/2023
Apr 07 21:02:13 pve kernel: Workqueue: events drm_connector_free_work_fn [drm]
Apr 07 21:02:13 pve kernel: RIP: 0010:dc_link_aux_transfer_raw+0x1b/0x40 [amdgpu]
Apr 07 21:02:13 pve kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 20 55 48 8b 80 c8 01 00 00 48 89 e5 48 8b 80 b0 04 00 00 <48> 8b 80 20 01 00 00 e>
Apr 07 21:02:13 pve kernel: RSP: 0018:ffff963b00c27ca8 EFLAGS: 00010206
Apr 07 21:02:13 pve kernel: RAX: 0000000000000000 RBX: ffff963b00c27d28 RCX: 0000000000000000
Apr 07 21:02:13 pve kernel: RDX: ffff963b00c27cc4 RSI: ffff963b00c27cc8 RDI: ffff8a9ce3677000
Apr 07 21:02:13 pve kernel: RBP: ffff963b00c27ca8 R08: 0000000000000001 R09: 0000000000000000
Apr 07 21:02:13 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a9cd18daa30
Apr 07 21:02:13 pve kernel: R13: ffff8a9cd18dae58 R14: 0000000000000000 R15: ffff8a9cd18daa30
Apr 07 21:02:13 pve kernel: FS:  0000000000000000(0000) GS:ffff8ab3aed00000(0000) knlGS:0000000000000000
Apr 07 21:02:13 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 07 21:02:13 pve kernel: CR2: 0000000000000120 CR3: 00000001e143e000 CR4: 0000000000750ee0
Apr 07 21:02:13 pve kernel: PKRU: 55555554
Apr 07 21:02:13 pve kernel: Call Trace:
Apr 07 21:02:13 pve kernel:  <TASK>
Apr 07 21:02:13 pve kernel:  ? show_regs+0x6d/0x80
Apr 07 21:02:13 pve kernel:  ? __die+0x24/0x80
Apr 07 21:02:13 pve kernel:  ? page_fault_oops+0x176/0x500
Apr 07 21:02:13 pve kernel:  ? do_user_addr_fault+0x31d/0x6a0
Apr 07 21:02:13 pve kernel:  ? exc_page_fault+0x83/0x1b0
Apr 07 21:02:13 pve kernel:  ? asm_exc_page_fault+0x27/0x30
Apr 07 21:02:13 pve kernel:  ? dc_link_aux_transfer_raw+0x1b/0x40 [amdgpu]
Apr 07 21:02:13 pve kernel:  dm_dp_aux_transfer+0xd0/0x1d0 [amdgpu]
Apr 07 21:02:13 pve kernel:  drm_dp_dpcd_access+0xa8/0x140 [drm_display_helper]
Apr 07 21:02:13 pve kernel:  drm_dp_dpcd_write+0xc4/0x120 [drm_display_helper]
Apr 07 21:02:13 pve kernel:  drm_dp_mst_topology_mgr_set_mst+0x1f4/0x2e0 [drm_display_helper]
Apr 07 21:02:13 pve kernel:  drm_dp_mst_topology_mgr_destroy+0x14/0x70 [drm_display_helper]
Apr 07 21:02:13 pve kernel:  amdgpu_dm_connector_destroy+0x28/0xf0 [amdgpu]
Apr 07 21:02:13 pve kernel:  drm_connector_free_work_fn+0x77/0xa0 [drm]
Apr 07 21:02:13 pve kernel:  process_one_work+0x23e/0x450
Apr 07 21:02:13 pve kernel:  worker_thread+0x50/0x3f0
Apr 07 21:02:13 pve kernel:  ? __pfx_worker_thread+0x10/0x10
Apr 07 21:02:13 pve kernel:  kthread+0xf2/0x120
Apr 07 21:02:13 pve kernel:  ? __pfx_kthread+0x10/0x10
Apr 07 21:02:13 pve kernel:  ret_from_fork+0x47/0x70
Apr 07 21:02:13 pve kernel:  ? __pfx_kthread+0x10/0x10
Apr 07 21:02:13 pve kernel:  ret_from_fork_asm+0x1b/0x30
Apr 07 21:02:13 pve kernel:  </TASK>
Apr 07 21:02:13 pve kernel: Modules linked in: nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_de>
Apr 07 21:02:13 pve kernel:  snd_hwdep snd_seq_device crypto_simd cryptd snd_pcm i2c_algo_bit cfg80211 rapl videodev video gigabyte_wmi wmi_bmof ccp k10temp pcspkr snd_timer videobuf2_>
Apr 07 21:02:13 pve kernel: CR2: 0000000000000120
Apr 07 21:02:13 pve kernel: ---[ end trace 0000000000000000 ]---
Apr 07 21:02:13 pve kernel: RIP: 0010:dc_link_aux_transfer_raw+0x1b/0x40 [amdgpu]
Apr 07 21:02:13 pve kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 20 55 48 8b 80 c8 01 00 00 48 89 e5 48 8b 80 b0 04 00 00 <48> 8b 80 20 01 00 00 e>
Apr 07 21:02:13 pve kernel: RSP: 0018:ffff963b00c27ca8 EFLAGS: 00010206
Apr 07 21:02:13 pve kernel: RAX: 0000000000000000 RBX: ffff963b00c27d28 RCX: 0000000000000000
Apr 07 21:02:13 pve kernel: RDX: ffff963b00c27cc4 RSI: ffff963b00c27cc8 RDI: ffff8a9ce3677000
Apr 07 21:02:13 pve kernel: RBP: ffff963b00c27ca8 R08: 0000000000000001 R09: 0000000000000000
Apr 07 21:02:13 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a9cd18daa30
Apr 07 21:02:13 pve kernel: R13: ffff8a9cd18dae58 R14: 0000000000000000 R15: ffff8a9cd18daa30
Apr 07 21:02:13 pve kernel: FS:  0000000000000000(0000) GS:ffff8ab3aed00000(0000) knlGS:0000000000000000
Apr 07 21:02:13 pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 07 21:02:13 pve kernel: CR2: 0000000000000120 CR3: 00000001e143e000 CR4: 0000000000750ee0
Apr 07 21:02:13 pve kernel: PKRU: 55555554
Apr 07 21:02:13 pve kernel: note: kworker/12:1[251] exited with irqs disabled

This hangs the process for the start task and it won't go away (not even with a kill -9). As a result, shutting down Proxmox cleanly becomes impossible and I need to power off abruptly.

If I turn off the feature from the monitor's OSD, everything works fine. In fact, as soon as the VM boots and starts using the screen, I am able to re-enable the feature with no issues (though at this point the amdgpu driver is from the guest kernel - Fedora 39 - so it is version 6.8.4-200)

The host system is:

Code:
root@pve:~# pveversion
pve-manager/8.1.10/4b06efb5db453f29 (running kernel: 6.5.13-3-pve)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!