[SOLVED] amdgpu unbind/unload errors with pve-kernel-6.1

leesteken

Distinguished Member
May 31, 2020
6,772
1,942
228
amdgpu unbind/hand-off to vfio-pci still appears to work but gives lots of errors in Syslog/journalctl. It worked without any errors with pve-kernel-5.19.
https://forum.proxmox.com/threads/o...r-proxmox-ve-7-x-available.119483/post-518711
https://forum.proxmox.com/threads/o...r-proxmox-ve-7-x-available.119483/post-519158
I think it can be reproduced with an AMD GPU (supported by vendor-reset) using modprobe amdgpu followed by rmmod amdgpu or echo "$PCIID" >"/sys/bus/pci/devices/$PCIID/driver/unbind".

Anybody happens to know how to fix this? Early binding to vfio-pci or blacklisting amdgpu probably works, but it would be really nice if unbind/unload of amdgpu would work like pve-kernel 5.19 without any errrors.
 
+1 also interested in this (never got unbinding to work, but I can work with onetime-passthrough, good enough for me )
 
This still happens with Linux sentry 6.1.2-1-pve #1 SMP PREEMPT_DYNAMIC PVE 6.1.2-1 (2023-01-10T00:00Z) x86_64 GNU/Linux.
Code:
WARNING: CPU: 2 PID: 73328 at drivers/gpu/drm/drm_mode_object.c:107 drm_mode_object_unregister+0x8c/0x90 [drm]
jan 19 07:40:06 sentry kernel: Modules linked in: binfmt_misc veth ebt_arp ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip>
jan 19 07:40:06 sentry kernel:  libarc4 k10temp mac_hid em28xx tveeprom videodev mc ledtrig_heartbeat it87 hwmon_vid nf_conntrack nf_defrag_ipv6 nf_defrag_ip>
jan 19 07:40:06 sentry kernel: CPU: 2 PID: 73328 Comm: task UPID:sentr Tainted: P           O       6.1.2-1-pve #1
jan 19 07:40:06 sentry kernel: Hardware name: Gigabyte Technology Co., Ltd. X570S AERO G/X570S AERO G, BIOS F4 12/27/2022
jan 19 07:40:06 sentry kernel: RIP: 0010:drm_mode_object_unregister+0x8c/0x90 [drm]
jan 19 07:40:06 sentry kernel: Code: af fa 5b 41 5c 41 5d 5d c3 cc cc cc cc 44 0f b6 6f 50 41 80 fd 01 0f 87 ed 84 01 00 41 83 e5 01 74 9a 49 83 7c 24 18 00 >
jan 19 07:40:06 sentry kernel: RSP: 0018:ffffa86604bfbb18 EFLAGS: 00010246
jan 19 07:40:06 sentry kernel: RAX: ffffffffc5547600 RBX: ffff96e173e60010 RCX: 0000000000000000
jan 19 07:40:06 sentry kernel: RDX: ffff96e173e76660 RSI: ffff96e173e76620 RDI: ffff96e173e60010
jan 19 07:40:06 sentry kernel: RBP: ffffa86604bfbb30 R08: 0000000000000001 R09: 0000000000b71b00
jan 19 07:40:06 sentry kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff96e173e76620
jan 19 07:40:06 sentry kernel: R13: 0000000000000001 R14: ffff96e173e60010 R15: 0000000000000001
jan 19 07:40:06 sentry kernel: FS:  00007fdfad181280(0000) GS:ffff96f04ea80000(0000) knlGS:0000000000000000
jan 19 07:40:06 sentry kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 19 07:40:06 sentry kernel: CR2: 00005565a985b268 CR3: 0000000353b50000 CR4: 0000000000750ee0
jan 19 07:40:06 sentry kernel: PKRU: 55555554
jan 19 07:40:06 sentry kernel: Call Trace:
jan 19 07:40:06 sentry kernel:  <TASK>
jan 19 07:40:06 sentry kernel:  drm_encoder_cleanup+0x54/0xd0 [drm]
jan 19 07:40:06 sentry kernel:  amdgpu_dm_fini+0x61/0x240 [amdgpu]
jan 19 07:40:06 sentry kernel:  dm_hw_fini+0x23/0x30 [amdgpu]
jan 19 07:40:06 sentry kernel:  amdgpu_device_fini_hw+0x2e1/0x3c0 [amdgpu]
jan 19 07:40:06 sentry kernel:  amdgpu_driver_unload_kms+0x51/0x60 [amdgpu]
jan 19 07:40:06 sentry kernel:  amdgpu_pci_remove+0x52/0x140 [amdgpu]
jan 19 07:40:06 sentry kernel:  ? __pm_runtime_resume+0x60/0x90
jan 19 07:40:06 sentry kernel:  pci_device_remove+0x39/0xb0
jan 19 07:40:06 sentry kernel:  device_remove+0x46/0x70
jan 19 07:40:06 sentry kernel:  device_release_driver_internal+0x1fa/0x280
jan 19 07:40:06 sentry kernel:  device_driver_detach+0x14/0x20
jan 19 07:40:06 sentry kernel:  unbind_store+0x12a/0x140
jan 19 07:40:06 sentry kernel:  drv_attr_store+0x24/0x40
jan 19 07:40:06 sentry kernel:  sysfs_kf_write+0x3f/0x50
jan 19 07:40:06 sentry kernel:  kernfs_fop_write_iter+0x13f/0x1d0
jan 19 07:40:06 sentry kernel:  vfs_write+0x2a7/0x3b0
jan 19 07:40:06 sentry kernel:  ksys_write+0x67/0xf0
jan 19 07:40:06 sentry kernel:  __x64_sys_write+0x1a/0x20
jan 19 07:40:06 sentry kernel:  do_syscall_64+0x5c/0x90
jan 19 07:40:06 sentry kernel:  ? exit_to_user_mode_prepare+0x37/0x180
jan 19 07:40:06 sentry kernel:  ? syscall_exit_to_user_mode+0x26/0x50
jan 19 07:40:06 sentry kernel:  ? __x64_sys_newfstat+0x16/0x20
jan 19 07:40:06 sentry kernel:  ? do_syscall_64+0x69/0x90
jan 19 07:40:06 sentry kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
jan 19 07:40:06 sentry kernel: RIP: 0033:0x7fdfad3a3fb3
jan 19 07:40:06 sentry kernel: Code: 75 05 48 83 c4 58 c3 e8 cb 41 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 >
jan 19 07:40:06 sentry kernel: RSP: 002b:00007ffc2c75ce48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
jan 19 07:40:06 sentry kernel: RAX: ffffffffffffffda RBX: 00005565a9859260 RCX: 00007fdfad3a3fb3
jan 19 07:40:06 sentry kernel: RDX: 000000000000000c RSI: 00005565a9859260 RDI: 0000000000000009
jan 19 07:40:06 sentry kernel: RBP: 000000000000000c R08: 0000000000000000 R09: 00005565a270b3b0
jan 19 07:40:06 sentry kernel: R10: 00005565a9836cf8 R11: 0000000000000246 R12: 00005565a9857540
jan 19 07:40:06 sentry kernel: R13: 00005565a3cee2a0 R14: 0000000000000009 R15: 00005565a9857540
jan 19 07:40:06 sentry kernel:  </TASK>
 
Last edited:
On my system I saw the same stack trace. And as I could easily reproduce it I decided to do a kernel bisect and filed the bug report upstream:
https://gitlab.freedesktop.org/drm/amd/-/issues/2374

Good news, I just tested with Ubuntu mainline kernel 6.1.9 and the issue is gone. Reported upstream and closed the bug for now.
Kernel 6.1.9 also contains amdgpu fixes for displayport mst issues, possibly not completely fixed yet, but this also fixed the issue I reported earlier
So it seems the amdgpu driver polishing is showing nice results ;)
 
On my system I saw the same stack trace. And as I could easily reproduce it I decided to do a kernel bisect and filed the bug report upstream:
https://gitlab.freedesktop.org/drm/amd/-/issues/2374
You're my hero!
Good news, I just tested with Ubuntu mainline kernel 6.1.9 and the issue is gone. Reported upstream and closed the bug for now.
Kernel 6.1.9 also contains amdgpu fixes for displayport mst issues, possibly not completely fixed yet, but this also fixed the issue I reported earlier
So it seems the amdgpu driver polishing is showing nice results ;)
Thank you for reporting back on this and more thanks for all the effort you put in.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!