Issue with amdgpu crashing in LXC

tsingla07

New Member
May 20, 2024
5
1
3
I am running jellyfin in a debain 12 lxc container. I am using my ryzen 3200g APU with Vega 8 Graphics for transcoding videos. It seems that amd_gpu is unstable crashes with same error very frequently. Do anyone has faced the same issue and what should I do debug this? I am attaching journalctl logs around the crash.


Code:
Jul 20 14:53:18 prox07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.1 timeout, signaled seq=8609, emitted seq=8611
Jul 20 14:53:18 prox07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ffmpeg pid 8077 thread ffmpeg:cs0 pid 8078
Jul 20 14:53:18 prox07 kernel: amdgpu 0000:2a:00.0: amdgpu: GPU reset begin!
Jul 20 14:53:22 prox07 kernel: amdgpu 0000:2a:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Jul 20 14:53:25 prox07 kernel: amdgpu 0000:2a:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Jul 20 14:53:42 prox07 sshd[8900]: Accepted publickey for root from 192.168.11.4 port 57992 ssh2: RSA SHA256:5R4ymA7QyduMgMQabgwFNXzIUgXiJmKg6NHF08KgN5U
Jul 20 14:53:42 prox07 sshd[8900]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jul 20 14:53:42 prox07 systemd-logind[714]: New session 4 of user root.
Jul 20 14:53:42 prox07 systemd[1]: Started session-4.scope - Session 4 of User root.
Jul 20 14:53:42 prox07 sshd[8900]: pam_env(sshd:session): deprecated reading of user environment enabled
Jul 20 14:53:54 prox07 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 25s! [kworker/u64:0:6684]
Jul 20 14:53:54 prox07 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs nf_conntrack_netlink xt_nat xt_conntrack nft_chain_nat xfrm_user xfrm_algo xt_addrtype nft_compat>
Jul 20 14:53:54 prox07 kernel:  rc_core soundcore k10temp i2c_algo_bit ccp mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap nct6775 nct6775_core hwmon_vid efi_pstore dmi_sysfs ip_tables x_tables au>
Jul 20 14:53:54 prox07 kernel: CPU: 2 PID: 6684 Comm: kworker/u64:0 Tainted: P           O       6.8.8-2-pve #1
Jul 20 14:53:54 prox07 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C52/B450M-A PRO MAX (MS-7C52), BIOS 3.L0 10/25/2023
Jul 20 14:53:54 prox07 kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jul 20 14:53:54 prox07 kernel: RIP: 0010:amdgpu_device_rreg+0xf3/0x120 [amdgpu]
Jul 20 14:53:54 prox07 kernel: Code: 75 1d f6 83 e8 13 04 00 10 74 14 48 8b 83 d8 2f 04 00 48 8d 78 18 e8 6c b1 72 cb 85 c0 75 10 4c 03 a3 e0 08 00 00 45 8b 24 24 <e9> 63 ff ff ff 48 89 df 31 d2 44 89 f6>
Jul 20 14:53:54 prox07 kernel: RSP: 0018:ffffad5b0791fb78 EFLAGS: 00000286
Jul 20 14:53:54 prox07 kernel: RAX: ffffffffc2065440 RBX: ffff96d683f80000 RCX: 0000000000000000
Jul 20 14:53:54 prox07 kernel: RDX: 0000000000000000 RSI: 000000000003b184 RDI: ffff96d683f80000
Jul 20 14:53:54 prox07 kernel: RBP: ffffad5b0791fba0 R08: 0000000000000000 R09: 0000000000000000
Jul 20 14:53:54 prox07 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
Jul 20 14:53:54 prox07 kernel: R13: 0000000000000000 R14: 000000000000ec61 R15: ffff96d683f80770
Jul 20 14:53:54 prox07 kernel: FS:  0000000000000000(0000) GS:ffff96d9b0300000(0000) knlGS:0000000000000000
Jul 20 14:53:54 prox07 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 20 14:53:54 prox07 kernel: CR2: 00007f7ad7de5000 CR3: 0000000107eb8000 CR4: 00000000003506f0
Jul 20 14:53:54 prox07 kernel: Call Trace:
Jul 20 14:53:54 prox07 kernel:  <IRQ>
Jul 20 14:53:54 prox07 kernel:  ? show_regs+0x6d/0x80
Jul 20 14:53:54 prox07 kernel:  ? watchdog_timer_fn+0x206/0x290
Jul 20 14:53:54 prox07 kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Jul 20 14:53:54 prox07 kernel:  ? __hrtimer_run_queues+0x108/0x280
Jul 20 14:53:54 prox07 kernel:  ? hrtimer_interrupt+0xf6/0x250
Jul 20 14:53:54 prox07 kernel:  ? __sysvec_apic_timer_interrupt+0x51/0x150
Jul 20 14:53:54 prox07 kernel:  ? sysvec_apic_timer_interrupt+0x8d/0xd0
Jul 20 14:53:54 prox07 kernel:  </IRQ>
Jul 20 14:53:54 prox07 kernel:  <TASK>
Jul 20 14:53:54 prox07 kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
Jul 20 14:53:54 prox07 kernel:  ? amdgpu_device_rreg+0xf3/0x120 [amdgpu]
Jul 20 14:53:54 prox07 kernel:  ? delay_halt+0x40/0x80
Jul 20 14:53:54 prox07 kernel:  gfx_v9_0_rlc_stop+0x10a/0x3d0 [amdgpu]
Jul 20 14:53:54 prox07 kernel:  ? __const_udelay+0x3d/0x50
Jul 20 14:53:54 prox07 kernel:  gfx_v9_0_hw_fini+0xe1/0x990 [amdgpu]
Jul 20 14:53:54 prox07 kernel:  ? sdma_v4_0_hw_fini.part.0+0x93/0xb0 [amdgpu]
Jul 20 14:53:54 prox07 kernel:  gfx_v9_0_suspend+0xe/0x20 [amdgpu]
Jul 20 14:53:54 prox07 kernel:  amdgpu_device_ip_suspend_phase2+0x17a/0x220 [amdgpu]
 
What kernel are you running on Proxmox (check with uname -a)? I've seen amdgpu driver crash (GPU won't work until reboot) with 6.8. Maybe see if it goes away with Proxmox kernel 6.5?
 
i am using kernel version 6.8.8-2-pve.
6.8.8-3-pve was released yesterday or so. You could try updating: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#system_software_updates
Is it safe to revert to 6.5?
I don't know what you mean by "safe"? I feel like people have reported less problems with kernel version 6.5 than the current 6.8, but I'm not sure if 6.5 is getting updates and security fixes. It would be an informative data point to know if 6.5 works better. I'm not forcing you to always run 6.5 from now on; I'm just asking you to test it.
 
6.8.8-3-pve was released yesterday or so. You could try updating: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#system_software_updates

I don't know what you mean by "safe"? I feel like people have reported less problems with kernel version 6.5 than the current 6.8, but I'm not sure if 6.5 is getting updates and security fixes. It would be an informative data point to know if 6.5 works better. I'm not forcing you to always run 6.5 from now on; I'm just asking you to test it.
By safe i meant by any issues that i may face with downgrading the kernel. I will test and update with 6.5.
 
By safe i meant by any issues that i may face with downgrading the kernel.
There could be issues like you might have a device that is not yet supported by 6.5. Or maybe there are (other) bugs in 6.5 that you might run into. Or maybe there is a security issue. No guarantees, sorry.
 
ok. I tried with kernel 6.5. This time I got different error (ring gfx_low timeout instead of ring comp_1.1.1 timeout). Any other thing that I can do?


Code:
Jul 20 16:28:55 prox07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=184238, emitted seq=184240
Jul 20 16:28:55 prox07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ffmpeg pid 5350 thread ffmpeg:cs0 pid 5351
Jul 20 16:28:55 prox07 kernel: amdgpu 0000:2a:00.0: amdgpu: GPU reset begin!
Jul 20 16:29:20 prox07 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u64:6:166]
Jul 20 16:29:20 prox07 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs nf_conntrack_netlink xt_nat xt_conntrack nft_chain_nat xfrm_user xfrm_algo xt_addrtype nft_compat overlay cfg80211 tcp_diag inet_diag veth xt_MASQUERADE xt_tcpudp xt_mark vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw nf_tables ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter softdog bonding tls sunrpc binfmt_misc nfnetlink_log nfnetlink amdgpu intel_rapl_msr intel_rapl_common edac_mce_amd amdxcp iommu_v2 kvm_amd drm_buddy gpu_sched drm_suballoc_helper kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel snd_hda_codec_realtek sha256_ssse3 sha1_ssse3 snd_hda_codec_generic aesni_intel ledtrig_audio crypto_simd drm_ttm_helper cryptd ttm snd_hda_codec_hdmi drm_display_helper snd_hda_intel snd_intel_dspcfg cec snd_intel_sdw_acpi snd_hda_codec snd_hda_core
Jul 20 16:29:20 prox07 kernel:  snd_hwdep snd_pcm rc_core snd_timer snd drm_kms_helper rapl wmi_bmof pcspkr i2c_algo_bit ccp k10temp soundcore mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap nct6775 nct6775_core hwmon_vid drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme crc32_pclmul xhci_pci nvme_core i2c_piix4 r8169 xhci_pci_renesas nvme_common ahci realtek libahci xhci_hcd video wmi gpio_amdpt
Jul 20 16:29:20 prox07 kernel: CPU: 0 PID: 166 Comm: kworker/u64:6 Tainted: P           O       6.5.13-5-pve #1
Jul 20 16:29:20 prox07 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C52/B450M-A PRO MAX (MS-7C52), BIOS 3.L0 10/25/2023
Jul 20 16:29:20 prox07 kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jul 20 16:29:20 prox07 kernel: RIP: 0010:amdgpu_device_rreg+0xf3/0x120 [amdgpu]
Jul 20 16:29:20 prox07 kernel: Code: 75 1d f6 83 a8 5d 04 00 10 74 14 48 8b 83 38 79 04 00 48 8d 78 18 e8 2c ca b4 dd 85 c0 75 10 4c 03 a3 00 09 00 00 45 8b 24 24 <e9> 63 ff ff ff 48 89 df 44 89 f6 e8 2d 21 11 00 41 89 c4 48 8b 83
Jul 20 16:29:20 prox07 kernel: RSP: 0018:ffffa79480723b60 EFLAGS: 00000286
Jul 20 16:29:20 prox07 kernel: RAX: ffffffffc2141b20 RBX: ffff8d1c0c300000 RCX: 0000000000000000
Jul 20 16:29:20 prox07 kernel: RDX: 0000000000000000 RSI: 000000000003b014 RDI: ffff8d1c0c300000
Jul 20 16:29:20 prox07 kernel: RBP: ffffa79480723b88 R08: 0000000000000000 R09: 0000000000000000
Jul 20 16:29:20 prox07 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
Jul 20 16:29:20 prox07 kernel: R13: 0000000000000000 R14: 000000000000ec05 R15: 0000000000000001
Jul 20 16:29:20 prox07 kernel: FS:  0000000000000000(0000) GS:ffff8d1f30a00000(0000) knlGS:0000000000000000
Jul 20 16:29:20 prox07 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 20 16:29:20 prox07 kernel: CR2: 00007ffcd4f02e94 CR3: 000000010c948000 CR4: 00000000003506f0
Jul 20 16:29:20 prox07 kernel: Call Trace:
Jul 20 16:29:20 prox07 kernel:  <IRQ>
Jul 20 16:29:20 prox07 kernel:  ? show_regs+0x6d/0x80
Jul 20 16:29:20 prox07 kernel:  ? watchdog_timer_fn+0x1d8/0x240
Jul 20 16:29:20 prox07 kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Jul 20 16:29:20 prox07 kernel:  ? __hrtimer_run_queues+0x108/0x280
Jul 20 16:29:20 prox07 kernel:  ? hrtimer_interrupt+0xf6/0x250
Jul 20 16:29:20 prox07 kernel:  ? __sysvec_apic_timer_interrupt+0x62/0x140
Jul 20 16:29:20 prox07 kernel:  ? sysvec_apic_timer_interrupt+0x8d/0xd0
Jul 20 16:29:20 prox07 kernel:  </IRQ>
Jul 20 16:29:20 prox07 kernel:  <TASK>
Jul 20 16:29:20 prox07 kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
Jul 20 16:29:20 prox07 kernel:  ? amdgpu_device_rreg+0xf3/0x120 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  ? delay_halt+0x40/0x80
Jul 20 16:29:20 prox07 kernel:  gfx_v9_0_set_safe_mode+0xd3/0x140 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  amdgpu_gfx_rlc_enter_safe_mode+0x6b/0x90 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  gfx_v9_0_set_powergating_state+0x91/0x270 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  amdgpu_device_set_pg_state+0xc5/0x130 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  ? __irq_put_desc_unlock+0x1e/0x50
Jul 20 16:29:20 prox07 kernel:  amdgpu_device_ip_suspend_phase1+0x21/0x110 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  amdgpu_device_ip_suspend+0x20/0x80 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  amdgpu_device_pre_asic_reset+0xdb/0x310 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  amdgpu_device_gpu_recover+0x4da/0xec0 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  amdgpu_job_timedout+0x182/0x270 [amdgpu]
Jul 20 16:29:20 prox07 kernel:  drm_sched_job_timedout+0x70/0x120 [gpu_sched]
Jul 20 16:29:20 prox07 kernel:  process_one_work+0x23e/0x450
Jul 20 16:29:20 prox07 kernel:  worker_thread+0x50/0x3f0
Jul 20 16:29:20 prox07 kernel:  ? __pfx_worker_thread+0x10/0x10
Jul 20 16:29:20 prox07 kernel:  kthread+0xf2/0x120
Jul 20 16:29:20 prox07 kernel:  ? __pfx_kthread+0x10/0x10
Jul 20 16:29:20 prox07 kernel:  ret_from_fork+0x47/0x70
Jul 20 16:29:20 prox07 kernel:  ? __pfx_kthread+0x10/0x10
Jul 20 16:29:20 prox07 kernel:  ret_from_fork_asm+0x1b/0x30
Jul 20 16:29:20 prox07 kernel:  </TASK>
Jul 20 16:29:48 prox07 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 53s! [kworker/u64:6:166]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!