I am running jellyfin in a debain 12 lxc container. I am using my ryzen 3200g APU with Vega 8 Graphics for transcoding videos. It seems that amd_gpu is unstable crashes with same error very frequently. Do anyone has faced the same issue and what should I do debug this? I am attaching journalctl logs around the crash.
Code:
Jul 20 14:53:18 prox07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.1 timeout, signaled seq=8609, emitted seq=8611
Jul 20 14:53:18 prox07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ffmpeg pid 8077 thread ffmpeg:cs0 pid 8078
Jul 20 14:53:18 prox07 kernel: amdgpu 0000:2a:00.0: amdgpu: GPU reset begin!
Jul 20 14:53:22 prox07 kernel: amdgpu 0000:2a:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Jul 20 14:53:25 prox07 kernel: amdgpu 0000:2a:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Jul 20 14:53:42 prox07 sshd[8900]: Accepted publickey for root from 192.168.11.4 port 57992 ssh2: RSA SHA256:5R4ymA7QyduMgMQabgwFNXzIUgXiJmKg6NHF08KgN5U
Jul 20 14:53:42 prox07 sshd[8900]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jul 20 14:53:42 prox07 systemd-logind[714]: New session 4 of user root.
Jul 20 14:53:42 prox07 systemd[1]: Started session-4.scope - Session 4 of User root.
Jul 20 14:53:42 prox07 sshd[8900]: pam_env(sshd:session): deprecated reading of user environment enabled
Jul 20 14:53:54 prox07 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 25s! [kworker/u64:0:6684]
Jul 20 14:53:54 prox07 kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs nf_conntrack_netlink xt_nat xt_conntrack nft_chain_nat xfrm_user xfrm_algo xt_addrtype nft_compat>
Jul 20 14:53:54 prox07 kernel: rc_core soundcore k10temp i2c_algo_bit ccp mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap nct6775 nct6775_core hwmon_vid efi_pstore dmi_sysfs ip_tables x_tables au>
Jul 20 14:53:54 prox07 kernel: CPU: 2 PID: 6684 Comm: kworker/u64:0 Tainted: P O 6.8.8-2-pve #1
Jul 20 14:53:54 prox07 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C52/B450M-A PRO MAX (MS-7C52), BIOS 3.L0 10/25/2023
Jul 20 14:53:54 prox07 kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
Jul 20 14:53:54 prox07 kernel: RIP: 0010:amdgpu_device_rreg+0xf3/0x120 [amdgpu]
Jul 20 14:53:54 prox07 kernel: Code: 75 1d f6 83 e8 13 04 00 10 74 14 48 8b 83 d8 2f 04 00 48 8d 78 18 e8 6c b1 72 cb 85 c0 75 10 4c 03 a3 e0 08 00 00 45 8b 24 24 <e9> 63 ff ff ff 48 89 df 31 d2 44 89 f6>
Jul 20 14:53:54 prox07 kernel: RSP: 0018:ffffad5b0791fb78 EFLAGS: 00000286
Jul 20 14:53:54 prox07 kernel: RAX: ffffffffc2065440 RBX: ffff96d683f80000 RCX: 0000000000000000
Jul 20 14:53:54 prox07 kernel: RDX: 0000000000000000 RSI: 000000000003b184 RDI: ffff96d683f80000
Jul 20 14:53:54 prox07 kernel: RBP: ffffad5b0791fba0 R08: 0000000000000000 R09: 0000000000000000
Jul 20 14:53:54 prox07 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
Jul 20 14:53:54 prox07 kernel: R13: 0000000000000000 R14: 000000000000ec61 R15: ffff96d683f80770
Jul 20 14:53:54 prox07 kernel: FS: 0000000000000000(0000) GS:ffff96d9b0300000(0000) knlGS:0000000000000000
Jul 20 14:53:54 prox07 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 20 14:53:54 prox07 kernel: CR2: 00007f7ad7de5000 CR3: 0000000107eb8000 CR4: 00000000003506f0
Jul 20 14:53:54 prox07 kernel: Call Trace:
Jul 20 14:53:54 prox07 kernel: <IRQ>
Jul 20 14:53:54 prox07 kernel: ? show_regs+0x6d/0x80
Jul 20 14:53:54 prox07 kernel: ? watchdog_timer_fn+0x206/0x290
Jul 20 14:53:54 prox07 kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Jul 20 14:53:54 prox07 kernel: ? __hrtimer_run_queues+0x108/0x280
Jul 20 14:53:54 prox07 kernel: ? hrtimer_interrupt+0xf6/0x250
Jul 20 14:53:54 prox07 kernel: ? __sysvec_apic_timer_interrupt+0x51/0x150
Jul 20 14:53:54 prox07 kernel: ? sysvec_apic_timer_interrupt+0x8d/0xd0
Jul 20 14:53:54 prox07 kernel: </IRQ>
Jul 20 14:53:54 prox07 kernel: <TASK>
Jul 20 14:53:54 prox07 kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
Jul 20 14:53:54 prox07 kernel: ? amdgpu_device_rreg+0xf3/0x120 [amdgpu]
Jul 20 14:53:54 prox07 kernel: ? delay_halt+0x40/0x80
Jul 20 14:53:54 prox07 kernel: gfx_v9_0_rlc_stop+0x10a/0x3d0 [amdgpu]
Jul 20 14:53:54 prox07 kernel: ? __const_udelay+0x3d/0x50
Jul 20 14:53:54 prox07 kernel: gfx_v9_0_hw_fini+0xe1/0x990 [amdgpu]
Jul 20 14:53:54 prox07 kernel: ? sdma_v4_0_hw_fini.part.0+0x93/0xb0 [amdgpu]
Jul 20 14:53:54 prox07 kernel: gfx_v9_0_suspend+0xe/0x20 [amdgpu]
Jul 20 14:53:54 prox07 kernel: amdgpu_device_ip_suspend_phase2+0x17a/0x220 [amdgpu]