Issue with kernel 5.13 after upgrade 7.0->7.1

Dec 29, 2021
2
0
6
This morning I updated my PVE-server from 7.0 to 7.1, using the enterprise-repository.
After installing all updates I rebooted the server and it was extremely slow and the containers didn't even start (or at least not within 5 minutes).

In the logs I found some entries about null pointer references by the kernel indicated with "BUG" and a few lines below "Oops" and a stack trace.
Seems to be related to the GPU? (which I don't care too much about anyway as I run the server headless).

Code:
Dec 29 08:28:14 pve kernel: [    7.735608] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Dec 29 08:28:14 pve kernel: [    7.735751] kfd kfd: amdgpu: error getting iommu info. is the iommu enabled?
Dec 29 08:28:14 pve kernel: [    7.735757] kfd kfd: amdgpu: Error initializing iommuv2
Dec 29 08:28:14 pve kernel: [    7.736873] kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
Dec 29 08:28:14 pve kernel: [    7.736882] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
Dec 29 08:28:14 pve kernel: [    7.736890] amdgpu 0000:00:01.0: amdgpu: amdgpu_device_ip_init failed
Dec 29 08:28:14 pve kernel: [    7.736897] amdgpu 0000:00:01.0: amdgpu: Fatal error during GPU init
Dec 29 08:28:14 pve kernel: [    7.736904] amdgpu 0000:00:01.0: amdgpu: amdgpu: finishing device.
Dec 29 08:28:14 pve kernel: [    7.739303] BUG: kernel NULL pointer dereference, address: 00000000000001db
Dec 29 08:28:14 pve kernel: [    7.739310] #PF: supervisor read access in kernel mode
Dec 29 08:28:14 pve kernel: [    7.739314] #PF: error_code(0x0000) - not-present page
Dec 29 08:28:14 pve kernel: [    7.739318] PGD 0 P4D 0 
Dec 29 08:28:14 pve kernel: [    7.739324] Oops: 0000 [#1] SMP NOPTI
Dec 29 08:28:14 pve kernel: [    7.739329] CPU: 0 PID: 692 Comm: systemd-udevd Tainted: P           O      5.13.19-2-pve #1
Dec 29 08:28:14 pve kernel: [    7.739335] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 06/26/2018
Dec 29 08:28:14 pve kernel: [    7.739340] RIP: 0010:smu8_dpm_powergate_acp+0xc/0x40 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.739902] Code: 7a f7 fd ff 44 89 ea 4c 89 e7 31 c9 be 13 00 00 00 e8 68 f7 fd ff 31 c0 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 8b 87 c0 01 00 00 <40> 38 b0 db 01 00 00 74 23 55 31 d2 48 89 e5 40 84 f6 74 0c be 0b
Dec 29 08:28:14 pve kernel: [    7.739910] RSP: 0018:ffffb86640743918 EFLAGS: 00010286
Dec 29 08:28:14 pve kernel: [    7.739914] RAX: 0000000000000000 RBX: ffff9a6bc33a0000 RCX: 000000000000000a
Dec 29 08:28:14 pve kernel: [    7.739918] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [    7.739921] RBP: ffffb86640743938 R08: 000000000000000f R09: 0000000000000000
Dec 29 08:28:14 pve kernel: [    7.739924] R10: ffff9a6bc66fc801 R11: ffff9a6bc66fc800 R12: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [    7.739927] R13: ffffffffc1272300 R14: ffff9a6bc33a0010 R15: ffff9a6bc33a0000
Dec 29 08:28:14 pve kernel: [    7.739931] FS:  00007f5d7b9bc8c0(0000) GS:ffff9a729f400000(0000) knlGS:0000000000000000
Dec 29 08:28:14 pve kernel: [    7.739936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 29 08:28:14 pve kernel: [    7.739940] CR2: 00000000000001db CR3: 00000001066ae000 CR4: 00000000001506f0
Dec 29 08:28:14 pve kernel: [    7.739944] Call Trace:
Dec 29 08:28:14 pve kernel: [    7.739950]  ? pp_set_powergating_by_smu+0x1ee/0x2b0 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.740262]  amdgpu_dpm_set_powergating_by_smu+0x70/0x100 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.740610]  ? amdgpu_dpm_set_powergating_by_smu+0x5/0x100 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.740936]  acp_hw_fini+0x154/0x160 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.741213]  amdgpu_device_fini+0x1d3/0x49f [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.741521]  amdgpu_driver_unload_kms+0x43/0x70 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.741780]  amdgpu_driver_load_kms.cold+0x46/0x83 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.742104]  amdgpu_pci_probe+0x12a/0x1b0 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.742347]  local_pci_probe+0x48/0x80
Dec 29 08:28:14 pve kernel: [    7.742356]  pci_device_probe+0x105/0x1c0
Dec 29 08:28:14 pve kernel: [    7.742362]  really_probe+0x24b/0x4c0
Dec 29 08:28:14 pve kernel: [    7.742371]  driver_probe_device+0xf0/0x160
Dec 29 08:28:14 pve kernel: [    7.742375]  device_driver_attach+0xab/0xb0
Dec 29 08:28:14 pve kernel: [    7.742379]  __driver_attach+0xb2/0x140
Dec 29 08:28:14 pve kernel: [    7.742383]  ? device_driver_attach+0xb0/0xb0
Dec 29 08:28:14 pve kernel: [    7.742387]  bus_for_each_dev+0x7e/0xc0
Dec 29 08:28:14 pve kernel: [    7.742391]  driver_attach+0x1e/0x20
Dec 29 08:28:14 pve kernel: [    7.742395]  bus_add_driver+0x135/0x1f0
Dec 29 08:28:14 pve kernel: [    7.742398]  driver_register+0x91/0xf0
Dec 29 08:28:14 pve kernel: [    7.742402]  __pci_register_driver+0x57/0x60
Dec 29 08:28:14 pve kernel: [    7.742406]  amdgpu_init+0x77/0x1000 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.742652]  ? 0xffffffffc14ae000
Dec 29 08:28:14 pve kernel: [    7.742655]  do_one_initcall+0x48/0x1d0
Dec 29 08:28:14 pve kernel: [    7.742663]  ? kmem_cache_alloc_trace+0xfb/0x240
Dec 29 08:28:14 pve kernel: [    7.742669]  do_init_module+0x62/0x290
Dec 29 08:28:14 pve kernel: [    7.742674]  load_module+0x265e/0x2720
Dec 29 08:28:14 pve kernel: [    7.742679]  __do_sys_finit_module+0xc2/0x120
Dec 29 08:28:14 pve kernel: [    7.742683]  __x64_sys_finit_module+0x1a/0x20
Dec 29 08:28:14 pve kernel: [    7.742687]  do_syscall_64+0x61/0xb0
Dec 29 08:28:14 pve kernel: [    7.742694]  ? syscall_exit_to_user_mode+0x27/0x50
Dec 29 08:28:14 pve kernel: [    7.742697]  ? __x64_sys_newstat+0x16/0x20
Dec 29 08:28:14 pve kernel: [    7.742702]  ? do_syscall_64+0x6e/0xb0
Dec 29 08:28:14 pve kernel: [    7.742706]  ? do_syscall_64+0x6e/0xb0
Dec 29 08:28:14 pve kernel: [    7.742709]  ? do_syscall_64+0x6e/0xb0
Dec 29 08:28:14 pve kernel: [    7.742713]  ? sysvec_apic_timer_interrupt+0x4e/0x90
Dec 29 08:28:14 pve kernel: [    7.742717]  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Dec 29 08:28:14 pve kernel: [    7.742722]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 29 08:28:14 pve kernel: [    7.742726] RIP: 0033:0x7f5d7b7ff9b9
Dec 29 08:28:14 pve kernel: [    7.742732] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
Dec 29 08:28:14 pve kernel: [    7.742738] RSP: 002b:00007fffd7e7d098 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Dec 29 08:28:14 pve kernel: [    7.742743] RAX: ffffffffffffffda RBX: 000056389ebf2450 RCX: 00007f5d7b7ff9b9
Dec 29 08:28:14 pve kernel: [    7.742747] RDX: 0000000000000000 RSI: 00007f5d7b9a3e2d RDI: 000000000000001a
Dec 29 08:28:14 pve kernel: [    7.742750] RBP: 0000000000020000 R08: 0000000000000000 R09: 000056389ebf2200
Dec 29 08:28:14 pve kernel: [    7.742753] R10: 000000000000001a R11: 0000000000000246 R12: 00007f5d7b9a3e2d
Dec 29 08:28:14 pve kernel: [    7.742756] R13: 0000000000000000 R14: 000056389ebf2280 R15: 000056389ebf2450
Dec 29 08:28:14 pve kernel: [    7.742761] Modules linked in: kvm_amd(+) ccp kvm fjes(-) irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd amdgpu(+) pcspkr efi_pstore fam15h_power k10temp iommu_v2 gpu_sched drm_ttm_helper ttm drm_kms_helper cec rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt 8250_dw mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c xhci_pci xhci_pci_renesas ehci_pci ahci crc32_pclmul tg3 xhci_hcd i2c_piix4 libahci ehci_hcd video
Dec 29 08:28:14 pve kernel: [    7.742843] CR2: 00000000000001db
Dec 29 08:28:14 pve kernel: [    7.742927] ---[ end trace d747077e97d28095 ]---
Dec 29 08:28:14 pve kernel: [    7.838380] RIP: 0010:smu8_dpm_powergate_acp+0xc/0x40 [amdgpu]
Dec 29 08:28:14 pve kernel: [    7.839403] Code: 7a f7 fd ff 44 89 ea 4c 89 e7 31 c9 be 13 00 00 00 e8 68 f7 fd ff 31 c0 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 8b 87 c0 01 00 00 <40> 38 b0 db 01 00 00 74 23 55 31 d2 48 89 e5 40 84 f6 74 0c be 0b
Dec 29 08:28:14 pve kernel: [    7.839411] RSP: 0018:ffffb86640743918 EFLAGS: 00010286
Dec 29 08:28:14 pve kernel: [    7.839417] RAX: 0000000000000000 RBX: ffff9a6bc33a0000 RCX: 000000000000000a
Dec 29 08:28:14 pve kernel: [    7.839420] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [    7.839424] RBP: ffffb86640743938 R08: 000000000000000f R09: 0000000000000000
Dec 29 08:28:14 pve kernel: [    7.839427] R10: ffff9a6bc66fc801 R11: ffff9a6bc66fc800 R12: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [    7.839430] R13: ffffffffc1272300 R14: ffff9a6bc33a0010 R15: ffff9a6bc33a0000
Dec 29 08:28:14 pve kernel: [    7.839433] FS:  00007f5d7b9bc8c0(0000) GS:ffff9a729f400000(0000) knlGS:0000000000000000
Dec 29 08:28:14 pve kernel: [    7.839437] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 29 08:28:14 pve kernel: [    7.839440] CR2: 00000000000001db CR3: 00000001066ae000 CR4: 00000000001506f0
Dec 29 08:28:14 pve kernel: [    7.853846] MCE: In-kernel MCE decoding enabled.
Dec 29 08:28:14 pve kernel: [    7.855756] EDAC amd64: MCT channel count: 2
Dec 29 08:28:14 pve kernel: [    7.855865] EDAC MC0: Giving out device to module amd64_edac controller F15h_M60h: DEV 0000:00:18.3 (INTERRUPT)
Dec 29 08:28:14 pve kernel: [    7.855871] EDAC amd64: F15h_M60h detected (node 0).
Dec 29 08:28:14 pve kernel: [    7.855875] EDAC MC: DCT0 chip selects:
Dec 29 08:28:14 pve kernel: [    7.855878] EDAC amd64: MC: 0:  8192MB 1:  8192MB
Dec 29 08:28:14 pve kernel: [    7.855881] EDAC amd64: MC: 2:     0MB 3:     0MB
Dec 29 08:28:14 pve kernel: [    7.855884] EDAC amd64: MC: 4:     0MB 5:     0MB
Dec 29 08:28:14 pve kernel: [    7.855887] EDAC amd64: MC: 6:     0MB 7:     0MB
Dec 29 08:28:14 pve kernel: [    7.855890] EDAC MC: DCT1 chip selects:
Dec 29 08:28:14 pve kernel: [    7.855892] EDAC amd64: MC: 0:  8192MB 1:  8192MB
Dec 29 08:28:14 pve kernel: [    7.855895] EDAC amd64: MC: 2:     0MB 3:     0MB
Dec 29 08:28:14 pve kernel: [    7.855897] EDAC amd64: MC: 4:     0MB 5:     0MB
Dec 29 08:28:14 pve kernel: [    7.855900] EDAC amd64: MC: 6:     0MB 7:     0MB
Dec 29 08:28:14 pve kernel: [    7.855902] EDAC amd64: using x8 syndromes.
Dec 29 08:28:14 pve kernel: [    7.855919] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.2 (POLLED)
Dec 29 08:28:14 pve kernel: [    7.855924] AMD64 EDAC driver v3.5.0

(See attached kern.log for complete boot process)

Hardware is a HP MicroServer Gen10 with 32GB RAM.
I was able to restart using an older 5.11-kernel eventally ("slightly" inconvenient because it is a headless server in a hard to reach place). For now I edited the loader.conf on the EFI-partition to start with the 5.11-kernel instead of the 5.13 kernel.

How should I proceed from here? How will I know whether this if fixed in a future update or whether this will happen again when I do the next "apt upgrade"?
I don' t do the apt upgrade automatically, always manually, just in case something weird happens (like today).

Is there any information I can gather to help you analyze/solve this issue?
 

Attachments

  • Like
Reactions: Moayad
Thanks Mira for the quick response.
So my options are either stay with the 5.11-kernel until 7.2 is released or update to the 5.15-kernel?

I guess I'll stick to 5.11 then for a few months, or would that cause other issues as the rest of PVE is updated to 7.1/current version?
 
Hard to say, there might be issues with future upgrades, because we only test changes on the current (and future) kernels.