This morning I updated my PVE-server from 7.0 to 7.1, using the enterprise-repository.
After installing all updates I rebooted the server and it was extremely slow and the containers didn't even start (or at least not within 5 minutes).
In the logs I found some entries about null pointer references by the kernel indicated with "BUG" and a few lines below "Oops" and a stack trace.
Seems to be related to the GPU? (which I don't care too much about anyway as I run the server headless).
(See attached kern.log for complete boot process)
Hardware is a HP MicroServer Gen10 with 32GB RAM.
I was able to restart using an older 5.11-kernel eventally ("slightly" inconvenient because it is a headless server in a hard to reach place). For now I edited the loader.conf on the EFI-partition to start with the 5.11-kernel instead of the 5.13 kernel.
How should I proceed from here? How will I know whether this if fixed in a future update or whether this will happen again when I do the next "apt upgrade"?
I don' t do the apt upgrade automatically, always manually, just in case something weird happens (like today).
Is there any information I can gather to help you analyze/solve this issue?
After installing all updates I rebooted the server and it was extremely slow and the containers didn't even start (or at least not within 5 minutes).
In the logs I found some entries about null pointer references by the kernel indicated with "BUG" and a few lines below "Oops" and a stack trace.
Seems to be related to the GPU? (which I don't care too much about anyway as I run the server headless).
Code:
Dec 29 08:28:14 pve kernel: [ 7.735608] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Dec 29 08:28:14 pve kernel: [ 7.735751] kfd kfd: amdgpu: error getting iommu info. is the iommu enabled?
Dec 29 08:28:14 pve kernel: [ 7.735757] kfd kfd: amdgpu: Error initializing iommuv2
Dec 29 08:28:14 pve kernel: [ 7.736873] kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
Dec 29 08:28:14 pve kernel: [ 7.736882] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
Dec 29 08:28:14 pve kernel: [ 7.736890] amdgpu 0000:00:01.0: amdgpu: amdgpu_device_ip_init failed
Dec 29 08:28:14 pve kernel: [ 7.736897] amdgpu 0000:00:01.0: amdgpu: Fatal error during GPU init
Dec 29 08:28:14 pve kernel: [ 7.736904] amdgpu 0000:00:01.0: amdgpu: amdgpu: finishing device.
Dec 29 08:28:14 pve kernel: [ 7.739303] BUG: kernel NULL pointer dereference, address: 00000000000001db
Dec 29 08:28:14 pve kernel: [ 7.739310] #PF: supervisor read access in kernel mode
Dec 29 08:28:14 pve kernel: [ 7.739314] #PF: error_code(0x0000) - not-present page
Dec 29 08:28:14 pve kernel: [ 7.739318] PGD 0 P4D 0
Dec 29 08:28:14 pve kernel: [ 7.739324] Oops: 0000 [#1] SMP NOPTI
Dec 29 08:28:14 pve kernel: [ 7.739329] CPU: 0 PID: 692 Comm: systemd-udevd Tainted: P O 5.13.19-2-pve #1
Dec 29 08:28:14 pve kernel: [ 7.739335] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 06/26/2018
Dec 29 08:28:14 pve kernel: [ 7.739340] RIP: 0010:smu8_dpm_powergate_acp+0xc/0x40 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.739902] Code: 7a f7 fd ff 44 89 ea 4c 89 e7 31 c9 be 13 00 00 00 e8 68 f7 fd ff 31 c0 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 8b 87 c0 01 00 00 <40> 38 b0 db 01 00 00 74 23 55 31 d2 48 89 e5 40 84 f6 74 0c be 0b
Dec 29 08:28:14 pve kernel: [ 7.739910] RSP: 0018:ffffb86640743918 EFLAGS: 00010286
Dec 29 08:28:14 pve kernel: [ 7.739914] RAX: 0000000000000000 RBX: ffff9a6bc33a0000 RCX: 000000000000000a
Dec 29 08:28:14 pve kernel: [ 7.739918] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [ 7.739921] RBP: ffffb86640743938 R08: 000000000000000f R09: 0000000000000000
Dec 29 08:28:14 pve kernel: [ 7.739924] R10: ffff9a6bc66fc801 R11: ffff9a6bc66fc800 R12: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [ 7.739927] R13: ffffffffc1272300 R14: ffff9a6bc33a0010 R15: ffff9a6bc33a0000
Dec 29 08:28:14 pve kernel: [ 7.739931] FS: 00007f5d7b9bc8c0(0000) GS:ffff9a729f400000(0000) knlGS:0000000000000000
Dec 29 08:28:14 pve kernel: [ 7.739936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 29 08:28:14 pve kernel: [ 7.739940] CR2: 00000000000001db CR3: 00000001066ae000 CR4: 00000000001506f0
Dec 29 08:28:14 pve kernel: [ 7.739944] Call Trace:
Dec 29 08:28:14 pve kernel: [ 7.739950] ? pp_set_powergating_by_smu+0x1ee/0x2b0 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.740262] amdgpu_dpm_set_powergating_by_smu+0x70/0x100 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.740610] ? amdgpu_dpm_set_powergating_by_smu+0x5/0x100 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.740936] acp_hw_fini+0x154/0x160 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.741213] amdgpu_device_fini+0x1d3/0x49f [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.741521] amdgpu_driver_unload_kms+0x43/0x70 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.741780] amdgpu_driver_load_kms.cold+0x46/0x83 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.742104] amdgpu_pci_probe+0x12a/0x1b0 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.742347] local_pci_probe+0x48/0x80
Dec 29 08:28:14 pve kernel: [ 7.742356] pci_device_probe+0x105/0x1c0
Dec 29 08:28:14 pve kernel: [ 7.742362] really_probe+0x24b/0x4c0
Dec 29 08:28:14 pve kernel: [ 7.742371] driver_probe_device+0xf0/0x160
Dec 29 08:28:14 pve kernel: [ 7.742375] device_driver_attach+0xab/0xb0
Dec 29 08:28:14 pve kernel: [ 7.742379] __driver_attach+0xb2/0x140
Dec 29 08:28:14 pve kernel: [ 7.742383] ? device_driver_attach+0xb0/0xb0
Dec 29 08:28:14 pve kernel: [ 7.742387] bus_for_each_dev+0x7e/0xc0
Dec 29 08:28:14 pve kernel: [ 7.742391] driver_attach+0x1e/0x20
Dec 29 08:28:14 pve kernel: [ 7.742395] bus_add_driver+0x135/0x1f0
Dec 29 08:28:14 pve kernel: [ 7.742398] driver_register+0x91/0xf0
Dec 29 08:28:14 pve kernel: [ 7.742402] __pci_register_driver+0x57/0x60
Dec 29 08:28:14 pve kernel: [ 7.742406] amdgpu_init+0x77/0x1000 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.742652] ? 0xffffffffc14ae000
Dec 29 08:28:14 pve kernel: [ 7.742655] do_one_initcall+0x48/0x1d0
Dec 29 08:28:14 pve kernel: [ 7.742663] ? kmem_cache_alloc_trace+0xfb/0x240
Dec 29 08:28:14 pve kernel: [ 7.742669] do_init_module+0x62/0x290
Dec 29 08:28:14 pve kernel: [ 7.742674] load_module+0x265e/0x2720
Dec 29 08:28:14 pve kernel: [ 7.742679] __do_sys_finit_module+0xc2/0x120
Dec 29 08:28:14 pve kernel: [ 7.742683] __x64_sys_finit_module+0x1a/0x20
Dec 29 08:28:14 pve kernel: [ 7.742687] do_syscall_64+0x61/0xb0
Dec 29 08:28:14 pve kernel: [ 7.742694] ? syscall_exit_to_user_mode+0x27/0x50
Dec 29 08:28:14 pve kernel: [ 7.742697] ? __x64_sys_newstat+0x16/0x20
Dec 29 08:28:14 pve kernel: [ 7.742702] ? do_syscall_64+0x6e/0xb0
Dec 29 08:28:14 pve kernel: [ 7.742706] ? do_syscall_64+0x6e/0xb0
Dec 29 08:28:14 pve kernel: [ 7.742709] ? do_syscall_64+0x6e/0xb0
Dec 29 08:28:14 pve kernel: [ 7.742713] ? sysvec_apic_timer_interrupt+0x4e/0x90
Dec 29 08:28:14 pve kernel: [ 7.742717] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Dec 29 08:28:14 pve kernel: [ 7.742722] entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 29 08:28:14 pve kernel: [ 7.742726] RIP: 0033:0x7f5d7b7ff9b9
Dec 29 08:28:14 pve kernel: [ 7.742732] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
Dec 29 08:28:14 pve kernel: [ 7.742738] RSP: 002b:00007fffd7e7d098 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Dec 29 08:28:14 pve kernel: [ 7.742743] RAX: ffffffffffffffda RBX: 000056389ebf2450 RCX: 00007f5d7b7ff9b9
Dec 29 08:28:14 pve kernel: [ 7.742747] RDX: 0000000000000000 RSI: 00007f5d7b9a3e2d RDI: 000000000000001a
Dec 29 08:28:14 pve kernel: [ 7.742750] RBP: 0000000000020000 R08: 0000000000000000 R09: 000056389ebf2200
Dec 29 08:28:14 pve kernel: [ 7.742753] R10: 000000000000001a R11: 0000000000000246 R12: 00007f5d7b9a3e2d
Dec 29 08:28:14 pve kernel: [ 7.742756] R13: 0000000000000000 R14: 000056389ebf2280 R15: 000056389ebf2450
Dec 29 08:28:14 pve kernel: [ 7.742761] Modules linked in: kvm_amd(+) ccp kvm fjes(-) irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd amdgpu(+) pcspkr efi_pstore fam15h_power k10temp iommu_v2 gpu_sched drm_ttm_helper ttm drm_kms_helper cec rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt 8250_dw mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c xhci_pci xhci_pci_renesas ehci_pci ahci crc32_pclmul tg3 xhci_hcd i2c_piix4 libahci ehci_hcd video
Dec 29 08:28:14 pve kernel: [ 7.742843] CR2: 00000000000001db
Dec 29 08:28:14 pve kernel: [ 7.742927] ---[ end trace d747077e97d28095 ]---
Dec 29 08:28:14 pve kernel: [ 7.838380] RIP: 0010:smu8_dpm_powergate_acp+0xc/0x40 [amdgpu]
Dec 29 08:28:14 pve kernel: [ 7.839403] Code: 7a f7 fd ff 44 89 ea 4c 89 e7 31 c9 be 13 00 00 00 e8 68 f7 fd ff 31 c0 41 5c 41 5d 5d c3 0f 1f 44 00 00 48 8b 87 c0 01 00 00 <40> 38 b0 db 01 00 00 74 23 55 31 d2 48 89 e5 40 84 f6 74 0c be 0b
Dec 29 08:28:14 pve kernel: [ 7.839411] RSP: 0018:ffffb86640743918 EFLAGS: 00010286
Dec 29 08:28:14 pve kernel: [ 7.839417] RAX: 0000000000000000 RBX: ffff9a6bc33a0000 RCX: 000000000000000a
Dec 29 08:28:14 pve kernel: [ 7.839420] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [ 7.839424] RBP: ffffb86640743938 R08: 000000000000000f R09: 0000000000000000
Dec 29 08:28:14 pve kernel: [ 7.839427] R10: ffff9a6bc66fc801 R11: ffff9a6bc66fc800 R12: ffff9a6bc6b63400
Dec 29 08:28:14 pve kernel: [ 7.839430] R13: ffffffffc1272300 R14: ffff9a6bc33a0010 R15: ffff9a6bc33a0000
Dec 29 08:28:14 pve kernel: [ 7.839433] FS: 00007f5d7b9bc8c0(0000) GS:ffff9a729f400000(0000) knlGS:0000000000000000
Dec 29 08:28:14 pve kernel: [ 7.839437] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 29 08:28:14 pve kernel: [ 7.839440] CR2: 00000000000001db CR3: 00000001066ae000 CR4: 00000000001506f0
Dec 29 08:28:14 pve kernel: [ 7.853846] MCE: In-kernel MCE decoding enabled.
Dec 29 08:28:14 pve kernel: [ 7.855756] EDAC amd64: MCT channel count: 2
Dec 29 08:28:14 pve kernel: [ 7.855865] EDAC MC0: Giving out device to module amd64_edac controller F15h_M60h: DEV 0000:00:18.3 (INTERRUPT)
Dec 29 08:28:14 pve kernel: [ 7.855871] EDAC amd64: F15h_M60h detected (node 0).
Dec 29 08:28:14 pve kernel: [ 7.855875] EDAC MC: DCT0 chip selects:
Dec 29 08:28:14 pve kernel: [ 7.855878] EDAC amd64: MC: 0: 8192MB 1: 8192MB
Dec 29 08:28:14 pve kernel: [ 7.855881] EDAC amd64: MC: 2: 0MB 3: 0MB
Dec 29 08:28:14 pve kernel: [ 7.855884] EDAC amd64: MC: 4: 0MB 5: 0MB
Dec 29 08:28:14 pve kernel: [ 7.855887] EDAC amd64: MC: 6: 0MB 7: 0MB
Dec 29 08:28:14 pve kernel: [ 7.855890] EDAC MC: DCT1 chip selects:
Dec 29 08:28:14 pve kernel: [ 7.855892] EDAC amd64: MC: 0: 8192MB 1: 8192MB
Dec 29 08:28:14 pve kernel: [ 7.855895] EDAC amd64: MC: 2: 0MB 3: 0MB
Dec 29 08:28:14 pve kernel: [ 7.855897] EDAC amd64: MC: 4: 0MB 5: 0MB
Dec 29 08:28:14 pve kernel: [ 7.855900] EDAC amd64: MC: 6: 0MB 7: 0MB
Dec 29 08:28:14 pve kernel: [ 7.855902] EDAC amd64: using x8 syndromes.
Dec 29 08:28:14 pve kernel: [ 7.855919] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.2 (POLLED)
Dec 29 08:28:14 pve kernel: [ 7.855924] AMD64 EDAC driver v3.5.0
(See attached kern.log for complete boot process)
Hardware is a HP MicroServer Gen10 with 32GB RAM.
I was able to restart using an older 5.11-kernel eventally ("slightly" inconvenient because it is a headless server in a hard to reach place). For now I edited the loader.conf on the EFI-partition to start with the 5.11-kernel instead of the 5.13 kernel.
How should I proceed from here? How will I know whether this if fixed in a future update or whether this will happen again when I do the next "apt upgrade"?
I don' t do the apt upgrade automatically, always manually, just in case something weird happens (like today).
Is there any information I can gather to help you analyze/solve this issue?