Einer meiner PVE hosts war nicht mehr via Web GUI erreichbar, ist komplett gecrasht (im Syslog stundenlang keine Einträge mehr) also habe ich den Hetzner "Automatischer Hardware-Reset" genutzt, um einen Reset auszulösen und das System neu zu starten.
Ich habe folgendes in /var/log/kern.log gefunden:
Ähnliche Meldungen gehen bis knapp vor 16:07 Uhr, dann erst wieder Meldungen seit dem Hardware Reset 19:31 Uhr... scheinbar war das System komplett gecrasht bzw. gefreezed.
Infos:
Es handelt sich um ein Hetzner EX101 mit 128 GB DDR5 ECC RAM, und dem Intel® i9-13900
Anbei das volle kern.log, Infos via dmidecode, sowie der Boot Ausschnitt vom syslog
Habe jetzt erst Mal das Kernel Update gemacht.
Ist zwar "nur der Backup" Server, aber der Hauptserver ist sehr ähnlich konfiguriert, daher die Frage, was ich zur Prävention machen kann.
Sieht das nach Hardware Problem aus oder eher ein Kernel Bug?
Wenn Kernel Bug, empfiehlt sich der Opt-in Kernel 6.2 bei dieser relativ neuen Intel CPU?
Ich habe folgendes in /var/log/kern.log gefunden:
Code:
Jun 8 16:00:17 13900HostHel kernel: [6660351.394271] BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394276] #PF: supervisor instruction fetch in kernel mode
Jun 8 16:00:17 13900HostHel kernel: [6660351.394277] #PF: error_code(0x0010) - not-present page
Jun 8 16:00:17 13900HostHel kernel: [6660351.394278] PGD 0 P4D 0
Jun 8 16:00:17 13900HostHel kernel: [6660351.394280] Thread overran stack, or stack corrupted
Jun 8 16:00:17 13900HostHel kernel: [6660351.394280] Oops: 0010 [#1] SMP NOPTI
Jun 8 16:00:17 13900HostHel kernel: [6660351.394282] CPU: 0 PID: 3749064 Comm: kthreadd Tainted: P O 5.15.102-1-pve #1
Jun 8 16:00:17 13900HostHel kernel: [6660351.394284] Hardware name: Hetzner /W680D4U-1L, BIOS 10.23 10/13/2022
Jun 8 16:00:17 13900HostHel kernel: [6660351.394284] RIP: 0010:0x0
Jun 8 16:00:17 13900HostHel kernel: [6660351.394288] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Jun 8 16:00:17 13900HostHel kernel: [6660351.394288] RSP: 0018:ffffa73f41423f58 EFLAGS: 00010092
Jun 8 16:00:17 13900HostHel kernel: [6660351.394290] RAX: ffffffffa081b440 RBX: 0000000000000000 RCX: 0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394291] RDX: 0000000000000800 RSI: ffff98dedc174bc0 RDI: ffffffffa081b440
Jun 8 16:00:17 13900HostHel kernel: [6660351.394292] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394292] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394293] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394294] FS: 0000000000000000(0000) GS:ffff98f8bf000000(0000) knlGS:0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394295] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 8 16:00:17 13900HostHel kernel: [6660351.394296] CR2: ffffffffffffffd6 CR3: 00000016e4410002 CR4: 0000000000772ef0
Jun 8 16:00:17 13900HostHel kernel: [6660351.394297] PKRU: 55555554
Jun 8 16:00:17 13900HostHel kernel: [6660351.394298] Call Trace:
Jun 8 16:00:17 13900HostHel kernel: [6660351.394299] WARNING: kernel stack frame pointer at 00000000af54c61a in kthreadd:3749064 has bad value 0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394300] unwind stack type:0 next_sp:0000000000000000 mask:0x2 graph_idx:0
Jun 8 16:00:17 13900HostHel kernel: [6660351.394301] 00000000af54c61a: 0000000000000000 ...
Jun 8 16:00:17 13900HostHel kernel: [6660351.394301] <TASK>
Jun 8 16:00:17 13900HostHel kernel: [6660351.394304] </TASK>
Jun 8 16:00:17 13900HostHel kernel: [6660351.394304] Modules linked in: nft_limit nft_counter nft_compat cfg80211 xt_MASQUERADE xt_mark iptable_nat ip6table_nat nf_nat xt_recent tcp_diag inet_diag binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw nf_tables bonding tls softdog nfnetlink_log nfnetlink i915 snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_soc_hdac_hda snd_hda_ext_core ttm snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus ledtrig_audio intel_rapl_msr drm_kms_helper intel_rapl_common snd_soc_core x86_pkg_temp_thermal cec intel_powerclamp rc_core fb_sys_fops snd_compress syscopyarea sysfillrect ac97_bus coretemp sysimgblt snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_intel snd_hda_codec snd_hda_core snd_hwdep mei_hdcp snd_pcm kvm irqbypass crct10dif_pclmul ghash_clmulni_intel snd_timer aesni_intel snd crypto_simd cryptd wmi_bmof pcspkr soundcore
Jun 8 16:00:17 13900HostHel kernel: [6660351.394329] efi_pstore mei_me mei ip6t_REJECT nf_reject_ipv6 acpi_tad acpi_pad xt_hl mac_hid ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog vhost_net xt_limit vhost vhost_iotlb tap xt_addrtype ib_iser xt_tcpudp rdma_cm iw_cm xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ib_cm ib_core ip6table_filter ip6_tables iscsi_tcp libiscsi_tcp iptable_filter libiscsi bpfilter scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb xhci_pci igb xhci_pci_renesas crc32_pclmul nvme intel_lpss_pci i2c_i801 i2c_algo_bit ahci intel_lpss i2c_smbus dca xhci_hcd libahci idma64 nvme_core wmi video [last unloaded: cpuid]
Jun 8 16:00:17 13900HostHel kernel: [6660351.394358] CR2: 0000000000000000
Jun 8 16:00:17 13900HostHel kernel: [6660351.394359] ---[ end trace 12eaba8011fa24d2 ]---
Jun 8 16:00:17 13900HostHel kernel: [6660351.482635] RIP: 0010:0x0
Jun 8 16:00:17 13900HostHel kernel: [6660351.482642] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Jun 8 16:00:17 13900HostHel kernel: [6660351.482644] RSP: 0018:ffffa73f41423f58 EFLAGS: 00010092
[...]
Ähnliche Meldungen gehen bis knapp vor 16:07 Uhr, dann erst wieder Meldungen seit dem Hardware Reset 19:31 Uhr... scheinbar war das System komplett gecrasht bzw. gefreezed.
Code:
[...]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633378] RIP: 0010:smp_call_function_single+0xed/0x130
Jun 8 16:06:59 13900HostHel kernel: [6660753.633384] Code: c3 cc cc cc cc 48 89 e6 4c 89 44 24 10 48 89 54 24 18 e8 26 fe ff ff 41 89 c0 8b 44 24 08 a8 01 74 0a f3 90 8b 44 24 08 a8 01 <75> f6 eb be 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 65 ff ff ff 8b 05
Jun 8 16:06:59 13900HostHel kernel: [6660753.633387] RSP: 0018:ffffa73f8a8cfa60 EFLAGS: 00000202
Jun 8 16:06:59 13900HostHel kernel: [6660753.633389] RAX: 0000000000000011 RBX: ffff98f032de0000 RCX: ffffa73f8aa53a60
Jun 8 16:06:59 13900HostHel kernel: [6660753.633391] RDX: ffff98f8bf571bc0 RSI: ffffa73f8a8cfa60 RDI: ffffa73f8a8cfa60
Jun 8 16:06:59 13900HostHel kernel: [6660753.633392] RBP: ffffa73f8a8cfab8 R08: 0000000000000000 R09: 0000000000000015
Jun 8 16:06:59 13900HostHel kernel: [6660753.633394] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000013
Jun 8 16:06:59 13900HostHel kernel: [6660753.633396] R13: 0000000000000013 R14: 0000000000000000 R15: ffff98f8bf4f0bc0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633397] FS: 00007f0ec4ff9700(0000) GS:ffff98f8bf4c0000(0000) knlGS:0000000000000000
Jun 8 16:06:59 13900HostHel kernel: [6660753.633399] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 8 16:06:59 13900HostHel kernel: [6660753.633401] CR2: 00000c05862cc075 CR3: 0000001c4460c005 CR4: 0000000000772ee0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633403] PKRU: 55555554
Jun 8 16:06:59 13900HostHel kernel: [6660753.633404] Call Trace:
Jun 8 16:06:59 13900HostHel kernel: [6660753.633406] <TASK>
Jun 8 16:06:59 13900HostHel kernel: [6660753.633409] ? crash_vmclear_local_loaded_vmcss+0x160/0x160 [kvm_intel]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633419] vmx_vcpu_load_vmcs+0x15d/0x4e0 [kvm_intel]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633425] vmx_vcpu_load+0x19/0x40 [kvm_intel]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633431] kvm_arch_vcpu_load+0x48/0x230 [kvm]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633484] ? vmx_prepare_switch_to_host+0xf7/0x190 [kvm_intel]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633489] kvm_sched_in+0x3d/0x50 [kvm]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633510] finish_task_switch.isra.0+0x17f/0x2b0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633514] __schedule+0x356/0x1740
Jun 8 16:06:59 13900HostHel kernel: [6660753.633518] schedule+0x69/0x110
Jun 8 16:06:59 13900HostHel kernel: [6660753.633520] kvm_vcpu_block+0x70/0x3b0 [kvm]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633542] kvm_arch_vcpu_ioctl_run+0x787/0x1730 [kvm]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633569] ? __wake_up_locked_key+0x1b/0x30
Jun 8 16:06:59 13900HostHel kernel: [6660753.633572] kvm_vcpu_ioctl+0x252/0x6b0 [kvm]
Jun 8 16:06:59 13900HostHel kernel: [6660753.633593] ? vfs_write+0xc8/0x270
Jun 8 16:06:59 13900HostHel kernel: [6660753.633597] ? __fget_files+0x86/0xc0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633600] __x64_sys_ioctl+0x92/0xd0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633603] do_syscall_64+0x59/0xc0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633606] ? syscall_exit_to_user_mode+0x27/0x50
Jun 8 16:06:59 13900HostHel kernel: [6660753.633608] ? do_syscall_64+0x69/0xc0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633609] ? do_syscall_64+0x69/0xc0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633611] ? do_syscall_64+0x69/0xc0
Jun 8 16:06:59 13900HostHel kernel: [6660753.633613] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jun 8 16:06:59 13900HostHel kernel: [6660753.633616] RIP: 0033:0x7f15a30745f7
Jun 8 16:06:59 13900HostHel kernel: [6660753.633617] Code: 00 00 00 48 8b 05 99 c8 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 c8 0d 00 f7 d8 64 89 01 48
Jun 8 16:06:59 13900HostHel kernel: [6660753.633620] RSP: 002b:00007f0ec4ff4288 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 8 16:06:59 13900HostHel kernel: [6660753.633622] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f15a30745f7
Jun 8 16:06:59 13900HostHel kernel: [6660753.633624] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000002b
Jun 8 16:06:59 13900HostHel kernel: [6660753.633626] RBP: 0000562db97c2db0 R08: 0000562db822a240 R09: 0000000000000039
Jun 8 16:06:59 13900HostHel kernel: [6660753.633627] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Jun 8 16:06:59 13900HostHel kernel: [6660753.633629] R13: 0000562db8935020 R14: 0000000000000001 R15: 0000000000000000
Jun 8 16:06:59 13900HostHel kernel: [6660753.633631] </TASK>
Jun 8 16:07:00 13900HostHel tailscaled[2288732]: derphttp.Client.Recv: connecting to derp-4 (fra)
Jun 8 19:31:57 13900HostHel systemd-modules-load[1357]: Inserted module 'coretemp'
Jun 8 19:31:57 13900HostHel systemd-pstore[1367]: PStore dmesg-erst-7242314802456952836 moved to /var/lib/systemd/pstore/7242314802456/dmesg-erst-7242314802456952836
Jun 8 19:31:57 13900HostHel systemd-pstore[1367]: PStore dmesg-erst-7242314802456952835 moved to /var/lib/systemd/pstore/7242314802456/dmesg-erst-7242314802456952835
Jun 8 19:31:57 13900HostHel systemd-pstore[1367]: PStore dmesg-erst-7242314802456952834 moved to /var/lib/systemd/pstore/7242314802456/dmesg-erst-7242314802456952834
Jun 8 19:31:57 13900HostHel systemd-pstore[1367]: PStore dmesg-erst-7242314802456952833 moved to /var/lib/systemd/pstore/7242314802456/dmesg-erst-7242314802456952833
Jun 8 19:31:57 13900HostHel systemd-modules-load[1357]: Inserted module 'nct6775'
Jun 8 19:31:57 13900HostHel kernel: [ 0.000000] Linux version 5.15.102-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) ()
Jun 8 19:31:57 13900HostHel systemd-modules-load[1357]: Inserted module 'iscsi_tcp'
Jun 8 19:31:57 13900HostHel kernel: [ 0.000000] Command line: initrd=\EFI\proxmox\5.15.102-1-pve\initrd.img-5.15.102-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Jun 8 19:31:57 13900HostHel kernel: [ 0.000000] KERNEL supported cpus:
Jun 8 19:31:57 13900HostHel kernel: [ 0.000000] Intel GenuineIntel
Jun 8 19:31:57 13900HostHel kernel: [ 0.000000] AMD AuthenticAMD
Jun 8 19:31:57 13900HostHel kernel: [ 0.000000] Hygon HygonGenuine
Infos:
uname -a
Linux 13900HostHel 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64 GNU/Linux
Es handelt sich um ein Hetzner EX101 mit 128 GB DDR5 ECC RAM, und dem Intel® i9-13900
Anbei das volle kern.log, Infos via dmidecode, sowie der Boot Ausschnitt vom syslog
Habe jetzt erst Mal das Kernel Update gemacht.
uname -a
Linux 13900HostHel 5.15.107-2-pve #1 SMP PVE 5.15.107-2 (2023-05-10T09:10Z) x86_64 GNU/Linux
Ist zwar "nur der Backup" Server, aber der Hauptserver ist sehr ähnlich konfiguriert, daher die Frage, was ich zur Prävention machen kann.
Sieht das nach Hardware Problem aus oder eher ein Kernel Bug?
Wenn Kernel Bug, empfiehlt sich der Opt-in Kernel 6.2 bei dieser relativ neuen Intel CPU?