Random crashes on one of two identical hosts.

Guillaume Soucy

Well-Known Member
Oct 20, 2017
74
5
48
30
L'Orignal, Canada
guillaumesoucy.com
Hello,

I do have two identical host, same hardware everywhere. One of them keep crashing at random times. I look on this forums, found lots of similar post but, I don't want to try random things and break the system and make things even more bad.

When it crashes, it won't reboot by itself.

The logs states things about the kernel:

Code:
Feb 26 11:39:49 pve-corp-010-dc kernel: pvestatd[1181]: segfault at 589417000cf3 ip 00005893b26ab752 sp 00007fff6fb3a450 error 4 in perl[5893b25e0000+195000] likely on CPU 2 (core 2, socket 0)
Feb 26 11:39:49 pve-corp-010-dc kernel: Code: 00 48 8b 13 44 8b 78 f8 48 8d 70 f8 44 89 f9 23 4a 18 48 8b 53 10 4c 8b 04 ca 4d 85 c0 0f 84 3e fe ff ff 48 63 48 fc 4d 89 c1 <0f> b6 4c 08 01 eb 13 0f 1f 80 00 00 00 00 4d 8b 09 4d 85 c9 0f 84
Feb 26 11:39:49 pve-corp-010-dc kernel: BUG: kernel NULL pointer dereference, address: 0000000000000620
Feb 26 11:39:49 pve-corp-010-dc kernel: #PF: supervisor read access in kernel mode
Feb 26 11:39:49 pve-corp-010-dc kernel: #PF: error_code(0x0000) - not-present page
Feb 26 11:39:49 pve-corp-010-dc kernel: PGD 0 P4D 0
Feb 26 11:39:49 pve-corp-010-dc kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Feb 26 11:39:49 pve-corp-010-dc kernel: CPU: 2 PID: 1181 Comm: pvestatd Tainted: P           O       6.8.12-4-pve #1
Feb 26 11:39:49 pve-corp-010-dc kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4 R2.0, BIOS P5.70 10/20/2022
Feb 26 11:39:49 pve-corp-010-dc kernel: RIP: 0010:folio_lruvec_lock_irqsave+0x4e/0xa0
Feb 26 11:39:49 pve-corp-010-dc kernel: Code: 8b 17 48 c1 ea 36 48 8b 14 d5 e0 e2 37 a0 66 90 48 63 8a 40 9e 02 00 48 85 c0 48 0f 44 05 2a 9e 11 02 48 8b 9c c8 90 08 00 00 <48> 3b 93 20 06 00 00 75 31 48 8d 7b 50 e8 b0 d3 ce 00 49 89 04 24
Feb 26 11:39:49 pve-corp-010-dc kernel: RSP: 0000:ffffa0c940d8f8f8 EFLAGS: 00010286
Feb 26 11:39:49 pve-corp-010-dc kernel: RAX: ffff93d4c3321800 RBX: 0000000000000000 RCX: 0000000000000000
Feb 26 11:39:49 pve-corp-010-dc kernel: RDX: ffff93db7efd5000 RSI: ffffa0c940d8f940 RDI: ffffdbb506197d00
Feb 26 11:39:49 pve-corp-010-dc kernel: RBP: ffffa0c940d8f908 R08: 0000000000000000 R09: 0000000000000000
Feb 26 11:39:49 pve-corp-010-dc kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0c940d8f940
Feb 26 11:39:49 pve-corp-010-dc kernel: R13: ffff93d4d180f000 R14: ffff93d8bada8ee8 R15: 000000000000001b
Feb 26 11:39:49 pve-corp-010-dc kernel: FS:  000078bb33771b80(0000) GS:ffff93db5ff00000(0000) knlGS:0000000000000000
Feb 26 11:39:49 pve-corp-010-dc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 26 11:39:49 pve-corp-010-dc kernel: CR2: 0000000000000620 CR3: 0000000587236000 CR4: 00000000003506f0
Feb 26 11:39:49 pve-corp-010-dc kernel: Call Trace:
Feb 26 11:39:49 pve-corp-010-dc kernel:  <TASK>
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? show_regs+0x6d/0x80
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? __die+0x24/0x80
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? page_fault_oops+0x176/0x500
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? do_user_addr_fault+0x2ed/0x660
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? exc_page_fault+0x83/0x1b0
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? asm_exc_page_fault+0x27/0x30
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? folio_lruvec_lock_irqsave+0x4e/0xa0
Feb 26 11:39:49 pve-corp-010-dc kernel:  release_pages+0x268/0x4c0
Feb 26 11:39:49 pve-corp-010-dc kernel:  free_pages_and_swap_cache+0x4a/0x60
Feb 26 11:39:49 pve-corp-010-dc kernel:  tlb_batch_pages_flush+0x43/0x80
Feb 26 11:39:49 pve-corp-010-dc kernel:  tlb_flush_mmu+0x3d/0x110
Feb 26 11:39:49 pve-corp-010-dc kernel:  unmap_page_range+0xcf3/0x1170
Feb 26 11:39:49 pve-corp-010-dc kernel:  unmap_single_vma+0x89/0xf0
Feb 26 11:39:49 pve-corp-010-dc kernel:  unmap_vmas+0xb5/0x190
Feb 26 11:39:49 pve-corp-010-dc kernel:  exit_mmap+0x10a/0x3f0
Feb 26 11:39:49 pve-corp-010-dc kernel:  __mmput+0x41/0x140
Feb 26 11:39:49 pve-corp-010-dc kernel:  mmput+0x31/0x40
Feb 26 11:39:49 pve-corp-010-dc kernel:  do_exit+0x324/0xae0
Feb 26 11:39:49 pve-corp-010-dc kernel:  do_group_exit+0x35/0x90
Feb 26 11:39:49 pve-corp-010-dc kernel:  get_signal+0xa8d/0xa90
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? force_sig_info_to_task+0x11b/0x190
Feb 26 11:39:49 pve-corp-010-dc kernel:  arch_do_signal_or_restart+0x42/0x280
Feb 26 11:39:49 pve-corp-010-dc kernel:  irqentry_exit_to_user_mode+0x1fe/0x260
Feb 26 11:39:49 pve-corp-010-dc kernel:  irqentry_exit+0x43/0x50
Feb 26 11:39:49 pve-corp-010-dc kernel:  exc_page_fault+0x94/0x1b0
Feb 26 11:39:49 pve-corp-010-dc kernel:  asm_exc_page_fault+0x27/0x30
Feb 26 11:39:49 pve-corp-010-dc kernel: RIP: 0033:0x5893b26ab752
Feb 26 11:39:49 pve-corp-010-dc kernel: Code: Unable to access opcode bytes at 0x5893b26ab728.
Feb 26 11:39:49 pve-corp-010-dc kernel: RSP: 002b:00007fff6fb3a450 EFLAGS: 00010206
Feb 26 11:39:49 pve-corp-010-dc kernel: RAX: 00005893b4d2a080 RBX: 00005893ba841240 RCX: 00000000622d6c72
Feb 26 11:39:49 pve-corp-010-dc kernel: RDX: 00005893ba7e9d50 RSI: 00005893b4d2a078 RDI: 000000002000000c
Feb 26 11:39:49 pve-corp-010-dc kernel: RBP: 0000000000000400 R08: 00005893ba80bd60 R09: 00005893ba80bd60
Feb 26 11:39:49 pve-corp-010-dc kernel: R10: 00005893b706fbb0 R11: 00005893b28de8e0 R12: 00005893b4d2a080
Feb 26 11:39:49 pve-corp-010-dc kernel: R13: 00005893b49192a0 R14: 00005893ba84b450 R15: 0000000065702f75
Feb 26 11:39:49 pve-corp-010-dc kernel:  </TASK>
Feb 26 11:39:49 pve-corp-010-dc kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs nf_tables lockd grace netfs bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_hdmi snd_hda_intel amdgpu snd_intel_dspcfg kvm snd_intel_sdw_acpi amdxcp drm_exec irqbypass gpu_sched drm_buddy drm_suballoc_helper drm_ttm_helper crct10dif_pclmul polyval_clmulni polyval_generic snd_hda_codec ghash_clmulni_intel sha256_ssse3 ttm sha1_ssse3 snd_hda_core snd_hwdep aesni_intel snd_pcm drm_display_helper crypto_simd cec snd_timer cryptd rc_core snd i2c_algo_bit rapl soundcore wmi_bmof pcspkr ccp k10temp input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 raid1 dm_thin_pool dm_persistent_data dm_bio_prison
Feb 26 11:39:49 pve-corp-010-dc kernel:  dm_bufio libcrc32c hid_generic usbkbd usbhid hid crc32_pclmul xhci_pci r8169 xhci_pci_renesas realtek xhci_hcd i2c_piix4 ahci libahci video wmi gpio_amdpt
Feb 26 11:39:49 pve-corp-010-dc kernel: CR2: 0000000000000620
Feb 26 11:39:49 pve-corp-010-dc kernel: ---[ end trace 0000000000000000 ]---
Feb 26 11:39:49 pve-corp-010-dc kernel: RIP: 0010:folio_lruvec_lock_irqsave+0x4e/0xa0
Feb 26 11:39:49 pve-corp-010-dc kernel: Code: 8b 17 48 c1 ea 36 48 8b 14 d5 e0 e2 37 a0 66 90 48 63 8a 40 9e 02 00 48 85 c0 48 0f 44 05 2a 9e 11 02 48 8b 9c c8 90 08 00 00 <48> 3b 93 20 06 00 00 75 31 48 8d 7b 50 e8 b0 d3 ce 00 49 89 04 24
Feb 26 11:39:49 pve-corp-010-dc kernel: RSP: 0000:ffffa0c940d8f8f8 EFLAGS: 00010286
Feb 26 11:39:49 pve-corp-010-dc kernel: RAX: ffff93d4c3321800 RBX: 0000000000000000 RCX: 0000000000000000
Feb 26 11:39:49 pve-corp-010-dc kernel: RDX: ffff93db7efd5000 RSI: ffffa0c940d8f940 RDI: ffffdbb506197d00
Feb 26 11:39:49 pve-corp-010-dc kernel: RBP: ffffa0c940d8f908 R08: 0000000000000000 R09: 0000000000000000
Feb 26 11:39:49 pve-corp-010-dc kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0c940d8f940
Feb 26 11:39:49 pve-corp-010-dc kernel: R13: ffff93d4d180f000 R14: ffff93d8bada8ee8 R15: 000000000000001b
Feb 26 11:39:49 pve-corp-010-dc kernel: FS:  000078bb33771b80(0000) GS:ffff93db5ff00000(0000) knlGS:0000000000000000
Feb 26 11:39:49 pve-corp-010-dc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 26 11:39:49 pve-corp-010-dc kernel: CR2: 0000000000000620 CR3: 00000001468be000 CR4: 00000000003506f0
Feb 26 11:39:49 pve-corp-010-dc kernel: note: pvestatd[1181] exited with irqs disabled
Feb 26 11:39:49 pve-corp-010-dc kernel: Fixing recursive fault but reboot is needed!
Feb 26 11:39:49 pve-corp-010-dc kernel: BUG: scheduling while atomic: pvestatd/1181/0x00000000
Feb 26 11:39:49 pve-corp-010-dc kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs nf_tables lockd grace netfs bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_hdmi snd_hda_intel amdgpu snd_intel_dspcfg kvm snd_intel_sdw_acpi amdxcp drm_exec irqbypass gpu_sched drm_buddy drm_suballoc_helper drm_ttm_helper crct10dif_pclmul polyval_clmulni polyval_generic snd_hda_codec ghash_clmulni_intel sha256_ssse3 ttm sha1_ssse3 snd_hda_core snd_hwdep aesni_intel snd_pcm drm_display_helper crypto_simd cec snd_timer cryptd rc_core snd i2c_algo_bit rapl soundcore wmi_bmof pcspkr ccp k10temp input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 raid1 dm_thin_pool dm_persistent_data dm_bio_prison
Feb 26 11:39:49 pve-corp-010-dc kernel:  dm_bufio libcrc32c hid_generic usbkbd usbhid hid crc32_pclmul xhci_pci r8169 xhci_pci_renesas realtek xhci_hcd i2c_piix4 ahci libahci video wmi gpio_amdpt
Feb 26 11:39:49 pve-corp-010-dc kernel: CPU: 2 PID: 1181 Comm: pvestatd Tainted: P      D    O       6.8.12-4-pve #1
Feb 26 11:39:49 pve-corp-010-dc kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4 R2.0, BIOS P5.70 10/20/2022
Feb 26 11:39:49 pve-corp-010-dc kernel: Call Trace:
Feb 26 11:39:49 pve-corp-010-dc kernel:  <TASK>
Feb 26 11:39:49 pve-corp-010-dc kernel:  dump_stack_lvl+0x76/0xa0
Feb 26 11:39:49 pve-corp-010-dc kernel:  dump_stack+0x10/0x20
Feb 26 11:39:49 pve-corp-010-dc kernel:  __schedule_bug+0x64/0x80
Feb 26 11:39:49 pve-corp-010-dc kernel:  __schedule+0x10f1/0x15e0
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? vprintk+0x42/0x80
Feb 26 11:39:49 pve-corp-010-dc kernel:  ? _printk+0x60/0x90
Feb 26 11:39:49 pve-corp-010-dc kernel:  do_task_dead+0x44/0x50
Feb 26 11:39:49 pve-corp-010-dc kernel:  make_task_dead+0x14c/0x170
Feb 26 11:39:49 pve-corp-010-dc kernel:  rewind_stack_and_make_dead+0x17/0x20
Feb 26 11:39:49 pve-corp-010-dc kernel: RIP: 0033:0x5893b26ab752
Feb 26 11:39:49 pve-corp-010-dc kernel: Code: Unable to access opcode bytes at 0x5893b26ab728.
Feb 26 11:39:49 pve-corp-010-dc kernel: RSP: 002b:00007fff6fb3a450 EFLAGS: 00010206
Feb 26 11:39:49 pve-corp-010-dc kernel: RAX: 00005893b4d2a080 RBX: 00005893ba841240 RCX: 00000000622d6c72
Feb 26 11:39:49 pve-corp-010-dc kernel: RDX: 00005893ba7e9d50 RSI: 00005893b4d2a078 RDI: 000000002000000c
Feb 26 11:39:49 pve-corp-010-dc kernel: RBP: 0000000000000400 R08: 00005893ba80bd60 R09: 00005893ba80bd60
Feb 26 11:39:49 pve-corp-010-dc kernel: R10: 00005893b706fbb0 R11: 00005893b28de8e0 R12: 00005893b4d2a080
Feb 26 11:39:49 pve-corp-010-dc kernel: R13: 00005893b49192a0 R14: 00005893ba84b450 R15: 0000000065702f75
Feb 26 11:39:49 pve-corp-010-dc kernel:  </TASK>

And also, I'm not crashed yet but, the WebGUI show "?"

Screenshot at 2025-02-26 16-46-57.png

What is the problem with that host? His twin is working just great.

Guillaume