Inexplicable unknown/"gray question mark" crash state of node in cluster with ceph

lewinernst

Member
Jul 31, 2021
18
0
6
28
Hi everyone

For a couple of months, one of my nodes has randomly entered the gray question mark state anywhere between 5 minutes and two days after booting. After it happens, vm management (shutdown/reboot) becomes unresponsive but the guests keep functioning for a few more hours and so does the local ceph osd. Sometimes the host console also remains available (

I have swapped ram and removed every single pcie device in the system, but the behaviour still occurs. Testing 4 different bios versions and multiple kernels also does not change the behaviour. The syslog shows the following around the time of "crash":

Code:
Jun 16 17:13:47 aspvendin kernel: BUG: kernel NULL pointer dereference, address: 0000000000000030
Jun 16 17:13:47 aspvendin kernel: #PF: supervisor read access in kernel mode
Jun 16 17:13:47 aspvendin kernel: #PF: error_code(0x0000) - not-present page
Jun 16 17:13:47 aspvendin kernel: PGD 0 P4D 0
Jun 16 17:13:47 aspvendin kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jun 16 17:13:47 aspvendin kernel: CPU: 9 PID: 171 Comm: ksmd Tainted: P           O       6.8.4-3-pve #1
Jun 16 17:13:47 aspvendin kernel: Hardware name: HP HP Z2 SFF G9 Workstation Desktop PC/895D, BIOS U50 Ver. 03.01.03 02/22/2024
Jun 16 17:13:47 aspvendin kernel: RIP: 0010:get_ksm_page+0x32/0x2b0
Jun 16 17:13:47 aspvendin kernel: Code: e5 41 57 41 56 49 89 fe 41 55 41 54 49 89 fc 53 49 83 cc 03 48 83 ec 08 89 75 d4 eb 0d 49 8b 46 30 49 39 c5 0f 84 29 01 00 00 <4d> 8b 6e 30 4c 89 eb 48 c1 e3 06 48 03 1d 44 9c 66 01 48 8b 43 18
Jun 16 17:13:47 aspvendin kernel: RSP: 0018:ffffaa834072bdb0 EFLAGS: 00010282
Jun 16 17:13:47 aspvendin kernel: RAX: 0000762c7eeaf000 RBX: ffff8fd410162080 RCX: 0000000000000002
Jun 16 17:13:47 aspvendin kernel: RDX: 0000762c7eeaf000 RSI: 0000000000000001 RDI: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: RBP: ffffaa834072bde0 R08: 0000000000000001 R09: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
Jun 16 17:13:47 aspvendin kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
Jun 16 17:13:47 aspvendin kernel: FS:  0000000000000000(0000) GS:ffff8fd47f080000(0000) knlGS:0000000000000000
Jun 16 17:13:47 aspvendin kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 16 17:13:47 aspvendin kernel: CR2: 0000000000000030 CR3: 000000100f236000 CR4: 0000000000f52ef0
Jun 16 17:13:47 aspvendin kernel: PKRU: 55555554
Jun 16 17:13:47 aspvendin kernel: Call Trace:
Jun 16 17:13:47 aspvendin kernel:  <TASK>
Jun 16 17:13:47 aspvendin kernel:  ? show_regs+0x6d/0x80
Jun 16 17:13:47 aspvendin kernel:  ? __die+0x24/0x80
Jun 16 17:13:47 aspvendin kernel:  ? page_fault_oops+0x176/0x500
Jun 16 17:13:47 aspvendin kernel:  ? do_user_addr_fault+0x2f9/0x6b0
Jun 16 17:13:47 aspvendin kernel:  ? exc_page_fault+0x83/0x1b0
Jun 16 17:13:47 aspvendin kernel:  ? asm_exc_page_fault+0x27/0x30
Jun 16 17:13:47 aspvendin kernel:  ? get_ksm_page+0x32/0x2b0
Jun 16 17:13:47 aspvendin kernel:  remove_rmap_item_from_tree+0x74/0x1d0
Jun 16 17:13:47 aspvendin kernel:  ksm_scan_thread+0x824/0x2300
Jun 16 17:13:47 aspvendin kernel:  ? __pfx_ksm_scan_thread+0x10/0x10
Jun 16 17:13:47 aspvendin kernel:  kthread+0xef/0x120
Jun 16 17:13:47 aspvendin kernel:  ? __pfx_kthread+0x10/0x10
Jun 16 17:13:47 aspvendin kernel:  ret_from_fork+0x44/0x70
Jun 16 17:13:47 aspvendin kernel:  ? __pfx_kthread+0x10/0x10
Jun 16 17:13:47 aspvendin kernel:  ret_from_fork_asm+0x1b/0x30
Jun 16 17:13:47 aspvendin kernel:  </TASK>
Jun 16 17:13:47 aspvendin kernel: Modules linked in: rbd ceph libceph netfs act_police cls_basic sch_ingress sch_htb veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables scsi_transport_iscsi nvme_fabrics bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common snd_hda_codec_realtek xe snd_sof_pci_intel_tgl snd_hda_codec_generic snd_sof_intel_hda_common intel_uncore_frequency soundwire_intel intel_uncore_frequency_common snd_sof_intel_hda_mlink intel_pmc_core soundwire_cadence intel_vsec snd_sof_intel_hda pmt_telemetry snd_sof_pci pmt_class snd_sof_xtensa_dsp drm_suballoc_helper snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_compress ac97_bus snd_pcm_dmaengine kvm_intel snd_hda_intel i915 nouveau snd_intel_dspcfg snd_intel_sdw_acpi kvm snd_hda_codec
Jun 16 17:13:47 aspvendin kernel:  crct10dif_pclmul mxm_wmi irdma snd_hda_core drm_gpuvm polyval_clmulni polyval_generic drm_exec ghash_clmulni_intel gpu_sched snd_hwdep sha256_ssse3 snd_pcm sha1_ssse3 drm_buddy drm_ttm_helper i40e aesni_intel ttm snd_timer crypto_simd cryptd drm_display_helper ib_uverbs cmdlinepart snd rapl spi_nor cec hp_wmi ucsi_ccg ucsi_acpi mei_me soundcore ib_core rc_core sparse_keymap intel_cstate pcspkr typec_ucsi mtd serio_raw platform_profile mei i2c_algo_bit wmi_bmof typec acpi_tad acpi_pad mac_hid vhost_net vhost vhost_iotlb tap nct6775_core hwmon_vid coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c cdc_ncm cdc_ether usbnet r8152 mii uas usb_storage xhci_pci nvme xhci_pci_renesas ice video crc32_pclmul e1000e nvme_core psmouse ahci spi_intel_pci xhci_hcd gnss i2c_i801 spi_intel libahci i2c_smbus i2c_nvidia_gpu nvme_auth i2c_ccgx_ucsi wmi
Jun 16 17:13:47 aspvendin kernel: CR2: 0000000000000030
Jun 16 17:13:47 aspvendin kernel: ---[ end trace 0000000000000000 ]---
Jun 16 17:13:47 aspvendin kernel: RIP: 0010:get_ksm_page+0x32/0x2b0
Jun 16 17:13:47 aspvendin kernel: Code: e5 41 57 41 56 49 89 fe 41 55 41 54 49 89 fc 53 49 83 cc 03 48 83 ec 08 89 75 d4 eb 0d 49 8b 46 30 49 39 c5 0f 84 29 01 00 00 <4d> 8b 6e 30 4c 89 eb 48 c1 e3 06 48 03 1d 44 9c 66 01 48 8b 43 18
Jun 16 17:13:47 aspvendin kernel: RSP: 0018:ffffaa834072bdb0 EFLAGS: 00010282
Jun 16 17:13:47 aspvendin kernel: RAX: 0000762c7eeaf000 RBX: ffff8fd410162080 RCX: 0000000000000002
Jun 16 17:13:47 aspvendin kernel: RDX: 0000762c7eeaf000 RSI: 0000000000000001 RDI: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: RBP: ffffaa834072bde0 R08: 0000000000000001 R09: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
Jun 16 17:13:47 aspvendin kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
Jun 16 17:13:47 aspvendin kernel: FS:  0000000000000000(0000) GS:ffff8fd47f080000(0000) knlGS:0000000000000000
Jun 16 17:13:47 aspvendin kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 16 17:13:47 aspvendin kernel: CR2: 0000000000000030 CR3: 00000008a8aae000 CR4: 0000000000f52ef0
Jun 16 17:13:47 aspvendin kernel: PKRU: 55555554
Jun 16 17:13:47 aspvendin kernel: note: ksmd[171] exited with irqs disabled


At this point i don't know what else to try - any pointers?
 
Hello,
I have exactly the same problem. Maybe this is a bug in the kernel? I have a hetzner root server and datacenter support already replaced the whole machine. I have reinstalled proxmox twice already.
Sometimes this happens when I try to reboot and sometimes it happens randomly. As you said, it can happen 5 min after a reboot or 8 hours.
I have a EX44 with an I5 13500.
Here is my log:

Code:
Jun 17 06:00:05 panigale kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Jun 17 06:00:05 panigale kernel: #PF: supervisor write access in kernel mode
Jun 17 06:00:05 panigale kernel: #PF: error_code(0x0002) - not-present page
Jun 17 06:00:05 panigale kernel: PGD 0 P4D 0
Jun 17 06:00:05 panigale kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Jun 17 06:00:05 panigale kernel: CPU: 1 PID: 1268 Comm: kvm Not tainted 6.8.4-3-pve #1
Jun 17 06:00:05 panigale kernel: Hardware name: ASUS System Product Name/PRIME B760M-A D4, BIOS 9006 02/20/2023
Jun 17 06:00:05 panigale kernel: RIP: 0010:blk_flush_complete_seq+0x291/0x2d0
Jun 17 06:00:05 panigale kernel: Code: 0f b6 f6 49 8d 56 01 49 c1 e6 04 4d 01 ee 48 c1 e2 04 49 8b 4e 10 4c 01 ea 48 39 ca 74 2b 48 8b 4b 50 48 8b 7b 48 48 8d 73 48 <48> 89 4f 08 48 89 39 49 8b 4e 18 49 89 76 18 48 89 53 48 48 89 4b
Jun 17 06:00:05 panigale kernel: RSP: 0018:ffffafcc00f0ba20 EFLAGS: 00010046
Jun 17 06:00:05 panigale kernel: RAX: 0000000000000000 RBX: ffff9fe559b88000 RCX: ffff9fe559b88048
Jun 17 06:00:05 panigale kernel: RDX: ffff9fe558d1ea10 RSI: ffff9fe559b88048 RDI: 0000000000000000
Jun 17 06:00:05 panigale kernel: RBP: ffffafcc00f0ba60 R08: 0000000000000000 R09: 0000000000000000
Jun 17 06:00:05 panigale kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000029801
Jun 17 06:00:05 panigale kernel: R13: ffff9fe558d1ea00 R14: ffff9fe558d1ea00 R15: ffff9fe55b2d3250
Jun 17 06:00:05 panigale kernel: FS:  000079220d6006c0(0000) GS:ffff9ff47ec80000(0000) knlGS:0000000000000000
Jun 17 06:00:05 panigale kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 17 06:00:05 panigale kernel: CR2: 0000000000000008 CR3: 00000001285dc000 CR4: 0000000000f52ef0
Jun 17 06:00:05 panigale kernel: PKRU: 55555554
Jun 17 06:00:05 panigale kernel: Call Trace:
Jun 17 06:00:05 panigale kernel:  <TASK>
Jun 17 06:00:05 panigale kernel:  ? show_regs+0x6d/0x80
Jun 17 06:00:05 panigale kernel:  ? __die+0x24/0x80
Jun 17 06:00:05 panigale kernel:  ? page_fault_oops+0x176/0x500
Jun 17 06:00:05 panigale kernel:  ? md_submit_bio+0x63/0xb0
Jun 17 06:00:05 panigale kernel:  ? do_user_addr_fault+0x2f9/0x6b0
Jun 17 06:00:05 panigale kernel:  ? exc_page_fault+0x83/0x1b0
Jun 17 06:00:05 panigale kernel:  ? asm_exc_page_fault+0x27/0x30
Jun 17 06:00:05 panigale kernel:  ? blk_flush_complete_seq+0x291/0x2d0
Jun 17 06:00:05 panigale kernel:  ? __blk_mq_alloc_requests+0x3e7/0x450
Jun 17 06:00:05 panigale kernel:  ? wbt_wait+0x33/0x100
Jun 17 06:00:05 panigale kernel:  blk_insert_flush+0xce/0x220
Jun 17 06:00:05 panigale kernel:  blk_mq_submit_bio+0x641/0x750
Jun 17 06:00:05 panigale kernel:  __submit_bio+0xb3/0x1c0
Jun 17 06:00:05 panigale kernel:  submit_bio_noacct_nocheck+0x2b7/0x390
Jun 17 06:00:05 panigale kernel:  submit_bio_noacct+0x1f3/0x650
Jun 17 06:00:05 panigale kernel:  ? ext4_file_write_iter+0x380/0x7e0
Jun 17 06:00:05 panigale kernel:  submit_bio+0xb2/0x110
Jun 17 06:00:05 panigale kernel:  md_super_write+0xcf/0x110
Jun 17 06:00:05 panigale kernel:  write_sb_page+0x148/0x300
Jun 17 06:00:05 panigale kernel:  filemap_write_page+0x5b/0x70
Jun 17 06:00:05 panigale kernel:  md_bitmap_unplug+0x99/0x200
Jun 17 06:00:05 panigale kernel:  flush_bio_list+0x108/0x110 [raid1]
Jun 17 06:00:05 panigale kernel:  raid1_unplug+0x3c/0xf0 [raid1]
Jun 17 06:00:05 panigale kernel:  __blk_flush_plug+0xbe/0x130
Jun 17 06:00:05 panigale kernel:  blk_finish_plug+0x31/0x50
Jun 17 06:00:05 panigale kernel:  io_submit_sqes+0x549/0x680
Jun 17 06:00:05 panigale kernel:  __do_sys_io_uring_enter+0x57c/0xbf0
Jun 17 06:00:05 panigale kernel:  ? syscall_exit_to_user_mode+0x86/0x260
Jun 17 06:00:05 panigale kernel:  __x64_sys_io_uring_enter+0x22/0x40
Jun 17 06:00:05 panigale kernel:  x64_sys_call+0x20b9/0x24b0
Jun 17 06:00:05 panigale kernel:  do_syscall_64+0x81/0x170
Jun 17 06:00:05 panigale kernel:  ? do_syscall_64+0x8d/0x170
Jun 17 06:00:05 panigale kernel:  ? do_syscall_64+0x8d/0x170
Jun 17 06:00:05 panigale kernel:  ? syscall_exit_to_user_mode+0x86/0x260
Jun 17 06:00:05 panigale kernel:  ? do_syscall_64+0x8d/0x170
Jun 17 06:00:05 panigale kernel:  ? do_syscall_64+0x8d/0x170
Jun 17 06:00:05 panigale kernel:  ? common_interrupt+0x54/0xb0
Jun 17 06:00:05 panigale kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Jun 17 06:00:05 panigale kernel: RIP: 0033:0x79221a8b7b95
Jun 17 06:00:05 panigale kernel: Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
Jun 17 06:00:05 panigale kernel: RSP: 002b:000079220d5fafa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
Jun 17 06:00:05 panigale kernel: RAX: ffffffffffffffda RBX: 000055b8d72d6190 RCX: 000079221a8b7b95
Jun 17 06:00:05 panigale kernel: RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000036
Jun 17 06:00:05 panigale kernel: RBP: 000055b8d72d6198 R08: 0000000000000000 R09: 0000000000000008
Jun 17 06:00:05 panigale kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055b8d72d6280
Jun 17 06:00:05 panigale kernel: R13: 0000000000000001 R14: 000055b8d7186f88 R15: 0000000000000000
Jun 17 06:00:05 panigale kernel:  </TASK>
Jun 17 06:00:05 panigale kernel: Modules linked in: nf_conntrack_netlink xt_tcpudp xt_conntrack nft_chain_nat xfrm_user xfrm_algo xt_addrtype nft_compat overlay cmac nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core cifs_md4 netfs cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables xt_MASQUERADE iptable_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_multiport bonding tls sunrpc nfnetlink_log binfmt_misc nfnetlink xe drm_gpuvm drm_exec gpu_sched drm_suballoc_helper intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm irqbypass drm_buddy rapl cmdlinepart drm_display_helper spi_nor intel_cstate cec eeepc_wmi wmi_bmof mtd rc_core ee1004 i2c_algo_bit intel_pmc_core intel_vsec pmt_telemetry pmt_class joydev acpi_pad acpi_tad input_leds serio_raw mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables
Jun 17 06:00:05 panigale kernel:  autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 bochs drm_vram_helper drm_ttm_helper ttm hid_generic usbkbd usbmouse usbhid hid raid1 mfd_aaeon asus_wmi crct10dif_pclmul crc32_pclmul ledtrig_audio polyval_clmulni sparse_keymap polyval_generic platform_profile xhci_pci ghash_clmulni_intel xhci_pci_renesas sha256_ssse3 nvme intel_lpss_pci spi_intel_pci sha1_ssse3 psmouse xhci_hcd r8169 spi_intel intel_lpss i2c_i801 nvme_core ahci realtek i2c_smbus video idma64 libahci nvme_auth wmi pinctrl_alderlake aesni_intel crypto_simd cryptd
Jun 17 06:00:05 panigale kernel: CR2: 0000000000000008
Jun 17 06:00:05 panigale kernel: ---[ end trace 0000000000000000 ]---
Jun 17 06:00:05 panigale kernel: RIP: 0010:blk_flush_complete_seq+0x291/0x2d0
Jun 17 06:00:05 panigale kernel: Code: 0f b6 f6 49 8d 56 01 49 c1 e6 04 4d 01 ee 48 c1 e2 04 49 8b 4e 10 4c 01 ea 48 39 ca 74 2b 48 8b 4b 50 48 8b 7b 48 48 8d 73 48 <48> 89 4f 08 48 89 39 49 8b 4e 18 49 89 76 18 48 89 53 48 48 89 4b
Jun 17 06:00:05 panigale kernel: RSP: 0018:ffffafcc00f0ba20 EFLAGS: 00010046
Jun 17 06:00:05 panigale kernel: RAX: 0000000000000000 RBX: ffff9fe559b88000 RCX: ffff9fe559b88048
Jun 17 06:00:05 panigale kernel: RDX: ffff9fe558d1ea10 RSI: ffff9fe559b88048 RDI: 0000000000000000
Jun 17 06:00:05 panigale kernel: RBP: ffffafcc00f0ba60 R08: 0000000000000000 R09: 0000000000000000
Jun 17 06:00:05 panigale kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000029801
Jun 17 06:00:05 panigale kernel: R13: ffff9fe558d1ea00 R14: ffff9fe558d1ea00 R15: ffff9fe55b2d3250
Jun 17 06:00:05 panigale kernel: FS:  000079220d6006c0(0000) GS:ffff9ff47ec80000(0000) knlGS:0000000000000000
Jun 17 06:00:05 panigale kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 17 06:00:05 panigale kernel: CR2: 0000000000000008 CR3: 00000001285dc000 CR4: 0000000000f52ef0
Jun 17 06:00:05 panigale kernel: PKRU: 55555554
Jun 17 06:00:05 panigale kernel: note: kvm[1268] exited with irqs disabled
Jun 17 06:00:05 panigale kernel: note: kvm[1268] exited with preempt_count 1
Jun 17 06:00:05 panigale kernel: ------------[ cut here ]------------
Jun 17 06:00:05 panigale kernel: WARNING: CPU: 1 PID: 1268 at kernel/exit.c:820 do_exit+0x8dd/0xae0
Jun 17 06:00:05 panigale kernel: Modules linked in: nf_conntrack_netlink xt_tcpudp xt_conntrack nft_chain_nat xfrm_user xfrm_algo xt_addrtype nft_compat overlay cmac nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core cifs_md4 netfs cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables xt_MASQUERADE iptable_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_multiport bonding tls sunrpc nfnetlink_log binfmt_misc nfnetlink xe drm_gpuvm drm_exec gpu_sched drm_suballoc_helper intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm irqbypass drm_buddy rapl cmdlinepart drm_display_helper spi_nor intel_cstate cec eeepc_wmi wmi_bmof mtd rc_core ee1004 i2c_algo_bit intel_pmc_core intel_vsec pmt_telemetry pmt_class joydev acpi_pad acpi_tad input_leds serio_raw mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables
Jun 17 06:00:05 panigale kernel:  autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 bochs drm_vram_helper drm_ttm_helper ttm hid_generic usbkbd usbmouse usbhid hid raid1 mfd_aaeon asus_wmi crct10dif_pclmul crc32_pclmul ledtrig_audio polyval_clmulni sparse_keymap polyval_generic platform_profile xhci_pci ghash_clmulni_intel xhci_pci_renesas sha256_ssse3 nvme intel_lpss_pci spi_intel_pci sha1_ssse3 psmouse xhci_hcd r8169 spi_intel intel_lpss i2c_i801 nvme_core ahci realtek i2c_smbus video idma64 libahci nvme_auth wmi pinctrl_alderlake aesni_intel crypto_simd cryptd
Jun 17 06:00:05 panigale kernel: CPU: 1 PID: 1268 Comm: kvm Tainted: G      D            6.8.4-3-pve #1
Jun 17 06:00:05 panigale kernel: Hardware name: ASUS System Product Name/PRIME B760M-A D4, BIOS 9006 02/20/2023
Jun 17 06:00:05 panigale kernel: RIP: 0010:do_exit+0x8dd/0xae0
Jun 17 06:00:05 panigale kernel: Code: e9 42 f8 ff ff 48 8b bb e0 09 00 00 31 f6 e8 9a e0 ff ff e9 ee fd ff ff 4c 89 ee bf 05 06 00 00 e8 08 3a 01 00 e9 6e f8 ff ff <0f> 0b e9 9c f7 ff ff 0f 0b e9 55 f7 ff ff 48 89 df e8 0d 2f 14 00
Jun 17 06:00:05 panigale kernel: RSP: 0018:ffffafcc00f0bec8 EFLAGS: 00010282
Jun 17 06:00:05 panigale kernel: RAX: 0000000000000000 RBX: ffff9fe54d325200 RCX: 0000000000000000
Jun 17 06:00:05 panigale kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jun 17 06:00:05 panigale kernel: RBP: ffffafcc00f0bf20 R08: 0000000000000000 R09: 0000000000000000
Jun 17 06:00:05 panigale kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9fe549c93f00
Jun 17 06:00:05 panigale kernel: R13: 0000000000000009 R14: ffff9fe553c8f380 R15: 0000000000000000
Jun 17 06:00:05 panigale kernel: FS:  000079220d6006c0(0000) GS:ffff9ff47ec80000(0000) knlGS:0000000000000000
Jun 17 06:00:05 panigale kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 17 06:00:05 panigale kernel: CR2: 0000000000000008 CR3: 00000001285dc000 CR4: 0000000000f52ef0
Jun 17 06:00:05 panigale kernel: PKRU: 55555554
Jun 17 06:00:05 panigale kernel: Call Trace:
Jun 17 06:00:05 panigale kernel:  <TASK>
Jun 17 06:00:05 panigale kernel:  ? show_regs+0x6d/0x80
Jun 17 06:00:05 panigale kernel:  ? __warn+0x89/0x160
Jun 17 06:00:05 panigale kernel:  ? do_exit+0x8dd/0xae0
Jun 17 06:00:05 panigale kernel:  ? report_bug+0x17e/0x1b0
Jun 17 06:00:05 panigale kernel:  ? handle_bug+0x46/0x90
Jun 17 06:00:05 panigale kernel:  ? exc_invalid_op+0x18/0x80
Jun 17 06:00:05 panigale kernel:  ? asm_exc_invalid_op+0x1b/0x20
Jun 17 06:00:05 panigale kernel:  ? do_exit+0x8dd/0xae0
Jun 17 06:00:05 panigale kernel:  ? do_exit+0x72/0xae0
Jun 17 06:00:05 panigale kernel:  ? _printk+0x60/0x90
Jun 17 06:00:05 panigale kernel:  make_task_dead+0x83/0x170
Jun 17 06:00:05 panigale kernel:  rewind_stack_and_make_dead+0x17/0x20
Jun 17 06:00:05 panigale kernel: RIP: 0033:0x79221a8b7b95
Jun 17 06:00:05 panigale kernel: Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
Jun 17 06:00:05 panigale kernel: RSP: 002b:000079220d5fafa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
Jun 17 06:00:05 panigale kernel: RAX: ffffffffffffffda RBX: 000055b8d72d6190 RCX: 000079221a8b7b95
Jun 17 06:00:05 panigale kernel: RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000036
Jun 17 06:00:05 panigale kernel: RBP: 000055b8d72d6198 R08: 0000000000000000 R09: 0000000000000008
Jun 17 06:00:05 panigale kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055b8d72d6280
Jun 17 06:00:05 panigale kernel: R13: 0000000000000001 R14: 000055b8d7186f88 R15: 0000000000000000
Jun 17 06:00:05 panigale kernel:  </TASK>
Jun 17 06:00:05 panigale kernel: ---[ end trace 0000000000000000 ]---
Jun 17 06:00:15 panigale pvestatd[1220]: VM 5001 qmp command failed - VM 5001 qmp command 'query-proxmox-support' failed - got timeout
Jun 17 06:00:22 panigale pvestatd[1220]: status update time (15.403 seconds)
Jun 17 06:00:30 panigale pvestatd[1220]: VM 5001 qmp command failed - VM 5001 qmp command 'query-proxmox-support' failed - unable to connect to VM 5001 qmp socket - timeout after 51 retries
Jun 17 06:00:38 panigale pvestatd[1220]: status update time (15.420 seconds)
Jun 17 06:00:46 panigale pvestatd[1220]: VM 5001 qmp command failed - VM 5001 qmp command 'query-proxmox-support' failed - unable to connect to VM 5001 qmp socket - timeout after 51 retries
 
Last edited:
Hi,

I have exactly the same problem. Maybe this is a bug in the kernel? I have a hetzner root server and datacenter support already replaced the whole machine. I have reinstalled proxmox twice already.
The two kernel NULL pointer dereferences reported here look similar at first, but are most likely different problems. NULL pointer dereferences can happen in different code paths within the kernel, so in order to tell whether two NULL pointer dereferences are the same problem, we have to look at the details.

The NULL pointer dereference reported by @Ksdmg has RIP (the instruction pointer) pointing to blk_flush_complete_seq ...
Code:
Jun 17 06:00:05 panigale kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
...
Jun 17 06:00:05 panigale kernel: RIP: 0010:blk_flush_complete_seq+0x291/0x2d0
...
... which is due to a bug in early 6.8 kernels that should be fixed in proxmox-kernel-6.8 6.8.8-1 and higher. See [1] for more information.

The NULL pointer dereference reported by @lewinernst however has RIP pointing to get_ksm_page ...
Code:
Jun 16 17:13:47 aspvendin kernel: BUG: kernel NULL pointer dereference, address: 0000000000000030
...
Jun 16 17:13:47 aspvendin kernel: RIP: 0010:get_ksm_page+0x32/0x2b0
...
which means it's most likely a different issue. I see a crash in get_ksm_page was already reported at [3]. My initial guess would be
faulty RAM, but you mention you have replaced it already (maybe still run a memtest86+ to be sure?).

Can you check whether you also encounter the crash with kernel 6.5 and kernel 6.8.8-2?

If this is an Intel host, one difference between kernels 6.5 and 6.8 is that intel_iommu now defaults to on [4, under "Kernel: intel_iommu now defaults to on"]. If you do not see the crash on 6.5 but see it on 6.8.8-2, can you check whether setting intel_iommu=off on 6.8.8-2 helps? [3]

The call trace mentions KSM [2], so disabling [2] that might be a workaround (if you don't need it), but it's also possible the crash then resurfaces at another call site.

If you try anything of the above and encounter another crash, please provide the complete message.

[1] https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760/post-674842
[2] https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
[3] https://forum.proxmox.com/threads/ideas-about-general-protection-fault.148773/#post-673069
 
Hi,


The two kernel NULL pointer dereferences reported here look similar at first, but are most likely different problems. NULL pointer dereferences can happen in different code paths within the kernel, so in order to tell whether two NULL pointer dereferences are the same problem, we have to look at the details.

The NULL pointer dereference reported by @Ksdmg has RIP (the instruction pointer) pointing to blk_flush_complete_seq ...

... which is due to a bug in early 6.8 kernels that should be fixed in proxmox-kernel-6.8 6.8.8-1 and higher. See [1] for more information.

The NULL pointer dereference reported by @lewinernst however has RIP pointing to get_ksm_page ...

which means it's most likely a different issue. I see a crash in get_ksm_page was already reported at [3]. My initial guess would be
faulty RAM, but you mention you have replaced it already (maybe still run a memtest86+ to be sure?).

Can you check whether you also encounter the crash with kernel 6.5 and kernel 6.8.8-2?

If this is an Intel host, one difference between kernels 6.5 and 6.8 is that intel_iommu now defaults to on [4, under "Kernel: intel_iommu now defaults to on"]. If you do not see the crash on 6.5 but see it on 6.8.8-2, can you check whether setting intel_iommu=off on 6.8.8-2 helps? [3]

The call trace mentions KSM [2], so disabling [2] that might be a workaround (if you don't need it), but it's also possible the crash then resurfaces at another call site.

If you try anything of the above and encounter another crash, please provide the complete message.

[1] https://forum.proxmox.com/threads/random-6-8-4-2-pve-kernel-crashes.145760/post-674842
[2] https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
[3] https://forum.proxmox.com/threads/ideas-about-general-protection-fault.148773/#post-673069
Hi, thanks for the elaborate feedback. I have tried all of the 6.8 versions and have also reverted to 6.5.13-5-pve - it still happens. Since you mentioned iommu, i noticed that if i never start vms using pcie passthrough (i have sata controllers and network cards passed through, sometimes the gpu) the crashes happen on the neither major kernel version (ran both for 24h). 6.5 was actually a kernel update that bricked gpu pcie passthrough for me (I had been running it fine since 5.x, then error 43 in windows), but in 6.8 that error goes away. I have run memtest86+ extensively on both RAM Kits, so i am afraid thats not it - maybe it was one of the kernel bugs and due to me having to force reset often now some files on disk are corrupted, or would that manifest differently?

This is the last crash activity as far as i can tell, this time it crashed without the same Error. I am not exactly sure when it went into "grey mode" since i was away and noticed when i got home, but this is 98% the time. Too many errors within 5 minutes, so here the pastebin:
https://pastebin.com/xkbNBECC
 
This "get_ksm_page dereference" issue has happened to me 3 times now.

The first time it happened, I upgraded to Linux 6.8.12-1-pve. But its happened twice on this version since.
 
I've disabled KSM now to test whether its a bug in KSM (likely) or whether the problem just migrates to somewhere else in the kernel.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!