Hi everyone
For a couple of months, one of my nodes has randomly entered the gray question mark state anywhere between 5 minutes and two days after booting. After it happens, vm management (shutdown/reboot) becomes unresponsive but the guests keep functioning for a few more hours and so does the local ceph osd. Sometimes the host console also remains available (
I have swapped ram and removed every single pcie device in the system, but the behaviour still occurs. Testing 4 different bios versions and multiple kernels also does not change the behaviour. The syslog shows the following around the time of "crash":
At this point i don't know what else to try - any pointers?
For a couple of months, one of my nodes has randomly entered the gray question mark state anywhere between 5 minutes and two days after booting. After it happens, vm management (shutdown/reboot) becomes unresponsive but the guests keep functioning for a few more hours and so does the local ceph osd. Sometimes the host console also remains available (
I have swapped ram and removed every single pcie device in the system, but the behaviour still occurs. Testing 4 different bios versions and multiple kernels also does not change the behaviour. The syslog shows the following around the time of "crash":
Code:
Jun 16 17:13:47 aspvendin kernel: BUG: kernel NULL pointer dereference, address: 0000000000000030
Jun 16 17:13:47 aspvendin kernel: #PF: supervisor read access in kernel mode
Jun 16 17:13:47 aspvendin kernel: #PF: error_code(0x0000) - not-present page
Jun 16 17:13:47 aspvendin kernel: PGD 0 P4D 0
Jun 16 17:13:47 aspvendin kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jun 16 17:13:47 aspvendin kernel: CPU: 9 PID: 171 Comm: ksmd Tainted: P O 6.8.4-3-pve #1
Jun 16 17:13:47 aspvendin kernel: Hardware name: HP HP Z2 SFF G9 Workstation Desktop PC/895D, BIOS U50 Ver. 03.01.03 02/22/2024
Jun 16 17:13:47 aspvendin kernel: RIP: 0010:get_ksm_page+0x32/0x2b0
Jun 16 17:13:47 aspvendin kernel: Code: e5 41 57 41 56 49 89 fe 41 55 41 54 49 89 fc 53 49 83 cc 03 48 83 ec 08 89 75 d4 eb 0d 49 8b 46 30 49 39 c5 0f 84 29 01 00 00 <4d> 8b 6e 30 4c 89 eb 48 c1 e3 06 48 03 1d 44 9c 66 01 48 8b 43 18
Jun 16 17:13:47 aspvendin kernel: RSP: 0018:ffffaa834072bdb0 EFLAGS: 00010282
Jun 16 17:13:47 aspvendin kernel: RAX: 0000762c7eeaf000 RBX: ffff8fd410162080 RCX: 0000000000000002
Jun 16 17:13:47 aspvendin kernel: RDX: 0000762c7eeaf000 RSI: 0000000000000001 RDI: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: RBP: ffffaa834072bde0 R08: 0000000000000001 R09: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
Jun 16 17:13:47 aspvendin kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
Jun 16 17:13:47 aspvendin kernel: FS: 0000000000000000(0000) GS:ffff8fd47f080000(0000) knlGS:0000000000000000
Jun 16 17:13:47 aspvendin kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 16 17:13:47 aspvendin kernel: CR2: 0000000000000030 CR3: 000000100f236000 CR4: 0000000000f52ef0
Jun 16 17:13:47 aspvendin kernel: PKRU: 55555554
Jun 16 17:13:47 aspvendin kernel: Call Trace:
Jun 16 17:13:47 aspvendin kernel: <TASK>
Jun 16 17:13:47 aspvendin kernel: ? show_regs+0x6d/0x80
Jun 16 17:13:47 aspvendin kernel: ? __die+0x24/0x80
Jun 16 17:13:47 aspvendin kernel: ? page_fault_oops+0x176/0x500
Jun 16 17:13:47 aspvendin kernel: ? do_user_addr_fault+0x2f9/0x6b0
Jun 16 17:13:47 aspvendin kernel: ? exc_page_fault+0x83/0x1b0
Jun 16 17:13:47 aspvendin kernel: ? asm_exc_page_fault+0x27/0x30
Jun 16 17:13:47 aspvendin kernel: ? get_ksm_page+0x32/0x2b0
Jun 16 17:13:47 aspvendin kernel: remove_rmap_item_from_tree+0x74/0x1d0
Jun 16 17:13:47 aspvendin kernel: ksm_scan_thread+0x824/0x2300
Jun 16 17:13:47 aspvendin kernel: ? __pfx_ksm_scan_thread+0x10/0x10
Jun 16 17:13:47 aspvendin kernel: kthread+0xef/0x120
Jun 16 17:13:47 aspvendin kernel: ? __pfx_kthread+0x10/0x10
Jun 16 17:13:47 aspvendin kernel: ret_from_fork+0x44/0x70
Jun 16 17:13:47 aspvendin kernel: ? __pfx_kthread+0x10/0x10
Jun 16 17:13:47 aspvendin kernel: ret_from_fork_asm+0x1b/0x30
Jun 16 17:13:47 aspvendin kernel: </TASK>
Jun 16 17:13:47 aspvendin kernel: Modules linked in: rbd ceph libceph netfs act_police cls_basic sch_ingress sch_htb veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables scsi_transport_iscsi nvme_fabrics bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common snd_hda_codec_realtek xe snd_sof_pci_intel_tgl snd_hda_codec_generic snd_sof_intel_hda_common intel_uncore_frequency soundwire_intel intel_uncore_frequency_common snd_sof_intel_hda_mlink intel_pmc_core soundwire_cadence intel_vsec snd_sof_intel_hda pmt_telemetry snd_sof_pci pmt_class snd_sof_xtensa_dsp drm_suballoc_helper snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_compress ac97_bus snd_pcm_dmaengine kvm_intel snd_hda_intel i915 nouveau snd_intel_dspcfg snd_intel_sdw_acpi kvm snd_hda_codec
Jun 16 17:13:47 aspvendin kernel: crct10dif_pclmul mxm_wmi irdma snd_hda_core drm_gpuvm polyval_clmulni polyval_generic drm_exec ghash_clmulni_intel gpu_sched snd_hwdep sha256_ssse3 snd_pcm sha1_ssse3 drm_buddy drm_ttm_helper i40e aesni_intel ttm snd_timer crypto_simd cryptd drm_display_helper ib_uverbs cmdlinepart snd rapl spi_nor cec hp_wmi ucsi_ccg ucsi_acpi mei_me soundcore ib_core rc_core sparse_keymap intel_cstate pcspkr typec_ucsi mtd serio_raw platform_profile mei i2c_algo_bit wmi_bmof typec acpi_tad acpi_pad mac_hid vhost_net vhost vhost_iotlb tap nct6775_core hwmon_vid coretemp vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c cdc_ncm cdc_ether usbnet r8152 mii uas usb_storage xhci_pci nvme xhci_pci_renesas ice video crc32_pclmul e1000e nvme_core psmouse ahci spi_intel_pci xhci_hcd gnss i2c_i801 spi_intel libahci i2c_smbus i2c_nvidia_gpu nvme_auth i2c_ccgx_ucsi wmi
Jun 16 17:13:47 aspvendin kernel: CR2: 0000000000000030
Jun 16 17:13:47 aspvendin kernel: ---[ end trace 0000000000000000 ]---
Jun 16 17:13:47 aspvendin kernel: RIP: 0010:get_ksm_page+0x32/0x2b0
Jun 16 17:13:47 aspvendin kernel: Code: e5 41 57 41 56 49 89 fe 41 55 41 54 49 89 fc 53 49 83 cc 03 48 83 ec 08 89 75 d4 eb 0d 49 8b 46 30 49 39 c5 0f 84 29 01 00 00 <4d> 8b 6e 30 4c 89 eb 48 c1 e3 06 48 03 1d 44 9c 66 01 48 8b 43 18
Jun 16 17:13:47 aspvendin kernel: RSP: 0018:ffffaa834072bdb0 EFLAGS: 00010282
Jun 16 17:13:47 aspvendin kernel: RAX: 0000762c7eeaf000 RBX: ffff8fd410162080 RCX: 0000000000000002
Jun 16 17:13:47 aspvendin kernel: RDX: 0000762c7eeaf000 RSI: 0000000000000001 RDI: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: RBP: ffffaa834072bde0 R08: 0000000000000001 R09: 0000000000000000
Jun 16 17:13:47 aspvendin kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
Jun 16 17:13:47 aspvendin kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
Jun 16 17:13:47 aspvendin kernel: FS: 0000000000000000(0000) GS:ffff8fd47f080000(0000) knlGS:0000000000000000
Jun 16 17:13:47 aspvendin kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 16 17:13:47 aspvendin kernel: CR2: 0000000000000030 CR3: 00000008a8aae000 CR4: 0000000000f52ef0
Jun 16 17:13:47 aspvendin kernel: PKRU: 55555554
Jun 16 17:13:47 aspvendin kernel: note: ksmd[171] exited with irqs disabled
At this point i don't know what else to try - any pointers?