Hi all
I've a brand new Proxmox 3 nodes cluster with full mesh Ceph storage.
Specs per host are:
24 x 13th Gen Intel(R) Core(TM) i7-13700 (1 Socket)
128 Gi RAM
2 x 1Gbp Nic for network traffic.
2 x 10Gbp Nic for Ceph traffic.
2 x 4TB SSD for storage.
1 x 2TB NVMe SSD for OS
On host 1 when I try to load it with VM booting or cloning some times the host crashes and shutdown and I have to manually start it up again.
Looking at the logs for host 1 I find the following:
May 14 16:49:14 AVHOST01 kernel: BUG: unable to handle page fault for address: ffff8d90d6f42210
May 14 16:49:14 AVHOST01 kernel: #PF: supervisor write access in kernel mode
May 14 16:49:14 AVHOST01 kernel: #PF: error_code(0x0002) - not-present page
May 14 16:49:14 AVHOST01 kernel: PGD a38601067 P4D a38601067 PUD 0
May 14 16:49:14 AVHOST01 kernel: Oops: 0002 [#1] SMP NOPTI
May 14 16:49:14 AVHOST01 kernel: CPU: 2 PID: 121889 Comm: kvm Tainted: P O 5.15.102-1-pve #1
May 14 16:49:14 AVHOST01 kernel: Hardware name: Gigabyte Technology Co., Ltd. B760 DS3H AX DDR4/B760 DS3H AX DDR4, BIOS F1 10/03/2022
May 14 16:49:14 AVHOST01 kernel: RIP: 0010:remove_wait_queue+0x29/0x50
May 14 16:49:14 AVHOST01 kernel: Code: 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 89 f3 e8 39 d3 c7 00 48 8b 53 18 4c 89 e7 48 89 c6 48 8b 43 20 48 89 42 08 <48> 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 43 18 48 83 c0 22 48
May 14 16:49:14 AVHOST01 kernel: RSP: 0018:ffff9b374afa79f8 EFLAGS: 00010046
May 14 16:49:14 AVHOST01 kernel: RAX: ffff8d90d6f42210 RBX: ffff8db1097886e0 RCX: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: RDX: ffff8db0d6f42210 RSI: 0000000000000282 RDI: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: RBP: ffff9b374afa7a08 R08: 0000000000000003 R09: ffff8dafd36c43a0
May 14 16:49:14 AVHOST01 kernel: R10: ffff9b3741b33c28 R11: 0000000000000000 R12: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: R13: ffff8db109788000 R14: ffff9b374afa7be8 R15: 0000000000000009
May 14 16:49:14 AVHOST01 kernel: FS: 00007fd4fda40200(0000) GS:ffff8dceff680000(0000) knlGS:0000000000000000
May 14 16:49:14 AVHOST01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 16:49:14 AVHOST01 kernel: CR2: ffff8d90d6f42210 CR3: 0000000235234002 CR4: 0000000000772ee0
May 14 16:49:14 AVHOST01 kernel: PKRU: 55555554
May 14 16:49:14 AVHOST01 kernel: Call Trace:
May 14 16:49:14 AVHOST01 kernel: <TASK>
May 14 16:49:14 AVHOST01 kernel: poll_freewait+0x6f/0xb0
May 14 16:49:14 AVHOST01 kernel: do_sys_poll+0x56e/0x690
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: __x64_sys_ppoll+0xbc/0x150
May 14 16:49:14 AVHOST01 kernel: do_syscall_64+0x59/0xc0
May 14 16:49:14 AVHOST01 kernel: ? syscall_exit_to_user_mode+0x27/0x50
May 14 16:49:14 AVHOST01 kernel: ? do_syscall_64+0x69/0xc0
May 14 16:49:14 AVHOST01 kernel: ? do_syscall_64+0x69/0xc0
May 14 16:49:14 AVHOST01 kernel: ? do_syscall_64+0x69/0xc0
May 14 16:49:14 AVHOST01 kernel: ? sysvec_apic_timer_interrupt+0x4e/0x90
May 14 16:49:14 AVHOST01 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
May 14 16:49:14 AVHOST01 kernel: RIP: 0033:0x7fd5003cee26
May 14 16:49:14 AVHOST01 kernel: Code: 7c 24 08 e8 7c 0f f9 ff 4c 8b 54 24 18 48 8b 74 24 10 41 b8 08 00 00 00 41 89 c1 48 8b 7c 24 08 4c 89 e2 b8 0f 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 44 89 cf 89 44 24 08 e8 a6 0f f9 ff 8b 44
May 14 16:49:14 AVHOST01 kernel: RSP: 002b:00007ffec0464ef0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
May 14 16:49:14 AVHOST01 kernel: RAX: ffffffffffffffda RBX: 00005612e12956b0 RCX: 00007fd5003cee26
May 14 16:49:14 AVHOST01 kernel: RDX: 00007ffec0464f10 RSI: 0000000000000049 RDI: 00005612e1d031f0
May 14 16:49:14 AVHOST01 kernel: RBP: 00007ffec0464f7c R08: 0000000000000008 R09: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffec0464f10
May 14 16:49:14 AVHOST01 kernel: R13: 00005612e12956b0 R14: 00007ffec0464f80 R15: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: </TASK>
May 14 16:49:14 AVHOST01 kernel: Modules linked in: veth snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nfnetlink_cttimeout bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink i915 ttm drm_kms_helper cec intel_rapl_msr rc_core intel_rapl_common i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt x86_pkg_temp_thermal intel_powerclamp snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi btusb mt7921e snd_hda_codec coretemp btrtl mt76_connac_lib btbcm mt76 snd_hda_core btintel snd_hwdep bluetooth kvm_intel snd_pcm ecdh_generic mei_hdcp ecc mac80211 kvm irqbypass snd_timer crct10dif_pclmul snd cfg80211 ghash_clmulni_intel aesni_intel gigabyte_wmi crypto_simd cryptd wmi_bmof pcspkr efi_pstore libarc4 soundcore ov01a1s mei_me power_ctrl_logic mei v4l2_fwnode
May 14 16:49:14 AVHOST01 kernel: v4l2_async videodev intel_hid mc acpi_pad sparse_keymap acpi_tad zfs(PO) zunicode(PO) mac_hid zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb hid_generic usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32_pclmul nvme atlantic xhci_pci ahci xhci_pci_renesas macsec r8169 libahci nvme_core xhci_hcd realtek wmi video
May 14 16:49:14 AVHOST01 kernel: CR2: ffff8d90d6f42210
May 14 16:49:14 AVHOST01 kernel: ---[ end trace 361f80607622d2a5 ]---
May 14 16:49:14 AVHOST01 kernel: RIP: 0010:remove_wait_queue+0x29/0x50
May 14 16:49:14 AVHOST01 kernel: Code: 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 89 f3 e8 39 d3 c7 00 48 8b 53 18 4c 89 e7 48 89 c6 48 8b 43 20 48 89 42 08 <48> 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 43 18 48 83 c0 22 48
May 14 16:49:14 AVHOST01 kernel: RSP: 0018:ffff9b374afa79f8 EFLAGS: 00010046
May 14 16:49:14 AVHOST01 kernel: RAX: ffff8d90d6f42210 RBX: ffff8db1097886e0 RCX: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: RDX: ffff8db0d6f42210 RSI: 0000000000000282 RDI: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: RBP: ffff9b374afa7a08 R08: 0000000000000003 R09: ffff8dafd36c43a0
May 14 16:49:14 AVHOST01 kernel: R10: ffff9b3741b33c28 R11: 0000000000000000 R12: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: R13: ffff8db109788000 R14: ffff9b374afa7be8 R15: 0000000000000009
May 14 16:49:14 AVHOST01 kernel: FS: 00007fd4fda40200(0000) GS:ffff8dceff680000(0000) knlGS:0000000000000000
May 14 16:49:14 AVHOST01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 16:49:14 AVHOST01 kernel: CR2: ffff8d90d6f42210 CR3: 0000000235234002 CR4: 0000000000772ee0
May 14 16:49:14 AVHOST01 kernel: PKRU: 55555554
May 14 16:49:16 AVHOST01 pvedaemon[122639]: VM 101 qmp command failed - VM 101 not running
-- Reboot --
Any thoughts?
I've a brand new Proxmox 3 nodes cluster with full mesh Ceph storage.
Specs per host are:
24 x 13th Gen Intel(R) Core(TM) i7-13700 (1 Socket)
128 Gi RAM
2 x 1Gbp Nic for network traffic.
2 x 10Gbp Nic for Ceph traffic.
2 x 4TB SSD for storage.
1 x 2TB NVMe SSD for OS
On host 1 when I try to load it with VM booting or cloning some times the host crashes and shutdown and I have to manually start it up again.
Looking at the logs for host 1 I find the following:
May 14 16:49:14 AVHOST01 kernel: BUG: unable to handle page fault for address: ffff8d90d6f42210
May 14 16:49:14 AVHOST01 kernel: #PF: supervisor write access in kernel mode
May 14 16:49:14 AVHOST01 kernel: #PF: error_code(0x0002) - not-present page
May 14 16:49:14 AVHOST01 kernel: PGD a38601067 P4D a38601067 PUD 0
May 14 16:49:14 AVHOST01 kernel: Oops: 0002 [#1] SMP NOPTI
May 14 16:49:14 AVHOST01 kernel: CPU: 2 PID: 121889 Comm: kvm Tainted: P O 5.15.102-1-pve #1
May 14 16:49:14 AVHOST01 kernel: Hardware name: Gigabyte Technology Co., Ltd. B760 DS3H AX DDR4/B760 DS3H AX DDR4, BIOS F1 10/03/2022
May 14 16:49:14 AVHOST01 kernel: RIP: 0010:remove_wait_queue+0x29/0x50
May 14 16:49:14 AVHOST01 kernel: Code: 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 89 f3 e8 39 d3 c7 00 48 8b 53 18 4c 89 e7 48 89 c6 48 8b 43 20 48 89 42 08 <48> 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 43 18 48 83 c0 22 48
May 14 16:49:14 AVHOST01 kernel: RSP: 0018:ffff9b374afa79f8 EFLAGS: 00010046
May 14 16:49:14 AVHOST01 kernel: RAX: ffff8d90d6f42210 RBX: ffff8db1097886e0 RCX: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: RDX: ffff8db0d6f42210 RSI: 0000000000000282 RDI: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: RBP: ffff9b374afa7a08 R08: 0000000000000003 R09: ffff8dafd36c43a0
May 14 16:49:14 AVHOST01 kernel: R10: ffff9b3741b33c28 R11: 0000000000000000 R12: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: R13: ffff8db109788000 R14: ffff9b374afa7be8 R15: 0000000000000009
May 14 16:49:14 AVHOST01 kernel: FS: 00007fd4fda40200(0000) GS:ffff8dceff680000(0000) knlGS:0000000000000000
May 14 16:49:14 AVHOST01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 16:49:14 AVHOST01 kernel: CR2: ffff8d90d6f42210 CR3: 0000000235234002 CR4: 0000000000772ee0
May 14 16:49:14 AVHOST01 kernel: PKRU: 55555554
May 14 16:49:14 AVHOST01 kernel: Call Trace:
May 14 16:49:14 AVHOST01 kernel: <TASK>
May 14 16:49:14 AVHOST01 kernel: poll_freewait+0x6f/0xb0
May 14 16:49:14 AVHOST01 kernel: do_sys_poll+0x56e/0x690
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: ? __pollwait+0xe0/0xe0
May 14 16:49:14 AVHOST01 kernel: __x64_sys_ppoll+0xbc/0x150
May 14 16:49:14 AVHOST01 kernel: do_syscall_64+0x59/0xc0
May 14 16:49:14 AVHOST01 kernel: ? syscall_exit_to_user_mode+0x27/0x50
May 14 16:49:14 AVHOST01 kernel: ? do_syscall_64+0x69/0xc0
May 14 16:49:14 AVHOST01 kernel: ? do_syscall_64+0x69/0xc0
May 14 16:49:14 AVHOST01 kernel: ? do_syscall_64+0x69/0xc0
May 14 16:49:14 AVHOST01 kernel: ? sysvec_apic_timer_interrupt+0x4e/0x90
May 14 16:49:14 AVHOST01 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
May 14 16:49:14 AVHOST01 kernel: RIP: 0033:0x7fd5003cee26
May 14 16:49:14 AVHOST01 kernel: Code: 7c 24 08 e8 7c 0f f9 ff 4c 8b 54 24 18 48 8b 74 24 10 41 b8 08 00 00 00 41 89 c1 48 8b 7c 24 08 4c 89 e2 b8 0f 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 44 89 cf 89 44 24 08 e8 a6 0f f9 ff 8b 44
May 14 16:49:14 AVHOST01 kernel: RSP: 002b:00007ffec0464ef0 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
May 14 16:49:14 AVHOST01 kernel: RAX: ffffffffffffffda RBX: 00005612e12956b0 RCX: 00007fd5003cee26
May 14 16:49:14 AVHOST01 kernel: RDX: 00007ffec0464f10 RSI: 0000000000000049 RDI: 00005612e1d031f0
May 14 16:49:14 AVHOST01 kernel: RBP: 00007ffec0464f7c R08: 0000000000000008 R09: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffec0464f10
May 14 16:49:14 AVHOST01 kernel: R13: 00005612e12956b0 R14: 00007ffec0464f80 R15: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: </TASK>
May 14 16:49:14 AVHOST01 kernel: Modules linked in: veth snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nfnetlink_cttimeout bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink i915 ttm drm_kms_helper cec intel_rapl_msr rc_core intel_rapl_common i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt x86_pkg_temp_thermal intel_powerclamp snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi btusb mt7921e snd_hda_codec coretemp btrtl mt76_connac_lib btbcm mt76 snd_hda_core btintel snd_hwdep bluetooth kvm_intel snd_pcm ecdh_generic mei_hdcp ecc mac80211 kvm irqbypass snd_timer crct10dif_pclmul snd cfg80211 ghash_clmulni_intel aesni_intel gigabyte_wmi crypto_simd cryptd wmi_bmof pcspkr efi_pstore libarc4 soundcore ov01a1s mei_me power_ctrl_logic mei v4l2_fwnode
May 14 16:49:14 AVHOST01 kernel: v4l2_async videodev intel_hid mc acpi_pad sparse_keymap acpi_tad zfs(PO) zunicode(PO) mac_hid zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb hid_generic usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32_pclmul nvme atlantic xhci_pci ahci xhci_pci_renesas macsec r8169 libahci nvme_core xhci_hcd realtek wmi video
May 14 16:49:14 AVHOST01 kernel: CR2: ffff8d90d6f42210
May 14 16:49:14 AVHOST01 kernel: ---[ end trace 361f80607622d2a5 ]---
May 14 16:49:14 AVHOST01 kernel: RIP: 0010:remove_wait_queue+0x29/0x50
May 14 16:49:14 AVHOST01 kernel: Code: 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 89 f3 e8 39 d3 c7 00 48 8b 53 18 4c 89 e7 48 89 c6 48 8b 43 20 48 89 42 08 <48> 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 43 18 48 83 c0 22 48
May 14 16:49:14 AVHOST01 kernel: RSP: 0018:ffff9b374afa79f8 EFLAGS: 00010046
May 14 16:49:14 AVHOST01 kernel: RAX: ffff8d90d6f42210 RBX: ffff8db1097886e0 RCX: 0000000000000000
May 14 16:49:14 AVHOST01 kernel: RDX: ffff8db0d6f42210 RSI: 0000000000000282 RDI: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: RBP: ffff9b374afa7a08 R08: 0000000000000003 R09: ffff8dafd36c43a0
May 14 16:49:14 AVHOST01 kernel: R10: ffff9b3741b33c28 R11: 0000000000000000 R12: ffff8db0d6f42208
May 14 16:49:14 AVHOST01 kernel: R13: ffff8db109788000 R14: ffff9b374afa7be8 R15: 0000000000000009
May 14 16:49:14 AVHOST01 kernel: FS: 00007fd4fda40200(0000) GS:ffff8dceff680000(0000) knlGS:0000000000000000
May 14 16:49:14 AVHOST01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 14 16:49:14 AVHOST01 kernel: CR2: ffff8d90d6f42210 CR3: 0000000235234002 CR4: 0000000000772ee0
May 14 16:49:14 AVHOST01 kernel: PKRU: 55555554
May 14 16:49:16 AVHOST01 pvedaemon[122639]: VM 101 qmp command failed - VM 101 not running
-- Reboot --
Any thoughts?