PVE crashing / hanging with very high load average

Hello Fabian,
I was not sure about an SSD drive I was using to run PVE, so I decided to re-install proxmox using a different drive. After that the system seems to be running without crashing. The only thing I've noticed is that the pvestatd keeps erroring out and I have to do a "systemctl restart pvestatd", but the system itself seems fine. Here is one error I've noticed:

Jan 28 21:26:33 home-pve pvestatd[462598]: qemu status update error: unable to find configuration file for VM 100 on node 'home-pve'

I noticed another error, but lost it. Will update if I see it again. Thoughts?
 
What do you make of these???:
Code:
Jan 28 22:18:19 home-pve kernel: BUG: unable to handle page fault for address: fffffff797578d70
Jan 28 22:18:19 home-pve kernel: #PF: supervisor instruction fetch in kernel mode
Jan 28 22:18:19 home-pve kernel: #PF: error_code(0x0010) - not-present page
Jan 28 22:18:19 home-pve kernel: PGD 130f43b067 P4D 130f43b067 PUD 0
Jan 28 22:18:19 home-pve kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Jan 28 22:18:19 home-pve kernel: CPU: 10 PID: 564 Comm: jbd2/dm-4-8 Tainted: P           O       6.8.12-7-pve #1
Jan 28 22:18:19 home-pve kernel: Hardware name: MicroElectronics B944/PRIME Z790-V AX, BIOS 1645 03/15/2024
Jan 28 22:18:19 home-pve kernel: RIP: 0010:0xfffffff797578d70
Jan 28 22:18:19 home-pve kernel: Code: Unable to access opcode bytes at 0xfffffff797578d46.
Jan 28 22:18:19 home-pve kernel: RSP: 0018:ffffb3e185f4bac8 EFLAGS: 00010282
Jan 28 22:18:19 home-pve kernel: RAX: fffffff797578d70 RBX: ffff994bd69be600 RCX: 0000000000000000
Jan 28 22:18:19 home-pve kernel: RDX: 0000000000000000 RSI: 000000000000000f RDI: ffff994be0ab8980
Jan 28 22:18:19 home-pve kernel: RBP: ffffb3e185f4bb68 R08: 0000000000000000 R09: ffff994bd9731c00
Jan 28 22:18:19 home-pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff994be0ab8980
Jan 28 22:18:19 home-pve kernel: R13: ffffb3e185f4bb18 R14: 0000000000000000 R15: ffff994bdb68d640
Jan 28 22:18:19 home-pve kernel: FS:  0000000000000000(0000) GS:ffff996abf300000(0000) knlGS:0000000000000000
Jan 28 22:18:19 home-pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 22:18:19 home-pve kernel: CR2: fffffff797578d46 CR3: 0000000135db2006 CR4: 0000000000f72ef0
Jan 28 22:18:19 home-pve kernel: PKRU: 55555554
Jan 28 22:18:19 home-pve kernel: Call Trace:
Jan 28 22:18:19 home-pve kernel:  <TASK>
Jan 28 22:18:19 home-pve kernel:  ? show_regs+0x6d/0x80
Jan 28 22:18:19 home-pve kernel:  ? __die+0x24/0x80
Jan 28 22:18:19 home-pve kernel:  ? page_fault_oops+0x176/0x500
Jan 28 22:18:19 home-pve kernel:  ? kernelmode_fixup_or_oops.constprop.0+0x69/0x90
Jan 28 22:18:19 home-pve kernel:  ? __bad_area_nosemaphore+0x19d/0x270
Jan 28 22:18:19 home-pve kernel:  ? bad_area_nosemaphore+0x16/0x30
Jan 28 22:18:19 home-pve kernel:  ? do_kern_addr_fault+0x7b/0xa0
Jan 28 22:18:19 home-pve kernel:  ? exc_page_fault+0x10d/0x1b0
Jan 28 22:18:19 home-pve kernel:  ? asm_exc_page_fault+0x27/0x30
Jan 28 22:18:19 home-pve kernel:  ? __blk_mq_sched_dispatch_requests+0x3bb/0x5d0
Jan 28 22:18:19 home-pve kernel:  blk_mq_sched_dispatch_requests+0x2c/0x70
Jan 28 22:18:19 home-pve kernel:  blk_mq_run_hw_queue+0x1bf/0x210
Jan 28 22:18:19 home-pve kernel:  blk_mq_flush_plug_list.part.0+0x187/0x5c0
Jan 28 22:18:19 home-pve kernel:  blk_mq_flush_plug_list+0x19/0x30
Jan 28 22:18:19 home-pve kernel:  __blk_flush_plug+0xdf/0x130
Jan 28 22:18:19 home-pve kernel:  ? submit_bio+0xb2/0x110
Jan 28 22:18:19 home-pve kernel:  blk_finish_plug+0x31/0x50
Jan 28 22:18:19 home-pve kernel:  jbd2_journal_commit_transaction+0x103d/0x1b00
Jan 28 22:18:19 home-pve kernel:  kjournald2+0xab/0x280
Jan 28 22:18:19 home-pve kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Jan 28 22:18:19 home-pve kernel:  ? __pfx_kjournald2+0x10/0x10
Jan 28 22:18:19 home-pve kernel:  kthread+0xef/0x120
Jan 28 22:18:19 home-pve kernel:  ? __pfx_kthread+0x10/0x10
Jan 28 22:18:19 home-pve kernel:  ret_from_fork+0x44/0x70
Jan 28 22:18:19 home-pve kernel:  ? __pfx_kthread+0x10/0x10
Jan 28 22:18:19 home-pve kernel:  ret_from_fork_asm+0x1b/0x30
Jan 28 22:18:19 home-pve kernel:  </TASK>
Jan 28 22:18:19 home-pve kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match nouveau snd_soc_acpi soundwire_generic_allocation intel_rapl_msr intel_rapl_common soundwire_bus mxm_wmi drm_gpuvm intel_uncore_frequency intel_uncore_frequency_common drm_exec intel_tcc_cooling gpu_sched snd_soc_core x86_pkg_temp_thermal rtw89_8851be intel_powerclamp snd_compress coretemp ac97_bus drm_ttm_helper kvm_intel snd_hda_codec_realtek ttm kvm rtw89_8851b drm_display_helper snd_pcm_dmaengine snd_hda_codec_hdmi btusb cec btrtl btintel snd_hda_codec_generic irqbypass snd_hda_intel snd_intel_dspcfg btbcm snd_intel_sdw_acpi btmtk snd_hda_codec rtw89_pci bluetooth rc_core
Jan 28 22:18:19 home-pve kernel:  rtw89_core crct10dif_pclmul polyval_clmulni snd_hda_core ecdh_generic ecc i2c_algo_bit polyval_generic joydev input_leds mac80211 snd_hwdep ghash_clmulni_intel snd_pcm cmdlinepart cfg80211 snd_timer spi_nor mei_pxp mei_hdcp sha256_ssse3 snd sha1_ssse3 mtd mei_me libarc4 soundcore aesni_intel mei crypto_simd cryptd rapl intel_cstate mac_hid pcspkr eeepc_wmi intel_pmc_core intel_vsec pmt_telemetry pmt_class wmi_bmof acpi_pad acpi_tad zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c usbkbd hid_generic usbhid hid nvme ahci nvme_core libahci nvme_auth mfd_aaeon asus_wmi xhci_pci ledtrig_audio xhci_pci_renesas sparse_keymap platform_profile r8169 crc32_pclmul xhci_hcd i2c_i801 spi_intel_pci intel_lpss_pci realtek i2c_smbus spi_intel intel_lpss vmd idma64 video wmi pinctrl_alderlake
Jan 28 22:18:19 home-pve kernel: CR2: fffffff797578d70
Jan 28 22:18:19 home-pve kernel: ---[ end trace 0000000000000000 ]---
Jan 28 22:18:19 home-pve kernel: RIP: 0010:0xfffffff797578d70
Jan 28 22:18:19 home-pve kernel: Code: Unable to access opcode bytes at 0xfffffff797578d46.
Jan 28 22:18:19 home-pve kernel: RSP: 0018:ffffb3e185f4bac8 EFLAGS: 00010282
Jan 28 22:18:19 home-pve kernel: RAX: fffffff797578d70 RBX: ffff994bd69be600 RCX: 0000000000000000
Jan 28 22:18:19 home-pve kernel: RDX: 0000000000000000 RSI: 000000000000000f RDI: ffff994be0ab8980
Jan 28 22:18:19 home-pve kernel: RBP: ffffb3e185f4bb68 R08: 0000000000000000 R09: ffff994bd9731c00
Jan 28 22:18:19 home-pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff994be0ab8980
Jan 28 22:18:19 home-pve kernel: R13: ffffb3e185f4bb18 R14: 0000000000000000 R15: ffff994bdb68d640
Jan 28 22:18:19 home-pve kernel: FS:  0000000000000000(0000) GS:ffff996abf300000(0000) knlGS:0000000000000000
Jan 28 22:18:19 home-pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 22:18:19 home-pve kernel: CR2: fffffff797578d46 CR3: 0000000135db2006 CR4: 0000000000f72ef0
Jan 28 22:18:19 home-pve kernel: PKRU: 55555554
Jan 28 22:18:19 home-pve kernel: note: jbd2/dm-4-8[564] exited with irqs disabled
Jan 28 22:18:19 home-pve kernel: ------------[ cut here ]------------
Jan 28 22:18:19 home-pve kernel: WARNING: CPU: 10 PID: 564 at kernel/exit.c:821 do_exit+0x8e5/0xaf0
Jan 28 22:18:19 home-pve kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match nouveau snd_soc_acpi soundwire_generic_allocation intel_rapl_msr intel_rapl_common soundwire_bus mxm_wmi drm_gpuvm intel_uncore_frequency intel_uncore_frequency_common drm_exec intel_tcc_cooling gpu_sched snd_soc_core x86_pkg_temp_thermal rtw89_8851be intel_powerclamp snd_compress coretemp ac97_bus drm_ttm_helper kvm_intel snd_hda_codec_realtek ttm kvm rtw89_8851b drm_display_helper snd_pcm_dmaengine snd_hda_codec_hdmi btusb cec btrtl btintel snd_hda_codec_generic irqbypass snd_hda_intel snd_intel_dspcfg btbcm snd_intel_sdw_acpi btmtk snd_hda_codec rtw89_pci bluetooth rc_core
Jan 28 22:18:19 home-pve kernel:  rtw89_core crct10dif_pclmul polyval_clmulni snd_hda_core ecdh_generic ecc i2c_algo_bit polyval_generic joydev input_leds mac80211 snd_hwdep ghash_clmulni_intel snd_pcm cmdlinepart cfg80211 snd_timer spi_nor mei_pxp mei_hdcp sha256_ssse3 snd sha1_ssse3 mtd mei_me libarc4 soundcore aesni_intel mei crypto_simd cryptd rapl intel_cstate mac_hid pcspkr eeepc_wmi intel_pmc_core intel_vsec pmt_telemetry pmt_class wmi_bmof acpi_pad acpi_tad zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c usbkbd hid_generic usbhid hid nvme ahci nvme_core libahci nvme_auth mfd_aaeon asus_wmi xhci_pci ledtrig_audio xhci_pci_renesas sparse_keymap platform_profile r8169 crc32_pclmul xhci_hcd i2c_i801 spi_intel_pci intel_lpss_pci realtek i2c_smbus spi_intel intel_lpss vmd idma64 video wmi pinctrl_alderlake
Jan 28 22:18:19 home-pve kernel: CPU: 10 PID: 564 Comm: jbd2/dm-4-8 Tainted: P      D    O       6.8.12-7-pve #1
Jan 28 22:18:19 home-pve kernel: Hardware name: MicroElectronics B944/PRIME Z790-V AX, BIOS 1645 03/15/2024
Jan 28 22:18:19 home-pve kernel: RIP: 0010:do_exit+0x8e5/0xaf0
Jan 28 22:18:19 home-pve kernel: Code: e9 3a f8 ff ff 48 8b bb f8 09 00 00 31 f6 e8 92 e0 ff ff e9 ee fd ff ff 4c 89 ee bf 05 06 00 00 e8 60 3d 01 00 e9 66 f8 ff ff <0f> 0b e9 94 f7 ff ff 0f 0b e9 4d f7 ff ff 48 89 df e8 b5 31 14 00
Jan 28 22:18:19 home-pve kernel: RSP: 0018:ffffb3e185f4bec8 EFLAGS: 00010282
Jan 28 22:18:19 home-pve kernel: RAX: 0000000000000000 RBX: ffff994be130a900 RCX: 0000000000000000
Jan 28 22:18:19 home-pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jan 28 22:18:19 home-pve kernel: RBP: ffffb3e185f4bf20 R08: 0000000000000000 R09: 0000000000000000
Jan 28 22:18:19 home-pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff994be1059680
Jan 28 22:18:19 home-pve kernel: R13: 0000000000000009 R14: ffff994bdaa22940 R15: 0000000000000000
Jan 28 22:18:19 home-pve kernel: FS:  0000000000000000(0000) GS:ffff996abf300000(0000) knlGS:0000000000000000
Jan 28 22:18:19 home-pve kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 22:18:19 home-pve kernel: CR2: fffffff797578d46 CR3: 0000000135db2006 CR4: 0000000000f72ef0
Jan 28 22:18:19 home-pve kernel: PKRU: 55555554
Jan 28 22:18:19 home-pve kernel: Call Trace:
Jan 28 22:18:19 home-pve kernel:  <TASK>
Jan 28 22:18:19 home-pve kernel:  ? show_regs+0x6d/0x80
Jan 28 22:18:19 home-pve kernel:  ? __warn+0x89/0x160
Jan 28 22:18:19 home-pve kernel:  ? do_exit+0x8e5/0xaf0
Jan 28 22:18:19 home-pve kernel:  ? report_bug+0x17e/0x1b0
Jan 28 22:18:19 home-pve kernel:  ? handle_bug+0x46/0x90
Jan 28 22:18:19 home-pve kernel:  ? exc_invalid_op+0x18/0x80
Jan 28 22:18:19 home-pve kernel:  ? asm_exc_invalid_op+0x1b/0x20
Jan 28 22:18:19 home-pve kernel:  ? do_exit+0x8e5/0xaf0
Jan 28 22:18:19 home-pve kernel:  ? do_exit+0x72/0xaf0
Jan 28 22:18:19 home-pve kernel:  make_task_dead+0x83/0x170
Jan 28 22:18:19 home-pve kernel:  rewind_stack_and_make_dead+0x17/0x20
Jan 28 22:18:19 home-pve kernel:  </TASK>
Jan 28 22:18:19 home-pve kernel: ---[ end trace 0000000000000000 ]---
 
Fabian,

I reinstalled PVE on Sunday and since then no issues. Tonight I upgraded to the new kernel (proxmox-kernel-6.8.12-7-pve-signed:amd64 (6.8.12-7, automatic)) and just now, a few hours after the new kernel was installed and rebooted and its crashing again.

I don't know why, but if its the intel chip, how come it wasn't crashing before the new kernel?
 
I already gave you my answer above - your CPU is most likely broken, please return it.
 
fabian, with all due respect, this is a brand new computer and it works perfectly fine if I'm running the previous kernel. The moment I upgraded to the 6.8.12-7 kernel it started crashing. How is that the CPU if I can run Rocky 8.10 or even windows natively, but Proxmox continues to be an issue when I upgrade?
 
Hello usridzero, just an idea : if you have only one disk and the VM are eating IOs, it is quite normal that the OS will hang/freeze/crash. Did you try to monitor the ressources with "atop 1", or "systat -xk 1" ?
 
Hello ghusson, thanks for the suggestion. I have the pve installed on a 1TB spinning disk, while the LVMs for the instances are on two separate NVMEs, one that is 2TB and one that is 500GB. The OS has plenty of space, even for backups, while the VMs are on the NVMEs, and should not take any resources from the OS.
When I look at htop, the only thing killing is the KVM (kernel related process). Which is why I re-installed using the old kernel first and while using the old kernel I didn't see any issues. As soon as I upgraded to the new kernel the system crashed.
Thoughts?
 
  1. PRIME Z790-V AX, BIOS 1645 03/15/2024
    You should really update the bios/UEFI to the most recent version! [1] (Obviously, this will not help, if the CPU is already damaged...)
  2. Installing the intel-microcode generally also does not hurt. [2]
  3. You could try the 6.11 kernel. [3]

[1] https://www.theverge.com/24216305/i...crash-news-updates-patches-fixes-motherboards
[2] https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu
[3] https://forum.proxmox.com/threads/o...ve-8-available-on-test-no-subscription.156818
 
  • Like
Reactions: Johannes S
@usridzero :
And when you have the problem, you are not doing any backup and there is no special IO waits on the OS disk ?
Although you might think that CPU bug is not related to your experience with changing kernel, it might be correlated. You never know how the kernel manage CPU timings or cache and so on. A change in it could have triggered the CPU bug. And the CPU bug could be mitigated by BIOS and microcode update (october 2024 for this particular CPU issue I think). So try it ? Look at the links given by Neobin and : https://wiki.debian.org/Microcode ?
 
  • Like
Reactions: Johannes S
I just recorded these errors form dmesg outputs and I'm taking the server to the vendor for a replacement. Thanks for the support.


Code:
[80159.979718] get_swap_device: Bad swap offset entry 3fffffbffffff
[80159.980763] BUG: Bad page map in process (udev-worker)  pte:800000000 pmd:1d3c7d067
[80159.981560] addr:000075f166f09000 vm_flags:08000071 anon_vma:0000000000000000 mapping:ffff903e492fe3c0 index:a6
[80159.982252] file:modules.dep.bin fault:filemap_fault mmap:ext4_file_mmap read_folio:ext4_read_folio
[80159.983182] CPU: 6 PID: 213092 Comm: (udev-worker) Tainted: P      D    O       6.8.4-2-pve #1
[80159.984040] Hardware name: MicroElectronics B944/PRIME Z790-V AX, BIOS 1645 03/15/2024
[80159.984933] Call Trace:
[80159.985801]  <TASK>
[80159.986640]  dump_stack_lvl+0x48/0x70
[80159.987432]  dump_stack+0x10/0x20
[80159.988161]  print_bad_pte+0x1be/0x280
[80159.988858]  unmap_page_range+0xd93/0x1170
[80159.989542]  ? __mod_memcg_lruvec_state+0x87/0x140
[80159.990169]  unmap_single_vma+0x89/0xf0
[80159.990771]  unmap_vmas+0xb5/0x190
[80159.991360]  unmap_region+0xe8/0x180
[80159.991952]  do_vmi_align_munmap+0x3e8/0x5b0
[80159.992568]  do_vmi_munmap+0xdf/0x190
[80159.993128]  __vm_munmap+0xad/0x180
[80159.993664]  __x64_sys_munmap+0x27/0x40
[80159.994194]  do_syscall_64+0x84/0x180
[80159.994719]  ? irqentry_exit+0x43/0x50
[80159.995238]  ? exc_page_fault+0x94/0x1b0
[80159.995757]  entry_SYSCALL_64_after_hwframe+0x73/0x7b
[80159.996279] RIP: 0033:0x75f1676ef977
[80159.996806] Code: 00 00 00 48 8b 15 89 04 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 04 0d 00 f7 d8 64 89 01 48
[80159.997368] RSP: 002b:00007ffff93186f8 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
[80159.997925] RAX: ffffffffffffffda RBX: 0000606c8de0a730 RCX: 000075f1676ef977
[80159.998481] RDX: 0000606c8de00e00 RSI: 00000000001021af RDI: 000075f166e63000
[80159.999036] RBP: 0000606c8de00e70 R08: 0000000000000007 R09: 0000606c8de00d40
[80159.999587] R10: a28453d7e824915d R11: 0000000000000206 R12: 00007ffff93187cc
[80160.000135] R13: 0000606c8def0c80 R14: 0000606c8ddffbf0 R15: 00007ffff93187a8
[80160.000681]  </TASK>
[80165.955512] BUG: Bad rss-counter state mm:00000000e0b5bce6 type:MM_SWAPENTS val:-1
 
  • Like
Reactions: ghusson