pvestatd crashes

penultimatum

New Member
Mar 1, 2023
5
0
1
Hi. I am getting fairly regular crashes of pvestatd, maybe 2-3 times a day, and occasionaly also had pveproxy crash, and sometimes a VM go to a status of 'internal error' not long after starting up.

I ran memtest, initially for 2 passes and then another 6 and no errors. I have also re-installed PVE but still getting the issues. What can I check next? I'm running fairly new hardware - 14900k + MSI z790-A + 192Gb RAM, and running the latest MSI BIOS.

Code:
[Sat Jan  6 18:49:27 2024] x86/split lock detection: #AC: CPU 0/KVM/858994 took a split_lock trap at address: 0x26a8dce1888
[Sat Jan  6 18:49:27 2024] pveproxy worker[856467]: segfault at 9 ip 000055c80750812a sp 00007ffd34c97c00 error 4 in perl[55c80741f000+195000] likely on CPU 0 (core 0, socket 0)
[Sat Jan  6 18:49:27 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05
[Sat Jan  6 18:49:35 2024] perf: interrupt took too long (3973 > 3920), lowering kernel.perf_event_max_sample_rate to 50250

Code:
[Sat Jan  6 09:35:29 2024] x86/split lock detection: #AC: CPU 0/KVM/524150 took a split_lock trap at address: 0xfffff80064e42fb3
[Sat Jan  6 15:33:18 2024] pvestatd[226428]: segfault at 107 ip 0000557ac583012a sp 00007ffe91c72d30 error 4 in perl[557ac5747000+195000] likely on CPU 0 (core 0, socket 0)
[Sat Jan  6 15:33:18 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05

Code:
# pveversion -v
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 18.2.0-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.3
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.5
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

I am occasionally seeing a VM go into a status of 'Internal Error' and then I see one of the CPU cores reporting 100C (I assume its also running at %100). I then have to do a stop and restart of the VM.


Thanks.
 
Hi,
Code:
[Sat Jan  6 09:35:29 2024] x86/split lock detection: #AC: CPU 0/KVM/524150 took a split_lock trap at address: 0xfffff80064e42fb3
[Sat Jan  6 15:33:18 2024] pvestatd[226428]: segfault at 107 ip 0000557ac583012a sp 00007ffe91c72d30 error 4 in perl[557ac5747000+195000] likely on CPU 0 (core 0, socket 0)
[Sat Jan  6 15:33:18 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05
does the segfault correlate with any specific event? Anything noticeable in the output of journtalctl -r -u pvestatd.service pveproxy.service? Do you already have installed the latest intel-microcode package: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

I am occasionally seeing a VM go into a status of 'Internal Error' and then I see one of the CPU cores reporting 100C (I assume its also running at %100). I then have to do a stop and restart of the VM.
Is there anything in the system logs/journal around the time the issue happened? Please also share the VM configuration: qm config <ID>
 
So I did try taking out 2 x 48GB RAM Modules to try confirming not a memory issue, and will swap them over again later.

I today had errors regarding 'expoilt attempt' and page fault:
Code:
[Mon Jan  8 13:10:43 2024] x86/split lock detection: #AC: CPU 2/KVM/261700 took a split_lock trap at address: 0x7ef1d050
[Mon Jan  8 13:13:47 2024] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[Mon Jan  8 13:13:47 2024] BUG: unable to handle page fault for address: ffffab892fae3f58
[Mon Jan  8 13:13:47 2024] #PF: supervisor instruction fetch in kernel mode
[Mon Jan  8 13:13:47 2024] #PF: error_code(0x0011) - permissions violation
[Mon Jan  8 13:13:47 2024] PGD 100000067 P4D 100000067 PUD 100204067 PMD 7b5fd6067 PTE 8000001060c98163
[Mon Jan  8 13:13:47 2024] Oops: 0011 [#1] PREEMPT SMP NOPTI
[Mon Jan  8 13:13:47 2024] CPU: 9 PID: 199104 Comm: kvm Tainted: P           OE      6.5.11-7-pve #1
[Mon Jan  8 13:13:47 2024] Hardware name: Micro-Star International Co., Ltd. MS-7E07/PRO Z790-A WIFI (MS-7E07), BIOS A.80 10/30/2023
[Mon Jan  8 13:13:47 2024] RIP: 0010:0xffffab892fae3f58
[Mon Jan  8 13:13:47 2024] Code: ff ff 77 6f c5 8e ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e6 00 e0 8e ff ff ff ff <80> ba f1 92 c9 55 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
[Mon Jan  8 13:13:47 2024] RSP: 0018:ffffab892fae3f18 EFLAGS: 00010046
[Mon Jan  8 13:13:47 2024] RAX: 0000000000000000 RBX: ffffab892fae3f58 RCX: 0000000000000000
[Mon Jan  8 13:13:47 2024] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[Mon Jan  8 13:13:47 2024] RBP: ffffab892fae3f18 R08: 0000000000000000 R09: 0000000000000000
[Mon Jan  8 13:13:47 2024] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[Mon Jan  8 13:13:47 2024] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[Mon Jan  8 13:13:47 2024] FS:  00007fab82272500(0000) GS:ffff9b569f240000(0000) knlGS:0000000000000000
[Mon Jan  8 13:13:47 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Jan  8 13:13:47 2024] CR2: ffffab892fae3f58 CR3: 0000000278a0c000 CR4: 0000000000752ee0
[Mon Jan  8 13:13:47 2024] PKRU: 55555554
[Mon Jan  8 13:13:47 2024] Call Trace:
[Mon Jan  8 13:13:47 2024]  <TASK>
[Mon Jan  8 13:13:47 2024]  ? show_regs+0x6d/0x80
[Mon Jan  8 13:13:47 2024]  ? __die+0x24/0x80
[Mon Jan  8 13:13:47 2024]  ? page_fault_oops+0x176/0x500
[Mon Jan  8 13:13:47 2024]  ? kernelmode_fixup_or_oops+0xb2/0x140
[Mon Jan  8 13:13:47 2024]  ? __bad_area_nosemaphore+0x1a5/0x280
[Mon Jan  8 13:13:47 2024]  ? bad_area_nosemaphore+0x16/0x30
[Mon Jan  8 13:13:47 2024]  ? do_kern_addr_fault+0x7b/0xa0
[Mon Jan  8 13:13:47 2024]  ? exc_page_fault+0x10d/0x1b0
[Mon Jan  8 13:13:47 2024]  ? asm_exc_page_fault+0x27/0x30
[Mon Jan  8 13:13:47 2024]  do_syscall_64+0x67/0x90
[Mon Jan  8 13:13:47 2024]  ? do_syscall_64+0x67/0x90
[Mon Jan  8 13:13:47 2024]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[Mon Jan  8 13:13:47 2024] RIP: 0033:0x7fab8514c17f
[Mon Jan  8 13:13:47 2024] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 49 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 9c d5 f8 ff 48
[Mon Jan  8 13:13:47 2024] RSP: 002b:00007ffd24d7f1c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[Mon Jan  8 13:13:47 2024] RAX: 0000000000000008 RBX: 000055c9926a9c70 RCX: 00007fab8514c17f
[Mon Jan  8 13:13:47 2024] RDX: 0000000000000008 RSI: 00007ffd24d7f1f0 RDI: 0000000000000008
[Mon Jan  8 13:13:47 2024] RBP: 00007ffd24d7f1f0 R08: 0000000000000000 R09: 00007fab86bb42c0
[Mon Jan  8 13:13:47 2024] R10: 0000000000000001 R11: 0000000000000293 R12: 0000000000000000
[Mon Jan  8 13:13:47 2024] R13: 0000000000000000 R14: 0000000000000001 R15: 000055c992f1ba80
[Mon Jan  8 13:13:47 2024]  </TASK>
[Mon Jan  8 13:13:47 2024] Modules linked in: veth tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables 8021q garp mrp bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc nvidia_vgpu_vfio(OE) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi intel_rapl_msr soundwire_generic_allocation intel_rapl_common soundwire_bus snd_soc_core intel_uncore_frequency intel_uncore_frequency_common snd_compress intel_tcc_cooling btusb x86_pkg_temp_thermal ac97_bus nvidia(POE) snd_pcm_dmaengine snd_hda_codec_hdmi btrtl intel_powerclamp i915 snd_usb_audio snd_hda_intel btbcm coretemp btintel snd_intel_dspcfg snd_usbmidi_lib snd_ump btmtk snd_intel_sdw_acpi kvm_intel drm_buddy iwlmvm snd_rawmidi snd_hda_codec bluetooth crct10dif_pclmul
[Mon Jan  8 13:13:47 2024]  ttm snd_seq_device polyval_clmulni polyval_generic mc snd_hda_core joydev ecdh_generic drm_display_helper input_leds ghash_clmulni_intel ecc mac80211 snd_hwdep cec aesni_intel snd_pcm mei_hdcp mei_pxp rc_core snd_timer crypto_simd libarc4 iwlwifi cryptd cmdlinepart mdev pmt_telemetry snd spi_nor drm_kms_helper rapl mei_me pmt_class kvm soundcore cfg80211 intel_cstate mtd i2c_algo_bit pcspkr mxm_wmi wmi_bmof intel_vsec serial_multi_instantiate acpi_pad acpi_tad mei mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb usbmouse usbkbd hid_generic dm_thin_pool usbhid dm_persistent_data hid dm_bio_prison dm_bufio libcrc32c xhci_pci nvme xhci_pci_renesas crc32_pclmul spi_intel_pci i2c_i801 xhci_hcd nvme_core ahci spi_intel igc i2c_smbus libahci nvme_common video pinctrl_alderlake wmi [last unloaded: cpuid]
[Mon Jan  8 13:13:47 2024] CR2: ffffab892fae3f58
[Mon Jan  8 13:13:47 2024] ---[ end trace 0000000000000000 ]---
[Mon Jan  8 13:13:47 2024] RIP: 0010:0xffffab892fae3f58
[Mon Jan  8 13:13:47 2024] Code: ff ff 77 6f c5 8e ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e6 00 e0 8e ff ff ff ff <80> ba f1 92 c9 55 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
[Mon Jan  8 13:13:47 2024] RSP: 0018:ffffab892fae3f18 EFLAGS: 00010046
[Mon Jan  8 13:13:47 2024] RAX: 0000000000000000 RBX: ffffab892fae3f58 RCX: 0000000000000000
[Mon Jan  8 13:13:47 2024] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[Mon Jan  8 13:13:47 2024] RBP: ffffab892fae3f18 R08: 0000000000000000 R09: 0000000000000000
[Mon Jan  8 13:13:47 2024] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[Mon Jan  8 13:13:47 2024] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[Mon Jan  8 13:13:47 2024] FS:  00007fab82272500(0000) GS:ffff9b569f240000(0000) knlGS:0000000000000000
[Mon Jan  8 13:13:47 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Jan  8 13:13:47 2024] CR2: ffffab892fae3f58 CR3: 0000000278a0c000 CR4: 0000000000752ee0
[Mon Jan  8 13:13:47 2024] PKRU: 55555554
[Mon Jan  8 13:13:47 2024] note: kvm[199104] exited with irqs disabled

And then a little after another pvestatd crash:

Code:
[Mon Jan  8 13:35:30 2024] pvestatd[264563]: segfault at ffffffffffffffff ip 0000556cc0ded4cc sp 00007ffd55686aa0 error 7 in perl[556cc0d02000+195000] likely on CPU 1 (core 0, socket 0)
[Mon Jan  8 13:35:30 2024] Code: 8b 43 0c e9 6a ff ff ff 66 0f 1f 44 00 00 3c 02 0f 86 a0 00 00 00 0d 00 00 00 10 48 8b 55 10 89 45 0c 48 8b 45 00 48 8b 40 18 <c6> 44 02 ff 00 48 8b 45 00 48 8b 75 10 48 8b 40 18 e9 73 ff ff ff

journalctl -r -u pvestatd.service

Code:
Jan 08 13:35:31 raptor systemd[1]: pvestatd.service: Consumed 5.495s CPU time.
Jan 08 13:35:31 raptor systemd[1]: pvestatd.service: Failed with result 'signal'.
Jan 08 13:35:31 raptor systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV

I do have the latest intel microcode applied, via MSI updates. I will take a look at the Debian early microcode updates you linked. thanks.
 
OK, an update on where I have got to. Still getting segfaults, mainly with pvestatd, and occasionally with pveproxy & pvedaemon.

Code:
[Wed Jan 24 19:40:18 2024] pvestatd[351154]: segfault at 128 ip 00005625ee79b12a sp 00007fff23a36a90 error 4 in perl[5625ee6b2000+195000] likely on CPU 1 (core 0, socket 0)
[Wed Jan 24 19:40:18 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05

[Wed Jan 24 19:50:00 2024] pvedaemon worke[405081]: segfault at 9 ip 000055c87890e12a sp 00007ffe4eac2250 error 4 in perl[55c878825000+195000] likely on CPU 1 (core 0, socket 0)
[Wed Jan 24 19:50:00 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05

[Wed Jan 24 20:03:34 2024] pvestatd[407512]: segfault at 107 ip 000055a75aebe12a sp 00007ffe4e3214f0 error 4 in perl[55a75add5000+195000] likely on CPU 1 (core 0, socket 0)
[Wed Jan 24 20:03:34 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05

[Wed Jan 24 20:21:30 2024] pvestatd[421299]: segfault at 107 ip 000055a228c6b12a sp 00007ffd51f852c0 error 4 in perl[55a228b82000+195000] likely on CPU 1 (core 0, socket 0)
[Wed Jan 24 20:21:30 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05

[Wed Jan 24 21:21:26 2024] pvestatd[431374]: segfault at 1e008 ip 000055597004312a sp 00007ffc73f7af60 error 4 in perl[55596ff5a000+195000] likely on CPU 0 (core 0, socket 0)
[Wed Jan 24 21:21:26 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05

[Wed Jan 24 21:57:14 2024] pveproxy worker[484344]: segfault at 9 ip 000055737f1d812a sp 00007ffedb4ad8b0 error 4 in perl[55737f0ef000+195000] likely on CPU 0 (core 0, socket 0)
[Wed Jan 24 21:57:14 2024] Code: ff 00 00 00 81 e2 00 00 00 04 75 11 49 8b 96 f8 00 00 00 48 89 10 49 89 86 f8 00 00 00 49 83 ae f0 00 00 00 01 4d 85 ff 74 19 <41> 8b 47 08 85 c0 0f 84 c2 00 00 00 83 e8 01 41 89 47 08 0f 84 05

[Wed Jan 24 22:50:41 2024] pvestatd[465526]: segfault at ffffffffffffffff ip 000055c3f7d574cc sp 00007fffdfac4fd0 error 7 in perl[55c3f7c6c000+195000] likely on CPU 1 (core 0, socket 0)
[Wed Jan 24 22:50:41 2024] Code: 8b 43 0c e9 6a ff ff ff 66 0f 1f 44 00 00 3c 02 0f 86 a0 00 00 00 0d 00 00 00 10 48 8b 55 10 89 45 0c 48 8b 45 00 48 8b 40 18 <c6> 44 02 ff 00 48 8b 45 00 48 8b 75 10 48 8b 40 18 e9 73 ff ff ff

[Wed Jan 24 23:55:37 2024] pvestatd[515903]: segfault at ffffffffffffffff ip 00005599e4a644cc sp 00007ffc5e855790 error 7 in perl[5599e4979000+195000] likely on CPU 1 (core 0, socket 0)
[Wed Jan 24 23:55:37 2024] Code: 8b 43 0c e9 6a ff ff ff 66 0f 1f 44 00 00 3c 02 0f 86 a0 00 00 00 0d 00 00 00 10 48 8b 55 10 89 45 0c 48 8b 45 00 48 8b 40 18 <c6> 44 02 ff 00 48 8b 45 00 48 8b 75 10 48 8b 40 18 e9 73 ff ff ff


So far I have:
Ran memtest for approx 8 passes and 0 issues flagged.
Removed 2 sticks of the RAM, and then swapped for the other 2 sticks to rule out a faulty module.
Swapped out of the GPU's
Tested a previous kernel
Re-installed PVE 8.1.3 onto a different SSD drive.
Moved data/vm drives onto different NVME
Disabled XMP to reduce RAM from 5200 to 4800
Ensured latest micro code applied
Reverted BIOS to previous release, then updated to latest which was just released

There are no specific events in journalctl prior to the segfaults.

I have approx 20 VM's running (mainly windows 10) of various configs. Below are a which are less standard as I have set specific CPU rather than Host.


Code:
args: -cpu Westmere,-hypervisor,kvm=off -smbios type=0,vendor="",version=1,date=01/01/1970,release=1.0
balloon: 0
bios: ovmf
boot: order=sata0;net0
cores: 4
cpu: Westmere
efidisk0: data3:vm-242-disk-0,efitype=4m,size=4M
hostpci0: mapping=Nvidia02,mdev=nvidia-435,pcie=1,x-vga=1
machine: pc-q35-8.1
memory: 4096
meta: creation-qemu=8.1.2,ctime=1702896587
name: VM0242
net0: e1000=BC:24:11:2B:DA:F3,bridge=vmbr0,tag=224
numa: 0
ostype: win10
sata0: data3:vm-242-disk-1,discard=on,size=200G
scsihw: virtio-scsi-single
smbios1: uuid=05ad4213-17e6-4b6d-8c94-a0f99bd58667
sockets: 1
vga: none
vmgenid: df163725-05e6-4aa8-8630-3225a768c573

Code:
cat 243.conf
args: -cpu Westmere,-hypervisor,kvm=off -smbios type=0,vendor="",version=1,date=01/01/1970,release=1.0
balloon: 0
bios: ovmf
boot: order=virtio0;net0
cores: 4
cpu: Westmere
efidisk0: data4:vm-243-disk-0,efitype=4m,size=4M
hostpci0: mapping=Nvidia01,mdev=nvidia-435,pcie=1,x-vga=1
machine: pc-q35-8.1
memory: 4096
meta: creation-qemu=8.1.2,ctime=1702896702
name: VM0243
net0: virtio=BC:24:11:2F:C4:EB,bridge=vmbr0,tag=20
numa: 0
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=af8c590a-6af4-4aba-8c06-79978fb0d187
sockets: 1
vga: none
virtio0: data4:vm-243-disk-1,discard=on,iothread=1,size=200G
vmgenid: cacac64e-2664-4652-bff9-3b4e2740df9e


I typically have 15-20 VM's running and CPU runs at 20-40%, and sensors shows CPU Cores generally 55-65C, with individual Cores spiking up to 80C briefly at times so cooling appears to be fine.

Suggestions on further things to try to fix or diagnose the issue? Is this likely a CPU / Motherboard / Memory issue or could it be other hardware? Any tips on how to diagnose which hardware, ideally without having to purchase another 14900k! :)

Thanks.
 
you could try installing and running "debsums -c" to rule out something like a core system library being corrupt - but this does sound like a hardware issue (CPU/memory/disk)
 
Do you already have installed the latest intel-microcode package: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?
Is this to get rid of the data leak possible message after booting up? if so, I have checked but could not find the repository.

GNU nano 7.2 /etc/apt/sources.list
deb http://ftp.us.debian.org/debian bookworm main contrib

deb http://ftp.us.debian.org/debian bookworm-updates main contrib

# security updates
deb http://security.debian.org bookworm-security main contrib

deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription

The link you sited said to append - "To be able to install packages from this component, runeditor /etc/apt/sources.list, append non-free-firmware to the end of each.debian.org repository line and run apt update."

Does that means, for example this: deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription/non-free-firmware?
Is there a charge for the firmware?
 
GNU nano 7.2 /etc/apt/sources.list
deb http://ftp.us.debian.org/debian bookworm main contrib

deb http://ftp.us.debian.org/debian bookworm-updates main contrib

# security updates
deb http://security.debian.org bookworm-security main contrib
The above are the Debian repositories.
The link you sited said to append - "To be able to install packages from this component, runeditor /etc/apt/sources.list, append non-free-firmware to the end of each.debian.org repository line and run apt update."
Does that means, for example this: deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription/non-free-firmware?
No, this is one you don't need to add it to, because it is not a .debian.org repository.

So you should change
Code:
deb http://ftp.us.debian.org/debian bookworm main contrib
to
Code:
deb http://ftp.us.debian.org/debian bookworm main contrib non-free-firmware
and similar for the others.

Is there a charge for the firmware?
The firmware is provided by Debian, so no. And there is no charge for any software provided by Proxmox and it's all AGPL-licensed. What you can buy is subscriptions for support and access to the enterprise repository (which contains better-tested packages): https://proxmox.com/en/services/support
 
  • Like
Reactions: Nollimox
Old post, but your problem seem's to be identical to mine https://forum.proxmox.com/threads/pvestatd-crash.144066/
where nobody reply.
in one server i ordered a full hardware change to the datacenter, but same time the service crash, but now one time in 3 days.
the configuration and the VM are identical to you, 20 Windows VM, but other node of the cluster never crashed and as identical hardware, so seem to be hardware related problem.
 
Try checking your CPU voltage. I believe my problem was that the CPU needed the voltage increasing as the MSI default was a little low, or my particular CPU wasnt so efficient.

I installed windows on my setup and ran Windows benchmarks and stress tests and found it would blue screen or error shortly after starting a stress test. I increased the CPU voltage (I think the MSI motherboard setting is 'CPU Lite Load' and changed the default from 10 to 12) and then the stress tests did not blue screen or report errors. I also had to upgrade the cooling to a decent AIO liquid cooler in order to properly stress test without cooking the CPU.

I believe that has solved the issue, I'm not running as many VM's now but have not had a crash in the last month.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!