Proxmox has random crashing randomly

biebernguyen

New Member
Aug 26, 2024
6
0
1
Hi everyone

My company is currently using Proxmox to run services, currently has about 9 Nodes with 2 clusters, 1 cluster of 6 nodes and 1 cluster of 3 nodes,
But recently, some nodes automatically rebooted causing the services to be interrupted,
I use console idrac don't see anything logs, they just crashing and i need restart for use.

I mean there was HA but that also hindered because I had to restart using idrac.

This is spec for 1 node
Dell R650SX
112 x Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (2 Sockets)
512 GB RAM,

they run ceph with SSD Sata.

This is log after I reboot using idrac.


IDRAC Event logs did not record any hardware errors.

journalctl --since "2024-12-09 12:15:00" --until "2024-12-09 13:37:00"
Dec 09 13:05:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:06:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:07:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:08:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:09:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:10:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:11:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:12:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:13:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:14:45 pve2 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Dec 09 13:14:45 pve2 systemd[1]: apt-daily.service: Deactivated successfully.
Dec 09 13:14:45 pve2 systemd[1]: Finished apt-daily.service - Daily apt download activities.
Dec 09 13:14:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:15:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:16:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:17:01 pve2 CRON[2138763]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 09 13:17:01 pve2 CRON[2138764]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Dec 09 13:17:01 pve2 CRON[2138763]: pam_unix(cron:session): session closed for user root
Dec 09 13:17:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:18:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:19:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 09 13:20:47 pve2 snmpd[514757]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Boot 328026d952c14b4c8957590e6d638f67 --
Dec 09 13:29:19 pve2 kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for De>
Dec 09 13:29:19 pve2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet
Dec 09 13:29:19 pve2 kernel: KERNEL supported cpus:
Dec 09 13:29:19 pve2 kernel: Intel GenuineIntel
Dec 09 13:29:19 pve2 kernel: AMD AuthenticAMD
Dec 09 13:29:19 pve2 kernel: Hygon HygonGenuine
Dec 09 13:29:19 pve2 kernel: Centaur CentaurHauls
Dec 09 13:29:19 pve2 kernel: zhaoxin Shanghai
Dec 09 13:29:19 pve2 kernel: BIOS-provided physical RAM map:



root@pve2:~# journalctl -p err -f
Dec 09 13:29:21 pve2 smartd[1380]: Device: /dev/bus/6 [megaraid_disk_00] [SAT], no ATA CHECK POWER STATUS support, ignoring -n Directive
Dec 09 13:29:21 pve2 smartd[1380]: Device: /dev/bus/6 [megaraid_disk_01] [SAT], no ATA CHECK POWER STATUS support, ignoring -n Directive
Dec 09 13:29:24 pve2 pmxcfs[1694]: [quorum] crit: quorum_initialize failed: 2
Dec 09 13:29:24 pve2 pmxcfs[1694]: [quorum] crit: can't initialize service
Dec 09 13:29:24 pve2 pmxcfs[1694]: [confdb] crit: cmap_initialize failed: 2
Dec 09 13:29:24 pve2 pmxcfs[1694]: [confdb] crit: can't initialize service
Dec 09 13:29:24 pve2 pmxcfs[1694]: [dcdb] crit: cpg_initialize failed: 2
Dec 09 13:29:24 pve2 pmxcfs[1694]: [dcdb] crit: can't initialize service
Dec 09 13:29:24 pve2 pmxcfs[1694]: [status] crit: cpg_initialize failed: 2
Dec 09 13:29:24 pve2 pmxcfs[1694]: [status] crit: can't initialize service
 
Logs from another node

root@pve5:~# journalctl --since "2024-12-04 16:30:00" --until "2024-12-04 17:00:00"
Dec 04 16:30:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:30:02 pve5 pmxcfs[2254]: [status] notice: received log
Dec 04 16:30:08 pve5 pmxcfs[2254]: [status] notice: received log
Dec 04 16:31:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:32:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:33:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:34:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:34:28 pve5 pmxcfs[2254]: [dcdb] notice: data verification successful
Dec 04 16:35:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:35:44 pve5 pmxcfs[2254]: [status] notice: received log
Dec 04 16:36:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:37:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:37:05 pve5 pmxcfs[2254]: [status] notice: received log
Dec 04 16:38:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:39:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:40:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:41:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:42:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:43:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 04 16:43:33 pve5 pmxcfs[2254]: [status] notice: received log
Dec 04 16:43:37 pve5 pmxcfs[2254]: [status] notice: received log
Dec 04 16:44:01 pve5 snmpd[2160]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Boot 9d32a0e4a1334081aedb9271937a456b --
Dec 04 16:53:05 pve5 kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX>
Dec 04 16:53:05 pve5 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet
Dec 04 16:53:05 pve5 kernel: KERNEL supported cpus:
Dec 04 16:53:05 pve5 kernel: Intel GenuineIntel
Dec 04 16:53:05 pve5 kernel: AMD AuthenticAMD
Dec 04 16:53:05 pve5 kernel: Hygon HygonGenuine
Dec 04 16:53:05 pve5 kernel: Centaur CentaurHauls
Dec 04 16:53:05 pve5 kernel: zhaoxin Shanghai
Dec 04 16:53:05 pve5 kernel: x86/tme: not enabled by BIOS
Dec 04 16:53:05 pve5 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Dec 04 16:53:05 pve5 kernel: BIOS-provided physical RAM map:
Dec 04 16:53:05 pve5 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable
Dec 04 16:53:05 pve5 kernel: BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] reserved
Dec 04 16:53:05 pve5 kernel: BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] usable


root@pve5:~# journalctl -p err -f
Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: quorum_initialize failed: 2
Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: can't initialize service
Dec 04 16:53:11 pve5 pmxcfs[2254]: [confdb] crit: cmap_initialize failed: 2
Dec 04 16:53:11 pve5 pmxcfs[2254]: [confdb] crit: can't initialize service
Dec 04 16:53:11 pve5 pmxcfs[2254]: [dcdb] crit: cpg_initialize failed: 2
Dec 04 16:53:11 pve5 pmxcfs[2254]: [dcdb] crit: can't initialize service
Dec 04 16:53:11 pve5 pmxcfs[2254]: [status] crit: cpg_initialize failed: 2
Dec 04 16:53:11 pve5 pmxcfs[2254]: [status] crit: can't initialize service
Dec 04 16:53:20 pve5 kernel: bond2: (slave eno12429): speed changed to 0 on port 2
Dec 06 01:14:12 pve5 pvedaemon[2891]: VM 173 qmp command failed - VM 173 qmp command 'guest-ping' failed - got timeout
 
Code:
Dec 04 16:53:05 pve5 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
If you are checked there aren't hardware issue try to disable split_locks adding "split_lock_detect=off" kernel parameters for the boot.


Code:
Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: quorum_initialize failed: 2
Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: can't initialize service
Here also seems you don't have enough node up in the cluster for running
 
Code:
Dec 04 16:53:05 pve5 kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
If you are checked there aren't hardware issue try to disable split_locks adding "split_lock_detect=off" kernel parameters for the boot.


Code:
Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: quorum_initialize failed: 2
Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: can't initialize service
Here also seems you don't have enough node up in the cluster for running

Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: quorum_initialize failed: 2
Dec 04 16:53:11 pve5 pmxcfs[2254]: [quorum] crit: can't initialize service

This logs appeared after I rebooted
and split_lock_detect=off Is it required, is that the cause of OS Proxmox crashing?
 
This logs appeared after I rebooted
and split_lock_detect=off Is it required, is that the cause of OS Proxmox crashing?
I don't understand, from you wrote seems you already used "split_lock_detect=off", but in that case should not print the error about FWIK.
And I don't understand if you did a clean reboot (as now you wrote "This logs appeared after I rebooted") or there was a crash.
I have the doubt that you seeing proxmox "not working" (perhaps due to missing quorum) you think it has problems and do a brutal shutdown.

https://pve.proxmox.com/wiki/Cluster_Manager#_quorum
 
Last edited:
I don't understand, from you wrote seems you already used "split_lock_detect=off", but in that case should not print the error about FWIK.
And I don't understand if you did a clean reboot (as now you wrote "This logs appeared after I rebooted") or there was a crash.
I have the doubt that you seeing proxmox "not working" (perhaps due to missing quorum) you think it has problems and do a brutal shutdown.

https://pve.proxmox.com/wiki/Cluster_Manager#_quorum
During the time when pve5 was frozen, all other nodes were functioning normally. Initially, I thought it was a network issue. I used iDRAC to interact with the server directly, and everything seemed completely unresponsive, with no feedback at all.

Based on the logs and monitoring on my end, the issue occurred from approximately 16:44:01 to 16:53:05, after which I had to reboot using iDRAC.

These messages appeared after the issue was resolved, so are they related?
 
I'm having trouble understanding the situation from what you write, maybe if there's someone who can understand they can help you better.
In a cluster if there are network issue and/or not enough node working and reachable for the quorum the services stop and remain read-only, so you have to be very careful to avoid such.
In both the first 2 messages, you see that you have the quorum error in the logs.
From the date the quorum logs seems after the reboot but if you rebooted individually the nodes is likely the same situation even before.
And it should also be understood if the system was really blocked because the logs posted are very close (date) to the reboot so the system itself would seem to be working (but services stopped by missed quorum) when you rebooted, if you rebooted and the system didn't crash, that's not clear from what you write.

EDIT:
can be useful to know also if you have recently updated proxmox, mainly the kernel version, if can be a regression with newer kernel, in that case can be useful to try older kernel
 
Last edited:
  • Like
Reactions: biebernguyen
This morning, my pve9 node crashed at 2:34 a.m. I found out at 8:13 a.m. and rebooted, here are the logs.

Dec 20 02:16:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:17:01 pve9 CRON[1028625]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 20 02:17:01 pve9 CRON[1028626]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Dec 20 02:17:01 pve9 CRON[1028625]: pam_unix(cron:session): session closed for user root
Dec 20 02:17:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:18:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:19:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:20:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:21:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:22:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:23:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Dec 20 02:24:56 pve9 snmpd[2141]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
-- Boot 1f72b38e5e3141a9b332afa24103117b --
Dec 20 08:13:46 pve9 kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1>
Dec 20 08:13:46 pve9 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet
Dec 20 08:13:46 pve9 kernel: KERNEL supported cpus:
Dec 20 08:13:46 pve9 kernel: Intel GenuineIntel
Dec 20 08:13:46 pve9 kernel: AMD AuthenticAMD
Dec 20 08:13:46 pve9 kernel: Hygon HygonGenuine
Dec 20 08:13:46 pve9 kernel: Centaur CentaurHauls
Dec 20 08:13:46 pve9 kernel: zhaoxin Shanghai
Dec 20 08:13:46 pve9 kernel: x86/tme: not enabled by BIOS
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!