Proxmox node random reboot

enrgienhi

New Member
Jun 7, 2024
2
1
3
Hi,

We've had a problem today with one of our proxmox node. This node seems to have crashed but we can't find anything useful to explain what happened.

Just before the "crash", the proxmox server was running 2 Windows Server 2019 and 1 Windows Server 2022 all running with the spice graphic driver and the spice guest tools. Also at the same time, one of my colleagues was copying a VM through scp to the node.

Here is the syslog around the time of the crash, i think that the file transfer was consuming a lot of CPU and might have caused the NODE to fall behind but i'm not entirely sure :
Oct 24 11:55:23 NODENAME sshd[2250189]: Received disconnect from IP_CLIENT port 47818:11: disconnected by user
Oct 24 11:55:24 NODENAME sshd[2250189]: Disconnected from user root IP_CLIENT port 47818
Oct 24 11:55:25 NODENAME systemd[1]: session-1453.scope: Deactivated successfully.
Oct 24 11:55:25 NODENAME sshd[2250189]: pam_unix(sshd:session): session closed for user root
Oct 24 11:55:25 NODENAME systemd[1]: session-1453.scope: Consumed 47.979s CPU time.
Oct 24 11:55:25 NODENAME systemd-logind[1279]: Session 1453 logged out. Waiting for processes to exit.
Oct 24 11:55:25 NODENAME systemd-logind[1279]: Removed session 1453.
Oct 24 11:55:40 NODENAME ceph-mon[2200]: 2024-10-24T11:55:40.565+0200 7d345c6006c0 -1 mon.NODENAME@5(peon).paxos(paxos updating c 20955740..20956304) lease_expire from mon.0 v2:OTHERNODEIP:3300/0 is 2.859154224s seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
Oct 24 11:55:52 NODENAME sshd[2248054]: Received disconnect from CLIENT_IP_2 port 56736:11: disconnected by user
Oct 24 11:55:54 NODENAME systemd-logind[1279]: Session 1449 logged out. Waiting for processes to exit.
Oct 24 11:55:55 NODENAME sshd[2248054]: Disconnected from user root CLIENT_IP_2 port 56736
Oct 24 11:55:55 NODENAME ceph-mon[2200]: 2024-10-24T11:55:53.923+0200 7d345c6006c0 -1 mon.NODENAME@5(peon).paxos(paxos updating c 20955740..20956306) lease_expire from mon.0 v2:OTHERNODEIP:3300/0 is 1.835228562s seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
Oct 24 11:55:55 NODENAME systemd[1]: session-1449.scope: Deactivated successfully.
Oct 24 11:55:55 NODENAME sshd[2248054]: pam_unix(sshd:session): session closed for user root
Oct 24 11:55:55 NODENAME systemd-logind[1279]: Removed session 1449.
Oct 24 11:56:02 NODENAME ceph-mon[2200]: 2024-10-24T11:56:02.460+0200 7d345c6006c0 -1 mon.NODENAME@5(peon).paxos(paxos updating c 20955740..20956308) lease_expire from mon.0 v2:OTHERNODEIP:3300/0 is 0.109672904s seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
Oct 24 11:56:03 NODENAME watchdog-mux[1282]: client watchdog expired - disable watchdog updates
-- Reboot --
Oct 24 11:58:43 NODENAME kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Oct 24 11:58:43 NODENAME kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-4-pve root=/dev/mapper/pve-root ro quiet intel_iommu=off
Oct 24 11:58:43 NODENAME kernel: KERNEL supported cpus:
Oct 24 11:58:43 NODENAME kernel: Intel GenuineIntel
Oct 24 11:58:43 NODENAME kernel: AMD AuthenticAMD
Oct 24 11:58:43 NODENAME kernel: Hygon HygonGenuine
Oct 24 11:58:43 NODENAME kernel: Centaur CentaurHauls
Oct 24 11:58:43 NODENAME kernel: zhaoxin Shanghai
Oct 24 11:58:43 NODENAME kernel: BIOS-provided physical RAM map:
Oct 24 11:58:43 NODENAME kernel: BIOS-e820: [mem 0x0000000000000000-0x00000000000987ff] usable

Does anybody have any idea on how we could debug this issue so it won't happen again ?

Thanks a lot
 
i think that the file transfer was consuming a lot of CPU
Not only CPU but also bandwidth - which possibly increases the latency on the network. Corosync is really sensitive in this regard.

How many corosync rings are configured on how many independent physical wires? (VLANs are not independent in this sense.) It is recommended to have a separate NIC for this with a fallback onto one of the other networks.

Best regards
 
Thanks to both of you.

It's a lot clearer now on what happened and what might have been the cause.

Not only CPU but also bandwidth - which possibly increases the latency on the network. Corosync is really sensitive in this regard.

How many corosync rings are configured on how many independent physical wires? (VLANs are not independent in this sense.) It is recommended to have a separate NIC for this with a fallback onto one of the other networks.

Best regards
From what i know, we have two rings. Each one on an individual NIC. However, the "management" interface is also shared on those NIC. The file transfer could have interfered with the corosync traffic and caused the node to lose connectivity with the cluster.

I'll see with my colleagues to rethink the network architecture to:
1) Stop file transfers on the management interface
2) Move the management interface to another independent NIC

Have a good day
 
  • Like
Reactions: esi_y