Hi,
We've had a problem today with one of our proxmox node. This node seems to have crashed but we can't find anything useful to explain what happened.
Just before the "crash", the proxmox server was running 2 Windows Server 2019 and 1 Windows Server 2022 all running with the spice graphic driver and the spice guest tools. Also at the same time, one of my colleagues was copying a VM through scp to the node.
Here is the syslog around the time of the crash, i think that the file transfer was consuming a lot of CPU and might have caused the NODE to fall behind but i'm not entirely sure :
Does anybody have any idea on how we could debug this issue so it won't happen again ?
Thanks a lot
We've had a problem today with one of our proxmox node. This node seems to have crashed but we can't find anything useful to explain what happened.
Just before the "crash", the proxmox server was running 2 Windows Server 2019 and 1 Windows Server 2022 all running with the spice graphic driver and the spice guest tools. Also at the same time, one of my colleagues was copying a VM through scp to the node.
Here is the syslog around the time of the crash, i think that the file transfer was consuming a lot of CPU and might have caused the NODE to fall behind but i'm not entirely sure :
Oct 24 11:55:23 NODENAME sshd[2250189]: Received disconnect from IP_CLIENT port 47818:11: disconnected by userOct 24 11:55:24 NODENAME sshd[2250189]: Disconnected from user root IP_CLIENT port 47818Oct 24 11:55:25 NODENAME systemd[1]: session-1453.scope: Deactivated successfully.Oct 24 11:55:25 NODENAME sshd[2250189]: pam_unix(sshd:session): session closed for user rootOct 24 11:55:25 NODENAME systemd[1]: session-1453.scope: Consumed 47.979s CPU time.Oct 24 11:55:25 NODENAME systemd-logind[1279]: Session 1453 logged out. Waiting for processes to exit.Oct 24 11:55:25 NODENAME systemd-logind[1279]: Removed session 1453.Oct 24 11:55:40 NODENAME ceph-mon[2200]: 2024-10-24T11:55:40.565+0200 7d345c6006c0 -1 mon.NODENAME@5(peon).paxos(paxos updating c 20955740..20956304) lease_expire from mon.0 v2:OTHERNODEIP:3300/0 is 2.859154224s seconds in the past; mons are probably laggy (or possibly clocks are too skewed)Oct 24 11:55:52 NODENAME sshd[2248054]: Received disconnect from CLIENT_IP_2 port 56736:11: disconnected by userOct 24 11:55:54 NODENAME systemd-logind[1279]: Session 1449 logged out. Waiting for processes to exit.Oct 24 11:55:55 NODENAME sshd[2248054]: Disconnected from user root CLIENT_IP_2 port 56736Oct 24 11:55:55 NODENAME ceph-mon[2200]: 2024-10-24T11:55:53.923+0200 7d345c6006c0 -1 mon.NODENAME@5(peon).paxos(paxos updating c 20955740..20956306) lease_expire from mon.0 v2:OTHERNODEIP:3300/0 is 1.835228562s seconds in the past; mons are probably laggy (or possibly clocks are too skewed)Oct 24 11:55:55 NODENAME systemd[1]: session-1449.scope: Deactivated successfully.Oct 24 11:55:55 NODENAME sshd[2248054]: pam_unix(sshd:session): session closed for user rootOct 24 11:55:55 NODENAME systemd-logind[1279]: Removed session 1449.Oct 24 11:56:02 NODENAME ceph-mon[2200]: 2024-10-24T11:56:02.460+0200 7d345c6006c0 -1 mon.NODENAME@5(peon).paxos(paxos updating c 20955740..20956308) lease_expire from mon.0 v2:OTHERNODEIP:3300/0 is 0.109672904s seconds in the past; mons are probably laggy (or possibly clocks are too skewed)Oct 24 11:56:03 NODENAME watchdog-mux[1282]: client watchdog expired - disable watchdog updates-- Reboot --Oct 24 11:58:43 NODENAME kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()Oct 24 11:58:43 NODENAME kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-4-pve root=/dev/mapper/pve-root ro quiet intel_iommu=offOct 24 11:58:43 NODENAME kernel: KERNEL supported cpus:Oct 24 11:58:43 NODENAME kernel: Intel GenuineIntelOct 24 11:58:43 NODENAME kernel: AMD AuthenticAMDOct 24 11:58:43 NODENAME kernel: Hygon HygonGenuineOct 24 11:58:43 NODENAME kernel: Centaur CentaurHaulsOct 24 11:58:43 NODENAME kernel: zhaoxin Shanghai Oct 24 11:58:43 NODENAME kernel: BIOS-provided physical RAM map:Oct 24 11:58:43 NODENAME kernel: BIOS-e820: [mem 0x0000000000000000-0x00000000000987ff] usableDoes anybody have any idea on how we could debug this issue so it won't happen again ?
Thanks a lot