Hello,
An incident occurs from time to time on a proxmox node. We have the following logs for several hours (daemon.log) :
When those messages starts, the indicators on the proxmox web interface stop. Last 24 hours, the server load and IO delay have increased rapidly, while CPU Usage and Memory remained stable. After 2 hours, the node reboots and everything is ok.
We are investigating on our side. Do the logs show something at the proxmox level ?
Thanks for your insights.
# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
An incident occurs from time to time on a proxmox node. We have the following logs for several hours (daemon.log) :
Code:
Dec 3 19:06:18 ns3058 systemd[1]: Started Proxmox VE replication runner.
Dec 3 19:07:02 ns3058 systemd[1]: Starting Proxmox VE replication runner...
Dec 3 19:07:26 ns3058 pmxcfs[1840]: [status] notice: received log
Dec 3 19:07:27 ns3058 pve-firewall[2026]: firewall update time (9.154 seconds)
Dec 3 19:09:15 ns3058 pve-firewall[2026]: firewall update time (5.141 seconds)
Dec 3 19:09:35 ns3058 smartd[1071]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 240 to 230
Dec 3 19:10:02 ns3058 pve-firewall[2026]: firewall update time (6.020 seconds)
Dec 3 19:10:17 ns3058 pvestatd[2027]: status update time (239.456 seconds)
Dec 3 19:11:05 ns3058 systemd[1]: Started Proxmox VE replication runner.
Dec 3 19:11:10 ns3058 systemd[1]: Starting Proxmox VE replication runner...
Dec 3 19:12:05 ns3058 pve-firewall[2026]: firewall update time (5.250 seconds)
Dec 3 19:12:44 ns3058 pvestatd[2027]: got timeout
Dec 3 19:13:23 ns3058 pve-firewall[2026]: firewall update time (5.817 seconds)
Dec 3 19:14:22 ns3058 pve-firewall[2026]: firewall update time (5.133 seconds)
Dec 3 19:15:14 ns3058 pvestatd[2027]: got timeout
Dec 3 19:15:33 ns3058 sssd[be[coffreo.camp]]: Shutting down
Dec 3 19:15:45 ns3058 pvestatd[2027]: got timeout
Dec 3 19:15:55 ns3058 sssd[be[coffreo.camp]]: Starting up
Dec 3 19:16:06 ns3058 pve-firewall[2026]: firewall update time (7.520 seconds)
Dec 3 19:16:47 ns3058 pvestatd[2027]: got timeout
Dec 3 19:17:22 ns3058 pve-firewall[2026]: firewall update time (5.048 seconds)
Dec 3 19:17:37 ns3058 pvestatd[2027]: got timeout
Dec 3 19:18:20 ns3058 pvestatd[2027]: got timeout
Dec 3 19:18:37 ns3058 pve-firewall[2026]: firewall update time (7.670 seconds)
Dec 3 19:18:48 ns3058 systemd[1]: Started Proxmox VE replication runner.
Dec 3 19:19:02 ns3058 systemd[1]: Starting Proxmox VE replication runner...
Dec 3 19:19:32 ns3058 pve-firewall[2026]: firewall update time (5.560 seconds)
Dec 3 19:20:08 ns3058 pve-firewall[2026]: firewall update time (30.909 seconds)
Dec 3 19:20:10 ns3058 pve-ha-lrm[2117]: loop take too long (37 seconds)
...
Dec 3 22:00:02 ns3058 pve-ha-crm[2076]: loop take too long (35 seconds)
Dec 3 22:00:59 ns3058 pve-ha-crm[2076]: loop take too long (37 seconds)
Dec 3 22:01:25 ns3058 corosync[1985]: info [MAIN ] Q empty, queued:0 sent:1562.
Dec 3 22:01:32 ns3058 corosync[1985]: [MAIN ] Q empty, queued:0 sent:1562.
Dec 3 22:01:33 ns3058 pve-ha-lrm[2117]: loop take too long (70 seconds)
Dec 3 22:01:37 ns3058 pvestatd[2027]: got timeout
Dec 3 22:01:54 ns3058 pve-ha-crm[2076]: loop take too long (41 seconds)
Dec 3 22:02:45 ns3058 pve-ha-crm[2076]: loop take too long (32 seconds)
Dec 3 22:03:09 ns3058 pve-ha-lrm[2117]: loop take too long (82 seconds)
Dec 3 22:03:25 ns3058 pve-firewall[2026]: firewall update time (228.760 seconds)
Dec 3 22:03:45 ns3058 pvestatd[2027]: got timeout
Dec 3 22:03:53 ns3058 sssd[be[coffreo.camp]]: Shutting down
Dec 3 22:04:34 ns3058 pve-ha-crm[2076]: loop take too long (58 seconds)
Dec 3 22:05:15 ns3058 sssd[be[coffreo.camp]]: Starting up
Dec 3 22:05:31 ns3058 pve-ha-lrm[2117]: loop take too long (123 seconds)
Dec 3 22:05:42 ns3058 pmxcfs[1840]: [dcdb] notice: data verification successful
Dec 3 22:06:03 ns3058 pvestatd[2027]: got timeout
When those messages starts, the indicators on the proxmox web interface stop. Last 24 hours, the server load and IO delay have increased rapidly, while CPU Usage and Memory remained stable. After 2 hours, the node reboots and everything is ok.
We are investigating on our side. Do the logs show something at the proxmox level ?
Thanks for your insights.
Last edited: