Node instability

lc63

New Member
Jun 8, 2018
15
1
3
52
Hello,

# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

An incident occurs from time to time on a proxmox node. We have the following logs for several hours (daemon.log) :

Code:
Dec  3 19:06:18 ns3058 systemd[1]: Started Proxmox VE replication runner.
Dec  3 19:07:02 ns3058 systemd[1]: Starting Proxmox VE replication runner...
Dec  3 19:07:26 ns3058 pmxcfs[1840]: [status] notice: received log
Dec  3 19:07:27 ns3058 pve-firewall[2026]: firewall update time (9.154 seconds)
Dec  3 19:09:15 ns3058 pve-firewall[2026]: firewall update time (5.141 seconds)
Dec  3 19:09:35 ns3058 smartd[1071]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 240 to 230
Dec  3 19:10:02 ns3058 pve-firewall[2026]: firewall update time (6.020 seconds)
Dec  3 19:10:17 ns3058 pvestatd[2027]: status update time (239.456 seconds)
Dec  3 19:11:05 ns3058 systemd[1]: Started Proxmox VE replication runner.
Dec  3 19:11:10 ns3058 systemd[1]: Starting Proxmox VE replication runner...
Dec  3 19:12:05 ns3058 pve-firewall[2026]: firewall update time (5.250 seconds)
Dec  3 19:12:44 ns3058 pvestatd[2027]: got timeout
Dec  3 19:13:23 ns3058 pve-firewall[2026]: firewall update time (5.817 seconds)
Dec  3 19:14:22 ns3058 pve-firewall[2026]: firewall update time (5.133 seconds)
Dec  3 19:15:14 ns3058 pvestatd[2027]: got timeout
Dec  3 19:15:33 ns3058 sssd[be[coffreo.camp]]: Shutting down
Dec  3 19:15:45 ns3058 pvestatd[2027]: got timeout
Dec  3 19:15:55 ns3058 sssd[be[coffreo.camp]]: Starting up
Dec  3 19:16:06 ns3058 pve-firewall[2026]: firewall update time (7.520 seconds)
Dec  3 19:16:47 ns3058 pvestatd[2027]: got timeout
Dec  3 19:17:22 ns3058 pve-firewall[2026]: firewall update time (5.048 seconds)
Dec  3 19:17:37 ns3058 pvestatd[2027]: got timeout
Dec  3 19:18:20 ns3058 pvestatd[2027]: got timeout
Dec  3 19:18:37 ns3058 pve-firewall[2026]: firewall update time (7.670 seconds)
Dec  3 19:18:48 ns3058 systemd[1]: Started Proxmox VE replication runner.
Dec  3 19:19:02 ns3058 systemd[1]: Starting Proxmox VE replication runner...
Dec  3 19:19:32 ns3058 pve-firewall[2026]: firewall update time (5.560 seconds)
Dec  3 19:20:08 ns3058 pve-firewall[2026]: firewall update time (30.909 seconds)
Dec  3 19:20:10 ns3058 pve-ha-lrm[2117]: loop take too long (37 seconds)

...

Dec  3 22:00:02 ns3058 pve-ha-crm[2076]: loop take too long (35 seconds)
Dec  3 22:00:59 ns3058 pve-ha-crm[2076]: loop take too long (37 seconds)
Dec  3 22:01:25 ns3058 corosync[1985]: info    [MAIN  ] Q empty, queued:0 sent:1562.
Dec  3 22:01:32 ns3058 corosync[1985]:  [MAIN  ] Q empty, queued:0 sent:1562.
Dec  3 22:01:33 ns3058 pve-ha-lrm[2117]: loop take too long (70 seconds)
Dec  3 22:01:37 ns3058 pvestatd[2027]: got timeout
Dec  3 22:01:54 ns3058 pve-ha-crm[2076]: loop take too long (41 seconds)
Dec  3 22:02:45 ns3058 pve-ha-crm[2076]: loop take too long (32 seconds)
Dec  3 22:03:09 ns3058 pve-ha-lrm[2117]: loop take too long (82 seconds)
Dec  3 22:03:25 ns3058 pve-firewall[2026]: firewall update time (228.760 seconds)
Dec  3 22:03:45 ns3058 pvestatd[2027]: got timeout
Dec  3 22:03:53 ns3058 sssd[be[coffreo.camp]]: Shutting down
Dec  3 22:04:34 ns3058 pve-ha-crm[2076]: loop take too long (58 seconds)
Dec  3 22:05:15 ns3058 sssd[be[coffreo.camp]]: Starting up
Dec  3 22:05:31 ns3058 pve-ha-lrm[2117]: loop take too long (123 seconds)
Dec  3 22:05:42 ns3058 pmxcfs[1840]: [dcdb] notice: data verification successful
Dec  3 22:06:03 ns3058 pvestatd[2027]: got timeout

When those messages starts, the indicators on the proxmox web interface stop. Last 24 hours, the server load and IO delay have increased rapidly, while CPU Usage and Memory remained stable. After 2 hours, the node reboots and everything is ok.

We are investigating on our side. Do the logs show something at the proxmox level ?
Thanks for your insights.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!