I'm running the latest proxmox version on a 6 node cluster.
Today and yesterday, the cluster degraded and in all the nodes, i see this messages:
I am unable to stop the pve-firewall, at this moment. It just hangs.
The nodes are using 30% of their RAM, less than 2% of the cpu and the I/O load is less than 0.3
LE: Afer stopping corosync, pve-firewall managed to be stopped.
LE2: Stopping corosync on all the nodes and then starting it again seems to have solved the issue (temporary)
Today and yesterday, the cluster degraded and in all the nodes, i see this messages:
Code:
Dec 23 04:39:06 dx411-s06 kernel: [82074.269775] INFO: task pve-firewall:2668 blocked for more than 120 seconds.
Dec 23 04:39:06 dx411-s06 kernel: [82074.269830] Tainted: P O 4.4.35-1-pve #1
Dec 23 04:39:06 dx411-s06 kernel: [82074.269870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 23 04:39:06 dx411-s06 kernel: [82074.269938] pve-firewall D ffff8818101ffc58 0 2668 1 0x00000000
Dec 23 04:39:06 dx411-s06 kernel: [82074.269944] ffff8818101ffc58 ffffffff810e7237 ffffffff81e12580 ffff88180df7a940
Dec 23 04:39:06 dx411-s06 kernel: [82074.269946] ffff881810200000 ffff8817eb3a03ac ffff88180df7a940 00000000ffffffff
Dec 23 04:39:06 dx411-s06 kernel: [82074.269948] ffff8817eb3a03b0 ffff8818101ffc70 ffffffff81856155 ffff8817eb3a03a8
Dec 23 04:39:06 dx411-s06 kernel: [82074.269951] Call Trace:
Dec 23 04:39:06 dx411-s06 kernel: [82074.269963] [<ffffffff810e7237>] ? call_rcu_sched+0x17/0x20
Dec 23 04:39:06 dx411-s06 kernel: [82074.269968] [<ffffffff81856155>] schedule+0x35/0x80
Dec 23 04:39:06 dx411-s06 kernel: [82074.269970] [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Dec 23 04:39:06 dx411-s06 kernel: [82074.269972] [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Dec 23 04:39:06 dx411-s06 kernel: [82074.269974] [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Dec 23 04:39:06 dx411-s06 kernel: [82074.269990] [<ffffffff8121c814>] path_openat+0x534/0x1530
Dec 23 04:39:06 dx411-s06 kernel: [82074.269994] [<ffffffff811ed49a>] ? __slab_free+0xca/0x2b0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270000] [<ffffffff811bef53>] ? do_wp_page+0x2a3/0x6f0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270003] [<ffffffff8121ea11>] do_filp_open+0x91/0x100
Dec 23 04:39:06 dx411-s06 kernel: [82074.270007] [<ffffffff8122c3f6>] ? __alloc_fd+0x46/0x180
Dec 23 04:39:06 dx411-s06 kernel: [82074.270011] [<ffffffff8120cce9>] do_sys_open+0x139/0x2a0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270015] [<ffffffff8106b4ea>] ? __do_page_fault+0x1ba/0x410
Dec 23 04:39:06 dx411-s06 kernel: [82074.270020] [<ffffffff81003356>] ? exit_to_usermode_loop+0xa6/0xd0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270022] [<ffffffff8120ce6e>] SyS_open+0x1e/0x20
Dec 23 04:39:06 dx411-s06 kernel: [82074.270026] [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Dec 23 04:41:06 dx411-s06 kernel: [82194.260491] INFO: task pve-firewall:2668 blocked for more than 120 seconds.
Dec 23 04:41:06 dx411-s06 kernel: [82194.260542] Tainted: P O 4.4.35-1-pve #1
Code:
root@dx411-s04:~# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
I am unable to stop the pve-firewall, at this moment. It just hangs.
The nodes are using 30% of their RAM, less than 2% of the cpu and the I/O load is less than 0.3
LE: Afer stopping corosync, pve-firewall managed to be stopped.
LE2: Stopping corosync on all the nodes and then starting it again seems to have solved the issue (temporary)
Last edited: