task pve-firewall: blocked for more than 120 seconds

BelCloud

Renowned Member
Dec 15, 2015
96
5
73
www.belcloud.net
I'm running the latest proxmox version on a 6 node cluster.

Today and yesterday, the cluster degraded and in all the nodes, i see this messages:

Code:
Dec 23 04:39:06 dx411-s06 kernel: [82074.269775] INFO: task pve-firewall:2668 blocked for more than 120 seconds.
Dec 23 04:39:06 dx411-s06 kernel: [82074.269830]  Tainted: P  O  4.4.35-1-pve #1
Dec 23 04:39:06 dx411-s06 kernel: [82074.269870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 23 04:39:06 dx411-s06 kernel: [82074.269938] pve-firewall  D ffff8818101ffc58  0  2668  1 0x00000000
Dec 23 04:39:06 dx411-s06 kernel: [82074.269944]  ffff8818101ffc58 ffffffff810e7237 ffffffff81e12580 ffff88180df7a940
Dec 23 04:39:06 dx411-s06 kernel: [82074.269946]  ffff881810200000 ffff8817eb3a03ac ffff88180df7a940 00000000ffffffff
Dec 23 04:39:06 dx411-s06 kernel: [82074.269948]  ffff8817eb3a03b0 ffff8818101ffc70 ffffffff81856155 ffff8817eb3a03a8
Dec 23 04:39:06 dx411-s06 kernel: [82074.269951] Call Trace:
Dec 23 04:39:06 dx411-s06 kernel: [82074.269963]  [<ffffffff810e7237>] ? call_rcu_sched+0x17/0x20
Dec 23 04:39:06 dx411-s06 kernel: [82074.269968]  [<ffffffff81856155>] schedule+0x35/0x80
Dec 23 04:39:06 dx411-s06 kernel: [82074.269970]  [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Dec 23 04:39:06 dx411-s06 kernel: [82074.269972]  [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Dec 23 04:39:06 dx411-s06 kernel: [82074.269974]  [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Dec 23 04:39:06 dx411-s06 kernel: [82074.269990]  [<ffffffff8121c814>] path_openat+0x534/0x1530
Dec 23 04:39:06 dx411-s06 kernel: [82074.269994]  [<ffffffff811ed49a>] ? __slab_free+0xca/0x2b0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270000]  [<ffffffff811bef53>] ? do_wp_page+0x2a3/0x6f0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270003]  [<ffffffff8121ea11>] do_filp_open+0x91/0x100
Dec 23 04:39:06 dx411-s06 kernel: [82074.270007]  [<ffffffff8122c3f6>] ? __alloc_fd+0x46/0x180
Dec 23 04:39:06 dx411-s06 kernel: [82074.270011]  [<ffffffff8120cce9>] do_sys_open+0x139/0x2a0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270015]  [<ffffffff8106b4ea>] ? __do_page_fault+0x1ba/0x410
Dec 23 04:39:06 dx411-s06 kernel: [82074.270020]  [<ffffffff81003356>] ? exit_to_usermode_loop+0xa6/0xd0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270022]  [<ffffffff8120ce6e>] SyS_open+0x1e/0x20
Dec 23 04:39:06 dx411-s06 kernel: [82074.270026]  [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Dec 23 04:41:06 dx411-s06 kernel: [82194.260491] INFO: task pve-firewall:2668 blocked for more than 120 seconds.
Dec 23 04:41:06 dx411-s06 kernel: [82194.260542]  Tainted: P  O  4.4.35-1-pve #1

Code:
root@dx411-s04:~# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

I am unable to stop the pve-firewall, at this moment. It just hangs.

The nodes are using 30% of their RAM, less than 2% of the cpu and the I/O load is less than 0.3

LE: Afer stopping corosync, pve-firewall managed to be stopped.

LE2: Stopping corosync on all the nodes and then starting it again seems to have solved the issue (temporary)
 
Last edited:
Currently, in syslog i see only:
Code:
Dec 23 05:05:39 dx411-s04 pmxcfs[2923]: [status] notice: received log
Dec 23 05:05:55 dx411-s04 pmxcfs[2923]: [status] notice: received log
Dec 23 05:06:03 dx411-s04 pmxcfs[2923]: [status] notice: received log
Dec 23 05:06:39 dx411-s04 pmxcfs[2923]: [status] notice: received log
 
Same issue, not related to firewall, this is what i see in the logs:
Code:
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [quorum] crit: quorum_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] notice: node lost quorum
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [dcdb] crit: cpg_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [dcdb] crit: cpg_leave failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [confdb] crit: cmap_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pve-ha-lrm[2652]: unable to write lrm status file - unable to open file '/etc/pve/nodes/dx411-s05/lrm_status.tmp.2652' - Permission denied
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] notice: update cluster info (cluster name  vps1, version = 9)
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [dcdb] notice: start cluster connection
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] crit: cpg_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] crit: cpg_leave failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] notice: node has quorum

Restarting corosync on all nodes fixes the issue for 1 day or so.