task pve-firewall: blocked for more than 120 seconds

BelCloud

Renowned Member
Dec 15, 2015
96
5
73
www.belcloud.net
I'm running the latest proxmox version on a 6 node cluster.

Today and yesterday, the cluster degraded and in all the nodes, i see this messages:

Code:
Dec 23 04:39:06 dx411-s06 kernel: [82074.269775] INFO: task pve-firewall:2668 blocked for more than 120 seconds.
Dec 23 04:39:06 dx411-s06 kernel: [82074.269830]  Tainted: P  O  4.4.35-1-pve #1
Dec 23 04:39:06 dx411-s06 kernel: [82074.269870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 23 04:39:06 dx411-s06 kernel: [82074.269938] pve-firewall  D ffff8818101ffc58  0  2668  1 0x00000000
Dec 23 04:39:06 dx411-s06 kernel: [82074.269944]  ffff8818101ffc58 ffffffff810e7237 ffffffff81e12580 ffff88180df7a940
Dec 23 04:39:06 dx411-s06 kernel: [82074.269946]  ffff881810200000 ffff8817eb3a03ac ffff88180df7a940 00000000ffffffff
Dec 23 04:39:06 dx411-s06 kernel: [82074.269948]  ffff8817eb3a03b0 ffff8818101ffc70 ffffffff81856155 ffff8817eb3a03a8
Dec 23 04:39:06 dx411-s06 kernel: [82074.269951] Call Trace:
Dec 23 04:39:06 dx411-s06 kernel: [82074.269963]  [<ffffffff810e7237>] ? call_rcu_sched+0x17/0x20
Dec 23 04:39:06 dx411-s06 kernel: [82074.269968]  [<ffffffff81856155>] schedule+0x35/0x80
Dec 23 04:39:06 dx411-s06 kernel: [82074.269970]  [<ffffffff8185640e>] schedule_preempt_disabled+0xe/0x10
Dec 23 04:39:06 dx411-s06 kernel: [82074.269972]  [<ffffffff81858109>] __mutex_lock_slowpath+0xb9/0x130
Dec 23 04:39:06 dx411-s06 kernel: [82074.269974]  [<ffffffff8185819f>] mutex_lock+0x1f/0x30
Dec 23 04:39:06 dx411-s06 kernel: [82074.269990]  [<ffffffff8121c814>] path_openat+0x534/0x1530
Dec 23 04:39:06 dx411-s06 kernel: [82074.269994]  [<ffffffff811ed49a>] ? __slab_free+0xca/0x2b0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270000]  [<ffffffff811bef53>] ? do_wp_page+0x2a3/0x6f0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270003]  [<ffffffff8121ea11>] do_filp_open+0x91/0x100
Dec 23 04:39:06 dx411-s06 kernel: [82074.270007]  [<ffffffff8122c3f6>] ? __alloc_fd+0x46/0x180
Dec 23 04:39:06 dx411-s06 kernel: [82074.270011]  [<ffffffff8120cce9>] do_sys_open+0x139/0x2a0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270015]  [<ffffffff8106b4ea>] ? __do_page_fault+0x1ba/0x410
Dec 23 04:39:06 dx411-s06 kernel: [82074.270020]  [<ffffffff81003356>] ? exit_to_usermode_loop+0xa6/0xd0
Dec 23 04:39:06 dx411-s06 kernel: [82074.270022]  [<ffffffff8120ce6e>] SyS_open+0x1e/0x20
Dec 23 04:39:06 dx411-s06 kernel: [82074.270026]  [<ffffffff8185a276>] entry_SYSCALL_64_fastpath+0x16/0x75
Dec 23 04:41:06 dx411-s06 kernel: [82194.260491] INFO: task pve-firewall:2668 blocked for more than 120 seconds.
Dec 23 04:41:06 dx411-s06 kernel: [82194.260542]  Tainted: P  O  4.4.35-1-pve #1

Code:
root@dx411-s04:~# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.19-1-pve: 4.4.19-66
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

I am unable to stop the pve-firewall, at this moment. It just hangs.

The nodes are using 30% of their RAM, less than 2% of the cpu and the I/O load is less than 0.3

LE: Afer stopping corosync, pve-firewall managed to be stopped.

LE2: Stopping corosync on all the nodes and then starting it again seems to have solved the issue (temporary)
 
Last edited:
Currently, in syslog i see only:
Code:
Dec 23 05:05:39 dx411-s04 pmxcfs[2923]: [status] notice: received log
Dec 23 05:05:55 dx411-s04 pmxcfs[2923]: [status] notice: received log
Dec 23 05:06:03 dx411-s04 pmxcfs[2923]: [status] notice: received log
Dec 23 05:06:39 dx411-s04 pmxcfs[2923]: [status] notice: received log
 
Same issue, not related to firewall, this is what i see in the logs:
Code:
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:39 dx411-s05 pmxcfs[2512]: [status] crit: cpg_send_message failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [quorum] crit: quorum_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] notice: node lost quorum
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [dcdb] crit: cpg_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [dcdb] crit: cpg_leave failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [confdb] crit: cmap_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pve-ha-lrm[2652]: unable to write lrm status file - unable to open file '/etc/pve/nodes/dx411-s05/lrm_status.tmp.2652' - Permission denied
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] notice: update cluster info (cluster name  vps1, version = 9)
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [dcdb] notice: start cluster connection
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] crit: cpg_dispatch failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] crit: cpg_leave failed: 2
Dec 29 01:24:40 dx411-s05 pmxcfs[2512]: [status] notice: node has quorum

Restarting corosync on all nodes fixes the issue for 1 day or so.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!