CLVMD over DRBD hangs on remote volume access

alitvak69

Renowned Member
Oct 2, 2015
105
3
83
I really love proxmox and have plans for a production cluster after successfully implementing test one few month ago.

I built a new one , v 3.4.11, with 2 nodes and qdisk for quorum and everything seems to fall in place but clvm on drbd.
I lost my notes from test cluster so I don't remember how exactly I did it then but firewall seems to be the issue.

When I have pve-firewall stopped on both nodes, clvmd starts on both nodes, and associates using sctp.
However when I start firewall and execute lvm commands then I get this

vgs
Error locking on node virt2n3-la: Command timed out
Error locking on node virt2n3-la: Command timed out

Essentially it scans local node VGs but it takes forever

If I reboot one of the nodes and other have a firewall up, clvmd would fail to connect boot will hang indefinitely and I would get in the logs

Oct 1 16:03:27 virt2n3-la kernel: [ 379.790005] dlm: Can't start SCTP association - retrying

And then I see related kernel process timeout messages every 120 sec

Oct 1 16:01:03 virt2n4-la kernel: [ 240.882245] INFO: task clvmd:3447 blocked for more than 120 seconds.
Oct 1 16:01:03 virt2n4-la kernel: [ 240.882793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883299] clvmd D ffff88083fc33640 0 3447 1 0x00000000
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883307] ffff88082236bc48 0000000000000086 ffff880828b85010 ffff88082236bfd8
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883312] ffff88082236bfd8 ffff88082236bfd8 ffff8808296bf260 ffff880828b85010
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883316] ffff88042fcb3640 ffff880035cfe658 ffff880035cfe660 7fffffffffffffff
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883319] Call Trace:
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883327] [<ffffffff8163cd39>] schedule+0x29/0x70
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883332] [<ffffffff8163a0dc>] schedule_timeout+0x22c/0x2c0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883335] [<ffffffff8163bbc3>] ? __schedule+0x2f3/0x810
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883341] [<ffffffff8109542b>] ? prepare_to_wait+0x5b/0x90
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883344] [<ffffffff8163cb09>] wait_for_completion+0xf9/0x150
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883353] [<ffffffff810a61d0>] ? try_to_wake_up+0x290/0x290
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883366] [<ffffffffa08c7d90>] new_lockspace+0x970/0xa80 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883369] [<ffffffff81095260>] ? wake_up_bit+0x40/0x40
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883376] [<ffffffffa08c8165>] dlm_new_lockspace+0x75/0x180 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883384] [<ffffffffa08d1c6e>] device_write+0x3ae/0x720 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883393] [<ffffffff812740dc>] ? security_file_permission+0x2c/0xb0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883398] [<ffffffff811c0e65>] vfs_write+0xc5/0x1f0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883402] [<ffffffff811c1352>] SyS_write+0x52/0xa0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883408] [<ffffffff81646689>] system_call_fastpath+0x16/0x1b



I tried different firewall settings and logically they seem correct but something is wrong. I see no messages in the pve-firewall log pointing to issue (logging set to DEBUG)

I would really appreciate any help or guidance as I am at loss here. Attached a text file with configs and stats.

View attachment cluster-configs.txt
 
Last edited:
I had a help from a great Engineer working for our company. Essentially it took adding of iptable rules into rc.local. Notice that those rules are inject in front of the chains. It definitely helped me. Make sure that 10 and 38 subnets are changed to your own networks

/sbin/iptables -I INPUT -s 38.102.250.0/24 -j ACCEPT
/sbin/iptables -I INPUT -d 38.102.250.0/24 -j ACCEPT
/sbin/iptables -I INPUT -s 10.0.20.0/22 -j ACCEPT
/sbin/iptables -I INPUT -d 10.0.20.0/22 -j ACCEPT
/sbin/iptables -I INPUT -s 224.0.0.0/4 -j ACCEPT
/sbin/iptables -I INPUT -d 224.0.0.0/4 -j ACCEPT
/sbin/iptables -I INPUT -m pkttype --pkt-type multicast -j ACCEPT
/sbin/iptables -I INPUT -m pkttype --pkt-type broadcast -j ACCEPT
/sbin/iptables -I OUTPUT -s 38.102.250.0/24 -j ACCEPT
/sbin/iptables -I OUTPUT -d 38.102.250.0/24 -j ACCEPT
/sbin/iptables -I OUTPUT -s 10.0.20.0/22 -j ACCEPT
/sbin/iptables -I OUTPUT -d 10.0.20.0/22 -j ACCEPT
/sbin/iptables -I OUTPUT -s 224.0.0.0/4 -j ACCEPT
/sbin/iptables -I OUTPUT -d 224.0.0.0/4 -j ACCEPT
/sbin/iptables -I OUTPUT -m pkttype --pkt-type multicast -j ACCEPT
/sbin/iptables -I OUTPUT -m pkttype --pkt-type broadcast -j ACCEPT
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!