I really love proxmox and have plans for a production cluster after successfully implementing test one few month ago.
I built a new one , v 3.4.11, with 2 nodes and qdisk for quorum and everything seems to fall in place but clvm on drbd.
I lost my notes from test cluster so I don't remember how exactly I did it then but firewall seems to be the issue.
When I have pve-firewall stopped on both nodes, clvmd starts on both nodes, and associates using sctp.
However when I start firewall and execute lvm commands then I get this
vgs
Error locking on node virt2n3-la: Command timed out
Error locking on node virt2n3-la: Command timed out
Essentially it scans local node VGs but it takes forever
If I reboot one of the nodes and other have a firewall up, clvmd would fail to connect boot will hang indefinitely and I would get in the logs
Oct 1 16:03:27 virt2n3-la kernel: [ 379.790005] dlm: Can't start SCTP association - retrying
And then I see related kernel process timeout messages every 120 sec
Oct 1 16:01:03 virt2n4-la kernel: [ 240.882245] INFO: task clvmd:3447 blocked for more than 120 seconds.
Oct 1 16:01:03 virt2n4-la kernel: [ 240.882793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883299] clvmd D ffff88083fc33640 0 3447 1 0x00000000
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883307] ffff88082236bc48 0000000000000086 ffff880828b85010 ffff88082236bfd8
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883312] ffff88082236bfd8 ffff88082236bfd8 ffff8808296bf260 ffff880828b85010
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883316] ffff88042fcb3640 ffff880035cfe658 ffff880035cfe660 7fffffffffffffff
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883319] Call Trace:
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883327] [<ffffffff8163cd39>] schedule+0x29/0x70
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883332] [<ffffffff8163a0dc>] schedule_timeout+0x22c/0x2c0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883335] [<ffffffff8163bbc3>] ? __schedule+0x2f3/0x810
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883341] [<ffffffff8109542b>] ? prepare_to_wait+0x5b/0x90
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883344] [<ffffffff8163cb09>] wait_for_completion+0xf9/0x150
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883353] [<ffffffff810a61d0>] ? try_to_wake_up+0x290/0x290
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883366] [<ffffffffa08c7d90>] new_lockspace+0x970/0xa80 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883369] [<ffffffff81095260>] ? wake_up_bit+0x40/0x40
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883376] [<ffffffffa08c8165>] dlm_new_lockspace+0x75/0x180 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883384] [<ffffffffa08d1c6e>] device_write+0x3ae/0x720 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883393] [<ffffffff812740dc>] ? security_file_permission+0x2c/0xb0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883398] [<ffffffff811c0e65>] vfs_write+0xc5/0x1f0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883402] [<ffffffff811c1352>] SyS_write+0x52/0xa0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883408] [<ffffffff81646689>] system_call_fastpath+0x16/0x1b
I tried different firewall settings and logically they seem correct but something is wrong. I see no messages in the pve-firewall log pointing to issue (logging set to DEBUG)
I would really appreciate any help or guidance as I am at loss here. Attached a text file with configs and stats.
View attachment cluster-configs.txt
I built a new one , v 3.4.11, with 2 nodes and qdisk for quorum and everything seems to fall in place but clvm on drbd.
I lost my notes from test cluster so I don't remember how exactly I did it then but firewall seems to be the issue.
When I have pve-firewall stopped on both nodes, clvmd starts on both nodes, and associates using sctp.
However when I start firewall and execute lvm commands then I get this
vgs
Error locking on node virt2n3-la: Command timed out
Error locking on node virt2n3-la: Command timed out
Essentially it scans local node VGs but it takes forever
If I reboot one of the nodes and other have a firewall up, clvmd would fail to connect boot will hang indefinitely and I would get in the logs
Oct 1 16:03:27 virt2n3-la kernel: [ 379.790005] dlm: Can't start SCTP association - retrying
And then I see related kernel process timeout messages every 120 sec
Oct 1 16:01:03 virt2n4-la kernel: [ 240.882245] INFO: task clvmd:3447 blocked for more than 120 seconds.
Oct 1 16:01:03 virt2n4-la kernel: [ 240.882793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883299] clvmd D ffff88083fc33640 0 3447 1 0x00000000
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883307] ffff88082236bc48 0000000000000086 ffff880828b85010 ffff88082236bfd8
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883312] ffff88082236bfd8 ffff88082236bfd8 ffff8808296bf260 ffff880828b85010
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883316] ffff88042fcb3640 ffff880035cfe658 ffff880035cfe660 7fffffffffffffff
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883319] Call Trace:
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883327] [<ffffffff8163cd39>] schedule+0x29/0x70
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883332] [<ffffffff8163a0dc>] schedule_timeout+0x22c/0x2c0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883335] [<ffffffff8163bbc3>] ? __schedule+0x2f3/0x810
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883341] [<ffffffff8109542b>] ? prepare_to_wait+0x5b/0x90
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883344] [<ffffffff8163cb09>] wait_for_completion+0xf9/0x150
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883353] [<ffffffff810a61d0>] ? try_to_wake_up+0x290/0x290
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883366] [<ffffffffa08c7d90>] new_lockspace+0x970/0xa80 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883369] [<ffffffff81095260>] ? wake_up_bit+0x40/0x40
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883376] [<ffffffffa08c8165>] dlm_new_lockspace+0x75/0x180 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883384] [<ffffffffa08d1c6e>] device_write+0x3ae/0x720 [dlm]
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883393] [<ffffffff812740dc>] ? security_file_permission+0x2c/0xb0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883398] [<ffffffff811c0e65>] vfs_write+0xc5/0x1f0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883402] [<ffffffff811c1352>] SyS_write+0x52/0xa0
Oct 1 16:01:03 virt2n4-la kernel: [ 240.883408] [<ffffffff81646689>] system_call_fastpath+0x16/0x1b
I tried different firewall settings and logically they seem correct but something is wrong. I see no messages in the pve-firewall log pointing to issue (logging set to DEBUG)
I would really appreciate any help or guidance as I am at loss here. Attached a text file with configs and stats.
View attachment cluster-configs.txt
Last edited: