2 Node 1 Quorum Disk Expected behavior

adamb

Famous Member
Mar 1, 2012
1,323
73
113
Just trying to get some input.

I have a 2 node setup with 1 quorum disk. DRBD and cluster communication is over a dedicated 10GB connection between the two nodes. When pulling the 10GB network from one of the nodes, they end up fencing each other off. I thought this is exactly what the quorum disk was suppose to prevent? Trying to pin down how to prevent this. I appreciate the input!
 
The quorum disk is also connected through the network you pull? Or does it use another connections?
 
Do you use heuristics for qdisk? Please can you post your qdisk configuration?

The qdisk is on another network.

<?xml version="1.0"?>
<cluster config_version="6" name="medprox">
<cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<quorumd allow_kill="0" interval="3" label="medprox_qdisk" tko="10">
<heuristic interval="3" program="ping $GATEWAY -c1 -w1" score="1" tko="4"/>
<heuristic interval="3" program="ip addr | grep eth2 | grep -q UP" score="2" tko="3"/>
</quorumd>
<totem token="54000"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.12.149" lanplus="1" login="USERID" name="ipmi1" passwd="PASSW0RD" power_wait="5"/>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.12.150" lanplus="1" login="USERID" name="ipmi2" passwd="PASSW0RD" power_wait="5"/>
</fencedevices>
<clusternodes>
<clusternode name="medprox1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="ipmi1"/>
</method>
</fence>
</clusternode>
<clusternode name="medprox2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="ipmi2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="100"/>
<pvevm autostart="1" vmid="101"/>
</rm>
</cluster>
 
Looks like I should be using the NIC that DRBD and cluster communication is on in the "<heuristic interval="3" program="ip addr | grep eth2 | grep -q UP" score="2" tko="3"/>"

After making this change fail over seems to work better. Only issue, is that after the node comes back up, I end up with this issue repeated over and over.

[TOTEM ] Retransmit List: 31d 31e 31f 320 321 322 323 324 32f 330 331 332 333 334 335 336 337

Now I am also seeing this.

Feb 13 09:20:56 medprox2 kernel: [<ffffffff81065d8c>] ? dequeue_task_fair+0x1fc/0x200
Feb 13 09:20:56 medprox2 kernel: [<ffffffff81529085>] schedule_timeout+0x215/0x2e0
Feb 13 09:20:56 medprox2 kernel: [<ffffffff815281f4>] ? thread_return+0xba/0x7e6
Feb 13 09:20:56 medprox2 kernel: [<ffffffff8135a633>] ? ve_kobject_uevent_env+0xa3/0xc0
Feb 13 09:20:56 medprox2 kernel: [<ffffffff81528cf3>] wait_for_common+0x123/0x190
Feb 13 09:20:56 medprox2 kernel: [<ffffffff81059f10>] ? default_wake_function+0x0/0x20
Feb 13 09:20:56 medprox2 kernel: [<ffffffff81528e1d>] wait_for_completion+0x1d/0x20
Feb 13 09:20:56 medprox2 kernel: [<ffffffffa043b2ca>] dlm_new_lockspace+0x9fa/0xaa0 [dlm]
Feb 13 09:20:56 medprox2 kernel: [<ffffffffa04440b7>] device_write+0x317/0x7d0 [dlm]
Feb 13 09:20:56 medprox2 kernel: [<ffffffff81059f10>] ? default_wake_function+0x0/0x20
Feb 13 09:20:56 medprox2 kernel: [<ffffffff8127e5ca>] ? strncpy_from_user+0x4a/0x90
Feb 13 09:20:56 medprox2 kernel: [<ffffffff81196768>] vfs_write+0xb8/0x1a0
Feb 13 09:20:56 medprox2 kernel: [<ffffffff81197181>] sys_write+0x51/0x90
Feb 13 09:20:56 medprox2 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b

I ended up doing a few reboots to get away from this. Here is what I did.

1. Disconnected DRBD connection
2. Node #2 was succesfully fenced
3. Once it came back up, the totem transmits started
4. Then Node #1 was fenced off
5. Once back up the kernel messages started appearing
6. Rebooted both nodes and all seems well
 
Last edited:
Seeing this again now. Havn't done anything more to the cluster. All seems well.

[TOTEM ] Retransmit List: 1d9a 1d9b 1d9c 1d9d 1d9e 1d9f 1da0 1da1 1da2 1dae 1daf 1db0 1db1 1db2 1db3 1db4 1db5

From what I understand this is caused when one node is slower than the other. Both of these nodes are brand new IBM x3650 M4's with 100GB of ram. pveperf seems to report very similar numbers. Not sure what the cause of this could be.


 
Last edited:
'man qdisk' mentions 'master_wins' option - that should avoid the fence race.

Can anyone who is utilizing a qdisk comment on whether they are seeing this issue. Also seems I cannot utilize heuristic along with the master_wins option. Seems like it would cause more issues than fix. I appreciate the input.
 
Can anyone who is utilizing a qdisk comment on whether they are seeing this issue. Also seems I cannot utilize heuristic along with the master_wins option. Seems like it would cause more issues than fix. I appreciate the input.

Why do you need heuristics? The master_wins options seems to be the solution to me.
 
Seeing this again now. Havn't done anything more to the cluster. All seems well.

[TOTEM ] Retransmit List: 1d9a 1d9b 1d9c 1d9d 1d9e 1d9f 1da0 1da1 1da2 1dae 1daf 1db0 1db1 1db2 1db3 1db4 1db5

I guess it is a bad idea to use the same network for storage replication and cluster communication.
 
I guess it is a bad idea to use the same network for storage replication and cluster communication.

Very odd, as I didn't have this issue when both nodes were on the bench. Did seem to arise untill I put them in the rack. Wonder if something is up with my run.

Both of these nodes are not in production and have no activity. There are two VM's both of which are a fresh install with nothing setup, just for HA testing.

Something else must be off.
 
Iperf is looking good and there are no errors on the interfaces.

root@medprox1:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.211.48.1 port 5001 connected with 10.211.48.2 port 35029
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 10.9 GBytes 9.39 Gbits/sec
[ 5] local 10.211.48.1 port 5001 connected with 10.211.48.2 port 35072
[ 5] 0.0-10.0 sec 10.9 GBytes 9.38 Gbits/sec
[ 4] local 10.211.48.1 port 5001 connected with 10.211.48.2 port 35121
[ 4] 0.0-10.0 sec 10.9 GBytes 9.39 Gbits/sec

I understand that having cluster communication and replication on a saturated link is a bad idea, I don't see how it could cause issues on a completely dead system with 2 inactive VM's.
 
After more digging, it ended up being the backed. I guess its not best to trust iperf.

Well I did have an issue with the backend link, but it is now happening again.

I am also seeing this issue on my first cluster which has separate 1GB DRBD and cluster communication. This is a dedicated connection with no switches. Having cluster communication on the same link doesn't seem to be the issue.
 
Last edited:
I am trying to add this to my cluster config, but it doesn't seem to take. Get a validation error.

<totem window_size="50"/>

Everything seems ok with my nodes and performance is spot on. I appreciate any and all input on this subject.
 
Hard to verify without having the whole file.


Here it is.

<?xml version="1.0"?>
<cluster config_version="3" name="medprox">
<cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<quorumd allow_kill="0" interval="3" label="medprox_qdisk" tko="10">
<heuristic interval="3" program="ping 10.80.1.8 -c1 -w1" score="1" tko="4"/>
<heuristic interval="3" program="ip addr | grep eth0 | grep -q UP" score="2" tko="3"/>
<heuristic interval="3" program="ip addr | grep eth2 | grep -q UP" score="2" tko="3"/>
</quorumd>
<totem token="54000"/>
<totem window_size="50"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.12.149" lanplus="1" login="USERID" name="ipmi1" passwd="PASSW0RD" power_wait="5"/>
<fencedevice agent="fence_ipmilan" ipaddr="10.80.12.150" lanplus="1" login="USERID" name="ipmi2" passwd="PASSW0RD" power_wait="5"/>
</fencedevices>
<clusternodes>
<clusternode name="medprox1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="ipmi1"/>
</method>
</fence>
</clusternode>
<clusternode name="medprox2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="ipmi2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
</rm>
</cluster>
 
I did a complete re install on the cluster not in production yet.

Still hitting this issue. Everything was going well, I was working on cluster.conf, when /etc/pve pretty much became unusable due to the totem retransmission issue. After a few minutes Node #2 ended up fencing Node #1. Once back up all is well.

All of this seems like a network issue but all seems good with the network. It is a dedicated 10GB run which is only 50 feet and no switch. iperf reports good numbers, and I can scp/rsync files at expected speeds. There are no errors on the interfaces and I can't seem to reproduce anytype of lost packets.
 
Here it is.
Code:
<totem token="54000"/>
  [COLOR=#333333]<totem window_size="50"/>[/COLOR]

You cannot have two totem sections, only one is allowed:

Code:
<totem token="54000"[COLOR=#333333] window_size="50"/>
[/COLOR]

 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!