[SOLVED] Proxmox 4 / Cluster over MPLS

fstrankowski

Well-Known Member
Nov 28, 2016
78
18
48
40
Hamburg
Hey guys, here is a tricky one:

We've setup a proxmox (4.4) cluster of 14 machines and wanted to join more servers into it. Cluster is running fine, there is no problem at all. The new servers got linked into the very same VLAN where the current running Cluster is operating in.

Difference: The link goes over MPLS.

I can ping the other hosts, Multicast seems to work (atleast hosts join the multicast group and can see pings), but when i try to connect the new servers to the existing cluster (pvecm add [..]), i get "waiting quorum....". The exact same config / setup works without MPLS.

So i really would like to know if i missed something here or if i need to setup anything special additional to Multicast.

Regards
 
Hey Jonas,

already tested that prior to my post and indeed its working. Thats what makes me so curious why its not working.
 
What is the latency like? You was able to run a large amount of omping without a drop?

Can every node ping another / corosync hostname e.t.c
 
What is the latency like? You was able to run a large amount of omping without a drop?

Can every node ping another / corosync hostname e.t.c
Latency like 0.003ms. ATM the box is in the same datacenter and should be put into another one which has a latency of 0.007ms (a few km away). I can ping Management-Interface (Proxmox IP) and also Corosync IP from and to all hosts of the cluster, so thats why expect the network to be setup correctly. I use tcpdump to read the traffic going back and forth and couldnt see something unusual. I was able to see Multicast-Traffic from all hosts of the cluster.

Can someone tell me what exactly is happening at the "waiting for quorum" step? What data is beeing exchanged, what processes take place at this specific time.
 
Hi,
Can someone tell me what exactly is happening at the "waiting for quorum" step? What data is beeing exchanged, what processes take place at this specific time.

We check if /etc/pve/local (/which is a "magic" symlink to the local nodes /etc/pve/nodes/<nodename>/ directory) is writeable, the writeable status gets controlled by our cluster file system which allows writes only if it has quorum from the corosync cluster engine.

If you try to add the cluster can you look a t the logs during this (from the node you want to add) with:

Code:
journalctl -f

if you have much else going on in the logs filterm them with:
Code:
journalctl -f -u corosync -u pve-cluster

And give me some outputs during it waits for quorum?

Namely:
Code:
systemctl status corosync pve-cluster
corosync-quorumtool
 
Thanks for the update. So because i couldnt get any errors i've decided to resetup the whole cluster. (I show here the status after setting up 3 hosts in 3 different datacenters with MPLS-Links)

1.) In the previous setup we've went with IPs instead of hostnames for corosync, now we go by hostnames in /etc/hosts

2.) /etc/hosts look like this now:

Code:
root@PX1-C1-N06:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
172.18.40.40 PX1-C1-N06.ourdomain.de PX1-C1-N06 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Corosync Hostnames

192.168.41.39 PX1-C3-QUO-CS.ourdomain.de PX1-C3-QUO-CS
192.168.41.40 PX1-C1-N06-CS.ourdomain.de PX1-C1-N06-CS
192.168.41.41 PX1-C1-N07-CS.ourdomain.de PX1-C1-N07-CS
Is this okay? I mean using two different hostnames for management and quorum/corosync?

3.) After resetting everything up from scratch it works. There absolutely no difference in configs beside using hostnames instead of IPs.

Now its time to understand this, which might be close to impossible.


UPDATE: After 30 Minutes the Server with MPLS-Connection got disconnected from the cluster and is alone now!


Code:
Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:


Journal Log for corosync / pve-cluster

Code:
-- Logs begin at Wed 2017-02-22 11:13:20 CET. --
Feb 22 12:11:02 PX1-C3-QUO corosync[6779]: [TOTEM ] Retransmit List: 3d8 3d9 3da 3dc 3dd 3de
Feb 22 12:11:02 PX1-C3-QUO corosync[6779]: [TOTEM ] Retransmit List: 3d8 3d9 3da 3dc 3dd 3de
Feb 22 12:11:02 PX1-C3-QUO corosync[6779]: [TOTEM ] Retransmit List: 3d8 3d9 3da 3dc 3dd 3de
Feb 22 12:11:02 PX1-C3-QUO corosync[6779]: [TOTEM ] Retransmit List: 3d8 3d9 3da 3dc 3dd 3de
Feb 22 12:11:03 PX1-C3-QUO corosync[6779]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Feb 22 12:11:03 PX1-C3-QUO corosync[6779]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Feb 22 12:11:04 PX1-C3-QUO corosync[6779]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Feb 22 12:11:04 PX1-C3-QUO pmxcfs[6760]: [dcdb] notice: members: 3/6760
Feb 22 12:11:04 PX1-C3-QUO pmxcfs[6760]: [status] notice: members: 3/6760
Feb 22 12:11:04 PX1-C3-QUO pmxcfs[6760]: [status] notice: node lost quorum

And the status

Code:
Feb 22 12:07:16 PX1-C3-QUO pmxcfs[6760]: [dcdb] notice: synced members: 1/10292, 2/10429
Feb 22 12:07:16 PX1-C3-QUO pmxcfs[6760]: [dcdb] notice: waiting for updates from leader
Feb 22 12:07:16 PX1-C3-QUO pmxcfs[6760]: [status] notice: received all states
Feb 22 12:07:16 PX1-C3-QUO pmxcfs[6760]: [status] notice: all data is up to date
Feb 22 12:07:16 PX1-C3-QUO pmxcfs[6760]: [dcdb] notice: update complete - trying to commit (got 37 inode updates)
Feb 22 12:07:16 PX1-C3-QUO pmxcfs[6760]: [dcdb] notice: all data is up to date
Feb 22 12:08:54 PX1-C3-QUO pmxcfs[6760]: [status] notice: received log
Feb 22 12:11:04 PX1-C3-QUO pmxcfs[6760]: [dcdb] notice: members: 3/6760
Feb 22 12:11:04 PX1-C3-QUO pmxcfs[6760]: [status] notice: members: 3/6760
Feb 22 12:11:04 PX1-C3-QUO pmxcfs[6760]: [status] notice: node lost quorum
 
Last edited:
1.) In the previous setup we've went with IPs instead of hostnames for corosync, now we go by hostnames in /etc/hosts

With the `ring0_addr` param? Could it be that you forgot to set the correct `bindnet0_addr` address parameter when you created a cluster?
I.e. in your case you should have done:
Code:
root@PX1-C1-N06-CS# pvecm create my-cluster-name -ring0_addr 192.168.41.40 -bindnet0_addr 192.168.41.40
root@PX1-C1-N07-CS# pvecm add 192.168.41.40 -ring0_addr 192.168.41.41
root@PX1-C3-QUO-CS# pvecm add 192.168.41.40 -ring0_addr 192.168.41.39

See http://pve.proxmox.com/pve-docs/chapter-pvecm.html#_separate_cluster_network for more infos

If you forgot corosync listened on another interface (the one where 172.18.40.40 is configured), where no multicast traffic of other nodes came.
I'll rework the checks from pvecm add/create currently I try to improve this situation too.

Do you have the old pvecm create/add commands from the bash history still around?

Is this okay? I mean using two different hostnames for management and quorum/corosync?

Hmm, you overwrite the other host entry, normally you should be good, though, not sure if I miss a possible bad implication of this...

3.) After resetting everything up from scratch it works. There absolutely no difference in configs beside using hostnames instead of IPs.

Now its time to understand this, which might be close to impossible.

It could be possible, If you have the old corosync config around and attach both, old and new, here I can give it a quick look :)
 
I've set it up exactly as u said so. Including ring0 attributes :) Both times!
I dont overwrite the hostname, because im using the original one with -CS attached to it, so two hostnames, one for Management and one for corosync, i think this should be sufficient.

Did you read my update? The server using MPLS got disconnected from the Cluster, this is just hilarious. I could join it and it gets dc. Time is adjusted on all servers, all got the same time using a local atom-clock-backed ntp server sitting in our dc.

The one using MPLS is now spamming, cant get it back into the cluster, fresh install!

Code:
Feb 22 12:46:04 PX1-C3-QUO corosync[3718]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 22 12:46:06 PX1-C3-QUO corosync[3718]: [TOTEM ] A new membership (192.168.41.39:352) was formed. Members
Feb 22 12:46:06 PX1-C3-QUO corosync[3718]: [QUORUM] Members[1]: 3
Feb 22 12:46:06 PX1-C3-QUO corosync[3718]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 22 12:46:09 PX1-C3-QUO corosync[3718]: [TOTEM ] A new membership (192.168.41.39:356) was formed. Members
Feb 22 12:46:09 PX1-C3-QUO corosync[3718]: [QUORUM] Members[1]: 3
Feb 22 12:46:09 PX1-C3-QUO corosync[3718]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 22 12:46:12 PX1-C3-QUO corosync[3718]: [TOTEM ] A new membership (192.168.41.39:360) was formed. Members
Feb 22 12:46:12 PX1-C3-QUO corosync[3718]: [QUORUM] Members[1]: 3
Feb 22 12:46:12 PX1-C3-QUO corosync[3718]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 22 12:46:14 PX1-C3-QUO corosync[3718]: [TOTEM ] A new membership (192.168.41.39:364) was formed. Members
Feb 22 12:46:14 PX1-C3-QUO corosync[3718]: [QUORUM] Members[1]: 3
Feb 22 12:46:14 PX1-C3-QUO corosync[3718]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 22 12:46:16 PX1-C3-QUO corosync[3718]: [TOTEM ] A new membership (192.168.41.39:368) was formed. Members
Feb 22 12:46:16 PX1-C3-QUO corosync[3718]: [QUORUM] Members[1]: 3
Feb 22 12:46:16 PX1-C3-QUO corosync[3718]: [MAIN  ] Completed service synchronization, ready to provide service.
Feb 22 12:46:19 PX1-C3-QUO corosync[3718]: [TOTEM ] A new membership (192.168.41.39:372) was formed. Members
[...]
 
I've set it up exactly as u said so. Including ring0 attributes :) Both times!

ok.

I dont overwrite the hostname, because im using the original one with -CS attached to it, so two hostnames, one for Management and one for corosync, i think this should be sufficient.

sorry my bad, overlooked that :confused:

Did you read my update? The server using MPLS got disconnected from the Cluster, this is just hilarious. I could join it and it gets dc. Time is adjusted on all servers, all got the same time using a local atom-clock-backed ntp server sitting in our dc.

Yes I read it just after posting my reply.

Hmm, I've got not much experience with multicast or IGMP over MPLS, it seems it was initially also not intended to work:
https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching#Multicast_addressing

Have to look up how mcats gets handled there.

[TOTEM ] Retransmit List: 3d8 3d9 3da 3dc 3dd 3de

Normally I would say that it looks like, a) to much traffic noise on the network. Corosync does not need much bandwidth but is timing sensitive .
b) Something is dropping multicast after an extended period of time. That would be normaly IGMP snooping from a switch, but 30 minutes is strange as the normal timeout is 5 minutes.

Is a longtime multicast test possible? If yes could you execute on each node (you may test it also just with too, not all three if that make it easier):
Code:
omping -c 2000 -i 1 -q NODE1-IP NODE2-IP ...
This has a very long duration of around ~30 min.
 
Haha, good idea with the "long term test". I think we got the proof that some of our network equipment has a malfunction (although IGMP-Snooping and everything is definitly turned off):

P.S.: Unicast is no option because we plan to expand our cluster to 20+ Nodes.

From .40

Code:
192.168.41.39 : waiting for response msg
192.168.41.41 : waiting for response msg
192.168.41.41 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.39 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.41 : waiting for response msg
192.168.41.41 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.39 : waiting for response msg
192.168.41.39 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.39 : given amount of query messages was sent
192.168.41.41 : given amount of query messages was sent

192.168.41.39 :   unicast, xmt/rcv/%loss = 2000/2000/0%, min/avg/max/std-dev = 0.148/0.349/0.657/0.067
192.168.41.39 : multicast, xmt/rcv/%loss = 2000/258/87%, min/avg/max/std-dev = 0.162/0.359/0.630/0.076
192.168.41.41 :   unicast, xmt/rcv/%loss = 2000/2000/0%, min/avg/max/std-dev = 0.025/0.076/0.212/0.038
192.168.41.41 : multicast, xmt/rcv/%loss = 2000/2000/0%, min/avg/max/std-dev = 0.031/0.087/0.216/0.040

From .41

Code:
192.168.41.39 : waiting for response msg
192.168.41.40 : waiting for response msg
192.168.41.40 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.39 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.39 : waiting for response msg
192.168.41.39 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.40 : waiting for response msg
192.168.41.40 : server told us to stop
192.168.41.39 : given amount of query messages was sent

192.168.41.39 :   unicast, xmt/rcv/%loss = 2000/2000/0%, min/avg/max/std-dev = 0.166/0.349/0.627/0.061
192.168.41.39 : multicast, xmt/rcv/%loss = 2000/255/87%, min/avg/max/std-dev = 0.187/0.356/0.534/0.067
192.168.41.40 :   unicast, xmt/rcv/%loss = 1999/1999/0%, min/avg/max/std-dev = 0.026/0.085/0.184/0.039
192.168.41.40 : multicast, xmt/rcv/%loss = 1999/1999/0%, min/avg/max/std-dev = 0.029/0.095/0.203/0.042

From .39 (MPLS)

Code:
192.168.41.40 : waiting for response msg
192.168.41.41 : waiting for response msg
192.168.41.40 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.41 : joined (S,G) = (*, 232.43.211.234), pinging
192.168.41.40 : waiting for response msg
192.168.41.40 : server told us to stop
192.168.41.41 : waiting for response msg
192.168.41.41 : server told us to stop

192.168.41.40 :   unicast, xmt/rcv/%loss = 1996/1996/0%, min/avg/max/std-dev = 0.147/0.295/0.531/0.060
192.168.41.40 : multicast, xmt/rcv/%loss = 1996/258/87%, min/avg/max/std-dev = 0.142/0.262/0.488/0.066
192.168.41.41 :   unicast, xmt/rcv/%loss = 1999/1999/0%, min/avg/max/std-dev = 0.148/0.284/0.513/0.058
192.168.41.41 : multicast, xmt/rcv/%loss = 1999/258/87%, min/avg/max/std-dev = 0.144/0.250/0.394/0.058
 
P.S.: Unicast is no option because we plan to expand our cluster to 20+ Nodes.

Haha ok, yeah I would definitively not use unicast with more than 4 nodes.

Haha, good idea with the "long term test". I think we got the proof that some of our network equipment has a malfunction (although IGMP-Snooping and everything is definitly turned off):
192.168.41.40 : unicast, xmt/rcv/%loss = 1996/1996/0%, min/avg/max/std-dev = 0.147/0.295/0.531/0.060 192.168.41.40 : multicast, xmt/rcv/%loss = 1996/258/87%, min/avg/max/std-dev = 0.142/0.262/0.488/0.066 192.168.41.41 : unicast, xmt/rcv/%loss = 1999/1999/0%, min/avg/max/std-dev = 0.148/0.284/0.513/0.058 192.168.41.41 : multicast, xmt/rcv/%loss = 1999/258/87%, min/avg/max/std-dev = 0.144/0.250/0.394/0.058

Yes, the cluster commuinication won't work with that. I find it a bit interesting that a its not hit or miss. But I guess it works the first X minutes and then stops completely and so the 87% miss rate comes into being.
I saw that some MPLS fixes and changes happened between our and the latest Kernel, you could try a 4.9/4.10 kernel (Ubuntu has PPAs with those) just to see if that addresses the problem.
But actually I'm not sure that the node+kernel is the culprit but rather think something else in the network stops forwarding the multicast packages after some time.
 
Okay, we found the problem after 1+ week of digging into our core routers. The problem was indeed the MPLS-Tunnel. We've had turned everything off, IGMP-Snooping, any flow controls so essentially everything which is not absolutely necessary to run our network at this point. So then, problems still persistet.

The fix has been that IGMP had been enabled for the tunnel itself. For whatever reason, this made the link drop pakets of the corosync selective multicast domain. We'll invest some more time examine this behavior but for now everything is running smooth.

Thanks everyone for helping us out at this point. Once we aquired more knowledge we'd be glad to assist other people in here too :)

So far, thanks again!

P.S.: We've just ordered our licenses for our current 25 CPUs ;)
 
  • Like
Reactions: bizzarrone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!