[SOLVED] Proxmox HA and 4GBE LACP - High availability almost immediately fails

Dan Manners · Sep 20, 2016

Morning everyone.

I've set up two Dell R610's with Proxmox running in HA and 4gb LACP to HP v1910-16's (see image below).

When the Proxmox boxes turn on, they see each other, and they act as though they're in HA. The logs say otherwise. The two boxes never lose connection with each other, but the quorum only sees one vote on each box while all of the other numbers are 2.

I've made sure that it is not a jumbo frame issue, nor is it an issue of the LACP not coming up fast enough (best I can tell), and I'm completely and totally out of ideas. I've set up multiple Proxmox boxes before in HA (6 total, 2gb LACP each) with no issues whatsoever, so I'm *reasonably* confident that it's not something completely obvious. That being said, I'm sure this is something I'm doing wrong and not a Proxmox issue.

Please and thank you in advance!

EDIT: Also weirder is that either of the Proxmox units can still manage the other one, but they appear to not recognize the other; logs are filled with errors on not being able to reach the other one.

Aron Dijkstra · Sep 20, 2016

Maybe some stupid questions, But can you SSH from one box to the other? and also the other way arround?
Can you ping and let it run, will you loose pings?
How does your network configuration on the boxes look? (openvswitch or linux)
Please give some more info

Dan Manners · Sep 20, 2016

Aron Dijkstra said:
Maybe some stupid questions, But can you SSH from one box to the other? and also the other way around?

Yup, SSH works fine. Takes a second for the password field to appear, but works fine and fast after that.

Aron Dijkstra said:
Can you ping and let it run, will you loose pings?

No pings appear to drop. Out of nearly 100,000 pings I ran, I dropped 6 packets.

Aron Dijkstra said:
How does your network configuration on the boxes look? (openvswitch or linux)
Please give some more info

I have it set up like this:

Proxmox 1
bond0 | Linux Bond | Active: Yes | Autostart: Yes | Port/Slaves: eth0 eth1 eth2 eth3 | IP Address 1.1.1.1 | Subnet Mask: 255.255.255.0
vmbr0 | Linux Bridge | Active: Yes| Autostart: Yes | Port/Slaves: bond0 | IP Address 10.5.0.201 | Subnet Mask: 255.255.255.0 | Gateway: 10.5.0.1

and

Proxmox 2
bond0 | Linux Bond | Active: Yes | Autostart: Yes | Port/Slaves: eth0 eth1 eth2 eth3 | IP Address 1.1.1.2 | Subnet Mask: 255.255.255.0
vmbr0 | Linux Bridge | Active: Yes| Autostart: Yes | Port/Slaves: bond0 | IP Address 10.5.0.202 | Subnet Mask: 255.255.255.0 | Gateway: 10.5.0.1

Also, I've made sure that no packets are dropping on the HP switches nor is there any port-flapping (misconfiguration with LACP)

Aron Dijkstra · Sep 20, 2016

hmmm, I find that strange. When i cluster my servers i dont get a login prompt! but it drops directly to the console. (cert login) Do you know for sure the key exchange and also the Clustering went good?

Dan Manners · Sep 20, 2016

Aron Dijkstra said:
hmmm, I find that strange. When i cluster my servers i dont get a login prompt! but it drops directly to the console. (cert login) Do you know for sure the key exchange and also the Clustering went good?

It appears to configure just fine, but then fails a few moments later in the GUI and the CLI.

Aron Dijkstra · Sep 20, 2016

What can you see if they fail... But, i did not know you could create a cluster from GUI?

Dan Manners · Sep 20, 2016

Aron Dijkstra said:
What can you see if they fail... But, i did not know you could create a cluster from GUI?

Sorry, mis-typed. From the first box, the GUI shows the second offline, and from the second box the GUI shows the first offline. The two units over CLI can get to each other (ping & ssh).

Aron Dijkstra · Sep 20, 2016

Ok, But i seriously doubt that the clustering went ok...
Can you try to re add them? (maybe first remove them)

Dan Manners · Sep 21, 2016

Okay, just rebooted both boxes and unplugged (not disabled) 3 of the 4 ethernet cables on each box.

Everything looks fine in this configuration, I figure I'm going to connect cables one by one and see when it breaks.

Dan Manners · Sep 21, 2016

Not 3 minutes later, the HA failed.

I'm well aware that I'm crazy, but seriously...

EDIT: I have not changed anything digitally or physically from the last post until now (#9 to #10)

fabian · Sep 21, 2016

http://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

Dan Manners · Sep 22, 2016

fabian said:
http://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

I've gone through and verified the following:

Nodes are on the same subnet
Nodes can unicast each other just fine far below 1% loss
Nodes can resolve each others hostnames just fine

For the life of me, I can't get omping working. I enter the node names and it tells me "omping: Can't find local address in arguments"

I'm struggling hard over here...what else would you guys recommend trying?

Dan Manners · Sep 22, 2016

Okay, I've got it all working if I take LACP out of the picture; single ethernet cable on each server going to the same switch in a standard network mode seems to keep the HA group online just fine.

Going to do my best to rule out these HP switches, but worst case scenario I put them out a window and get something else...

alexskysilk · Sep 22, 2016

really stupid question. Are you mixing your cluster and service networks?

The only reason to bond a cluster traffic transport is for fault tolerance as bonding, even lacp, does not decrease your connection latency.
I humbly suggest you rethink your entire network architecture, something like this:

port 0 and 1 in a active/passive bond, connected DIRECTLY from one server to the other. Since you only have two nodes, this is the most dependable cabling available. This is your cluster traffic interface.
port 2 and 3 in a active/active bond (LACP if supported, active-alb if not or you want switch fault tolerance.) This is your service network. If you're using LACP, you dont need the second switch at all. Truth be told, I dont know what the benefit you would get from LACP.

Dan Manners · Sep 23, 2016

alexskysilk said:
really stupid question. Are you mixing your cluster and service networks?

Not a really stupid question'

Yes, I'm mixing my cluster and service networks. It's in my homelab; less than 20 clients at any time.

alexskysilk said:
The only reason to bond a cluster traffic transport is for fault tolerance as bonding, even lacp, does not decrease your connection latency.
I humbly suggest you rethink your entire network architecture, something like this:

port 0 and 1 in a active/passive bond, connected DIRECTLY from one server to the other. Since you only have two nodes, this is the most dependable cabling available. This is your cluster traffic interface.
port 2 and 3 in a active/active bond (LACP if supported, active-alb if not or you want switch fault tolerance.) This is your service network. If you're using LACP, you dont need the second switch at all.

I'm going to try this right now and see how it operates.

alexskysilk said:
Truth be told, I dont know what the benefit you would get from LACP.

The LACP is to best communicate with my Synology DS1815+ running in 4GB LACP at the moment as NFS storage for the proxmox boxes.

Dan Manners · Sep 24, 2016

Okay, so I've narrowed it down to this: when the HA fails, I can still navigate both boxes from the single GUI interface. If I restart the corosync service on both boxes within 10 seconds or so, the HA group goes back to normal for 3-8 minutes.

It definitely looks like a multicast issue, but troubleshooting is proving difficult since I've never had this issue before and I don't do a whole lot of IGMP configuration

EDIT: Yup, definitely a multicast issue. 0% packet loss for unicast, but 57% on multicast.

Dan Manners · Sep 24, 2016

Okay, so I've finally isolated it 100% to a multicast issue. I'm having the same issue across multiple switches, both managed and unmanged, but what I've done is edit the corosync file (per https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues#PVE_.E2.89.A5_4.x) to force updu and it seems to be stable at this point in time! Whoooo!

I'll update this thread if it doesn't end up being resolved

Search

Search

[SOLVED] Proxmox HA and 4GBE LACP - High availability almost immediately fails

Dan Manners

Renowned Member

Aron Dijkstra

Well-Known Member

Dan Manners

Renowned Member

Aron Dijkstra

Well-Known Member

Dan Manners

Renowned Member

Aron Dijkstra

Well-Known Member

Dan Manners

Renowned Member

Aron Dijkstra

Well-Known Member

Dan Manners

Renowned Member

Dan Manners

Renowned Member

fabian

Proxmox Staff Member

Dan Manners

Renowned Member

Dan Manners

Renowned Member

alexskysilk

Distinguished Member

Dan Manners

Renowned Member

Dan Manners

Renowned Member

Dan Manners

Renowned Member

We value your privacy