[SOLVED] Proxmox HA and 4GBE LACP - High availability almost immediately fails

Dan Manners

Active Member
Sep 20, 2016
13
0
41
Raleigh, NC
www.danmanners.com
Morning everyone.

I've set up two Dell R610's with Proxmox running in HA and 4gb LACP to HP v1910-16's (see image below).

5ntSAAEqN.png


When the Proxmox boxes turn on, they see each other, and they act as though they're in HA. The logs say otherwise. The two boxes never lose connection with each other, but the quorum only sees one vote on each box while all of the other numbers are 2.

I've made sure that it is not a jumbo frame issue, nor is it an issue of the LACP not coming up fast enough (best I can tell), and I'm completely and totally out of ideas. I've set up multiple Proxmox boxes before in HA (6 total, 2gb LACP each) with no issues whatsoever, so I'm *reasonably* confident that it's not something completely obvious. That being said, I'm sure this is something I'm doing wrong and not a Proxmox issue.

Please and thank you in advance!

EDIT: Also weirder is that either of the Proxmox units can still manage the other one, but they appear to not recognize the other; logs are filled with errors on not being able to reach the other one.
 
Last edited:
Maybe some stupid questions, But can you SSH from one box to the other? and also the other way arround?
Can you ping and let it run, will you loose pings?
How does your network configuration on the boxes look? (openvswitch or linux)
Please give some more info :)
 
Maybe some stupid questions, But can you SSH from one box to the other? and also the other way around?
Yup, SSH works fine. Takes a second for the password field to appear, but works fine and fast after that.

Can you ping and let it run, will you loose pings?
No pings appear to drop. Out of nearly 100,000 pings I ran, I dropped 6 packets.

How does your network configuration on the boxes look? (openvswitch or linux)
Please give some more info :)
I have it set up like this:
  • Proxmox 1
  • bond0 | Linux Bond | Active: Yes | Autostart: Yes | Port/Slaves: eth0 eth1 eth2 eth3 | IP Address 1.1.1.1 | Subnet Mask: 255.255.255.0
  • vmbr0 | Linux Bridge | Active: Yes| Autostart: Yes | Port/Slaves: bond0 | IP Address 10.5.0.201 | Subnet Mask: 255.255.255.0 | Gateway: 10.5.0.1
and

  • Proxmox 2
  • bond0 | Linux Bond | Active: Yes | Autostart: Yes | Port/Slaves: eth0 eth1 eth2 eth3 | IP Address 1.1.1.2 | Subnet Mask: 255.255.255.0
  • vmbr0 | Linux Bridge | Active: Yes| Autostart: Yes | Port/Slaves: bond0 | IP Address 10.5.0.202 | Subnet Mask: 255.255.255.0 | Gateway: 10.5.0.1

Also, I've made sure that no packets are dropping on the HP switches nor is there any port-flapping (misconfiguration with LACP)
 
hmmm, I find that strange. When i cluster my servers i dont get a login prompt! but it drops directly to the console. (cert login) Do you know for sure the key exchange and also the Clustering went good?
 
Okay, just rebooted both boxes and unplugged (not disabled) 3 of the 4 ethernet cables on each box.

5zzKRVbMK.png


Everything looks fine in this configuration, I figure I'm going to connect cables one by one and see when it breaks.
 
I've gone through and verified the following:
  1. Nodes are on the same subnet
  2. Nodes can unicast each other just fine far below 1% loss
  3. Nodes can resolve each others hostnames just fine
For the life of me, I can't get omping working. I enter the node names and it tells me "omping: Can't find local address in arguments"

I'm struggling hard over here...what else would you guys recommend trying?
 
Okay, I've got it all working if I take LACP out of the picture; single ethernet cable on each server going to the same switch in a standard network mode seems to keep the HA group online just fine.

Going to do my best to rule out these HP switches, but worst case scenario I put them out a window and get something else...
 
really stupid question. Are you mixing your cluster and service networks?

The only reason to bond a cluster traffic transport is for fault tolerance as bonding, even lacp, does not decrease your connection latency.
I humbly suggest you rethink your entire network architecture, something like this:

port 0 and 1 in a active/passive bond, connected DIRECTLY from one server to the other. Since you only have two nodes, this is the most dependable cabling available. This is your cluster traffic interface.
port 2 and 3 in a active/active bond (LACP if supported, active-alb if not or you want switch fault tolerance.) This is your service network. If you're using LACP, you dont need the second switch at all. Truth be told, I dont know what the benefit you would get from LACP.
 
really stupid question. Are you mixing your cluster and service networks?
Not a really stupid question' :) Yes, I'm mixing my cluster and service networks. It's in my homelab; less than 20 clients at any time.

The only reason to bond a cluster traffic transport is for fault tolerance as bonding, even lacp, does not decrease your connection latency.
I humbly suggest you rethink your entire network architecture, something like this:

port 0 and 1 in a active/passive bond, connected DIRECTLY from one server to the other. Since you only have two nodes, this is the most dependable cabling available. This is your cluster traffic interface.
port 2 and 3 in a active/active bond (LACP if supported, active-alb if not or you want switch fault tolerance.) This is your service network. If you're using LACP, you dont need the second switch at all.
I'm going to try this right now and see how it operates.

Truth be told, I dont know what the benefit you would get from LACP.
The LACP is to best communicate with my Synology DS1815+ running in 4GB LACP at the moment as NFS storage for the proxmox boxes.
 
Okay, so I've narrowed it down to this: when the HA fails, I can still navigate both boxes from the single GUI interface. If I restart the corosync service on both boxes within 10 seconds or so, the HA group goes back to normal for 3-8 minutes.

It definitely looks like a multicast issue, but troubleshooting is proving difficult since I've never had this issue before and I don't do a whole lot of IGMP configuration

EDIT: Yup, definitely a multicast issue. 0% packet loss for unicast, but 57% on multicast.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!