"Quorum dissolved" after short network outage due to STP topology change

jampy · Sep 2, 2015

Hmmm, ok, it should work as long as the connection between the two switches is working.

Again the same example:

Node #2 can't reach the primary switch and will switch to the backup switch.

Node #1 is not aware of that (since both links of node #1 are up) and will still try to reach node #2 using the primary switch. That switch will forward the packets to the backup switch, which can reach node #2.

In the end it should work, although not as robust as the mesh solution (which would still work even if the link between the two switches is down).

I will give it a try during the next hours, when the cluster is not in use.

Still, I'd like to know why the Proxmox cluster crashes completely during a short network outage..

jampy · Sep 2, 2015

jampy said:
I will give it a try during the next hours, when the cluster is not in use.

Still, I'd like to know why the Proxmox cluster crashes completely during a short network outage..

okay, the network is working fine with bonding and also isn't affected by switch reboots.
could not test single nic failures, though (no physical access)

mgabriel · Sep 4, 2015

I'd also be interested in a solution or an explanation. We discovered exactly the same behaviour in a very common, yet fresh setup of a proxmox VE cluster with 3 Nodes.

Totally different hardware, but due to a switch outage of less than a second, the cluster falls apart and does not come back to a normal state. CMAN is locked, rgmanager not started. No fencing happens as all nodes only have one vote of three.

We didn't discover such things earlier. I'm wondering if this might be a bug or if it is just coincidence.

Marco

manu · Sep 8, 2015

@jampy: can you maybe set the Thread to SOLVED if HA works with bonding ? it will help others who google that
@mgabriel: can you open a new thread with your question, explaining in detial what is your network setup ?

mgabriel · Sep 8, 2015

manu said:
@mgabriel: can you open a new thread with your question, explaining in detial what is your network setup ?

We solved it yesterday, as it seems. Tests are ongoing.

We had two HP 5700 switches in place and wanted to use the virtual trunking wit IRF to be able to do LACP over both switches due to redundancy. IRF didn't work for us, so we disabled it but we still had the link between the two switches in place so that we had a network loop. Removed the link between the two switches and it seems to work now.

Thanks,
Marco

jampy · Sep 11, 2015

manu said:
@jampy: can you maybe set the Thread to SOLVED if HA works with bonding ?

To me, bonding in this case is just a workaround.

Bonding reduces the possibility of a total network outage, but the original problem still persists: The Proxmox cluster becomes unusable when the network is inoperable for a short amount of time. I still wish somebody could give me an answer how that can happen and what can be done to fix it.

jampy · Sep 15, 2015

How comes nobody really cares about this problem? IMHO what I'm experiencing *could* indicate that there is some serious bug in Proxmox...

FYI, today the cluster crashed once again after rebooting two nodes (not at the same time). Once again rgmanager crashed (see below) and I had to force-reboot the two nodes. And this is with a seemingly stable network (bonding, no STP):

Search

Search

"Quorum dissolved" after short network outage due to STP topology change

jampy

Member

jampy

Member

mgabriel

Renowned Member

manu

Proxmox Staff Member

mgabriel

Renowned Member

jampy

Member

jampy

Member

Attachments

We value your privacy