"Quorum dissolved" after short network outage due to STP topology change

Hmmm, ok, it should work as long as the connection between the two switches is working.


Again the same example:

Node #2 can't reach the primary switch and will switch to the backup switch.

Node #1 is not aware of that (since both links of node #1 are up) and will still try to reach node #2 using the primary switch. That switch will forward the packets to the backup switch, which can reach node #2.

In the end it should work, although not as robust as the mesh solution (which would still work even if the link between the two switches is down).


I will give it a try during the next hours, when the cluster is not in use.

Still, I'd like to know why the Proxmox cluster crashes completely during a short network outage..
 
I will give it a try during the next hours, when the cluster is not in use.

Still, I'd like to know why the Proxmox cluster crashes completely during a short network outage..

okay, the network is working fine with bonding and also isn't affected by switch reboots.
could not test single nic failures, though (no physical access)
 
I'd also be interested in a solution or an explanation. We discovered exactly the same behaviour in a very common, yet fresh setup of a proxmox VE cluster with 3 Nodes.

Totally different hardware, but due to a switch outage of less than a second, the cluster falls apart and does not come back to a normal state. CMAN is locked, rgmanager not started. No fencing happens as all nodes only have one vote of three.

We didn't discover such things earlier. I'm wondering if this might be a bug or if it is just coincidence.

Marco
 
@jampy: can you maybe set the Thread to SOLVED if HA works with bonding ? it will help others who google that
@mgabriel: can you open a new thread with your question, explaining in detial what is your network setup ?
 
@mgabriel: can you open a new thread with your question, explaining in detial what is your network setup ?

We solved it yesterday, as it seems. Tests are ongoing.

We had two HP 5700 switches in place and wanted to use the virtual trunking wit IRF to be able to do LACP over both switches due to redundancy. IRF didn't work for us, so we disabled it but we still had the link between the two switches in place so that we had a network loop. Removed the link between the two switches and it seems to work now.

Thanks,
Marco
 
Last edited:
@jampy: can you maybe set the Thread to SOLVED if HA works with bonding ?

To me, bonding in this case is just a workaround.

Bonding reduces the possibility of a total network outage, but the original problem still persists: The Proxmox cluster becomes unusable when the network is inoperable for a short amount of time. I still wish somebody could give me an answer how that can happen and what can be done to fix it.
 
How comes nobody really cares about this problem? IMHO what I'm experiencing *could* indicate that there is some serious bug in Proxmox...

FYI, today the cluster crashed once again after rebooting two nodes (not at the same time). Once again rgmanager crashed (see below) and I had to force-reboot the two nodes. And this is with a seemingly stable network (bonding, no STP):

Sh19JAc.png

CqASphR.png
 

Attachments

  • CqASphR.png
    CqASphR.png
    25 KB · Views: 2

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!