Recovering from poor network performance

Jonathan Spence

Active Member
Nov 10, 2017
1
0
41
34
Hey guys,
I thought I would share some experience I had yesterday/today with an unstable network and its impact on cluster quorom. I have a setup of 2x 40gbps/10gbps Mikrotik switches that the hosts plug into (one 10gbps into each switch with acive/backup network config). These switches are then connected also to a 10gbps/1gbps mikrotik switch that is copper only (except 2x 10gbps fibre to the other switches) and used for IPMI. All VLANs are available on all Proxmox host ports and the ports connecting the switches together (they are in a ring). There are 12 Proxmox hosts using hyper-converged Ceph.

I had an issue last night where for some reason the 10gbps/1gbps switch was elected as the root bridge by spanning tree and rather than the 40gbps connection connecting the other switches together being used, all traffic started going through the slower switch. I didn't realize either that this switch didn't have hardware offloading enabled. The result was with all the Ceph traffic that the switch CPU overloaded and there was massive delays and packet loss. The switch UI was also practically unusable too.

In any case I managed to resolve the networking situation by disabling one of the interfaces connecting the faster and slower switches - all traffic then took its correct path. I updated STP config for lower root bridge priority on a faster switch, enabled hardware offload just in case on the 1gbps copper switch and changed the path cost metrics so 40gbps path is much lower cost than 10gbps path.

The network however was too bad for Proxmox and each node was isolated when trying pvecm status. Nothing I could do could fix it, restarting corosync, all proxmox services - even a host reboot. The way I solved it was to shut down all hosts (12 hosts) and bring them up one by one. It was not ideal to have to turn everything off to reach quorom again but it worked which is the main thing.

I just wanted to share in case someone else comes accross this situation. I am ruinning Proxmox 6.4.

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!