[SOLVED] Issues with iowait after faulty LAG

chrispage1

Member
Sep 1, 2021
90
47
23
32
Hi,

We have a three node setup, each of which is running Proxmox & Ceph.

Each node has a four-port 10GBit link. Two ports for Ceph and two ports for public/private networking.
There are also two 1GBit ports used purely for Corosync.

Link aggregation for Ceph works great, Ceph is nice and happy and has a green status.

However, we had an issue with our public/private LAG on node 1 which dropped the transfer speeds to MBit/s rather than GBit/s. One thing we noticed was excessive IOWait times inside our VMs, to the point that when backups were performed, the load for the VM was so high that it became unresponsive.

Our monitors, managers & metadata services all speak over public/private networks but our OSDs sync via a dedicated network.

1680605551009.png

Is this configuration OK? Why would I be getting the iowait issues?

Thanks,
Chris.
 
However, we had an issue with our public/private LAG on node 1 which dropped the transfer speeds to MBit/s rather than GBit/s.
Unless I understand something wrong, one node currently has a bandwidth of Mbit instead of Gbit?
Is this issue still present? If so, figure out why (bad cable?).

With Ceph you always have the network involved. If you make use of the optional Ceph Cluster network, then the OSD replication is used for that. But the clients (VMs) will use the Ceph Public network and that also needs to be fast.

Since Ceph will spread the data over all nodes, even one Node with a slow network can cause performance issues.


Sorry if I understood the situation wrong :)
 
Thanks for your reply Aaron.

Is this issue still present? If so, figure out why (bad cable?).

The issue is still present so we have taken the port out of the LAG which resolves everything. We can't work out exactly what the cause is. Out of the LAG the interface in question is capable of transferring 9GBit/s (on a 10GBit link) with no problem. As soon as it's added into the LAG our LAG throughput drops to just MB/s. Cable diagnostics are coming back OK but we're swapping out the link, SFP and ruling out any hardware-specific issues.

With Ceph you always have the network involved. If you make use of the optional Ceph Cluster network, then the OSD replication is used for that. But the clients (VMs) will use the Ceph Public network and that also needs to be fast.

Right, that makes complete sense. So Ceph's `public_network` setting is configured to be our private network 10.0.0.0/24 which will in turn route traffic through the faulty LAG. I presume there is no problem with that configuration? And once we've fixed the faulty issue in theory we should never drop below 10GBit/s.
 
Last edited:
Yep, once the network is operating fine again and Ceph gets enough bandwidth, you should see the performance to go back to previous levels.
 
Just to update for anyone that comes across this thread, despite our fibre passing light tests, and running at full speed when not aggregated, it turns out it was indeed a faulty optic and was fixed by replacing. We're now back to full 20GB/s bonded throughput.
 
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!