[SOLVED] Issues with iowait after faulty LAG

chrispage1 · Apr 4, 2023

Hi,

We have a three node setup, each of which is running Proxmox & Ceph.

Each node has a four-port 10GBit link. Two ports for Ceph and two ports for public/private networking.
There are also two 1GBit ports used purely for Corosync.

Link aggregation for Ceph works great, Ceph is nice and happy and has a green status.

However, we had an issue with our public/private LAG on node 1 which dropped the transfer speeds to MBit/s rather than GBit/s. One thing we noticed was excessive IOWait times inside our VMs, to the point that when backups were performed, the load for the VM was so high that it became unresponsive.

Our monitors, managers & metadata services all speak over public/private networks but our OSDs sync via a dedicated network.

Is this configuration OK? Why would I be getting the iowait issues?

Thanks,
Chris.

aaron · Apr 4, 2023

chrispage1 said:
However, we had an issue with our public/private LAG on node 1 which dropped the transfer speeds to MBit/s rather than GBit/s.

Unless I understand something wrong, one node currently has a bandwidth of Mbit instead of Gbit?
Is this issue still present? If so, figure out why (bad cable?).

With Ceph you always have the network involved. If you make use of the optional Ceph Cluster network, then the OSD replication is used for that. But the clients (VMs) will use the Ceph Public network and that also needs to be fast.

Since Ceph will spread the data over all nodes, even one Node with a slow network can cause performance issues.

Sorry if I understood the situation wrong

chrispage1 · Apr 4, 2023

Thanks for your reply Aaron.

Is this issue still present? If so, figure out why (bad cable?).

The issue is still present so we have taken the port out of the LAG which resolves everything. We can't work out exactly what the cause is. Out of the LAG the interface in question is capable of transferring 9GBit/s (on a 10GBit link) with no problem. As soon as it's added into the LAG our LAG throughput drops to just MB/s. Cable diagnostics are coming back OK but we're swapping out the link, SFP and ruling out any hardware-specific issues.

With Ceph you always have the network involved. If you make use of the optional Ceph Cluster network, then the OSD replication is used for that. But the clients (VMs) will use the Ceph Public network and that also needs to be fast.

Right, that makes complete sense. So Ceph's `public_network` setting is configured to be our private network 10.0.0.0/24 which will in turn route traffic through the faulty LAG. I presume there is no problem with that configuration? And once we've fixed the faulty issue in theory we should never drop below 10GBit/s.

aaron · Apr 4, 2023

Yep, once the network is operating fine again and Ceph gets enough bandwidth, you should see the performance to go back to previous levels.

chrispage1 · Apr 4, 2023

Thanks, Aaron, appreciate your support and will mark this solved

Chris.

chrispage1 · Apr 20, 2023

Just to update for anyone that comes across this thread, despite our fibre passing light tests, and running at full speed when not aggregated, it turns out it was indeed a faulty optic and was fixed by replacing. We're now back to full 20GB/s bonded throughput.

[SOLVED] Issues with iowait after faulty LAG

chrispage1

Well-Known Member

aaron

Proxmox Staff Member

chrispage1

Well-Known Member

aaron

Proxmox Staff Member

chrispage1

Well-Known Member

chrispage1

Well-Known Member

We value your privacy