High I/O delay

juliannieborg · Dec 10, 2022

Hi,
I'm facing high a strangely high I/O delay.
I have 3 times A Dell Poweredge 720XD with (almost) the same configuration.

The main specs are that are the same over all three servers are:
- 26x 500gb SSDs in a ZFS pool (RaidZ3) ashift 12 with a block size of 256K
- 380GB of ram
- Dell Perc 710p mini mono flashed to IT mode for passthrough mode (both D1 revision now, first the node that is giving problems had the B0 revision).
- Same BIOS (version and settings)
- Same fresh install of Proxmox

The biggest difference is the CPU.
2 nodes have 2x E5 2697 V2 and one node has 2x E5 2690 V2.

The nodes with the 2697 have an average I/O delay between 0 and 2% and are currently running between 10 and 20 VMs.

The node with the 2690 already has an I/O delay of 50%+ with 1 vm (which is also running on another node).
Without running VMs and coping a 10GB file the I/O delay increases to over 50%.
Is it that my CPU difference is the bottleneck. Or am I overlooking something else. I'm out of ideas.

Any advices or opinions are very much appreciated.

Thank you.

rason · Dec 10, 2022

Although unlikely, it's possible that the difference in CPU between the two nodes is causing the higher I/O delay on the node with the 2690 CPUs. The 2697 CPUs have a higher clock speed and more cores, which could enable them to process I/O requests more quickly than the 2690 CPUs.
However, there could be other factors at play as well. To troubleshoot the issue further, I would recommend trying the following:

Monitor the CPU utilization on both nodes while running a workload to see if the 2690 node is consistently running at a higher CPU utilization than the 2697 nodes. If the 2690 node is consistently using a higher percentage of its CPU, this could indicate that the difference in CPU performance is causing the higher I/O delay.
Test the I/O performance of the storage on both nodes using a tool like fio to see if there are any differences in the raw I/O performance of the disks. This can help you determine whether the higher I/O delay is due to the CPU or the storage.
(you might have already done it since you've mentioned that the BIOS settings are identical; if so, skip this step) Try disabling any power-saving features on the 2690 node to see if this has any impact on the I/O delay. Some power-saving features can cause the CPU to clock down under certain conditions, which could impact its ability to process I/O requests efficiently.

Another possibility is that there is a problem with the PERC H710P Mini Mono RAID controller on the node with the 2690 V2. Since you mentioned that the controller was flashed to IT mode, it's possible that the firmware update didn't go as smoothly as it could have, or that the controller is not functioning properly. You could try flashing the controller again or replacing it with another controller to see if that fixes the problem.

It's also worth checking the system logs on the node with the 2690 V2 to see if there are any error messages or other clues that could help identify the cause of the high I/O delay.

Overall, it's difficult to say exactly what is causing the high I/O delay without more information, but the difference in CPU performance and the potential issue with the RAID controller are two possible causes that you can investigate further.

Dunuin · Dec 10, 2022

As an addition to what rason wrote:

juliannieborg said:
- 26x 500gb SSDs in a ZFS pool (RaidZ3)

Only throughput performance will scale with the number of disks and there maybe the PCIe bandwidth might be the bottleneck.
No matter how much SSDs you got, a raidz1/2/3 will only have IOPS performance of a single disk. As IOPS performance won't scale with the number of disks, but the number of vdevs. And with a single raidz3, you just got one vdev. So the whole raidz3 will be as slow as the slowest disk. Maybe just one SSD is bad and will slow the whole pool down. You could benchmark all 26 individual disks and see if one of them causes any troubles.
For better IOPS performance it would be recommended to use a striped mirror or at least to stripe several smaller raidz2s (like 4x 6 disk raidz2 + 2 spares or 3x 8 disk raidz2 + 2 spares).

juliannieborg said:
Dell Perc 710p mini mono flashed to IT mode for passthrough mode

All 26 SSDs are connected to the same single HBA with a SAS expander or do you got alot of them per server? I just ask because it only got 8 lanes of PCIe 2.0, so total maximum bandwidth of 3.2 GB/s. That would mean 3.2 GB/s / 26 SSDs = 123MB/s throughput per SSD.

juliannieborg said:
ashift 12 with a block size of 256K

I hope you don't run any databases or something like that. Read/write amplification will be terrible. For example, each single 8k random sync write of a postgresql DB will read+write a full 256K block. So you do 512K of reads/writes for just 8K of data. Even if sequential reads can be cached in ARC, there will be massive write amplification crippling the performance and cause massive SSD wear.
And yes, I know. You can't go that much lower with the volblocksize or padding overhead will increase even more, because your raidz3 is too big.

Are all 3 nodes running the same workload? Maybe that slow one is just doing more random sync writes so you are more bottlenecked by the bad IOPS performance and write amplification. You could check what zpool iostat is reporting and compare that between nodes. iostat of the sysstat package can also be helpful to see how much each indiviual disk is hit by IO.

Search

Search

High I/O delay

juliannieborg

Member

rason

Member

Dunuin

Distinguished Member

We value your privacy