Very bad I/O bottlenecks in my ZFS pools

feicipet

New Member
Dec 23, 2024
2
0
1
Hi,

I'm running a PVE 8.4.1 3-node cluster that I mostly use as a lab for a team of 10-15 developers to practice and train on. The rough specs of the 3 machines are:
  1. HP Z6 G4 - single Xeon Gold 5220 CPU - 384GB 2666 memory
  2. HP Z840 - dual Xeon E5-2699 V3 - 512GB 2133 memory
  3. HP Z840 - dual Xeon E5-2680 V4 - 512GB 2400 memory
In each of the machines, I have 6 consumer-grade M.2 SSDs for workload storage (operating system is running on another device), plugged in as follows:
  1. HP Z6 G4 - 2 M.2 devices connected directly to the motherboard and 4 M.2 devices connected to Asus M.2 X16 4 slot PCIE card
  2. Both HP Z840s - 2 devices are on a 2 slot M.2 PCIE card that I can't recall the brand of and 4 M.2 devices connected to Asus M.2 X16 4 slot PCIE card
Initially, I had setup 1 mirrored ZFS pool using the 2 M.2 devices on each machine (2XTB SSDs, effective size 2TB) and then I built a ceph cluster using all 12 M.2 devices on the Asus PCIE card.

Our workloads are primarily K8S clusters with worker nodes hosted on Proxmox VMs that I scatter across the 3 nodes. With the ceph cluster though, I kept seeing read/write latencies spiking up to more than 100 ms and it really caused a lot of intermittent issues in all the developers' experience on the cluster. Whatever that was running on the ZFS pool was fine though, so last weekend, I destroyed the ceph pool and repurposed the SSDs to create a 2nd ZFS pool on each of the nodes (each node with 4 M.2 SSDs in RAID10 configuration, each SSD is 2TB so I have effective size of 4TB after RAID10). I figured that since the first ZFS pool was working fine, the 2nd should work as well. I figured that I had sufficient memory for ARC for both the pools as well.

Once my devs started using the cluster though, we hit on major I/O bottlenecks. In iostat, w_await would spike up intermittently to high 3 digit milliseconds (refer to the attached file, my M.2 devices are the ones with "nvme" prefix) and everything on the entire server would stall (even worse than on ceph).

I didn't limit the amount of ARC memory used and when I check for arcstats, I can confirm that I'm using roughly around half of the system's memory (following example on the Z840 with 512GB of memory:

Code:
root@pve03:~# cat /proc/spl/kstat/zfs/arcstats | grep "^size"
size                            4    261057627064

From my own calculations, I figure that the apps in the cluster are probably writing at most around 200-300MB/s of data into the ZFS pool at peak and I can't figure out where my bottleneck is:
  1. The PCIE slot I'm using is a X16 with x4x4x4x4 bifurcation enabled - this being PCIE 3.0, I should be getting ~16GB/s bandwidth, translating to 4GB/s for each M.2
  2. The M.2 devices themselves, while being consumer grade, have actually gotten fairly good reviews and while I don't expect miracles out of them, I didn't expect them to bottleneck out that badly
  3. Disk utilization in both ZFS pools are currently well below 40% too so I'm not running into trouble pertaining to disk utilization hitting 80%.
Right now, it seems that whenever someone just redeploys an application into K8S and when K8S starts pulling in images and deploying it, the whole machine starts going haywire and while it's not a production issue per se, it is causing quite a lot of productivity delays for my team working on it.

Would appreciate it very much if someone with more experience can take a look at my setup as described above and tell me if there's any glaring mistake I made in the setup and whether there's a better way to set this up.

Thank you,
Wong
 

Attachments

  • image (1).png
    image (1).png
    155.6 KB · Views: 4
Last edited:
Search the forum for QLC and you will find that ZFS does not work well with those kind of drives. They might have good reviews for consumer/games but Proxmox VE is a clustered enterprise hypervisor. Lots of threads with suggestions to use (second-hand) enterprise drives with PLP instead of consumer QLC flash memory.

Everybody who buys QLC drives creates a thread on the forum with ZFS problems. It's never the other way around as nobody who searches the forum and uses ZFS buys QLC drives.
 
Last edited:
Thanks for the response. If I were to just switch down to using LVM / ext4 instead of using ZFS, do you think the problem would be alleviated due to lower overheads on the filesystem itself? I do not currently have any avenue of getting enterprise drives at the moment. As this is a training environment with no real SLAs, my daily backups are sufficient to protect against drive failures.
 
That would reduce sync writes and write amplification (but you lose most features that I like, such as mirroring and checksums). I've used consumer TLC flash drives for ZFS (expecting them to wear quickly but accelerate a HDD mirror) and it works well, and endures longer than expected, for some workloads. I do not want to use QLC so I really can't comment on your idea.
 
  • Like
Reactions: news