Hi,
I'm running a PVE 8.4.1 3-node cluster that I mostly use as a lab for a team of 10-15 developers to practice and train on. The rough specs of the 3 machines are:
Our workloads are primarily K8S clusters with worker nodes hosted on Proxmox VMs that I scatter across the 3 nodes. With the ceph cluster though, I kept seeing read/write latencies spiking up to more than 100 ms and it really caused a lot of intermittent issues in all the developers' experience on the cluster. Whatever that was running on the ZFS pool was fine though, so last weekend, I destroyed the ceph pool and repurposed the SSDs to create a 2nd ZFS pool on each of the nodes (each node with 4 M.2 SSDs in RAID10 configuration, each SSD is 2TB so I have effective size of 4TB after RAID10). I figured that since the first ZFS pool was working fine, the 2nd should work as well. I figured that I had sufficient memory for ARC for both the pools as well.
Once my devs started using the cluster though, we hit on major I/O bottlenecks. In iostat, w_await would spike up intermittently to high 3 digit milliseconds (refer to the attached file, my M.2 devices are the ones with "nvme" prefix) and everything on the entire server would stall (even worse than on ceph).
I didn't limit the amount of ARC memory used and when I check for arcstats, I can confirm that I'm using roughly around half of the system's memory (following example on the Z840 with 512GB of memory:
From my own calculations, I figure that the apps in the cluster are probably writing at most around 200-300MB/s of data into the ZFS pool at peak and I can't figure out where my bottleneck is:
Would appreciate it very much if someone with more experience can take a look at my setup as described above and tell me if there's any glaring mistake I made in the setup and whether there's a better way to set this up.
Thank you,
Wong
I'm running a PVE 8.4.1 3-node cluster that I mostly use as a lab for a team of 10-15 developers to practice and train on. The rough specs of the 3 machines are:
- HP Z6 G4 - single Xeon Gold 5220 CPU - 384GB 2666 memory
- HP Z840 - dual Xeon E5-2699 V3 - 512GB 2133 memory
- HP Z840 - dual Xeon E5-2680 V4 - 512GB 2400 memory
- HP Z6 G4 - 2 M.2 devices connected directly to the motherboard and 4 M.2 devices connected to Asus M.2 X16 4 slot PCIE card
- Both HP Z840s - 2 devices are on a 2 slot M.2 PCIE card that I can't recall the brand of and 4 M.2 devices connected to Asus M.2 X16 4 slot PCIE card
Our workloads are primarily K8S clusters with worker nodes hosted on Proxmox VMs that I scatter across the 3 nodes. With the ceph cluster though, I kept seeing read/write latencies spiking up to more than 100 ms and it really caused a lot of intermittent issues in all the developers' experience on the cluster. Whatever that was running on the ZFS pool was fine though, so last weekend, I destroyed the ceph pool and repurposed the SSDs to create a 2nd ZFS pool on each of the nodes (each node with 4 M.2 SSDs in RAID10 configuration, each SSD is 2TB so I have effective size of 4TB after RAID10). I figured that since the first ZFS pool was working fine, the 2nd should work as well. I figured that I had sufficient memory for ARC for both the pools as well.
Once my devs started using the cluster though, we hit on major I/O bottlenecks. In iostat, w_await would spike up intermittently to high 3 digit milliseconds (refer to the attached file, my M.2 devices are the ones with "nvme" prefix) and everything on the entire server would stall (even worse than on ceph).
I didn't limit the amount of ARC memory used and when I check for arcstats, I can confirm that I'm using roughly around half of the system's memory (following example on the Z840 with 512GB of memory:
Code:
root@pve03:~# cat /proc/spl/kstat/zfs/arcstats | grep "^size"
size 4 261057627064
From my own calculations, I figure that the apps in the cluster are probably writing at most around 200-300MB/s of data into the ZFS pool at peak and I can't figure out where my bottleneck is:
- The PCIE slot I'm using is a X16 with x4x4x4x4 bifurcation enabled - this being PCIE 3.0, I should be getting ~16GB/s bandwidth, translating to 4GB/s for each M.2
- The M.2 devices themselves, while being consumer grade, have actually gotten fairly good reviews and while I don't expect miracles out of them, I didn't expect them to bottleneck out that badly
- Disk utilization in both ZFS pools are currently well below 40% too so I'm not running into trouble pertaining to disk utilization hitting 80%.
Would appreciate it very much if someone with more experience can take a look at my setup as described above and tell me if there's any glaring mistake I made in the setup and whether there's a better way to set this up.
Thank you,
Wong
Attachments
Last edited: