Can ZFS be configured to throttle host VM deletion disk IO?

trey.b

New Member
Jan 22, 2024
19
21
3
I have been managing a PVE cluster at my company for about 5 years, and I have adapted settings and hardware configs as I have learned best practices with PVE, but one major issue complicates our ability to maintain operations: Disk IO Delay on large VM disk deletion within ZFS.

We host production Windows and Linux VMs to build and validate software within R&D with software dating back to the 1980s, creating a massive amount of build requirements and artifacts possible for a generic build VM. Our VMs are configured to 3TB virtio disks, and we may have a few million files/folders accumulated over time, because when you have over 1000 VMs, the fastest network fetch is the one that does not have to happen so a 2 minute compile doesn't spend hours copying things from the network.

Whenever we need to recreate VMs, especially if we're pushing out a new image annually, I see ZFS shortly after deleting the old VM consume an enormous amount of disk IO. Even in our latest servers, I'll see 20% IO delay, which in my experience is the tip of the iceberg with actual IO delay.

When we recreate VMs, we run commands like 'pvesm alloc' and 'qm set' and when disks are busy and don't respond immediately, it will error out about the disks not responding.

Our latest servers are ones like Dell R7625 containing:
  • 2x AMD Zen 4 9374F x2 (64C) @ ~4.1 GHz
  • 1280 GB RAM
  • ZFS RAID10 with 16x8TB drives (Samsung PM1743 or similar Kioxia Gen 5 NVMe SSDs, x4 trained)
These are some of the best enterprise Gen 5 NVMe SSDs on the market.

Ideally, I would like a way to tell ZFS to cap these sorts of operations to 30% and complete it over time or somehow as a lower priority. Is there any way to tune ZFS for this?
 
Maybe the delete causes a lot of trims (which slow down the drives)? Is autotrim enabled on the pool? Maybe try disabling it and trimming the pool on your own schedule.
I think autotrim is not enabled by default. Do you trim you drives regularly? Otherwise, deleting and creating large volumes might take longer if there are no free flash blocks.
Does the pool have at least 20% free? Otherwise large operations might take longer.
In the end you might indeed need a ZFS expert to have a look (as I don't recognize such behavior since I switched to enterprise drives, but I am also an amateur and home user).
 
Thanks for the ideas.

I did look at the ZFS options, and there were some like zfs_free_max_blocks that might be related, but considering it probably has to be tweaked per server config and may be detrimental to other operations, it is probably best not to tinker.

It does sound like autotrim is what this operation is, and I don't see a way to tune it. We do have it enabled, and while we could disable it and schedule it explicitly, we're already using every minute of the day and week. We schedule maintenance pipelines daily in the middle of the night when globally there's less activity to prune artifacts not referenced in a month, that way we don't run out of disk space during production builds and spend 2 hours cleaning up and resyncing.

It also wouldn't make sense to delay autotrim during mass template redeployment because then for example on an upcoming server we'll have 16x16 TB RAID10 and 40 VMs with 4TB each - that's a lot of data to delay and process over night and would also compete with production builds.

Another way to handle this would be to serialize all VM creation per host:
  • Delete all VMs
  • Wait until autotrim or disk activity idles
  • Create all VMs
  • Bootstrap all VMs

But the real world problem with this is that each host typically has 40 VMs that are production build agents in Azure DevOps. Some builds take 2 minutes. Some take 2 days. Operationally waiting until all are idle creates a lot of friction.

Ultimately, I need to work on an optimization that I've identified to resolve this issue. Today we have a single OS drive that we recreate with VMs, including the artifacts. I want to refactor this to instead move all non-OS files to an artifact drive that persists between image recreation. I mocked up some simple commands to detach, rename (so it doesn't get auto-deleted on VM recreation) and rename back and attach on VM creation. That would reduce the IO load of recreating VMs from 2-4 TB to perhaps 50-100 GB of whatever is in the linked-clone delta disk for OS updates, TEMP files, eg all changes since creation.

Side note - I keep suspecting what aggravates this inefficient data layout is the server hardware IO topology. We're using Dell R series dual socket hosts. We have AMD Zen 4 and a couple of generations of Xeons. AMD has worse IO delay, and I'm looking forward to the Zen 5 chips we're ordering because it reduced IO delay as a focus area. Before Linus helped fix the NUMA balancer bug in Linux Kernel 6.8.3, the AMD hosts' Windows VMs would freeze and BSOD constantly, like 3 VMs out of 40 per day. Now it's about 3 of the 40 per year. The Intel hosts never BSOD, but they did network disconnect from AzDo briefly and Windows Event Viewer showed that NT restarted explorer.exe because the disks weren't responding for 30 seconds. After that NUMA fix, Intel never does that and AMD's rate reduced by over 95%.

The AMD Zen 4 has CCD chiplets per CPU, and there's PCIe 5.0 lanes linking the 2 sockets. With ZFS RAID10, each write during autotrim must have to saturate writes across that link while other production builds are going on.

I could maybe explore single socket, but the way things are going with CPUs they're getting huge with more NUMA domains per CPU. Zen 6 is supposed to reduce the delay between sockets massively. They'll probably also start stacking them as a late stage Silicon method to get us beyond Silicon semiconductor's limitations.

So really, I should stop deleting TBs of data routinely when it could be avoided, even if technically the SSDs can handle it.
 
Last edited:
If you are ready to make your vm images on zfs datasets instead in zvols create a dataset for each vm. After usage just destroy dataset.
With enterprise nvme's having some less to even lots more of 100% spare cells trim isn't really necessary anymore, try yourself !
We have lots of dc nvme running and no one has trim enabled at all, that's more a relict from earlier (for consumer ssd's it depends).
 
  • Like
Reactions: Johannes S