High KVM CPU load when VM on ZFS

Ballistic

Active Member
Oct 4, 2018
21
0
41
23
Hi Guys,

I've setup one of my new servers today for some testing.
Xeon Silver CPU
256GB DDR4
Samsung 1733 NVME drives

As long as the VM is running on local lvm-thin type storage, the disk performance in the VM is blazing fast (7GB/sec read, 4GB/sec write) with low cpu load.
As soon as I run the VM on ZFS storage (raidz, 10, 1 does not matter), the disk performance drops in half (boohoo only 4GB/sec read) and all my 16CPU cores go to 100% load. Wierd thing is, it's not the ZFS process that's eating the CPU.
it's the KVM worker threads that are blasting the CPU. Inside of the VM it shows no CPU usage at all. Anyone have an idea how?

Data:
Windows VM with Virtio stuff
"Host" setting CPU, 16 cores
Virtual disk options enabled: No cache, SSD Emulation, IO thread, Discard

ZFS pool compression off, atime off, arc=metadata only
 
What I also don't understand is why this command, ran on the proxmox host zfs pool directy only does about 35MB/sec

fio --ioengine=libaio --filename=/pool/test --direct=1 --sync=1 --size=1GB--rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
 
What I also don't understand is why this command, ran on the proxmox host zfs pool directy only does about 35MB/sec

fio --ioengine=libaio --filename=/pool/test --direct=1 --sync=1 --size=1GB--rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
You are doing single threaded sync 4K writes. 35MB/s isn't bad for that workload. Thats still 8750 IOPS. I've seen SSDs not even reaching 1MB/s.
 
Last edited:
Thanks! That explains. What options do you recommend for some more realistic linear and random read/write performance like CrystalDiskMark or Atto does?

As for ZFS and the CPU load. Sofar I think that overhead that ZFS requires (checksumming) simple necks the CPU. However, when enabling the L1ARC, performance does go up by double in some tests so that doesn't really make sense to me yet.
 
If you use the ARC for read caching you will basically benchmarking your RAM. So no wonder that the performance will go up because that RAM is way faster than your SSD.

And I don't think CrystalDiskMark will be realistic. CrystalDiskMark is using highly parallelized async writes that will be cached in RAM. So its basically benchmarking your RAM to show some synthetic very big numbers that you often won't find in a real life scenario.

With your fio benchmark you basically did the opposite. You benchmarked the worst (or nearly worrst case because you only did sequential writes and not random writes) case scenario with small unparallelized 4K sync writes with RAM caching disabled.

A real workload would be somewhere in the middle.

If you want to see some big number using fio too, you could enable ARC caching and run something like this (doing 8 paralellized threads of 1M async sequential writes):
fio --ioengine=libaio --filename=/pool/test --direct=1 --size=5GB --rw=write --bs=1M --numjobs=8 --iodepth=32 --runtime=60 --time_based --name=fio --refill_buffers --group_reporting
You might need to increase the "--size=5GB" if it finishes in under 1 minute.

And most of the time it is recommended to enable lz4 compression for a better performance. The benefit of writing less data (because compressed) to the disk is bigger than the work needed to compress that data.
 
Last edited:
With L1ARC:
fio: (groupid=0, jobs=8): err= 0: pid=2351: Fri Dec 3 09:55:29 2021
write: IOPS=8595, BW=8595MiB/s (9013MB/s)(504GiB/60001msec); 0 zone resets
slat (usec): min=102, max=83658, avg=604.06, stdev=1098.89
clat (usec): min=2, max=125231, avg=28851.81, stdev=11989.98
lat (usec): min=496, max=126538, avg=29457.11, stdev=12268.10
clat percentiles (msec):
| 1.00th=[ 13], 5.00th=[ 14], 10.00th=[ 17], 20.00th=[ 20],
| 30.00th=[ 23], 40.00th=[ 24], 50.00th=[ 27], 60.00th=[ 31],
| 70.00th=[ 33], 80.00th=[ 36], 90.00th=[ 45], 95.00th=[ 51],
| 99.00th=[ 70], 99.50th=[ 83], 99.90th=[ 105], 99.95th=[ 109],
| 99.99th=[ 118]
bw ( MiB/s): min= 5394, max=16264, per=100.00%, avg=8613.06, stdev=294.61, samples=952
iops : min= 5394, max=16264, avg=8612.66, stdev=294.62, samples=952
lat (usec) : 4=0.01%, 10=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=21.38%, 50=73.69%
lat (msec) : 100=4.75%, 250=0.16%
cpu : usr=34.29%, sys=46.12%, ctx=540458, majf=0, minf=116
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,515710,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=8595MiB/s (9013MB/s), 8595MiB/s-8595MiB/s (9013MB/s-9013MB/s), io=504GiB (541GB), run=60001-60001msec

Without L1ARC:
fio: (groupid=0, jobs=8): err= 0: pid=861606: Fri Dec 3 09:57:44 2021
write: IOPS=8016, BW=8016MiB/s (8406MB/s)(470GiB/60002msec); 0 zone resets
slat (usec): min=103, max=92676, avg=669.20, stdev=1206.22
clat (usec): min=2, max=122290, avg=30935.24, stdev=12438.04
lat (usec): min=265, max=122942, avg=31605.71, stdev=12722.56
clat percentiles (msec):
| 1.00th=[ 12], 5.00th=[ 16], 10.00th=[ 19], 20.00th=[ 22],
| 30.00th=[ 24], 40.00th=[ 27], 50.00th=[ 30], 60.00th=[ 32],
| 70.00th=[ 34], 80.00th=[ 40], 90.00th=[ 46], 95.00th=[ 52],
| 99.00th=[ 79], 99.50th=[ 90], 99.90th=[ 109], 99.95th=[ 113],
| 99.99th=[ 118]
bw ( MiB/s): min= 5104, max=16710, per=100.00%, avg=8023.09, stdev=252.58, samples=952
iops : min= 5104, max=16710, avg=8022.39, stdev=252.58, samples=952
lat (usec) : 4=0.01%, 10=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=14.66%, 50=79.59%
lat (msec) : 100=5.49%, 250=0.25%
cpu : usr=31.84%, sys=44.09%, ctx=702138, majf=0, minf=118
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,480996,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=8016MiB/s (8406MB/s), 8016MiB/s-8016MiB/s (8406MB/s-8406MB/s), io=470GiB (504GB), run=60002-60002msec


Difference in RAM and NVME performance this test isn't even that big :)
Difference get's bigger in Random4K block tests so it's wise to keep on the L1ARC since it's mainly used as a VDI platform.

I will test the impact of lz4 compression and deduplication next. But either way this setup is totally CPU bottlenecked :)
 
It's a wonder what either of you is talking about, because the ARC is integral to ZFS and cannot be "removed", only adjusted in size.
You can tell ZFS what to use the ARC for. Thats what the "primarycache" attribute is for. You can for example set it to "primarycache=none" and ZFS won't use the ARC any longer to cache data or metadata. So if you want to benchmark your disks, you should disable ARC caching first so you don't just benchmark your RAM.

See the OpenZFS documentation: https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html?highlight=primarycache
primarycache=all|none|metadata
Controls what is cached in the primary cache (ARC). If this property is set to all, then both user data and metadata is cached. If this property is set to none, then neither user data nor metadata is cached. If this property is set to metadata, then only metadata is cached. The default value is all.
 
Last edited:
  • Like
Reactions: fiveangle
Ah, that’s a great test feature I haven’t seen before (and searching seems to back up that it’s not super common knowledge). Thanks.

When I was testing HW NVMe RAID0 on a single-cpu Bronze X11 system, we had to drop a second cpu in just to saturate the card (9460-16i) with 12x P4500 4TB on the Supermicro PCI switch backplane behind it, so not surprised at your system being CPU-limited on ZFS.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!