High KVM CPU load when VM on ZFS

Ballistic · Dec 2, 2021

Hi Guys,

I've setup one of my new servers today for some testing.
Xeon Silver CPU
256GB DDR4
Samsung 1733 NVME drives

As long as the VM is running on local lvm-thin type storage, the disk performance in the VM is blazing fast (7GB/sec read, 4GB/sec write) with low cpu load.
As soon as I run the VM on ZFS storage (raidz, 10, 1 does not matter), the disk performance drops in half (boohoo only 4GB/sec read) and all my 16CPU cores go to 100% load. Wierd thing is, it's not the ZFS process that's eating the CPU.
it's the KVM worker threads that are blasting the CPU. Inside of the VM it shows no CPU usage at all. Anyone have an idea how?

Data:
Windows VM with Virtio stuff
"Host" setting CPU, 16 cores
Virtual disk options enabled: No cache, SSD Emulation, IO thread, Discard

ZFS pool compression off, atime off, arc=metadata only

Ballistic · Dec 2, 2021

What I also don't understand is why this command, ran on the proxmox host zfs pool directy only does about 35MB/sec

fio --ioengine=libaio --filename=/pool/test --direct=1 --sync=1 --size=1GB--rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

Dunuin · Dec 2, 2021

Ballistic said:
What I also don't understand is why this command, ran on the proxmox host zfs pool directy only does about 35MB/sec

fio --ioengine=libaio --filename=/pool/test --direct=1 --sync=1 --size=1GB--rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

You are doing single threaded sync 4K writes. 35MB/s isn't bad for that workload. Thats still 8750 IOPS. I've seen SSDs not even reaching 1MB/s.

Ballistic · Dec 2, 2021

Thanks! That explains. What options do you recommend for some more realistic linear and random read/write performance like CrystalDiskMark or Atto does?

As for ZFS and the CPU load. Sofar I think that overhead that ZFS requires (checksumming) simple necks the CPU. However, when enabling the L1ARC, performance does go up by double in some tests so that doesn't really make sense to me yet.

Dunuin · Dec 2, 2021

If you use the ARC for read caching you will basically benchmarking your RAM. So no wonder that the performance will go up because that RAM is way faster than your SSD.

And I don't think CrystalDiskMark will be realistic. CrystalDiskMark is using highly parallelized async writes that will be cached in RAM. So its basically benchmarking your RAM to show some synthetic very big numbers that you often won't find in a real life scenario.

With your fio benchmark you basically did the opposite. You benchmarked the worst (or nearly worrst case because you only did sequential writes and not random writes) case scenario with small unparallelized 4K sync writes with RAM caching disabled.

A real workload would be somewhere in the middle.

If you want to see some big number using fio too, you could enable ARC caching and run something like this (doing 8 paralellized threads of 1M async sequential writes):

fio --ioengine=libaio --filename=/pool/test --direct=1 --size=5GB --rw=write --bs=1M --numjobs=8 --iodepth=32 --runtime=60 --time_based --name=fio --refill_buffers --group_reporting

You might need to increase the "--size=5GB" if it finishes in under 1 minute.

And most of the time it is recommended to enable lz4 compression for a better performance. The benefit of writing less data (because compressed) to the disk is bigger than the work needed to compress that data.

Ballistic · Dec 3, 2021

With L1ARC:
fio: (groupid=0, jobs=8): err= 0: pid=2351: Fri Dec 3 09:55:29 2021
write: IOPS=8595, BW=8595MiB/s (9013MB/s)(504GiB/60001msec); 0 zone resets
slat (usec): min=102, max=83658, avg=604.06, stdev=1098.89
clat (usec): min=2, max=125231, avg=28851.81, stdev=11989.98
lat (usec): min=496, max=126538, avg=29457.11, stdev=12268.10
clat percentiles (msec):
| 1.00th=[ 13], 5.00th=[ 14], 10.00th=[ 17], 20.00th=[ 20],
| 30.00th=[ 23], 40.00th=[ 24], 50.00th=[ 27], 60.00th=[ 31],
| 70.00th=[ 33], 80.00th=[ 36], 90.00th=[ 45], 95.00th=[ 51],
| 99.00th=[ 70], 99.50th=[ 83], 99.90th=[ 105], 99.95th=[ 109],
| 99.99th=[ 118]
bw ( MiB/s): min= 5394, max=16264, per=100.00%, avg=8613.06, stdev=294.61, samples=952
iops : min= 5394, max=16264, avg=8612.66, stdev=294.62, samples=952
lat (usec) : 4=0.01%, 10=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=21.38%, 50=73.69%
lat (msec) : 100=4.75%, 250=0.16%
cpu : usr=34.29%, sys=46.12%, ctx=540458, majf=0, minf=116
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,515710,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=8595MiB/s (9013MB/s), 8595MiB/s-8595MiB/s (9013MB/s-9013MB/s), io=504GiB (541GB), run=60001-60001msec

Without L1ARC:
fio: (groupid=0, jobs=8): err= 0: pid=861606: Fri Dec 3 09:57:44 2021
write: IOPS=8016, BW=8016MiB/s (8406MB/s)(470GiB/60002msec); 0 zone resets
slat (usec): min=103, max=92676, avg=669.20, stdev=1206.22
clat (usec): min=2, max=122290, avg=30935.24, stdev=12438.04
lat (usec): min=265, max=122942, avg=31605.71, stdev=12722.56
clat percentiles (msec):
| 1.00th=[ 12], 5.00th=[ 16], 10.00th=[ 19], 20.00th=[ 22],
| 30.00th=[ 24], 40.00th=[ 27], 50.00th=[ 30], 60.00th=[ 32],
| 70.00th=[ 34], 80.00th=[ 40], 90.00th=[ 46], 95.00th=[ 52],
| 99.00th=[ 79], 99.50th=[ 90], 99.90th=[ 109], 99.95th=[ 113],
| 99.99th=[ 118]
bw ( MiB/s): min= 5104, max=16710, per=100.00%, avg=8023.09, stdev=252.58, samples=952
iops : min= 5104, max=16710, avg=8022.39, stdev=252.58, samples=952
lat (usec) : 4=0.01%, 10=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=14.66%, 50=79.59%
lat (msec) : 100=5.49%, 250=0.25%
cpu : usr=31.84%, sys=44.09%, ctx=702138, majf=0, minf=118
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,480996,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: bw=8016MiB/s (8406MB/s), 8016MiB/s-8016MiB/s (8406MB/s-8406MB/s), io=470GiB (504GB), run=60002-60002msec

Difference in RAM and NVME performance this test isn't even that big

Difference get's bigger in Random4K block tests so it's wise to keep on the L1ARC since it's mainly used as a VDI platform.

I will test the impact of lz4 compression and deduplication next. But either way this setup is totally CPU bottlenecked

fiveangle · Apr 3, 2022

Ballistic said:
Without L1

It's a wonder what either of you is talking about, because the ARC is integral to ZFS and cannot be "removed", only adjusted in size.

Dunuin · Apr 3, 2022

fiveangle said:
It's a wonder what either of you is talking about, because the ARC is integral to ZFS and cannot be "removed", only adjusted in size.

You can tell ZFS what to use the ARC for. Thats what the "primarycache" attribute is for. You can for example set it to "primarycache=none" and ZFS won't use the ARC any longer to cache data or metadata. So if you want to benchmark your disks, you should disable ARC caching first so you don't just benchmark your RAM.

See the OpenZFS documentation: https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html?highlight=primarycache

primarycache=all|none|metadata
Controls what is cached in the primary cache (ARC). If this property is set to all, then both user data and metadata is cached. If this property is set to none, then neither user data nor metadata is cached. If this property is set to metadata, then only metadata is cached. The default value is all.

fiveangle · Apr 3, 2022

Ah, that’s a great test feature I haven’t seen before (and searching seems to back up that it’s not super common knowledge). Thanks.

When I was testing HW NVMe RAID0 on a single-cpu Bronze X11 system, we had to drop a second cpu in just to saturate the card (9460-16i) with 12x P4500 4TB on the Supermicro PCI switch backplane behind it, so not surprised at your system being CPU-limited on ZFS.

Search

Search

High KVM CPU load when VM on ZFS

Ballistic

Active Member

Ballistic

Active Member

Dunuin

Distinguished Member

Ballistic

Active Member

Dunuin

Distinguished Member

Ballistic

Active Member

fiveangle

Member

Dunuin

Distinguished Member

fiveangle

Member