High CPU IO during DD testing on VMs

en4ble · Jun 24, 2024

Hello,

Wondering if I could get some insight on my situation. This is a new build. During benchmarking on the VMs which uses DD test we can see unhealthy high IO. I have similar system without this behavior - only difference is this one is using ZFS(raid1 - dual wd red) for OS.

Little about the system:

dual epyc 7702
OS as zfs(raid1) using two wd red SSDs NOT shared with VMs (dedicated)
80x VMs (2c/4t) sitting on evenly distributed 15 NVME drives (sn850x) - so about 5-6 VMs per drive
all VM drives running at gen4 speed using LVM Thin (not ZFS)
each VM has disk i/o throttle setup for 250/300 (burst) with "unsafe" policy

On the chart we can see when DD starts kicking in the IO levels are concerning.

Apologize my ignorance but my question is could it be perhaps ZFS type for OS as the culprit?! If it is I'm confused why since the DD test does not touches OS drives but only dedicated NVMEs for the VMs.

Any other artifacts I could show to help with this issue?!

Thank You in advance for any assistance!

leesteken · Jun 24, 2024

en4ble said:
Apologize my ignorance but my question is could it be perhaps ZFS type for OS as the culprit?!

Yes, very possible. ZFS has a lot of write amplification and overhead (because of useful features) and works better with enterprise SSDs with PLP. Additional details and suggestions for drives can be found on this forum, where this has been talked about a lot.

en4ble · Jun 24, 2024

leesteken said:
Yes, very possible. ZFS has a lot of write amplification and overhead (because of useful features) and works better with enterprise SSDs with PLP. Additional details and suggestions for drives can be found on this forum, where this has been talked about a lot.

Thanks @leesteken for reply. I'm confused on why would the ZFS cause high IO if testing is not done on the drives themselves?! As mentioned the drives are dedicated for OS only and DD test is happening on NVMEs.

leesteken · Jun 24, 2024

en4ble said:
Thanks @leesteken for reply. I'm confused on why would the ZFS cause high IO if testing is not done on the drives themselves?! As mentioned the drives are dedicated for OS only and DD test is happening on NVMEs.

Ah right, you don't use ZFS on the NVMe drives? Then please ignore my previous reply.
Maybe your consumer NVMe drives can't handle sustained writes (of multiple VMs) for very long and that's why the IO delay goes up?

en4ble · Jun 24, 2024

leesteken said:
Ah right, you don't use ZFS on the NVMe drives? Then please ignore my previous reply.
Maybe your consumer NVMe drives can't handle sustained writes (of multiple VMs) for very long and that's why the IO delay goes up?

Each VM has a throttle of 250/300MB/s, using sn850x. 5 VMs at 300MBs would be around 1.5GB, they should handle burst like that. Another system (similar to this) with same drives and same tier CPU does not have those IO numbers but its running OS as lvm thin (on nvme and not ssd). But again would be confusing why ZFS would cause that...

_gabriel · Jun 24, 2024

is SWAP enabled on host ?
What about host RAM usage ?

en4ble · Jun 24, 2024

_gabriel said:
is SWAP enabled on host ?
What about host RAM usage ?

Good question. No swap on the system - didn't want to over provision on that piece (which could introduce io if same drive was used for it). This was default config when using ZFS.

RAM we are at about 50-60% up/down.

Ballooning is enabled on VMs - this was by default - not sure on what's the best recommendation.

Disk policies (going with more aggressive limits):

Testing Write through cache as well since it may benefit the dd test (tbd). EDIT - does not, writeback gives the best perf so far.

_gabriel · Jun 24, 2024

try with Cache set to Default (None).

en4ble · Jun 25, 2024

@dunuin talks about using (none) actually when using ZFS, although again, the VMs are using LVM Thin, and OS is only on ZFS - not sure if this still would apply?! Also that "Keep in mind that all caching modes besides none will use RAM to buffer data" - but we don't have issue with RAM space.

I ran both vms at the same time (unsafe and none) and both had very similar performance using DD, with slight advantage on the unsafe one...

en4ble · Jun 25, 2024

What about KSM impact on storage? I thought it was just CPU intense?

spirit · Jun 25, 2024

do you have enabled compression on your zfs pool ?

Dunuin · Jun 25, 2024

You know why "unsafe" is called like it? It will do all important sync writes (that shouldn't be lost, no matter what) as unsafe volatile async writes so a failing PSU, kernel crash, power outage or other hardware failure might for example kill a whole filesystem or DB.

My guess would also be problems with caching. With "writeback" or "unsafe" the PVE hosts RAM will cache all async writes of a VM until your host runs out of RAM. If the VM is writing too much and the storage can't keep up writing stuff from RAM to disk and your RAM gets full, IO delay should go up.

en4ble · Jun 25, 2024

spirit said:
do you have enabled compression on your zfs pool ?

I'm sorry my ignorance where would I check on that?

en4ble · Jun 25, 2024

Dunuin said:
You know why "unsafe" is called like it? It will do all important sync writes (that shouldn't be lost, no matter what) as unsafe volatile async writes so a failing PSU, kernel crash, power outage or other hardware failure might for example kill a whole filesystem or DB.

My guess would also be problems with caching. With "writeback" or "unsafe" the PVE hosts RAM will cache all async writes of a VM until your host runs out of RAM. If the VM is writing too much and the storage can't keep up writing stuff from RAM to disk and your RAM gets full, IO delay should go up.

During the spike I've mentioned at the beginning ram never hit beyond 60% on the host. Perhaps I may try all the VMs with none and see how it performs. So we don't think its the ZFS on OS since VMs are on dedicated?! As always thanks for chipping in @Dunuin

Dunuin · Jun 25, 2024

I don't think ZFS is the problem. Most stuff the system disks are writing are logs, metrics and updates of the cluster DB. HDDs could handle this, then consumer SSD shouldn't have a problem either.
You could use tools like iotop or iostat to narrow down whats causing the io delay. zpool iostat -vy 60 1 should also give you some hints if the system disks are busy.

en4ble · Jun 25, 2024

Dunuin said:
I don't think ZFS is the problem. Most stuff the system disks are writing are logs, metrics and updates of the cluster DB. HDDs could handle this, then consumer SSD shouldn't have a problem either.
You could use tools like iotop or iostat to narrow down whats causing the io delay. zpool iostat -vy 60 1 should also give you some hints if the system disks are busy.

@Dunuin not sure how to read the iostat, I ran it twice:

root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 30 190 277K 8.95M
mirror-0 156G 308G 30 190 277K 8.95M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 14 95 134K 4.48M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 95 142K 4.48M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 31 177 280K 8.25M
mirror-0 156G 308G 31 177 280K 8.25M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 15 88 135K 4.13M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 88 145K 4.12M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~#

en4ble · Jun 25, 2024

Interesting observation, when I moved all VMs to default (no cache) and started dd bench CPU IO was barely moving, the disk r/w performance was a lot worse vs "unsafe" BUT no CPU IO, when I moved back to unsafe and reran the test I've noticed the CPU IO going up again. So there is a definitely some "correlation" between cache enabled on the VMs vs off. I'm going to try leave it at default BUT perhaps put higher burst on the disk throttle ...

en4ble · Jun 25, 2024

And assuming virtio is what you want correct?

Dunuin · Jun 26, 2024

en4ble said:
@Dunuin not sure how to read the iostat, I ran it twice:

root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 30 190 277K 8.95M
mirror-0 156G 308G 30 190 277K 8.95M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 14 95 134K 4.48M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 95 142K 4.48M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 31 177 280K 8.25M
mirror-0 156G 308G 31 177 280K 8.25M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 15 88 135K 4.13M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 88 145K 4.12M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~#

View attachment 70326bount

8MB/s is a lot of writes for a system disk that isn't storing any guests. 177 IOPS shouldn't be a problem for consumer SSDs even if those would be sync writes.

en4ble said:
nteresting observation, when I moved all VMs to default (no cache) and started dd bench CPU IO was barely moving, the disk r/w performance was a lot worse vs "unsafe" BUT no CPU IO, when I moved back to unsafe and reran the test I've noticed the CPU IO going up again. So there is a definitely some "correlation" between cache enabled on the VMs vs off. I'm going to try leave it at default BUT perhaps put higher burst on the disk throttle ...

Yes, so basically what I wrote above about cache filling up and disks can't keep up writing it to disk. With caching set to "none" ZFS will still be write-caching, but only the last 5 seconds and then it will be flushed so data can't pile up in RAM that much that the system needs to wait because of overwhelmed disks.

en4ble said:
And assuming virtio is what you want correct?

Yes, virtio SCSI single is default and vest practice.

_gabriel · Jun 26, 2024

Dunuin said:
With caching set to "none" ZFS will still be write-caching, but only the last 5 seconds

OP doesn't use ZFS for VM storage,
only PVE OS system boot is ZFS.

High CPU IO during DD testing on VMs

Member

Distinguished Member

Member

Distinguished Member

Member

Renowned Member

Member

Renowned Member

Member

Member

Distinguished Member

Distinguished Member

Member

Member

Distinguished Member

Member

Member

Member

Distinguished Member

Renowned Member