High CPU IO during DD testing on VMs

en4ble

Member
Feb 24, 2023
69
5
8
Hello,

Wondering if I could get some insight on my situation. This is a new build. During benchmarking on the VMs which uses DD test we can see unhealthy high IO. I have similar system without this behavior - only difference is this one is using ZFS(raid1 - dual wd red) for OS.

Little about the system:

  1. dual epyc 7702
  2. OS as zfs(raid1) using two wd red SSDs NOT shared with VMs (dedicated)
  3. 80x VMs (2c/4t) sitting on evenly distributed 15 NVME drives (sn850x) - so about 5-6 VMs per drive
  4. all VM drives running at gen4 speed using LVM Thin (not ZFS)
  5. each VM has disk i/o throttle setup for 250/300 (burst) with "unsafe" policy

1719245906114.png

On the chart we can see when DD starts kicking in the IO levels are concerning.

Apologize my ignorance but my question is could it be perhaps ZFS type for OS as the culprit?! If it is I'm confused why since the DD test does not touches OS drives but only dedicated NVMEs for the VMs.

Any other artifacts I could show to help with this issue?!

Thank You in advance for any assistance!
 
Last edited:
Apologize my ignorance but my question is could it be perhaps ZFS type for OS as the culprit?!
Yes, very possible. ZFS has a lot of write amplification and overhead (because of useful features) and works better with enterprise SSDs with PLP. Additional details and suggestions for drives can be found on this forum, where this has been talked about a lot.
 
Yes, very possible. ZFS has a lot of write amplification and overhead (because of useful features) and works better with enterprise SSDs with PLP. Additional details and suggestions for drives can be found on this forum, where this has been talked about a lot.
Thanks @leesteken for reply. I'm confused on why would the ZFS cause high IO if testing is not done on the drives themselves?! As mentioned the drives are dedicated for OS only and DD test is happening on NVMEs.
 
Thanks @leesteken for reply. I'm confused on why would the ZFS cause high IO if testing is not done on the drives themselves?! As mentioned the drives are dedicated for OS only and DD test is happening on NVMEs.
Ah right, you don't use ZFS on the NVMe drives? Then please ignore my previous reply.
Maybe your consumer NVMe drives can't handle sustained writes (of multiple VMs) for very long and that's why the IO delay goes up?
 
Ah right, you don't use ZFS on the NVMe drives? Then please ignore my previous reply.
Maybe your consumer NVMe drives can't handle sustained writes (of multiple VMs) for very long and that's why the IO delay goes up?
Each VM has a throttle of 250/300MB/s, using sn850x. 5 VMs at 300MBs would be around 1.5GB, they should handle burst like that. Another system (similar to this) with same drives and same tier CPU does not have those IO numbers but its running OS as lvm thin (on nvme and not ssd). But again would be confusing why ZFS would cause that...
 
is SWAP enabled on host ?
What about host RAM usage ?
Good question. No swap on the system - didn't want to over provision on that piece (which could introduce io if same drive was used for it). This was default config when using ZFS.

RAM we are at about 50-60% up/down.

1719257678694.png

Ballooning is enabled on VMs - this was by default - not sure on what's the best recommendation.

1719257784025.png

Disk policies (going with more aggressive limits):
1719257891342.png
1719257933435.png

Testing Write through cache as well since it may benefit the dd test (tbd). EDIT - does not, writeback gives the best perf so far.
 
Last edited:
@dunuin talks about using (none) actually when using ZFS, although again, the VMs are using LVM Thin, and OS is only on ZFS - not sure if this still would apply?! Also that "Keep in mind that all caching modes besides none will use RAM to buffer data" - but we don't have issue with RAM space.

I ran both vms at the same time (unsafe and none) and both had very similar performance using DD, with slight advantage on the unsafe one...
 
Last edited:
You know why "unsafe" is called like it? It will do all important sync writes (that shouldn't be lost, no matter what) as unsafe volatile async writes so a failing PSU, kernel crash, power outage or other hardware failure might for example kill a whole filesystem or DB.

My guess would also be problems with caching. With "writeback" or "unsafe" the PVE hosts RAM will cache all async writes of a VM until your host runs out of RAM. If the VM is writing too much and the storage can't keep up writing stuff from RAM to disk and your RAM gets full, IO delay should go up.
 
You know why "unsafe" is called like it? It will do all important sync writes (that shouldn't be lost, no matter what) as unsafe volatile async writes so a failing PSU, kernel crash, power outage or other hardware failure might for example kill a whole filesystem or DB.

My guess would also be problems with caching. With "writeback" or "unsafe" the PVE hosts RAM will cache all async writes of a VM until your host runs out of RAM. If the VM is writing too much and the storage can't keep up writing stuff from RAM to disk and your RAM gets full, IO delay should go up.
During the spike I've mentioned at the beginning ram never hit beyond 60% on the host. Perhaps I may try all the VMs with none and see how it performs. So we don't think its the ZFS on OS since VMs are on dedicated?! As always thanks for chipping in @Dunuin
 
I don't think ZFS is the problem. Most stuff the system disks are writing are logs, metrics and updates of the cluster DB. HDDs could handle this, then consumer SSD shouldn't have a problem either.
You could use tools like iotop or iostat to narrow down whats causing the io delay. zpool iostat -vy 60 1 should also give you some hints if the system disks are busy.
 
Last edited:
I don't think ZFS is the problem. Most stuff the system disks are writing are logs, metrics and updates of the cluster DB. HDDs could handle this, then consumer SSD shouldn't have a problem either.
You could use tools like iotop or iostat to narrow down whats causing the io delay. zpool iostat -vy 60 1 should also give you some hints if the system disks are busy.
@Dunuin not sure how to read the iostat, I ran it twice:

root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 30 190 277K 8.95M
mirror-0 156G 308G 30 190 277K 8.95M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 14 95 134K 4.48M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 95 142K 4.48M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 31 177 280K 8.25M
mirror-0 156G 308G 31 177 280K 8.25M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 15 88 135K 4.13M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 88 145K 4.12M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~#

1719278893348.png
 
Interesting observation, when I moved all VMs to default (no cache) and started dd bench CPU IO was barely moving, the disk r/w performance was a lot worse vs "unsafe" BUT no CPU IO, when I moved back to unsafe and reran the test I've noticed the CPU IO going up again. So there is a definitely some "correlation" between cache enabled on the VMs vs off. I'm going to try leave it at default BUT perhaps put higher burst on the disk throttle ...
1719283480129.png
 
Last edited:
@Dunuin not sure how to read the iostat, I ran it twice:

root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 30 190 277K 8.95M
mirror-0 156G 308G 30 190 277K 8.95M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 14 95 134K 4.48M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 95 142K 4.48M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~# zpool iostat -vy 60 1
capacity operations bandwidth
pool alloc free read write read write
-------------------------------------- ----- ----- ----- ----- ----- -----
rpool 156G 308G 31 177 280K 8.25M
mirror-0 156G 308G 31 177 280K 8.25M
ata-WDC_WDS500G1R0A-68A4W0_24070L800900-part3 - - 15 88 135K 4.13M
ata-WDC_WDS500G1R0A-68A4W0_24070L800864-part3 - - 15 88 145K 4.12M
-------------------------------------- ----- ----- ----- ----- ----- -----
root@epyc-pve01:~#

View attachment 70326bount
8MB/s is a lot of writes for a system disk that isn't storing any guests. 177 IOPS shouldn't be a problem for consumer SSDs even if those would be sync writes.

nteresting observation, when I moved all VMs to default (no cache) and started dd bench CPU IO was barely moving, the disk r/w performance was a lot worse vs "unsafe" BUT no CPU IO, when I moved back to unsafe and reran the test I've noticed the CPU IO going up again. So there is a definitely some "correlation" between cache enabled on the VMs vs off. I'm going to try leave it at default BUT perhaps put higher burst on the disk throttle ...
Yes, so basically what I wrote above about cache filling up and disks can't keep up writing it to disk. With caching set to "none" ZFS will still be write-caching, but only the last 5 seconds and then it will be flushed so data can't pile up in RAM that much that the system needs to wait because of overwhelmed disks.

And assuming virtio is what you want correct?
Yes, virtio SCSI single is default and vest practice.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!