Very poor disk performance on PVE Node

Poehlmann

New Member
Sep 4, 2023
10
0
1
Hello,
we are using a single pve node with a zfs raid-z, where all the vms reside. We have 12 vms running, where RAM and CPU are still not at all stressed.
However we noticed that a lot of vms get blocked tasks, which is especially true for a gitlab-runner vm running docker, which repeatedly cannot connect to the docker engine.
After running the fio benchmark we saw write speeds of less than 400kb/s. This is sequential writes with 16k block (see command below). iostat shows writes of around 40M, which hopefully is not the upper limit I can expect from modern disks. Any hints on where to look would be apreciated.

System Information

Hardware
MainboardSupermicro - X12DPi-NT6
CPU(s)2 x Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz
Memory8x 32GB RAM - ATP X4B32QB4BNWESO-7-TN1
Disks
  • INTEL SSDSC2KB96 1TB - System drive
  • 3x WDC WUH722222AL 20TB - Data drive
HBA ControllerBroadcom / LSI 9500-8i Tri-Mode HBA
Network
  1. Super Micro Computer Inc Ethernet Controller X550
  2. Intel Corporation Ethernet Controller X550
GraphicsASPEED Technology, Inc. ASPEED Graphics Family
Software
Kernel VersionLinux 6.8.8-1-pve (2024-06-10T11:42Z)
Boot ModeEFI
Manager Versionpve-manager/8.2.4/faa83925c9641325
KSM sharing0 B (Off)
IO delay< 1%
Load average3.17,2.35,2.25

Benchmark:
Code:
# fio --filename=test --sync=1 --rw=write --bs=4k --numjobs=1   --iodepth=4 --group_reporting --name=test --filesize=10G --runtime=300 && rm test
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=4
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 10240MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [W(1)][100.0%][w=56KiB/s][w=14 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=250809: Thu Jun 20 17:11:08 2024
  write: IOPS=60, BW=241KiB/s (247kB/s)(70.7MiB/300007msec); 0 zone resets
    clat (msec): min=2, max=2232, avg=16.58, stdev=43.58
     lat (msec): min=2, max=2232, avg=16.58, stdev=43.58
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    6], 50.00th=[    8], 60.00th=[   11],
     | 70.00th=[   15], 80.00th=[   22], 90.00th=[   37], 95.00th=[   52],
     | 99.00th=[  113], 99.50th=[  176], 99.90th=[  542], 99.95th=[  995],
     | 99.99th=[ 1787]
   bw (  KiB/s): min=    8, max=  912, per=100.00%, avg=249.30, stdev=195.70, samples=580
   iops        : min=    2, max=  228, avg=62.32, stdev=48.92, samples=580
  lat (msec)   : 4=30.94%, 10=28.59%, 20=18.42%, 50=16.74%, 100=4.10%
  lat (msec)   : 250=0.92%, 500=0.17%, 750=0.03%, 1000=0.03%, 2000=0.04%
  lat (msec)   : >=2000=0.01%
  cpu          : usr=0.02%, sys=0.18%, ctx=18518, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,18089,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=277KiB/s (284kB/s), 277KiB/s-277KiB/s (284kB/s-284kB/s), io=81.2MiB (85.2MB), run=300039-300039msec
 

Attachments

  • Screenshot 2024-06-21 083226.png
    Screenshot 2024-06-21 083226.png
    32.1 KB · Views: 7
Last edited:
RAIDz1 is a terrible choice for running VMs (as several threads on this forum will explain in more detail), especially with only 3 drives. If those drives are QLC SSDs then you might as well give them away (as many threads on this forum will explain in more detail) as modern HDDs will perform better under load.
 
  • Like
Reactions: Kingneutron
Hi, Raid-Z is great for backups but never use it for VMs.
With Enterprise NVMe Drives the performance is ok, but not good.
It is best to use ZFS Mirror and if you have HDDS, it is best to plan two small fast SSDs as a special device.

The system with 2x 24Core CPU looks like an enterprise server, if it has a good raid controller with battery cache, then it is better to build a Raid5 with it and use an LVM thin pool, then HDDs will also perform sufficiently.
 
  • Like
Reactions: Kingneutron
Okay, thanks. So I just tried to understand why that is. Is it correct, that ZFS is highly IO demanding for vms because of its COW feature and running an RAID-Z makes it even worse?

When I follow that thought my solution would be:
Since our node is only for internal use and downtimes because of disk failures for a few hours shouldn't be problematic, it would be sensible to skip RAIDs and ZFS all together and use some basic filesystem like ext4? Our backups are a maximum of one day old and are on a pbs, so restoring them is easy enough. It probably makes sense to manually ballance the load of the vms, by creating multiple data pools.

Anything I'm missing?
 
Last edited:
Okay, thanks. So I just tried to understand why that is. Is it correct, that ZFS is highly IO demanding for vms because of its COW feature and running an RAID-Z makes it even worse?

When I follow that thought my solution would be:
Since our node is only for internal use and downtimes for a few hours of downtime because of disk failures shouldn't be problematic, it be sensible to skip RAIDs and ZFS all together and use some basic filesystem like ext4?
You do not need file systems for the VMs, it is better to use ZFS pools or LVM thinpools. You will have significantly less overhead.
Our backups are a maximum of one day old and are on a pbs, so restoring them is easy enough. It probably makes sense to manually ballance the load of the vms, by creatingg multiple data pools.

Anything I'm missing?
Personally, I wouldn't work without Raid if I had the option.
Of course, you can also use the disks natively individually.

But then I would build 3 thin pools so that not everything is lost if a disk fails. You can of course also combine the 3 disks with LVM, but this triples the probability of failure.
 
Thanks for you opinion, I realy try to wrap my head around that subject, but I have a few more questions:

You do not need file systems for the VMs, it is better to use ZFS pools or LVM thinpools. You will have significantly less overhead.
Yes, you are probably correct, but there would be an overhead non the less, correct? And I'm probably missing something, but I don't see the benefits of using LVM thinpools, when I'm using one drive per pool only. Snapshotting of the vms should work just fine and the filesystem would only contain vms, so there is no need to create a snapshot of the drive itself.
Also an info that might help: We automate creation an deletion of some vms and need access using guestfish, so we need an actual file to read.

Personally, I wouldn't work without Raid if I had the option.
It is a enterprise server, but there is no RAID Controller installed. So the only raid I could do is MD or ZFS and both don't seem like a good option.

Of course, you can also use the disks natively individually.

But then I would build 3 thin pools so that not everything is lost if a disk fails. You can of course also combine the 3 disks with LVM, but this triples the probability of failure.
Seems like I am still missing something. How would a thin pool help protecting data on disk failure? Or do you mean, that only ~1/3 of the data would be lost?
 
Okay, thanks. So I just tried to understand why that is. Is it correct, that ZFS is highly IO demanding for vms because of its COW feature and running an RAID-Z makes it even worse?
Yes, ZFS has much more overhead (because of features) but also no. It's that VMs are very (random) IOPS demanding and (a single) RAIDz1 has the lowest IOPS. It also has even more write amplification than other ZFS with stripes and mirrors. The volblocksize of your VMs is probably also a bad match as you will have less usuable space than you expect.

What kind of drives (or flash memory) are you using?
 
  • Like
Reactions: Kingneutron
Thanks for you opinion, I realy try to wrap my head around that subject, but I have a few more questions:


Yes, you are probably correct, but there would be an overhead non the less, correct? And I'm probably missing something, but I don't see the benefits of using LVM thinpools, when I'm using one drive per pool only. Snapshotting of the vms should work just fine and the filesystem would only contain vms, so there is no need to create a snapshot of the drive itself.
Also an info that might help: We automate creation an deletion of some vms and need access using guestfish, so we need an actual file to read.
I don't know Guestfish, but it may even be able to talk to the Proxmox API. If you absolutely need files, I would prefer XFS, as I have had the best experience with it.
But whenever possible, I save the extra file system. With SQL Server VMs you have up to twice the performance with ZFS/LVM-Thin compared to an extra file system.
It is a enterprise server, but there is no RAID Controller installed. So the only raid I could do is MD or ZFS and both don't seem like a good option.
If we could get more information on the hardware, then we would certainly have ideas on how to make the best use of it.
Seems like I am still missing something. How would a thin pool help protecting data on disk failure? Or do you mean, that only ~1/3 of the data would be lost?
Thin pools do not help to protect, but they are practical and simple.
With 3 pools there would simply be less loss.
 
The usable space is currently (and in the forseeable feature) no problem. But sure, that probably could be optimized as well. The IOPS seem relevant, though. In the other replies was a suggestion, I will try on monday. I hope it helps.

What kind of drives (or flash memory) are you using?
Oh totally forgot to include this info, sorry. I updated the original post, but there are 3x 20 TB WD drives with the current RAID-Z.
 
Hello,

Based on model number in first post the drives are :
Ultrastar DC HC570 CMR, 22TB HDD with 3.5" Drive Carrier
- available SATA or SAS with SED or TCG variants.

If I/O and responsive VM's is the main factor and the current space largely unused, would it be possible to use SSD's (combined with other's advices) ?

Hope it helps !
 
I don't know Guestfish, but it may even be able to talk to the Proxmox API. If you absolutely need files, I would prefer XFS, as I have had the best experience with it.
But whenever possible, I save the extra file system. With SQL Server VMs you have up to twice the performance with ZFS/LVM-Thin compared to an extra file system.
We use the API to create the vms, but you cannot extract information from the vm's filesystem, like copying files, with it, so we have to use guestfish for that. It is optimized for libvirt, but can also be used with files, so that what we do.

If we could get more information on the hardware, then we would certainly have ideas on how to make the best use of it.
Sure thing, l've updated the first post, to have it all in one place.
 
Hello,

Based on model number in first post the drives are :
Ultrastar DC HC570 CMR, 22TB HDD with 3.5" Drive Carrier
- available SATA or SAS with SED or TCG variants.

If I/O and responsive VM's is the main factor and the current space largely unused, would it be possible to use SSD's (combined with other's advices) ?

Hope it helps !
Since the system is relatively new, I don't think the company wants to provide much more money for hardware atm. I could probably push for it, but I'd rather try to get the HDDs to perform as excpeted first and only upgrade later, if that didn't fix the actual problem.
 
Raid-Z is great for backups but never use it for VMs.
For decades, I thought so too, yet it is unfortunately very dependend on your use case. If you have a ransomware attack and need to restore TBs worth of data while also further backing up to the pool, you WILL WANT SSD for that. Just had this use case this year at a customer location and the restore speed was VERY VERY terrible. First, they restored the critical stuff like DNS, Wiki, Firewall and such and then came the bigger machines which totally tanked the performance.
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!