ZFS on rotative drives super bad performance

michele.scapini

New Member
Jan 11, 2025
17
0
1
Hi guys,
I deployed PVE on a new Dell R360 with HBA355 and 2 rotative drives (Toshiba 4TB) using ZFS.
This server is replacing an old VMware standalone server with 2 Windows VM.
You probably will ask me why not use SSD? Basically I have only 2 VM and the cost of mixed use SSD drives is huge, and also I don’t need so much performance (raid 1 on VMware guarantee me to reach 1 gbps speed and maybe more)
So, I said: I want to move to proxmox (thanks Broadcom) and move from hardware raid to ZFS and all will work without any issue!
Instead, during the vm migration I see bad IOPS and disk performance, but I said myself, maybe was just the import procedure. And I wasn’t. My drive performance is max 50-60 MBps, and are not constant.
The boot time of both vm is huge, the task manager says the disk usage is constant to 100%.
The IO delay and the IO pressure stall on proxmox chart is always skyrocketing
if I try to clone the smallest VM (50GB) it takes 30 mins to perform the task.
If I try to copy a file within the VM sometimes do at 200MBps, but it’s staying a lot of time at 0MBps.
I tried to disable compression but is not changing.
I tried to give more ram (12 GB) to ZFS but is not giving me more performance
Can you give me any advice to address this issue?
I would like to avoid to move to SSD because the server is brand new
 
ZFS is known to have bad performance with HDD's. While it may work, often it is suggested to add 2 smaller SSD's as special devices to your ZFS HDD pool to help with performance, even better use SSD's only. ZFS being a COW filesystem needs more IOPS and HDD's can't deliver them.

If you really don't want to get SSD's you should use your hardware raid controller, set the drives up in a RAID1 and install PVE with LVM-Thin instead of ZFS.
 
  • Like
Reactions: UdoB
Ciao Markusko,
Ok, I will evaluate to add 2 SSD (1 is enough or I need 2?) as special device. I cannot use hardware raid because it’s a HBA card.
I know mdadm is not supported, can I ask if maybe btrfs could help me?
 
btrs would be tricky as if you loose one drive it unexpectly won't mount and boot and must be done manually.
For use of mdam an installation as debian system would be preferable and then "upgrade" to pve after.
 
It should be at least two in a mirror, if you loose your special device(s) all your data in that zfs pool is gone because all the metadata of your zfs pool is stored there.

If you add ssd's and your budget allows it then i would recommend to use only ssd's for your zfs pool. You could purchase used enterprise ssd's to keep cost down. Don't use consumer grade stuff with zfs if you can. Depending on your workload you don't need mixed use ssd's, read intensive (but with PLP) could be enough.

Would not use btrfs, while it's fine for single drive desktop workloads, raid seems to be unstable and recovering from that a real nightmare (maybe that changed over the years?)
 
It should be at least two in a mirror, if you loose your special device(s) all your data in that zfs pool is gone because all the metadata of your zfs pool is stored there.

If you add ssd's and your budget allows it then i would recommend to use only ssd's for your zfs pool. You could purchase used enterprise ssd's to keep cost down. Don't use consumer grade stuff with zfs if you can. Depending on your workload you don't need mixed use ssd's, read intensive (but with PLP) could be enough.

Would not use btrfs, while it's fine for single drive desktop workloads, raid seems to be unstable and recovering from that a real nightmare (maybe that changed over the years?)
RI is enough?
I’m more expert on nutanix and they require only MU drives.
In your opinion with 2 drives as cache how many performance can I get?
It’s normal to get less than 1 normal HDD performance ?
 
Depends on the workload, if your VM's don't write much you "can" get away with RI, they are often 1DWPD, the MU are mostly 3DWPD, are you going to write more than 3,8TB / day? Definitely check the specs sheet for the ssd before purchasing.

With ZFS there are different type of "devices" you can add to a zfs pool. The most common is a special device where the metadata is stored on ssd's for faster access and bulk data on HDD for cheap storage space. It's not a cache but a separation of data. If you loose one you loose all, that's why a mirror with ssd's as special device for the hdd mirror is important.

Can't really tell how much impact a zfs ssd special device has, i only know if you use a PBS it is highly recommended if you can't afford all flash storage.

For ZFS it's normal that you get less performance because for every write of your VM ZFS has to do multiple (4-8++ depending on settings) physical writes to the storage device (checksumming, metadata and other stuff). ZFS really needs IOPS and HDD's can't really deliver them.

I'm not an expert with zfs but you should really read the official docs and wiki entries and some forum posts explaining things.
 
  • Like
Reactions: michele.scapini
Few things to correct: ZFS can do much faster, since you are comparing to VMware/RAID, a hard drive can’t go faster than 50-60MB/s, if you see GB/s, you are no longer measuring a hard drive but RAM. And when the power goes out (yes, BBU RAID controllers exist, but the battery needs to be replaced every few years, else it will be dead when you need it), so does the last few minutes of written data.

So if you want the same risk and performance, just turn off sync writes and make everything asynchronous. Don’t do it if you care about your data.

Now you can do something like BBU for ZFS, use a pair of SSD or NVRAM as ZIL (for writes) or L2ARC (read cache). With NVMe you can even namespace a single drive into multiple sections. There is also the “special” VDEVs, those can for example write small files to an SSD and large files/streaming loads to HDD, but now you effectively have 2 storage pools that can independently fill up. Many people are surprised how many workloads use one or the other and don’t give the expected benefit.
 
Few things to correct: ZFS can do much faster, since you are comparing to VMware/RAID, a hard drive can’t go faster than 50-60MB/s, if you see GB/s, you are no longer measuring a hard drive but RAM. And when the power goes out (yes, BBU RAID controllers exist, but the battery needs to be replaced every few years, else it will be dead when you need it), so does the last few minutes of written data.

So if you want the same risk and performance, just turn off sync writes and make everything asynchronous. Don’t do it if you care about your data.

Now you can do something like BBU for ZFS, use a pair of SSD or NVRAM as ZIL (for writes) or L2ARC (read cache). With NVMe you can even namespace a single drive into multiple sections. There is also the “special” VDEVs, those can for example write small files to an SSD and large files/streaming loads to HDD, but now you effectively have 2 storage pools that can independently fill up. Many people are surprised how many workloads use one or the other and don’t give the expected benefit.
ciao, I’m referring a RAID 1 and gigabit speed, no gigabyte.
AFAIK on raid 1 the write speed is near to half performance of single disk, but in read is near to double the speed.
In this case I never see far this speed.
 
Depends on the workload, if your VM's don't write much you "can" get away with RI, they are often 1DWPD, the MU are mostly 3DWPD, are you going to write more than 3,8TB / day? Definitely check the specs sheet for the ssd before purchasing.

With ZFS there are different type of "devices" you can add to a zfs pool. The most common is a special device where the metadata is stored on ssd's for faster access and bulk data on HDD for cheap storage space. It's not a cache but a separation of data. If you loose one you loose all, that's why a mirror with ssd's as special device for the hdd mirror is important.

Can't really tell how much impact a zfs ssd special device has, i only know if you use a PBS it is highly recommended if you can't afford all flash storage.

For ZFS it's normal that you get less performance because for every write of your VM ZFS has to do multiple (4-8++ depending on settings) physical writes to the storage device (checksumming, metadata and other stuff). ZFS really needs IOPS and HDD's can't really deliver them.

I'm not an expert with zfs but you should really read the official docs and wiki entries and some forum posts explaining things.
There is a DC and a file server (1 TB of data growing), so no a lot of WPD are needed. I think 1 WPD is enough
 
Your "problem" is normal as you copy in the zpool zvol here (but it's the same for zfs,xfs,ext4/...) and do read and write to same disks at same time which isn't as problematic when using nvme's but for hdd's ... it's hard I/O.
 
When you do real copy of files on a hdd or raid1 virtual disk you would reach about 33% of it's write performance.
So if your 4TB hdd is able to write 180 MB/s and is in a raid1 or zfs mirror you could assume reaching 60 MB/s what is what you observed and is normal.
 
If I try to copy a file within the VM sometimes do at 200MBps, but it’s staying a lot of time at 0MBps.
Sounds like you have issues beyond the disk subsystem. guest type, virtual HW configuration, and drivers can all play a part.

I'd begin troubleshooting on the host end. What kind of performance do you get with similar IO issues by the host itself directly to the store?
 
Your "problem" is normal as you copy in the zpool zvol here (but it's the same for zfs,xfs,ext4/...) and do read and write to same disks at same time which isn't as problematic when using nvme's but for hdd's ... it's hard I/O.
I never noticed this kind of error on HDD. I’m not expecting super performance, are rotative drives I expect decent performance.
 
Sounds like you have issues beyond the disk subsystem. guest type, virtual HW configuration, and drivers can all play a part.

I'd begin troubleshooting on the host end. What kind of performance do you get with similar IO issues by the host itself directly to the store?
Can you explain better?
If you indicate what you want to check I put here everything needed