Why Does ZFS Hate my Server

No, you mean write back (unsafe). I am talking about the other write back option which basically lets the underlying OS decide whether a write should be synced or unsynced. Unless you cannot trust the OS (particularly old OS or Windows, XP/2003 and prior) it should be safe to set it to write back.

Please read the documentation on the entire stack why things work the way they do.

ZFS and Proxmox is set up to do things very safe/conservative out of the box, so if you force a sync after every write (the default) you will get what you ask but you won’t see for example write aggregations which on a 12-wide RAIDZ will limit your throughout to the speed of a single disk (every disk has to respond before the next write). As others have pointed, RAIDZ 12-wide is not recommended for that use case, I would recommend 2x 6-disk RAIDZ2 for data safety and maximum yield or 2-way mirror w/ PBS (or 3-way if you need high availability).
 
O_DIRECT semantics (cache=none) means that a write call does not return until the I/O to disk completes, and all caches (eg. ARC) are flushed, that is how ZFS is currently implemented.

That being said, on spinning disks you probably won’t see much of a difference if you are using zvol and the block size in the VM is aligned and you are using O_DIRECT/O_DSYNC writes (databases should do all those things automatically), so feel free to test performance without it, but in many cases (except for database design, programmers generally don’t think about those things) you will see a huge performance boost. But if your block size is different, you are amplifying each write.

You should not lose data unless, again, your OS or database is not properly aware of how writes to disk should function (I remember this being an issue on SQL Server, SQL Server on Linux, Windows Storage Spaces - not sure if they ever fixed their issues since BBU HW RAID is not standard in the cloud).

But again, that is all just tweaking performance, you should be able to test different metrics (throughput instead of IOPS) as far as throughput while keeping cache=none.
 
so firstly, no matter what system topology i use, both windows and proxmox on EXT4 will perfectly saturate the SATA bus in the back, pegging out at 450+ MB/s on the metal, and about 250 MB/s in a virtual machine. (for writes. reads in a vm are also 450 MB/s)
So you've shown benchmarks with 35.9K IOPs @140MB/S, which is FANTASTIC for these drives (I wouldnt expect that to last as the tbw grows.) where did you see 450MB/S, and under what benchmark?
 
Typically with enterprise servers, SAS HDDs have write cache disabled because they are configured to be used HW RAID with a BBU. The HW RAID has built-in RAM cache.

I too had terrible IOPS especially with Ceph when used with an IT-mode disk controller, ie, Dell H310 flashed to IT-mode or Dell HBA330.

The solution for me was to enable write cache on the SAS HDDs. IOPS has returned to "normal". I'm not lacking for IOPS on workloads running from DBs to DHCP servers.

As for SATA drives, you have to use the /etc/hdparm.conf file to set the write cache for each drive. I use this file on a Dell R530 which is a bare-metal Proxmox Backup Server.

I use the following optimizations learned through trial-and-error. YMMV.

Code:
    Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
    Set VM Disk Cache to None if clustered, Writeback if standalone
    Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
    Set VM CPU Type to 'Host'
    Set VM CPU NUMA on servers with 2 or more physical CPU sockets
    Set VM Networking VirtIO Multiqueue to 1
    Set VM Qemu-Guest-Agent software installed
    Set VM IO Scheduler to none/noop on Linux
    Set Ceph RBD pool to use 'krbd' option
 
Last edited:
  • Like
Reactions: _gabriel
the reason i swapped in those cheap drives to run tests with, in the event that you would've told me they need some kind of low level format to work correctly, doing that on 120gb drives vs the 1tb ones that are actually going to be used would've been faster. there's no reason they shouldn't get to at least 75% of the sata bus speed of 450 MB/s. not 25 MB/s as the test shows. speaking of the sas, drives, here is what they look like when a vm is trying to write to them:

View attachment 74384
they're pegging out at 5 MB/s. so something is definitely foundationally wrong with this server's execution of zfs.

here is the command you had me execute:

Code:
root@r730xd-1:~# pveperf /usr/test
CPU BOGOMIPS:      120000.24
REGEX/SECOND:      3017756
HD SIZE:           84.97 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     6790.02
DNS EXT:           53.18 ms
DNS INT:           34.42 ms (local)

Thats irritating, FSYNCS/SECOND is pretty good, almost amazing. Indicating of good drives for "rpool/ROOT/pve-1"
thats very weird then.

Without knowing, its almost looking like some sort of controller with memory caching.... If yes then thats the culprint.
 
Last edited:
Thats irritating, FSYNCS/SECOND is pretty good, almost amazing. Indicating of good drives for "rpool/ROOT/pve-1"
thats very weird then.

Without knowing, its almost looking like some sort of controller with memory caching.... If yes then thats the culprint.
these drives are connected directly to the system board; the raid card does not see them. however, i see in the BIOS the sata write cache is disabled. my understanding of zfs is it SHOULD be disabled, because zfs does its own caching, right?
 
as soon as i switch to ZFS, things completely fall apart. i have tried an ashift of 9, 12, and 13. i have tried turning compression off and on. i have tried an ARC of 64MB to 16GB. i have tried zfs_dirty_data_max/max_max of 0 - 32GB. absolutely NOTHING functions.
This is like trying to fly a commercial airplane by just testing all the nobs instead of learning how to fly that thing ;)

What are you trying to achieve?
Yes, ZFS maybe slower than other filesystem, because it is CoW.
And yes, RAID controllers can offer better performance and are not supported by ZFS.

If you are still interested in ZFS, because of other advantages that ZFS offers (like Snapshots), there are a few basic things you should follow before we run any benchmarks or try to optimize stuff:
- what is the sector size of your disks? Probably 4k so leave ashift at 12.
- go with mirror. RAIDZ can have horrible performance, storage efficiency and rw amplification problems if you don't know what you are doing.
 
Where's the rest of your drives? I thought you had 12? I only see 4.
i do. i created a raid10 of 4 just for you :) you want to see a raid10 of all 12? wouldn't that be a little inefficient storage-wise?
 
i do. i created a raid10 of 4 just for you :) you want to see a raid10 of all 12? wouldn't that be a little inefficient storage-wise?
Yes, it will waste cost 50% of your storage, but it will continue to increase the speed of your benchmarks. That's the way you scale up performance with ZFS. More vdevs = more speed.
 
Last edited:
  • Like
Reactions: IsThisThingOn
You could also do 2 RAID5 RAIDZ1 vdevs of 6 drives each and it would be about the same speed as this 4 drive mirrored vdev setup. Please do go read up on ZFS performance setups as it doesn't work the same as old school hardware RAID cards. More vdevs, however you set them up increases your speed.
 
Last edited:
You could also do 2 RAID5 RAIDZ1 vdevs of 6 drives each and it would be about the same speed as this 4 drive mirrored vdev setup. Please do go read up on ZFS performance setups as it doesn't work the same as old school hardware RAID cards. More vdevs, however you set them up increases your speed.
so you think those 12Gb SAS drives doing 200 MB/s sequencial writes on zfs sounds about right?
 
you want to see a raid10 of all 12? wouldn't that be a little inefficient storage-wise?
Compared to what?
Depending on ashift and volblocksize, RAIDZ can have the same or even worse efficiency storage wise.
Of course it can also have way better storage efficiency.

so you think those 12Gb SAS drives doing 200 MB/s sequencial writes on zfs sounds about right?
Depends, but more or less, yes.
Maybe I missed it, but do we know what drives you use?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!