Terrible Raid-Z2 Performance on Sequential Writes

Ruklaw

Member
Jul 30, 2021
13
1
8
42
Hi

I've been doing a lot of reading and a bit of experimentation with Proxmox this week and have a bit of a strange situation, in terms of performance.

I manage the network of a large school, and we've been using Hyper-V for a few years with Storage Spaces Tiered Storage local to our hypervisors, then using hyper-v replication for failover.

We have recently purchased a new server to replace one of our older hypervisors, a Dell R7515, and in the process of configuring it I've been taking the opportunity to have a play with Proxmox to see what the state of play is with it, and whether we should move across.

It's loaded with 12x 3tb SAS drives and a couple of intel p3520 NVME 1.2TB SSDs. The server has 128gb of 3200 ECC RAM, and an Epyc 7313p processor.

I span it up with Windows to start with and got some baseline numbers using the Perc H740P in raid 6 mode across the 12x3tb drives and get very solid numbers - running a 10GB test using AS SSD Benchmark I get 2300 mb/s sequential writes.

I put the Perc card in HBA mode, and installed Proxmox 7.0-8 on a Raid Z2 ZFS disk spanned across the 12 3tb disks (with default settings as set by the proxmox installer), then spun up a windows VM (using a virtio SCSI disk in no cache mode) so that I could run the AS SSD Benchmark app and get some comparative testing.

Running the AS SSD Benchmark it starts very quick, but the write speed then hits an absolute cliff and grinds to a near halt, giving a sequential write speed of 50mb/s - this gets stranger as the 4k write speed is actually faster at nearly 59mb/s and 4k 64 thread write speed at 186 mb/s (I'm imagining this is perhaps because the AS SSD benchmark does 10gb for the sequential test but a smaller amount of data for the 4k tests so hasn't hit the write speed cliff perhaps?)

If I run the test with a smaller amount of data you can see the speed before it falls off a cliff - with a 1gb AS SSD test the sequential write speed is 3300mb/s (!), 3gb test the sequential write result is already down to 183 mb/s and so on until the 50mb/s shown with the 10gb test.

I will caveat here that I appreciate this is a somewhat unfair test as the previous windows install had native disk access instead of virtualized, however testing against one of the Intel SSDs I found sequential write performance to be very comparable, 1333 mb/s native write speed in windows against 1232mb/s virtualized on proxmox.

I've done a bit of benchmarking on Proxmox using DD to try and give ZFS a fairer shake, and performance is a fair bit better but still well behind the Raid 6 performance - so the command:
dd if=/dev/zero of=/root/testfile2 bs=10G count=1 iflag=fullblock oflag=dsync

I get 344mb/s with ZFS compression turned off (2gb/s with compression on but this is obviously a bit of a poor test as zeros are very compressible!)

using the harder random data based test
dd if=/dev/urandom of=/root/testfile4 bs=10G count=1 iflag=fullblock oflag=dsync

the resultant speed is 171mb/s (compression still off).

For comparison running those tests against one of the intel SSDs i get 2.7gb/s for the zeroes, 294mb/s for the urandom data (SSD controller compression no doubt coming into play with the zero data there, speed of urandom generation perhaps capping the speed of the other test so neither test perfect!)



So, what's going on? I did anticipate RaidZ2 would be a bit slower in the writes but these writes seem painful - and random writes being faster seems very strange indeed!

Obviously read speeds are great but this is hugely skewed by the dataset fitting easily within the ARC, won't really know how reads behave until our production load is spun up.
 

Attachments

  • Raid z2 virtio 12x3tb compression on.png
    Raid z2 virtio 12x3tb compression on.png
    32.4 KB · Views: 33
  • Dell 12x3tb SAS RAID 6.PNG
    Dell 12x3tb SAS RAID 6.PNG
    107.8 KB · Views: 32
Did you change the volblocksize? Default is 8K for zvols and that is very bad for any raidz2. Look here for a table that shows padding+parity overhead vs volblocksize. Here is the blog explaining how raidz2 works in detail on block level.

In short: For a raidz2 of 6 drives using 4K blocksize (pool with ashift=12) you want a volblocksize of atleast 16K. For a Raidz2 of 12 drives atleast 256K and for two raidz2 of 6 drives each, that are striped together, a volblocksize of atleast 32K.
 
Last edited:
  • Like
Reactions: markleman
Thanks for the reply - I did misunderstand it at first and had some interesting results which I will now report!

I dug into the settings, ashift=12 was already set as I'd hope. The recordsize was set to 128k. I bumped it up to 512k, created a new virtual disk and repeated the test on that new disk and saw basically no difference in the results, very similar to those posted.

So I then tried another tack and tried varying the Virtio disk settings - unsurprisingly write back (unsafe) was a lot quicker but obviously sounded like a bad idea - I tried 'write through' and found write performance was also hugely improved!

It makes me wonder if there is some kind of bug in the virtio write through mode as the numbers seem almost unbelievable - notable is that the IO delay numbers on the proxmox node summary go through the roof while these benchmarks are running, and the VM becomes very sluggish to respond - I did see some graphical glitches at the end of long runs presumably where windows timed out accessing storage.

Anyhow, I then had another read of your message and realised that VolBlockSize was rather different from what I'd been tweaking in the first place.

I managed to work out how to change this - for the benefit of any other newbies reading this you set it on the Proxmox web gui from datacenter on the left, then storage, then edit the ZFS by typing in the desired block size - a new virtual disk will need to be created for this to have effect. I set this to 256kb as suggested, and created a new disk. I also formatted this new disk with NTFS cluster size 256kb.

With the default virtio disk settings, the sequential write speed was considerably improved (400mb/s) although 4kb writes were somewhat understandably poor - a little over 10mb/s. It also showed the same phenomenon of starting quick but tailing off in speed as time went on, this was just somewhat less pronounced with the higher volblocksize.

Turning on write-through for this disk bumps the performance right up again, similar to what the 8kb volblocksize disks have shown.

Ultimately I feel like I'm not much further forward!
  • I've been able to improve sequential write speed by increasing volblocksize but it's still way short of what a hardware raid 6 achieves and has very poor 4kb write speed.
  • I can get great performance out of almost any ZFS config tried by using 'write through' mode on my virtual disks.
  • But this is with the caveat that this great benchmark performance seems to be at the cost of the responsiveness of the system, and I've even seen spontaneous crashing of vms while running long benchmarks
It all seems like a bit of a headache to get right and this is before I've even started experimenting with SLOG and L2ARC setup.

Is it madness to just run a ZFS volume on a RAID 6 virtual disk? Seems proxmox needs ZFS for replication.
 

Attachments

  • 512k recordsize 8k volblocksize virtio write through mode.png
    512k recordsize 8k volblocksize virtio write through mode.png
    32.4 KB · Views: 12
  • 512k recordsize 8k volblocksize virtio write through mode 8gb crystal.png
    512k recordsize 8k volblocksize virtio write through mode 8gb crystal.png
    23.9 KB · Views: 14
  • 512k recordsize 8k volblocksize virtio write through mode 32gb crystal.PNG
    512k recordsize 8k volblocksize virtio write through mode 32gb crystal.PNG
    27.7 KB · Views: 9
  • 256k volblocksize and ntfs cluster size virtio defaults.png
    256k volblocksize and ntfs cluster size virtio defaults.png
    32.3 KB · Views: 10
  • 256kb volblocksize and ntfs clusters, virtio write through.png
    256kb volblocksize and ntfs clusters, virtio write through.png
    32.9 KB · Views: 8
  • 256kb volblocksize default ntfs clusters, virtio write through.png
    256kb volblocksize default ntfs clusters, virtio write through.png
    33.2 KB · Views: 11
Anyhow, I then had another read of your message and realised that VolBlockSize was rather different from what I'd been tweaking in the first place.
Recordsize is only used for LXCs (dataset/file level). Volblocksize for VMs (zvol/block level).
Turning on write-through for this disk bumps the performance right up again, similar to what the 8kb volblocksize disks have shown.
Generally you only want your virtio cache mode to be "none" if using ZFS. You can use writeback or writethrough but in that case data is cached two times in RAM. First will writeback/writethrough cache data in RAM and then ZFS will also cache the same stuff in same RAM using its ARC.
Writeback (unsafe) is faster because it lies to the guest and will handle all sync writes as async writes. If then a power outage or kernel crash happens you will loose data.
Is it madness to just run a ZFS volume on a RAID 6 virtual disk? Seems proxmox needs ZFS for replication.
You dont want to run ZFS ontop of HW raid. See here: https://openzfs.github.io/openzfs-d...uning/Hardware.html#hardware-raid-controllers
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!