ZFS zvol on HDD locks up VM

Dec 30, 2020
8
0
1
Hello,

inside the proxmox GUI I've created a 4 member RAID10 HDD ZFS pool (4T SAS HDDs).
I placed a VM disk inside this pool and wanted to copy a big repo from a physical machine to the VM (SCSI disk, discard=on,iothread=1,cache=none)

source (physical machine):
tar -cpf - repo | mbuffer -s 256k -m 2G -O 192.168.100.50:9090

destination (proxmox VM):
mbuffer -s 256k -m 2G -I 9090 | tar -xpf -


But this does not work at all!
  • the mbuffer quickly fills up at the destination
  • the VM becomes completely unresponsive - every command takes ages
  • the proxmox host load jumps to 40-50
  • the IOPs on one HDD (iostat) varies between 500 and 800 (way to much for a HDD)
I then attached a raw disk file inside this HDD ZFS pool as harddisk to this VM and repeated the test
  • the mbuffer at the destination stays at 0%
  • I constantly get 110 MiB/s transfer rate (1Gbit network connection)
  • the VM stays responsible
  • the host load increases only slightly
  • the IOPs on one HDD (iostat) varies between 60 and 130
It looks like ZFS zvols still have serious problems, when being used as base for VM harddisks (you find many reports that zvols are slow on the internet).
  • In the test above the zvol generates way to much IOPs for a HDD
  • Repeating the test on a SSD ZFS zvol works better, but generates 2000-3000 IOPs on the SSD
  • I've tried with different volblocksize and filesystems (xfs, ext4) - but that did not improve the results
Has anybody some suggestions to improve the zvol performance (drastically lower the generated IOPs)?
 

tburger

Active Member
Oct 13, 2017
715
80
33
37
In the test above the zvol generates way to much IOPs for a HDD
That is likely part of the write amplification which is generated by ZFS.

It looks like ZFS zvols still have serious problems, when being used as base for VM harddisks (you find many reports that zvols are slow on the internet).
I don't think so. I think the main problem is that people see ZFS as an alternative RAID engine and apply their "RAID" knowledge to ZFS. This simply wrong because the fundamentals work completely different.
 
Dec 30, 2020
8
0
1
That is likely part of the write amplification which is generated by ZFS.

I don't think so. I think the main problem is that people see ZFS as an alternative RAID engine and apply their "RAID" knowledge to ZFS. This simply wrong because the fundamentals work completely different.

What makes me puzzled is not HW RAID vs ZFS zvol - it's ZFS raw image (ZFS dataset hosting a raw image file) vs ZFS zvol.

The difference between ZFS raw image and ZFS zvol is way too big - for the same test a zvol generates 20x to 30x the number of IOPs - this quickly fills up even a SATA SSD pool.
  • zvol on SSD-ZFS-mirror: 2000-3000 IOPs
  • raw disk image on 4-member-HDD-ZFS-RAID10: 60-130 IOPs
Data security should be the same - both the ZFS raw image and the ZFS zvol are running with sync=standard and the VM disk has cache=none.

Why is the ZFS raw image working like a charm but the ZFS zvol chokes up on this (real world) test?
 
Last edited:
Dec 30, 2020
8
0
1
Repeated the test some more times - the results did not change

IOPs to the host raw diskVM behaviorproxmox host load
ZFS zvol - HDD pool500-800completely unresponsive> 40
ZFS zvol - SSD pool1000-3500responsive10-16
ZFS raw image file - HDD pool60-180 (with pauses)responsive4-5

  • it looks like ZFS zvols are generating huge amounts of IOPs, that a ZFS dataset on the other hand absorbs (ZFS is good at serializing random ops)
  • running ZFS zvols on a ZFS HDD pool will not work for higher IO loads
  • ZFS zvols on a ZFS SSD pool can survive high IO loads - but still generate huge amounts of IOPs and cause a significant load on the proxmox host
  • I don't know if this is a general problem with ZFS zvols or if I could tune this behavior (write amplification) somehow
 

tburger

Active Member
Oct 13, 2017
715
80
33
37
Which hdds do you use? Or better: are they by any chance smr?
Are they connected to the same controller as the SSDs?
Can you share more details about your system setup in general?
Do I interpret it right that you actually create load via the network?
 
Dec 30, 2020
8
0
1
Software
pve-manager/6.3-3/eee5f901
Linux 5.4.78-2-pve #1 SMP PVE 5.4.78-2

Hardware
DL380p Gen8 2x E5-2650 v2 256GB RAM
LSI 2308 SAS HBA (IT mode)
4x 4TB SAS HDD (SEAGATE ST4000NM0023,SMEG4000S5 HGST HUS726040AL5210)
2x SATA SSD SAMSUNG 240GB (rpool)
4x SATA SSD Seagate/Intel (Nytro/4610) 480GB (zssd480)
2x SATA SSD Seagate Nytro 960GB (zssd960)
1Gbit and 2x 10Gbit Intel ethernet adapters

  • the SAS HDD are 7200 rpm - cmr
  • these disks are on the same backplane as the SATA SSDs - while the test ist running only the SAS HDDs are busy (there is very low IO on the SATA SSDs)
  • yes the load is coming from the network - I'm using mbuffer instead of ssh to speed up the network traffic - mbuffer loads the 1Gbit nic on the source host fully
on the destination proxmox VM (IP - 192.168.100.50):
mbuffer -s 256k -m 2G -I 9090 | tar -xpf -
(mbuffer is waiting on port 9090 for traffic and feeds the data to tar)

then on the source host (the source host has only 1Gbit nics):
tar -cpf - repo | mbuffer -s 256k -m 2G -O 192.168.100.50:9090
(tar feeds it's stream to mbuffer which transmits the data over the network to 192.168.100.50:9090 - the proxmox VM)

repo is a 1.5TB directory of backup files
the tar on the proxmox VM constantly gets data with 110MiB/sec over the (mbufferd) network

  • for me the problem is the huge amount of (write) IOPs zvol generates on this workload
  • if I "tar -xpf -" to a zvol on a ZFS SSD pool the IOPs increase up to 3500 - this means this workload generates and needs 3500 write IOPs to succeed
  • sure the HDD pool fails to provide 3500 IOPs and chokes up - making the VM unresponsive and the proxmox host heavily loaded
As soon as I convert the VM disk (where tar -xpf - is writing) to a 2TB raw file on a dataset on the same HDD ZFS pool everything works like a charm - I've just copied 1TB over the last two hours without a glitch - VM is reponsive / the load on the proxmox host is 3.5
 
Last edited:

mailinglists

Well-Known Member
Mar 14, 2012
551
52
48
It might be related to volblocksize of zvols.
If you have the time, please match volblocksize to size as is on disks, and then also match it in filesystem you use in your VM.
Do same tests then.

But hopefully someone with more experience will join this conversation.
 
Dec 30, 2020
8
0
1
It might be related to volblocksize of zvols.
If you have the time, please match volblocksize to size as is on disks, and then also match it in filesystem you use in your VM.
Do same tests then.

I've already created a zvol with 4K volblocksize (8K is the default) - same results.

Additionally I run the test on the proxmox host directly

created a 4K zvol
mkfs.xfs
mounted

Running the test generated 500-700 IOPs on one HDD and a load of >40 on the proxmox host - like inside the VM
The proxmox host did not lock up (256GB memory), but the umount took over 5 minutes (fs buffers from memory had to be synced to the disk)

Running the test on a ZFS dataset instead of a zvol generated 50-170 IOPs and a sync returned immediately (no buffers in memory)
 

tburger

Active Member
Oct 13, 2017
715
80
33
37
Could you please try if the behaviour is the same when you do io locally in the VM?
E.g. if the network is not involved?
I am asking myself if you see increase in IO due to the fact that it is coming from the network.
Also how much data do you actually write?
 

Dunuin

Active Member
Jun 30, 2020
693
114
43
It might be related to volblocksize of zvols.
If you have the time, please match volblocksize to size as is on disks, and then also match it in filesystem you use in your VM.
Do same tests then.

But hopefully someone with more experience will join this conversation.
I also thought about that. If you access a raw image on an dataset the recordsize of the dataset should be used what is in most cases 128K and not the 8K or 4K your zvol is using.
 
Dec 30, 2020
8
0
1
I also thought about that. If you access a raw image on an dataset the recordsize of the dataset should be used what is in most cases 128K and not the 8K or 4K your zvol is using.

Yes - very good hint - 128k volblocksize makes a big difference !
  • I additionally run some fio test inside the VM - the 128k volblocksize fixes the problems on sequential IO (like the tar test above) but does not slow down random read/write tests.
  • the VM stays responsive during the tests
  • the IOPs on the proxmox host when using zvol VM disks are comparable with the ZFS raw images
  • on SSD pools fio IO is faster on the zvol than on the ZFS raw image
  • fio random writes on zvols on SSD pools still increases the load on proxmox host to >40 - but the VM and the proxmox host stays responsive
  • fio random writes on ZFS raw images on SSD pools are 50% slower - the load on the proxmox host stays at 11
So for me it looks like creating the zvols with 128k volblocksize should be the default - I will convert my existing zvols to 128k volblocksize (send/receive)

Thanks to all for the input.
 

mailinglists

Well-Known Member
Mar 14, 2012
551
52
48
Hmm... if there was no downside, it would be the default setting, i think.
I guess you loose on disk space then, if block size in PM GUI for that datastore is set at 128k instead of 8k. Am I correct?
Does anyone see any other downsides?

I guess I will do some tests when I have the time and test this. I might start setting higher than 8k ZVOL block size on ZFS datasets then. :)
 

Dunuin

Active Member
Jun 30, 2020
693
114
43
Hmm... if there was no downside, it would be the default setting, i think.
I guess you loose on disk space then, if block size in PM GUI for that datastore is set at 128k instead of 8k. Am I correct?
Most of the time its the opposite. People loose capaticity if they use raidz and they don't increase the volblocksize first.
Does anyone see any other downsides?
Sure. If you dont't need a higher volblocksize because of padding overhead a raidz would produce, it is better to keep the volblocksize as small as possible. Otherwise you will loose capacity on really small writes and increase the overhead (its bad if a guest with a lower blocksize writes to a zvol with a higher blocksize).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!