ZFS zvol on HDD locks up VM

GeraldH · Jan 4, 2021

Hello,

inside the proxmox GUI I've created a 4 member RAID10 HDD ZFS pool (4T SAS HDDs).
I placed a VM disk inside this pool and wanted to copy a big repo from a physical machine to the VM (SCSI disk, discard=on,iothread=1,cache=none)

source (physical machine):
tar -cpf - repo | mbuffer -s 256k -m 2G -O 192.168.100.50:9090

destination (proxmox VM):
mbuffer -s 256k -m 2G -I 9090 | tar -xpf -

But this does not work at all!

the mbuffer quickly fills up at the destination
the VM becomes completely unresponsive - every command takes ages
the proxmox host load jumps to 40-50
the IOPs on one HDD (iostat) varies between 500 and 800 (way to much for a HDD)

I then attached a raw disk file inside this HDD ZFS pool as harddisk to this VM and repeated the test

the mbuffer at the destination stays at 0%
I constantly get 110 MiB/s transfer rate (1Gbit network connection)
the VM stays responsible
the host load increases only slightly
the IOPs on one HDD (iostat) varies between 60 and 130

It looks like ZFS zvols still have serious problems, when being used as base for VM harddisks (you find many reports that zvols are slow on the internet).

In the test above the zvol generates way to much IOPs for a HDD
Repeating the test on a SSD ZFS zvol works better, but generates 2000-3000 IOPs on the SSD
I've tried with different volblocksize and filesystems (xfs, ext4) - but that did not improve the results

Has anybody some suggestions to improve the zvol performance (drastically lower the generated IOPs)?

apoc · Jan 5, 2021

GeraldH said:
In the test above the zvol generates way to much IOPs for a HDD

That is likely part of the write amplification which is generated by ZFS.

GeraldH said:
It looks like ZFS zvols still have serious problems, when being used as base for VM harddisks (you find many reports that zvols are slow on the internet).

I don't think so. I think the main problem is that people see ZFS as an alternative RAID engine and apply their "RAID" knowledge to ZFS. This simply wrong because the fundamentals work completely different.

GeraldH · Jan 5, 2021

tburger said:
That is likely part of the write amplification which is generated by ZFS.

I don't think so. I think the main problem is that people see ZFS as an alternative RAID engine and apply their "RAID" knowledge to ZFS. This simply wrong because the fundamentals work completely different.

What makes me puzzled is not HW RAID vs ZFS zvol - it's ZFS raw image (ZFS dataset hosting a raw image file) vs ZFS zvol.

The difference between ZFS raw image and ZFS zvol is way too big - for the same test a zvol generates 20x to 30x the number of IOPs - this quickly fills up even a SATA SSD pool.

zvol on SSD-ZFS-mirror: 2000-3000 IOPs
raw disk image on 4-member-HDD-ZFS-RAID10: 60-130 IOPs

Data security should be the same - both the ZFS raw image and the ZFS zvol are running with sync=standard and the VM disk has cache=none.

Why is the ZFS raw image working like a charm but the ZFS zvol chokes up on this (real world) test?

GeraldH · Jan 5, 2021

Repeated the test some more times - the results did not change

	IOPs to the host raw disk	VM behavior	proxmox host load
ZFS zvol - HDD pool	500-800	completely unresponsive	> 40
ZFS zvol - SSD pool	1000-3500	responsive	10-16
ZFS raw image file - HDD pool	60-180 (with pauses)	responsive	4-5

it looks like ZFS zvols are generating huge amounts of IOPs, that a ZFS dataset on the other hand absorbs (ZFS is good at serializing random ops)
running ZFS zvols on a ZFS HDD pool will not work for higher IO loads
ZFS zvols on a ZFS SSD pool can survive high IO loads - but still generate huge amounts of IOPs and cause a significant load on the proxmox host
I don't know if this is a general problem with ZFS zvols or if I could tune this behavior (write amplification) somehow

apoc · Jan 5, 2021

Which hdds do you use? Or better: are they by any chance smr?
Are they connected to the same controller as the SSDs?
Can you share more details about your system setup in general?
Do I interpret it right that you actually create load via the network?

mailinglists · Jan 5, 2021

I'm also interested in this.
@udo @fabian @dietmar @tom @LnxBil

GeraldH · Jan 5, 2021

Software
pve-manager/6.3-3/eee5f901
Linux 5.4.78-2-pve #1 SMP PVE 5.4.78-2

Hardware
DL380p Gen8 2x E5-2650 v2 256GB RAM
LSI 2308 SAS HBA (IT mode)
4x 4TB SAS HDD (SEAGATE ST4000NM0023,SMEG4000S5 HGST HUS726040AL5210)
2x SATA SSD SAMSUNG 240GB (rpool)
4x SATA SSD Seagate/Intel (Nytro/4610) 480GB (zssd480)
2x SATA SSD Seagate Nytro 960GB (zssd960)
1Gbit and 2x 10Gbit Intel ethernet adapters

the SAS HDD are 7200 rpm - cmr
these disks are on the same backplane as the SATA SSDs - while the test ist running only the SAS HDDs are busy (there is very low IO on the SATA SSDs)
yes the load is coming from the network - I'm using mbuffer instead of ssh to speed up the network traffic - mbuffer loads the 1Gbit nic on the source host fully

on the destination proxmox VM (IP - 192.168.100.50):
mbuffer -s 256k -m 2G -I 9090 | tar -xpf -
(mbuffer is waiting on port 9090 for traffic and feeds the data to tar)

then on the source host (the source host has only 1Gbit nics):
tar -cpf - repo | mbuffer -s 256k -m 2G -O 192.168.100.50:9090
(tar feeds it's stream to mbuffer which transmits the data over the network to 192.168.100.50:9090 - the proxmox VM)

repo is a 1.5TB directory of backup files
the tar on the proxmox VM constantly gets data with 110MiB/sec over the (mbufferd) network

for me the problem is the huge amount of (write) IOPs zvol generates on this workload
if I "tar -xpf -" to a zvol on a ZFS SSD pool the IOPs increase up to 3500 - this means this workload generates and needs 3500 write IOPs to succeed
sure the HDD pool fails to provide 3500 IOPs and chokes up - making the VM unresponsive and the proxmox host heavily loaded

As soon as I convert the VM disk (where tar -xpf - is writing) to a 2TB raw file on a dataset on the same HDD ZFS pool everything works like a charm - I've just copied 1TB over the last two hours without a glitch - VM is reponsive / the load on the proxmox host is 3.5

mailinglists · Jan 6, 2021

It might be related to volblocksize of zvols.
If you have the time, please match volblocksize to size as is on disks, and then also match it in filesystem you use in your VM.
Do same tests then.

But hopefully someone with more experience will join this conversation.

GeraldH · Jan 6, 2021

mailinglists said:
It might be related to volblocksize of zvols.
If you have the time, please match volblocksize to size as is on disks, and then also match it in filesystem you use in your VM.
Do same tests then.

I've already created a zvol with 4K volblocksize (8K is the default) - same results.

Additionally I run the test on the proxmox host directly

created a 4K zvol
mkfs.xfs
mounted

Running the test generated 500-700 IOPs on one HDD and a load of >40 on the proxmox host - like inside the VM
The proxmox host did not lock up (256GB memory), but the umount took over 5 minutes (fs buffers from memory had to be synced to the disk)

Running the test on a ZFS dataset instead of a zvol generated 50-170 IOPs and a sync returned immediately (no buffers in memory)

apoc · Jan 8, 2021

Could you please try if the behaviour is the same when you do io locally in the VM?
E.g. if the network is not involved?
I am asking myself if you see increase in IO due to the fact that it is coming from the network.
Also how much data do you actually write?

Dunuin · Jan 9, 2021

mailinglists said:
It might be related to volblocksize of zvols.
If you have the time, please match volblocksize to size as is on disks, and then also match it in filesystem you use in your VM.
Do same tests then.

But hopefully someone with more experience will join this conversation.

I also thought about that. If you access a raw image on an dataset the recordsize of the dataset should be used what is in most cases 128K and not the 8K or 4K your zvol is using.

GeraldH · Jan 10, 2021

Dunuin said:
I also thought about that. If you access a raw image on an dataset the recordsize of the dataset should be used what is in most cases 128K and not the 8K or 4K your zvol is using.

Yes - very good hint - 128k volblocksize makes a big difference !

I additionally run some fio test inside the VM - the 128k volblocksize fixes the problems on sequential IO (like the tar test above) but does not slow down random read/write tests.
the VM stays responsive during the tests
the IOPs on the proxmox host when using zvol VM disks are comparable with the ZFS raw images
on SSD pools fio IO is faster on the zvol than on the ZFS raw image
fio random writes on zvols on SSD pools still increases the load on proxmox host to >40 - but the VM and the proxmox host stays responsive
fio random writes on ZFS raw images on SSD pools are 50% slower - the load on the proxmox host stays at 11

So for me it looks like creating the zvols with 128k volblocksize should be the default - I will convert my existing zvols to 128k volblocksize (send/receive)

Thanks to all for the input.

mailinglists · Jan 13, 2021

Hmm... if there was no downside, it would be the default setting, i think.
I guess you loose on disk space then, if block size in PM GUI for that datastore is set at 128k instead of 8k. Am I correct?
Does anyone see any other downsides?

I guess I will do some tests when I have the time and test this. I might start setting higher than 8k ZVOL block size on ZFS datasets then.

Dunuin · Jan 14, 2021

mailinglists said:
Hmm... if there was no downside, it would be the default setting, i think.
I guess you loose on disk space then, if block size in PM GUI for that datastore is set at 128k instead of 8k. Am I correct?

Most of the time its the opposite. People loose capaticity if they use raidz and they don't increase the volblocksize first.

mailinglists said:
Does anyone see any other downsides?

Sure. If you dont't need a higher volblocksize because of padding overhead a raidz would produce, it is better to keep the volblocksize as small as possible. Otherwise you will loose capacity on really small writes and increase the overhead (its bad if a guest with a lower blocksize writes to a zvol with a higher blocksize).

Search

Search

ZFS zvol on HDD locks up VM

GeraldH

Member

apoc

Famous Member

GeraldH

Member

GeraldH

Member

apoc

Famous Member

mailinglists

Renowned Member

GeraldH

Member

mailinglists

Renowned Member

GeraldH

Member

apoc

Famous Member

Dunuin

Distinguished Member

GeraldH

Member

mailinglists

Renowned Member

Dunuin

Distinguished Member