zvol vs image on top of dataset

chrcoluk

Renowned Member
Oct 7, 2018
172
28
68
45
Some guys have been doing some benchmarking and have discovered that using a raq or qcow2 image in top of ZFS is much better performing than ZVOL. Unless they made some error, this I consider big news.

Some links here. ( cannot post direct urls,) they are reddit links with the following titles.

benchmarking_raw_image_vs_qcow2_vs_zvol_with_kvm/
benchmarking_zvol_vs_qcow2_with_kvm/

Now apparently by default qcow2 uses 64k block sizes or something so is nowhere near optimal performance, from what i can tell making a qcow2 disk in proxmox does not allow you to set fine tuned options and I have no idea if proxmox does anything specific.

I thought I would post it here so if anyone has anything to dispute those tests and if staff can offer advice as well.

I was initially looking for information regarding ext4/qcow2 vs zvol performance and accidentally came across this instead.
 
zvol does support O_DIRECT, zfs filesystem does not, so you will not be able to use O_DIRECT and this is a big impact for everything consistency related. Normally, databases use O_DIRECT to forcefully write stuff to disk and this takes time.

Normally you also won't be a able to create a KVM VM on top of a zfs directory due to this limitation without setting the cache mode to at least writeback, which does not honor the O_DIRECT mechanism.
 
Now apparently by default qcow2 uses 64k block sizes or something so is nowhere near optimal performance,

The are not any block size that is optimal for any load.

I was able to read something on reddit about this huge discovery ... with only some results based on fio. From my experience, any test like this (fio or whatever) without any details (host description, storage layaout, and so on) it have no value.
In that reddit thred is was only about of sync write performance using a database load like. But even so, some database will need to read some data, not only write.

In the end, your info is usefulness for most pmx/kvm/qcow2 users(in my own opinion )
 
I have been running now with a random 16k io workload (torrents) for a few days with this config, its not an out of the box config to say the least.

The comparison is to esxi vmfs backed storage on very similar spec'd hardware.

I installed proxmox on mirrored zfs pool.
After installation I did a lot of work (which was documented to the process), to migrate it to a 1000gig zfs pool mirror. This is on 2 3tb drives each drive had 2tb unallocated.
I then made on each drive a extra 2 tb zfs partition with a single drive pool on each of those 2tb partitions.

So at this point I have a zfs on root redundant proxmox installation, with 1tb available space for redundant backed storage.
The reason for the 2tb standalone (total 4tb) storage is that data is not important if lost on drive failure, and capacity more important.

pfsense was installed on a VM. 2 virtual nics, one to the default bridge, another to a new bridge I made as a lan switch.
debian was installed on a second VM connected to the virtual lan switch with pfsense as its gateway, its boot drive hosted on the 1tb mirrored zfs pool, and 2 2tb drives made for it on the 2 tb zfs pools.

At this point I want to specify what I configured, based on lots of reading up on things and my existing knowledge of zfs.

I used raw images for the 2tb drives hosted on the zfs pools (directory).
Initially I used 16k record size to avoid read on write overhead.
Sparse storage used.
Set xattrs to sa.
primarycache to metadata to avoid wasted ram on host and also I believe the host OS should not be making cache decisions in this case.
Everything else defaults for the zfs pool.

In the qemu settings, writeback cache was used as nocache isnt supported, there is 2 writeback options, I chose the one that doesnt say dangerous.
I used scsi virtio
4 gig ram
cpu in host mode and PCID box ticked.

Debian 9.5 was the OS version. The 2 2tb drives were merged as one 4tb drive using LVM, not striped but where one is after the other.

After I setup everything on the guest OS and started using it the results were impressive, in esxi backed storage I had io wait bottlenecks at about 40 megabyte/sec writes (after guest os ram cache saturated), in proxmox in this config it was about 60 megabytes/sec. I was then however curious, as I knew higher record sizes can optimise i/o,s bit more on spinning disks and have higher laz4 compress ratio, so upped the recordsize to 32kb (moving the images off the pool and back on to apply the new recordsize), and am able to saturate the gigabit connection without io bottlenecks, blowing esxi vmfs completely out the water. Remember this is with primarycache on metadata as well which to me proves guest OS cache is sufficient for caching.

I plan to do a few more tests as I have another spare server with same spec'd hardware.

1 - with 128kb default record size, I expect this to be worse.
2 - using default proxmox zfs volume backed storage, this one I think will be worse but I want peace of mind to see if I am right and if the guys on reddit are right.

I didnt use qcow2 as the benefits of it vs raw image are provided by zfs. qcow2 seems to have performance issues vs raw.

proxmox it self is up to date on non subscription repo.

Also prefetch is disabled on zfs, to improve io/sec performance.

Also as you pointed out lnxbil, there is the question mark over what writeback caching is doing in qemu, even with primarycache set to metadata. The guys doing the reddit tests put in their data they used nocache when testing but as I discovered and you pointed out, nocache cannot be configured on a zfs directory backed storage.
 
Last edited:
The are not any block size that is optimal for any load.

I was able to read something on reddit about this huge discovery ... with only some results based on fio. From my experience, any test like this (fio or whatever) without any details (host description, storage layaout, and so on) it have no value.
In that reddit thred is was only about of sync write performance using a database load like. But even so, some database will need to read some data, not only write.

In the end, your info is usefulness for most pmx/kvm/qcow2 users(in my own opinion )

One guy just tested with fio, the other tested with the test suite used by that linux benchmarking site, I agree they only testing limited scenarios. Which may or may not be relevant to other workloads. But I still felt it was worth bringing up here for further discussion.

The zfs on linux git hub, has a lot of issue reports showing unexpected behaviour from server admins, which were resolved by doing things that are typically not considered recommended practice such as reducing recordsize or using zfs dataset instead of zfs volume. Some of these issues were confirmed by dev's and have planned fixes, others left as unresolved (unknown cause).
 
One guy just tested with fio, the other tested with the test suite used by that linux benchmarking site, I agree they only testing limited scenarios

... and for sure only them know their scenarios - so nobody care . If I want I can show you many tests that can outperform ext4, but is usefulness because nobody will use this test scenarios in real world. By the way if you want to show some tests and to extrapolate then the results, you need to show your test case, your test environment. In others situations your test is a joke .
But I still felt it was worth bringing up here for further discussion.

It is like let take a joke and then try to waste time to find if maybe it could be something useful in this ;)

The zfs on linux git hub, has a lot of issue reports showing unexpected behaviour from server admins, which were resolved by doing things that are typically not considered recommended practice such as reducing recordsize or using zfs dataset instead of zfs volume. Some of these issues were confirmed by dev's and have planned fixes, others left as unresolved (unknown cause).

Sorry but I use zfs for many years, and I read from time to time the bug list from zfs.
Your statement is unfair at least. When you say that are many bugs ... what is the point. You can have thousands of bugs, but this is not important at all if ALL of this bugs are not critical/blocking bugs or the most users (98 %) are unaffected.
I can say that in more than 5 years I do not see ANY bug that affect me (tens of servers with zfs an different OS)

In my own opinion you make many mistakes and misleadings that are not help anybody. My own suggestion is to think twice and inform yourself about this subject and then try to write some ideas about the subject.
I know is hard to do it, but if you are a good guy you can try at least.

God luck and be informed; )
 
How many have actually tested in the real world? Most people will go with "defaults", or go with what the "manual" says.

I am trying not to be rude but your last post I am taking it as you are calling me stupid by getting myself informed, when you know nothing about me, my experience, and what I did before making the post. You havent told me anything about why "you" are informed and why you think its a joke. Do you have hard data for example.

Its limited testing, but it is better than no testing. Go on the internet and look for data where people have specifically tested different configurations and published the results. There is not much data out there. Now I can tell you abit about myself, I have been administrating linux and BSD servers for circa 13 years, I have been using ZFS since FreeBSD 8.1 which is since late 2013, thats 5 years experience and I have lots of real world data comparing ext4, ufs, ntfs, zfs and other filesystems for various workloads. Part of research is always reading bug reports, as they are very useful data, bug reports originates from people using a product and reporting their findings to a developer, you can learn from such things what an outcome is from a bug report, whether a developer can repeat a problem, prove the report is wrong, and find a fix for a problem.

When you say that are many bugs ... what is the point. You can have thousands of bugs, but this is not important at all if ALL of this bugs are not critical/blocking bugs or the most users (98 %) are unaffected.

Now you sound like microsoft, are you trying to say if it just works what is the point of striving for more? There is often a point of maximising performance from an investment, so I most definitely do not agree with your thinking that unless an issue is critical it should be ignored, my piece of advise to you since you decided to offer me advise is if you come across a discussion on the internet that you feel is pointless, then simply do not take part in it, dont waste time by making a post simply to try and tell the person to stop what they doing because you see no personal benefit in it. If noone else has any interest in this, i.e. they dont care if improvements can be made, then this thread will die and we all move on, its as simple as that really.
 
as you are calling me stupid by getting myself informed,

I am truly sorry that you think this. In my opinion to be uninformed is not the same to be stupid. Maybe this is true only in my own country, so plese accept my apologies!


Now you sound like microsoft, are you trying to say if it just works what is the point of striving for more? There is often a point of maximising performance from an investment, s

No is not about this at all. As you know any open source project have limited resources. Nobody can even think that can resolve non-critical bugs for let say 2% of user base. This kind of bugs need a lot of time to investigate and to try to solve. Most of them are related with some unusual combination of hardware / software.
 
You havent told me anything about why "you" are informed and why you think its a joke

From my own opinion a valuable test must be like this:
- a detail description of test methodology that is used, like hardware info, os settings if it is used non defaults values
- software tools parameters used for tests
- how many tests was used (3 try. ..)

Without this minimal info the results are very hard to be confirmed or not, by others users, because they can not use the same tests in their own environment.
Without this many users will compare oranges with apples.
 
then simply do not take part in it, dont waste time by making a post simply to try and tell the person to stop what they doing because you see no personal benefit in it.

I do not have say something like this, it is your own supposition. For a such person I would say, do your test better so others can replicate your test/results. If he can do this, then his own tests an results can be useful for any dev team. More, maybe some users will be able to avoid problems after they will see this test(I can not use this setup in my environment because I get the same results using the same tests as John who post his test method )

By the way, I am pretty sure you already know this because you are not stupid.

I am the only stupid guy because I do not know bsd like you = uninformed guy ;)
 
Now I can tell you abit about myself, I have been administrating linux and BSD servers for circa 13 years, I have been using ZFS since FreeBSD 8.1 which is since late 2013, thats 5 years experience a

Who care ... if I tell you that I use linux starting from 2002 and I use zfs on linux when was available zfs-fuse , then my own remarks are more valuable then your remarks? Your ideas could be better then mine even with less experienced.
What is important is the fact that others can think about anybody write in a post, and then they can say ... X write something that help me or not.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!