[SOLVED] Testing ZFS performance inside lxc container (mysql)

TwiX

Renowned Member
Feb 3, 2015
310
21
83
Hi,

I just built a ZFS pool with one SSD drive for testing purposes.
Then I created 1 VM and 1 LXC container using this ZFS pool.

So, inside KVM, I made a simple test :
Code:
root@debian:~# dd if=/dev/zero of=here bs=4k count=10k oflag=direct
10240+0 enregistrements lus
10240+0 enregistrements écrits
41943040 bytes (42 MB, 40 MiB) copied, 0,891124 s, 47,1 MB/s

OK but same test on LXC is just a fail :
Code:
dd if=/dev/zero of=/root/test bs=4k count=10k oflag=direct
dd: failed to open '/root/test': Invalid argument

without oflag=direct it works :
Code:
dd if=/dev/zero of=/root/test bs=4k count=10k             
10240+0 records in
10240+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 0.0830864 s, 505 MB/s

Same thing directly on host :(

Is there any restriction regarding the disk cache for using zfs with lxc container ?

I plan to create a mysql galera cluster, should I use KVM or LXC with ZFS ?

Thanks a lot !
 
first: writing zero to a zpool makes no sense (because of compression and zfs optimization)
second: zfs does not support O_DIRECT on subvolumes
 
Hi,

zfs do not support O_DIRECT at filesystem-level.
and dd is not a good benchmark use instead fio.
 
Ok thanks,

So if ZFS doesn't support O_DIRECT do you think it's a good idea to create mysql server on it ?
 
For KVM use an plain ext4 in the VM and ensure volblocksize is 4K.
Cache is set to none.
 
  • Like
Reactions: chrone
Thanks,
I noticed that disk performance is shitty when setting cache to none...(with ceph, didn't test with zfs).

I think I'll go for using a CT.
 
There is not cache configuration for every storage technology.
So please do not compare different Storages, that make no sense.
ZFS with block device emulation should use "no cache".
 
For KVM use an plain ext4 in the VM and ensure volblocksize is 4K.
Cache is set to none.


Hi,


4k volblocksize is a wrong decision if you ask me :)

If you want a minimal settings for mysql/percona with kvm and proxmox you need this at least:
- use 2 different virtual disks for your VM, one with OS only with 32-64k for zvol, and the second vdisk for /var/lib/mysql with 16k for zvol
- use zfs cache=metadata for both vdisks (you do not want double caching at zvol and OS/sql level)
- move the mysql log outside of /var/lib/mysql
- disable compression on mysql zvol (innodb use compression), or disable compression at innodb and use zfs compression (do not use both at the same time)


The same or equivalent must be in a lxc container:
- 2 vdisks, the second is for /var/lib/mysql, with block size = 16 k

From my tests lxc with mysql/mariadb/perconadb perform better compared with kvm. But each of them have advantages from different points of view.


In any case any kvm VM work better with any zvol volblocksize > 8k with any linux/windos os. Another's big disadvantage for 4k size are:
- very long time for scrub
- long time for vzdump backups
- huge time in case of resilver situations when you need to replace a broken disk

For this reasons I do not know WHY proxmox use by default 4k, insted of 16k ... :)

Also take in account that if you use a recent SSD, they use internaly compression and big page size > 128 k (most of them) so 4 k is very bad.
 
Last edited:
For this reasons I do not know WHY proxmox use by default 4k, insted of 16k ... :)
Sorry for digging up that old post, but I am currently asking myself the same question.
My guess is that because they wanna cover the worst case witch would be a Postgres DB that uses 8k.
But on the other hand, if I get that one right, the default size for Windows and Linux guests is 4k, so why not use 4k instead? Middle-ground compromise?
 
Sorry for digging up that old post, but I am currently asking myself the same question.
My guess is that because they wanna cover the worst case witch would be a Postgres DB that uses 8k.
But on the other hand, if I get that one right, the default size for Windows and Linux guests is 4k, so why not use 4k instead? Middle-ground compromise?
Hi,

By default on linux guests you will have 512, and I guess this is also on Win.

In case of a DB, let say postgres, or whatever, because you wil use 8k, then 2 blocks will be write on zfs, but mostly in different position on the hdd. So when postgres will need to re-write this 8k block, it will need to consume 2 iops mostly insted of 1 iops(in case of 8k, then this 2 blocks will be placed in the same region of a hdd)

Good luck!
 
Hi guletz

I don't when they changed it, but Proxmox uses 8k by default for zvol.

Just to be sure I get this right, would you agree with:
Postgres uses 8k writes. If I use the default 8k, there is one write. If I use 4k, there are two writes. If I use 64k, there is only one write, but fragmentation and read amplification happen.

Windows seems to be using 4k. This is at least true for my Surface. According to this Source, Windows uses 4k since Windows 7 and even 8k to 64k for bigger drives. But for bare metal Windows NTFS fragmentation is probably not a problem, because it can defrag compared to ZFS?

What I don't understand is why TrueNAS chooses 128k as default and Proxmox chooses 8k. Is that because Windows and Linux guests have a lot of small writes compared to a TrueNAS that is mostly a NAS? Or would most guests be perfectly fine with 128k but just to play it save if someone uses postgres inside a VM, they choose 8k?
 
Last edited:
Hi,

but Proxmox uses 8k by default for zvol.
Yes. it is true.


Postgres uses 8k writes. If I use the default 8k, there is one write. If I use 4k, there are two writes. If I use 64k, there is only one write, but fragmentation and read amplification happen.
Yes.

The solution is to use different partitions with the necessary block size.

But for bare metal Windows NTFS fragmentation is probably not a problem, because it can defrag compared to ZFS?
By default zfs write blocks in advantageous positions, and most of the time, fragmentation is not a problem that could affect performance. Defragmentation on the OS FS is not recommended, because Guest OS could not know, how the blocks are distributed by zfs on the disks. For this reason, I disabe disk optimisation on Windows guests.

If you want to improve zfs fragmentation, you can export your zfs pool on a external pool on other system, destroy your pool, and then sent back your pool to the original system.

What I don't understand is why TrueNAS chooses 128k as default and Proxmox chooses 8k. Is that because Windows and Linux guests have a lot of small writes compared to a TrueNAS that is mostly a NAS? Or would most guests be perfectly fine with 128k but just to play it save if someone uses postgres inside a VM, they choose 8k?

It is possible, a NAS mostly can have many disks, so a big block size make sense.

Good luck / Bafta !
 
  • Like
Reactions: IsThisThingOn
If you want to improve zfs fragmentation, you can export your zfs pool on a external pool on other system, destroy your pool, and then sent back your pool to the original system.
And best not to fill up your pool too much. The more filled your pool is, the fast it will fragment.

What I don't understand is why TrueNAS chooses 128k as default and Proxmox chooses 8k.
Are you sure, you are not mixing this with the 128K recordsize? As far as I know TrueNAS will default to different volblocksizes depending on your pool layout: https://www.truenas.com/docs/core/coretutorials/storage/pools/zvols/
Optimal Zvol Block Sizesexpand
TrueNAS automatically recommends a space-efficient block size for new zvols. This table shows the minimum recommended volume block size values. To manually change this value, use the Block size dropdown list.
ConfigurationNumber of DrivesOptimal Block Size
MirrorN/A16k
Raidz-1316k
Raidz-14/532k
Raidz-16/7/8/964k
Raidz-110+128k
Raidz-2416k
Raidz-25/632k
Raidz-27/8/9/1064k
Raidz-211+128k
Raidz-3516k
Raidz-36/732k
Raidz-38/9/10/1164k
Raidz-312+128k
Additional tuning might be required for optimal performance, depending on the workload.

PS: Would be great if PVE could do something similar when creating a new ZFSPool storage. Would reduce a lot of support needed answering peoples threads complaining about pools being smaller than expected, becasue they don't understand that padding overhead.
 
Last edited:
Defragmentation on the OS FS is not recommended, because Guest OS could not know, how the blocks are distributed by zfs on the disks. For this reason, I disabe disk optimisation on Windows guests.
I am pretty sure, that if you enable SSD emulation, Windows does not defrag but only TRIM the disk to free up unused space for thin provisioned disks.
Are you sure, you are not mixing this with the 128K recordsize?
Kinda :) What I mixed up was zvol and dataset. Zvol is 16k and dataset is 128k default for TrueNAS.
But if Windows and Linux use 4k as the default cluster size, would it not be even better to set it to 4k default instead of the default 8k PVE uses currently? Or was that done by PVE because I use mirror?


Would be great if PVE could do something similar when creating a new ZFSPool storage
I have read about that padding problem and how it is hard to set is right with compression and how you are basically better of with mirror for zvols. But I am unable to find it right now. Maybe a strong encouragement for mirror in general would be good for PVE :)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!