[SOLVED] Testing ZFS performance inside lxc container (mysql)

TwiX · Dec 15, 2017

Hi,

I just built a ZFS pool with one SSD drive for testing purposes.
Then I created 1 VM and 1 LXC container using this ZFS pool.

So, inside KVM, I made a simple test :

Code:

root@debian:~# dd if=/dev/zero of=here bs=4k count=10k oflag=direct
10240+0 enregistrements lus
10240+0 enregistrements écrits
41943040 bytes (42 MB, 40 MiB) copied, 0,891124 s, 47,1 MB/s

OK but same test on LXC is just a fail :

Code:

dd if=/dev/zero of=/root/test bs=4k count=10k oflag=direct
dd: failed to open '/root/test': Invalid argument

without oflag=direct it works :

Code:

dd if=/dev/zero of=/root/test bs=4k count=10k             
10240+0 records in
10240+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 0.0830864 s, 505 MB/s

Same thing directly on host

Is there any restriction regarding the disk cache for using zfs with lxc container ?

I plan to create a mysql galera cluster, should I use KVM or LXC with ZFS ?

Thanks a lot !

dcsapak · Dec 15, 2017

first: writing zero to a zpool makes no sense (because of compression and zfs optimization)
second: zfs does not support O_DIRECT on subvolumes

wolfgang · Dec 15, 2017

Hi,

zfs do not support O_DIRECT at filesystem-level.
and dd is not a good benchmark use instead fio.

TwiX · Dec 15, 2017

Ok thanks,

So if ZFS doesn't support O_DIRECT do you think it's a good idea to create mysql server on it ?

wolfgang · Dec 15, 2017

Yes it is a good idea but you have to optimize and setup correct.
Oracle is doing noting else.
http://open-zfs.org/wiki/Performance_tuning#Database_workloads

TwiX · Dec 15, 2017

So KVM or container ?

wolfgang · Dec 15, 2017

Both.

TwiX · Dec 15, 2017

Thanks

TwiX · Dec 18, 2017

Hi, One last question

You said it's ok for both KVM and LXC.

For KVM, the FS must be also ZFS ? Seems that mysql performance could be improved with ZFS but I don't know if it's still the case when the underlying FS is ZFS and not the KVM FS.

https://www.percona.com/blog/2017/12/07/hands-look-zfs-with-mysql/

wolfgang · Dec 18, 2017

For KVM use an plain ext4 in the VM and ensure volblocksize is 4K.
Cache is set to none.

TwiX · Dec 18, 2017

Thanks,
I noticed that disk performance is shitty when setting cache to none...(with ceph, didn't test with zfs).

I think I'll go for using a CT.

wolfgang · Dec 18, 2017

There is not cache configuration for every storage technology.
So please do not compare different Storages, that make no sense.
ZFS with block device emulation should use "no cache".

TwiX · Dec 18, 2017

Understood...Thanks again !

guletz · Dec 20, 2017

wolfgang said:
For KVM use an plain ext4 in the VM and ensure volblocksize is 4K.
Cache is set to none.

Hi,

4k volblocksize is a wrong decision if you ask me

If you want a minimal settings for mysql/percona with kvm and proxmox you need this at least:
- use 2 different virtual disks for your VM, one with OS only with 32-64k for zvol, and the second vdisk for /var/lib/mysql with 16k for zvol
- use zfs cache=metadata for both vdisks (you do not want double caching at zvol and OS/sql level)
- move the mysql log outside of /var/lib/mysql
- disable compression on mysql zvol (innodb use compression), or disable compression at innodb and use zfs compression (do not use both at the same time)

The same or equivalent must be in a lxc container:
- 2 vdisks, the second is for /var/lib/mysql, with block size = 16 k

From my tests lxc with mysql/mariadb/perconadb perform better compared with kvm. But each of them have advantages from different points of view.

In any case any kvm VM work better with any zvol volblocksize > 8k with any linux/windos os. Another's big disadvantage for 4k size are:
- very long time for scrub
- long time for vzdump backups
- huge time in case of resilver situations when you need to replace a broken disk

For this reasons I do not know WHY proxmox use by default 4k, insted of 16k ...

Also take in account that if you use a recent SSD, they use internaly compression and big page size > 128 k (most of them) so 4 k is very bad.

IsThisThingOn · Apr 14, 2023

guletz said:
For this reasons I do not know WHY proxmox use by default 4k, insted of 16k ...

Sorry for digging up that old post, but I am currently asking myself the same question.
My guess is that because they wanna cover the worst case witch would be a Postgres DB that uses 8k.
But on the other hand, if I get that one right, the default size for Windows and Linux guests is 4k, so why not use 4k instead? Middle-ground compromise?

guletz · Apr 14, 2023

IsThisThingOn said:
Sorry for digging up that old post, but I am currently asking myself the same question.
My guess is that because they wanna cover the worst case witch would be a Postgres DB that uses 8k.
But on the other hand, if I get that one right, the default size for Windows and Linux guests is 4k, so why not use 4k instead? Middle-ground compromise?

Hi,

By default on linux guests you will have 512, and I guess this is also on Win.

In case of a DB, let say postgres, or whatever, because you wil use 8k, then 2 blocks will be write on zfs, but mostly in different position on the hdd. So when postgres will need to re-write this 8k block, it will need to consume 2 iops mostly insted of 1 iops(in case of 8k, then this 2 blocks will be placed in the same region of a hdd)

Good luck!

IsThisThingOn · Apr 14, 2023

Hi guletz

I don't when they changed it, but Proxmox uses 8k by default for zvol.

Just to be sure I get this right, would you agree with:
Postgres uses 8k writes. If I use the default 8k, there is one write. If I use 4k, there are two writes. If I use 64k, there is only one write, but fragmentation and read amplification happen.

Windows seems to be using 4k. This is at least true for my Surface. According to this Source, Windows uses 4k since Windows 7 and even 8k to 64k for bigger drives. But for bare metal Windows NTFS fragmentation is probably not a problem, because it can defrag compared to ZFS?

What I don't understand is why TrueNAS chooses 128k as default and Proxmox chooses 8k. Is that because Windows and Linux guests have a lot of small writes compared to a TrueNAS that is mostly a NAS? Or would most guests be perfectly fine with 128k but just to play it save if someone uses postgres inside a VM, they choose 8k?

guletz · Apr 15, 2023

Hi,

IsThisThingOn said:
but Proxmox uses 8k by default for zvol.

Yes. it is true.

IsThisThingOn said:
Postgres uses 8k writes. If I use the default 8k, there is one write. If I use 4k, there are two writes. If I use 64k, there is only one write, but fragmentation and read amplification happen.

Yes.

The solution is to use different partitions with the necessary block size.

IsThisThingOn said:
But for bare metal Windows NTFS fragmentation is probably not a problem, because it can defrag compared to ZFS?

By default zfs write blocks in advantageous positions, and most of the time, fragmentation is not a problem that could affect performance. Defragmentation on the OS FS is not recommended, because Guest OS could not know, how the blocks are distributed by zfs on the disks. For this reason, I disabe disk optimisation on Windows guests.

If you want to improve zfs fragmentation, you can export your zfs pool on a external pool on other system, destroy your pool, and then sent back your pool to the original system.

IsThisThingOn said:
What I don't understand is why TrueNAS chooses 128k as default and Proxmox chooses 8k. Is that because Windows and Linux guests have a lot of small writes compared to a TrueNAS that is mostly a NAS? Or would most guests be perfectly fine with 128k but just to play it save if someone uses postgres inside a VM, they choose 8k?

It is possible, a NAS mostly can have many disks, so a big block size make sense.

Good luck / Bafta !

Dunuin · Apr 15, 2023

guletz said:
If you want to improve zfs fragmentation, you can export your zfs pool on a external pool on other system, destroy your pool, and then sent back your pool to the original system.

And best not to fill up your pool too much. The more filled your pool is, the fast it will fragment.

IsThisThingOn said:
What I don't understand is why TrueNAS chooses 128k as default and Proxmox chooses 8k.

Are you sure, you are not mixing this with the 128K recordsize? As far as I know TrueNAS will default to different volblocksizes depending on your pool layout: https://www.truenas.com/docs/core/coretutorials/storage/pools/zvols/

Optimal Zvol Block Sizesexpand
TrueNAS automatically recommends a space-efficient block size for new zvols. This table shows the minimum recommended volume block size values. To manually change this value, use the Block size dropdown list.

Configuration Number of Drives Optimal Block Size
Mirror N/A 16k
Raidz-1 3 16k
Raidz-1 4/5 32k
Raidz-1 6/7/8/9 64k
Raidz-1 10+ 128k
Raidz-2 4 16k
Raidz-2 5/6 32k
Raidz-2 7/8/9/10 64k
Raidz-2 11+ 128k
Raidz-3 5 16k
Raidz-3 6/7 32k
Raidz-3 8/9/10/11 64k
Raidz-3 12+ 128k

Additional tuning might be required for optimal performance, depending on the workload.

PS: Would be great if PVE could do something similar when creating a new ZFSPool storage. Would reduce a lot of support needed answering peoples threads complaining about pools being smaller than expected, becasue they don't understand that padding overhead.

IsThisThingOn · Apr 17, 2023

guletz said:
Defragmentation on the OS FS is not recommended, because Guest OS could not know, how the blocks are distributed by zfs on the disks. For this reason, I disabe disk optimisation on Windows guests.

I am pretty sure, that if you enable SSD emulation, Windows does not defrag but only TRIM the disk to free up unused space for thin provisioned disks.

Dunuin said:
Are you sure, you are not mixing this with the 128K recordsize?

Kinda

What I mixed up was zvol and dataset. Zvol is 16k and dataset is 128k default for TrueNAS.
But if Windows and Linux use 4k as the default cluster size, would it not be even better to set it to 4k default instead of the default 8k PVE uses currently? Or was that done by PVE because I use mirror?

Dunuin said:
Would be great if PVE could do something similar when creating a new ZFSPool storage

I have read about that padding problem and how it is hard to set is right with compression and how you are basically better of with mirror for zvols. But I am unable to find it right now. Maybe a strong encouragement for mirror in general would be good for PVE

Configuration	Number of Drives	Optimal Block Size
Mirror	N/A	16k
Raidz-1	3	16k
Raidz-1	4/5	32k
Raidz-1	6/7/8/9	64k
Raidz-1	10+	128k
Raidz-2	4	16k
Raidz-2	5/6	32k
Raidz-2	7/8/9/10	64k
Raidz-2	11+	128k
Raidz-3	5	16k
Raidz-3	6/7	32k
Raidz-3	8/9/10/11	64k
Raidz-3	12+	128k

[SOLVED] Testing ZFS performance inside lxc container (mysql)

Renowned Member

Proxmox Staff Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Famous Member

Member

Famous Member

Member

Famous Member

Distinguished Member

Member