Not a problem - Ext4 block optimisation on ZVOLs

glaeken2 · Jun 7, 2023

Hello!

I'm trying to optimise ext4 speed and host zfs dedup and speed and write amplification.
One of the things I'd like to achive is ext4 blocks(block groups?) alignment to ZVOL 128k blocksize.

I tried ext4 -O bigalloc -C 131072, which yes, gives some advantage, but it's not quite what I need, AND it is an option still in kernel devel. Also sometimes I use 256kb blocksize and this would mean that every file, even 1kb one, will use a "virtual" 256kb which will skew the "used space" on the guest side, and compression ratio on the host side pretty darn well.
Compression has no problem with the "leftover" free space in the clusters tho.

What I would like to achive:
- align files bigger than 128kb (or all files) on ext4 starting at 128kb boundary - this would allow for close to 100% deduplication as the ZVOLs are set to 128kb blocksize, so the file won't start "somewhere" within 128kb block (ext4 uses 4k blocks by default so the file can start at 0, 4k, 8k, 16k realtive to the ZVOL block, right? so the same file may look very different at the ZVOL side)
I thought about ext4 stride size. Can it be used to force ext4 to align files to 128kb blocks? Stride and stripe-width parameters are very poorly described all over the net. Most definitions are about raid setups and i'm not interested in how many drives the nonexistent raid has.

I'm also thinking about non compressed ZFS ontop of a ZVOL with recordsize set to 128kb. But the COW overhead of zfs does'nt seem like a good choice. And i'd like to avoid nonlinear reads/writes and gaps created by zfs cow system at the guest side.

I don't use partitions, so the ext4 starts right at the beginning of the block device.

Hannes Laimer · Jun 7, 2023

Hey,

this is a complex task and the question really delves into the intricate mechanisms of file systems and block storage.

Starting with ext4, there are indeed options to modify the block size using the "-b" option with mke2fs. However, it has a maximum of 4KB. You cannot go beyond that. This is a constraint of the ext4 filesystem, which isn't built to handle large block sizes, due to its design and goals of general-purpose efficiency.

The stride size could potentially help, but its usage is primarily intended for optimizing RAID configurations, as it is used to align the filesystem to the RAID stripe size. It doesn't have a direct influence on the alignment of individual files.

Therefore, from the ext4 side, there are not many things you can do to achieve the goal of aligning files to start at 128KB boundaries. This is something which is generally handled at a higher level of the stack, typically by the application writing the files.

In regards to ZFS, you're correct that ZFS uses a Copy-On-Write mechanism that can introduce fragmentation over time, especially with small random writes. Deduplication in ZFS can also lead to a significant increase in RAM usage due to the necessity of storing the deduplication table in memory.

One workaround you might consider is using the ZFS L2ARC (Level 2 Adjustable Replacement Cache) feature. It can help with the read speed, but it does nothing for write speed. SLOG can also help with write speed in synchronous writes, but not asynchronous ones.

You could consider XFS as an alternative to ext4, as it has more flexible options for controlling allocation and better large file support. However, it still doesn't have an explicit feature to align files at specific boundaries.

Overall, achieving file alignment at specific block boundaries is not a task that filesystems are generally designed to handle. It might be possible to write a custom tool that pads files to align them, or modify the source code of an open-source filesystem to do this, but it would be a non-trivial task. You should also consider whether the benefits of aligning files in this way will outweigh the complexity and potential issues it could introduce.

LnxBil · Jun 7, 2023

glaeken2 said:
I tried ext4 -O bigalloc -C 131072, which yes, gives some advantage, but it's not quite what I need, AND it is an option still in kernel devel. Also sometimes I use 256kb blocksize and this would mean that every file, even 1kb one, will use a "virtual" 256kb which will skew the "used space" on the guest side, and compression ratio on the host side pretty darn well.
Compression has no problem with the "leftover" free space in the clusters tho.

I am not convinced that this will be such a big impact and still compressible, yet I have no data to back it up yet. Just installing a VM to try it out. Which kernel did you try to get the new ext4 options?

LnxBil · Jun 7, 2023

LnxBil said:
I am not convinced that this will be such a big impact and still compressible, yet I have no data to back it up yet. Just installing a VM to try it out. Which kernel did you try to get the new ext4 options?

I'm using Kernel 6.1 on Debian SID and the command does not even return

Code:

root@dyn-031 ~ > mkfs.ext4 -O bigalloc -C 131072 /dev/sdc
mke2fs 1.47.0 (5-Feb-2023)

Warning: bigalloc file systems with a cluster size greater than
16 times the block size is considered experimental

^C^C^C^C

LnxBil · Jun 7, 2023

I used an almost minimal Debian SID (rsynced it, overwritten all free space with zeros) as a test and yeah, I see ...

Code:

root@guest ~ > df -PHT /mnt/ext4_4k /mnt/ext4_128k
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sdb       ext4  8.4G  1.7G  6.3G  21% /mnt/ext4_4k
/dev/sdc       ext4  8.6G  5.7G  2.3G  72% /mnt/ext4_128k

root@storage ~ > zfs list -o name,volsize,volblocksize,used,lused,refer,lrefer,compressratio rpool/iscsi/vm-7777-disk-0 rpool/iscsi/vm-7777-disk-1
NAME                        VOLSIZE  VOLBLOCK   USED  LUSED     REFER  LREFER  RATIO
rpool/iscsi/vm-7777-disk-0       8G        4K   999M  1.55G      999M   1.55G  1.62x
rpool/iscsi/vm-7777-disk-1       8G      128K   852M  5.27G      852M   5.27G  6.34x

Overall the virtual disk needs less space on disk, but the disks need to be much larger. I have to analyze how big the deduplication benefit would be.

I also tried xfs and played around with jfs, both are not suitable for this. xfs is not able to do anything due to wrong page size. It seems that linux filesystems are optimized with respect to the default page size in memory.

LnxBil · Jun 7, 2023

I have to say, I have problems with some scsi calls to the 128K ZFS zvol and they block sometimes, e.g. fstrim hangs in D+.

glaeken2 · Jun 8, 2023

LnxBil said:
I also tried xfs and played around with jfs, both are not suitable for this. xfs is not able to do anything due to wrong page size. It seems that linux filesystems are optimized with respect to the default page size in memory.

Oh... I just started to think about xfs, but I see there is no point.

ps. You can also use ext4 with default block on 128k blocksize zvol, which works great, reaches very high speeds, but of course has pretty high write amplification (unless you write larger amount of data, like... more than 128kb per file per 5 seconds, linear).
Increasing txg write timeout helps with this a bit for log files (they have more time to get more data). Also ext4 commit=X option helps, so ext4 has more time to get more data before write occurs. sync=disabled helps the most allowing for better aggregation, but it's a little dangerous.

LnxBil · Jun 8, 2023

glaeken2 said:
ps. You can also use ext4 with default block on 128k blocksize zvol, which works great, reaches very high speeds, but of course has pretty high write amplification (unless you write larger amount of data, like... more than 128kb per file per 5 seconds, linear).
Increasing txg write timeout helps with this a bit for log files (they have more time to get more data). Also ext4 commit=X option helps, so ext4 has more time to get more data before write occurs. sync=disabled helps the most allowing for better aggregation, but it's a little dangerous.

Yes I know. I was interessted in the better deduplicability, too.

Search

Search

Not a problem - Ext4 block optimisation on ZVOLs

glaeken2

New Member

Hannes Laimer

Proxmox Staff Member

LnxBil

Distinguished Member

LnxBil

Distinguished Member

LnxBil

Distinguished Member

LnxBil

Distinguished Member

glaeken2

New Member

LnxBil

Distinguished Member

We value your privacy