Help me to understand the used space on ZFS

jacklayne

Active Member
Oct 3, 2018
16
2
43
31
Hello All,

I'm using proxmox with ZFS and I created a RAIDZ1 with 3x4TB disk and ashift=12

Code:
# cat /sys/block/sdb/queue/logical_block_size
512
# cat /sys/block/sdb/queue/physical_block_size
4096

These are the options enabled on the pool:

Code:
NAME                          SYNC          DEDUP  COMPRESS  ATIME
lxpool                    disabled            off       lz4    off
lxpool/32k                disabled            off       lz4    off
lxpool/32k/vm-104-disk-0  disabled            off       lz4      -
lxpool/32k/vm-107-disk-0  disabled            off       lz4      -
lxpool/8k                 disabled            off       lz4    off
lxpool/8k/vm-104-disk-0   disabled            off       lz4      -
lxpool/8k/vm-107-disk-0   disabled            off       lz4      -

Can someone help me to understand this output:

Code:
NAME                       USED  LUSED  REFER  LREFER  RECSIZE  VOLBLOCK
lxpool                    1.69T  1.28T   139K     44K     128K         -
lxpool/32k                30.3G  31.1G   128K     40K      32K         -
lxpool/32k/vm-104-disk-0  15.8G  16.6G  15.8G   16.6G        -       32K
lxpool/32k/vm-107-disk-0  14.5G  14.5G  14.5G   14.5G        -       32K
lxpool/8k                 1.66T  1.25T   128K     40K       8K         -
lxpool/8k/vm-104-disk-0   1.64T  1.23T  1.64T   1.23T        -        8K
lxpool/8k/vm-107-disk-0   19.4G  14.6G  19.4G   14.6G        -        8K

Why in the 8k zvol, the used and refer space are bigger than logicalused and logicalreferenced?

If I use a zvol with recordsize=32k I have the same amount for the used|refer space and logical used|refer, but if I use the a zvol 8k, the used space is bigger than the logical one.

On top of this zvol there are an linux vm and windows vm.
linux=vm-104
windows=vm-107

Is this a waste of space?

EDIT1:
I forgot to mention that these VM are file server ( NAS with SMB )

EDIT2:

Using recordsize=1M I got the opposite:

Code:
NAME                     PROPERTY           VALUE   SOURCE
lxpool/1M/vm-104-disk-0  used               84.0G   -
lxpool/1M/vm-104-disk-0  logicalused        90.0G   -
lxpool/1M/vm-104-disk-0  referenced         84.0G   -
lxpool/1M/vm-104-disk-0  logicalreferenced  90.0G   -

Is it an advantage to have a logical value bigger than used one?

Thanks to everyone in advance!
 
Last edited:
Is it an advantage to have a logical value bigger than used one?

The advantage is less space usage, but with snapshots it can use more space and it will be in general much slower. Best is to have the same block size on your storage than in your VM - everything else could lead to read and write amplification.

Why in the 8k zvol, the used and refer space are bigger than logicalused and logicalreferenced?

You have to consider the RAIDz1 level. Your pool does have the sum of all your drives as free space in the beginning. A dataset will always use more space due to redundancy, because the redundancy also takes up space. Another story is if you send/received data from a shift 9 pool, you will also see this space inflation - e.g. here:

https://forum.proxmox.com/threads/zfs-space-inflation.25230/

You also have to consider the space used for snapshots when computing the overall space usage. This all will get very complicated and very hard to read and interpret. Most of the time you're going to really compute space usage with send/receive and count actual bytes.
 
  • Like
Reactions: semanticbeeng
The advantage is less space usage, but with snapshots it can use more space and it will be in general much slower. Best is to have the same block size on your storage than in your VM - everything else could lead to read and write amplification.

So if my disks are 4k, I should use a recordsize of 4k, with the compression disabled?

You have to consider the RAIDz1 level. Your pool does have the sum of all your drives as free space in the beginning. A dataset will always use more space due to redundancy, because the redundancy also takes up space. Another story is if you send/received data from a shift 9 pool, you will also see this space inflation - e.g. here:

https://forum.proxmox.com/threads/zfs-space-inflation.25230/

I know that that the available space is less in RAIDZ1 due to redundancy, but it's already taken ( total pool is 10.9TB, but only 7.2 usable ).
I saw a better read/write performance with greater block size instead of 4k or 8k.

As you can see, using aa recordsize=1M I got the a better result:

Code:
NAME                            USED  LUSED  REFER  LREFER  RECSIZE  VOLBLOCK  COMPRESS  REFRATIO
lxpool/xpestore/vm-100-disk-1  1.49T  1.56T  1.49T   1.56T        -        1M       lz4     1.04x

but if I well understood this could impact the performance, correct?
 
So if my disks are 4k, I should use a recordsize of 4k, with the compression disabled?

That is the best option for the performance, but the worst option for space usage (assume your VM is block size aligned). Best option for space usage would be to directly use ZFS inside of your VM and use an appropriate recordsize for your content, e.g. big files means big recordsize. In the end, the default 8K volblocksize in PVE is a tradeoff between this. It would also help to use LXC as much as possible, because you don't have an additional filesystem layer. I also have best results with ashift 9, because your will have a better overall compression ratio, but you'll have to have the disks for it or be willing to loose performance due to write amplification. You need to know what you'll optimize for, because you cannot optimise for throughput and space usage at the same time only with software options.

As you can see, using aa recordsize=1M I got the a better result:

Yes, this means that you can use compression, because for bigger uncompressed data, the compressed data ratio is better. This always works, but it'll impact data that is rewritten constantly. You will most probably have write amplification with this and in the case of multiple snapshots, you could end up with more used space, because you need to rewrite a larger block if one bit changes.

I also performed my own tests a while ago for storing images and videos on ZFS with various recordsizes:

Code:
root@proxmox ~ > zfs list -r -o name,used,lrefer,compressratio,mountpoint zpool/bilder/test
NAME                         USED  LREFER  RATIO  MOUNTPOINT
zpool/bilder/test           45,0G     53K  1.06x  /zpool/bilder/test
zpool/bilder/test/bs-0004k  6,89G   4,75G  1.00x  /zpool/bilder/test/bs-0004k
zpool/bilder/test/bs-0008k  6,82G   4,72G  1.00x  /zpool/bilder/test/bs-0008k
zpool/bilder/test/bs-0016k  5,07G   4,71G  1.01x  /zpool/bilder/test/bs-0016k
zpool/bilder/test/bs-0032k  4,85G   4,71G  1.05x  /zpool/bilder/test/bs-0032k
zpool/bilder/test/bs-0064k  4,36G   4,71G  1.08x  /zpool/bilder/test/bs-0064k
zpool/bilder/test/bs-0128k  4,33G   4,72G  1.09x  /zpool/bilder/test/bs-0128k
zpool/bilder/test/bs-0256k  4,23G   4,75G  1.09x  /zpool/bilder/test/bs-0256k
zpool/bilder/test/bs-0512k  4,22G   4,80G  1.11x  /zpool/bilder/test/bs-0512k
zpool/bilder/test/bs-1024k  4,19G   4,86G  1.12x  /zpool/bilder/test/bs-1024k

The data is always rsynced to the device, because send/receive does contain the recordsize in a replication stream.
 
  • Like
Reactions: semanticbeeng
Ok got it! In my case the file doesn't change too much, except for the small files with my personal cloud, but they are really fews gigabyte.

I spotted that starting from 32k the result is almost the same.
Which pros/cons are there using 32k/64k instead of 1M?
If I well uderstood using a smaller recordsize with big file ( iso, movie, etc ) it will write more block, that means more checksum and ilps during the reads/writes, correct?

Thanks a lot!
 
If I well uderstood using a smaller recordsize with big file ( iso, movie, etc ) it will write more block, that means more checksum and ilps during the reads/writes, correct?

Yes, and worse compression results (it the content it compressible).

The main problem with higher record size and "real" VMs (not LX(C) containers) is that the data will not always be serialized on the disk, so that the files do not have to be consecutive on the disk.
 
Yesterday I did some test with containers ( debian 9.4 ).
I cloned the VMs in different recordsize and I got the similar results: higher recordsize results in a better compression:

Code:
NAME                                  USED  LUSED  REFER  LREFER  VOLSIZE  RECSIZE  VOLBLOCK  COMPRESS  REFRATIO
lxpool/vmdata/32k/subvol-109-disk-0   768M   829M   768M    829M        -      32K         -       off     1.48x
lxpool/vmdata/4k/subvol-111-disk-0   1.29G   853M  1.29G    853M        -       4K         -       off     1.00x
lxpool/vmdata/64k/subvol-110-disk-0   722M   823M   722M    823M        -      64K         -       off     1.54x
lxpool/vmdata/8k/subvol-201-disk-0   1.10G   841M  1.10G    841M        -       8K         -       off     1.17x
lxpool/vmdata/subvol-112-disk-0       705M   825M   705M    825M        -     128K         -       off     1.56x
lxpool/xpestore/subvol-113-disk-0     655M   826M   655M    826M        -       1M         -       off     1.70x

thank I did some benchmarks installing bonnie++ and compression disable:

upload_2018-10-17_9-21-35.png

The container servers as dns server for home, so it has a very low workload ( disk I/O average is 4M in proxmox statics and 100M of ram usage ) Could I use an higher recordsize with this kind of application?

thanks
 
the larger the volblocksize the worse performance you will get on small transactions. Thats the downside. Also for things like random 16k read/writes you will get read-modify-write overhead as detailed in http://open-zfs.org/wiki/Performance_tuning.

There has been a fair few discussions on recordsize and volblocksize recently. If you using an SSD backed storage its probably a good idea to match either the expected workload so e.g. 16k for torrents, 8k for innodb or to match the sector size on the storage medium so e.g. 4k for a 4k SSD.

For spindles larger sizes are beneficial for datasets so the recordsize has a default value of 128k, but as your benchmarks show once you go above 128k the advantages for sequential are very small vs the downsides on random i/o. Generally I leave ZFS at its default unless I need to optimise it for a specific usage pattern e.g. on my new proxmox server I have configured a zfs dataset with 16k recordsize for torrent read/write as it does 16k at a time, although the bit I am unsure about if its applicable when its not the direct filesystem as the VM filesystem is a in between. Its a constant learning game. But since containers write directly you dont have that problem.

ZFS compression compresses each block individually, each block in this case been the recordsize, so this throws a spanner in the works, as you have discovered with a larger amount of data a better compression ratio can be achieved, better compression reduces i/o and saves space.

Now if that bench was representative of your expected workload I would consider the 32k recordsize as the optimal value, but it probably is not. A DNS server is probably not going to be anywhere near I/O bound so the recordsize is unlikely to have an impact, and the default of 128k is fairly high anyway, I would consider a 1meg recordsize as very excessive for a DNS server.
 
the larger the volblocksize the worse performance you will get on small transactions. Thats the downside. Also for things like random 16k read/writes you will get read-modify-write overhead as detailed in http://open-zfs.org/wiki/Performance_tuning.

There has been a fair few discussions on recordsize and volblocksize recently. If you using an SSD backed storage its probably a good idea to match either the expected workload so e.g. 16k for torrents, 8k for innodb or to match the sector size on the storage medium so e.g. 4k for a 4k SSD.

For spindles larger sizes are beneficial for datasets so the recordsize has a default value of 128k, but as your benchmarks show once you go above 128k the advantages for sequential are very small vs the downsides on random i/o. Generally I leave ZFS at its default unless I need to optimise it for a specific usage pattern e.g. on my new proxmox server I have configured a zfs dataset with 16k recordsize for torrent read/write as it does 16k at a time, although the bit I am unsure about if its applicable when its not the direct filesystem as the VM filesystem is a in between. Its a constant learning game. But since containers write directly you dont have that problem.

ZFS compression compresses each block individually, each block in this case been the recordsize, so this throws a spanner in the works, as you have discovered with a larger amount of data a better compression ratio can be achieved, better compression reduces i/o and saves space.

Now if that bench was representative of your expected workload I would consider the 32k recordsize as the optimal value, but it probably is not. A DNS server is probably not going to be anywhere near I/O bound so the recordsize is unlikely to have an impact, and the default of 128k is fairly high anyway, I would consider a 1meg recordsize as very excessive for a DNS server.

Hello,

thank you for explanation. I become to figure out how ZFS works.

Today I have done several tests for the space efficiency, maybe could help someone else:

I created 3 pools, 2x single vdev + 1x RAIDz1

1st pool
Name: hc0
RAID: single vdev ( 1x 1TB Hitachi )
Disk Block size: 512n
Ashift=9 ( aligned to the disk )

2nd pool

Name: wd0
RAID: single vdev ( 1x 1.5TB WD Green )
Disk Block size: 512e ( Physical=4k, logical=512 )
Ashift=9and Ashift=12( I did 2 tests, one for each ashift )

3rd pool
Name: lxpool
RAID: RAIDz1 ( 3x 4TB WD Red )
Disk Block size: 512e ( Physical=4k, logical=512 )
Ashift=12

Then I created 8 datasets for pool with ashift=9 and 6 dataset for pool ashift=12, with different recordsize and compression.

Recordsize: 512b ( only for ashift=9 ), 4k, 32k, 128k
Compression: lz4, off
I created 2 datasets for each recordsize, one with compression=lz4 and the other one with compression=off

I cloned a single container across datasets and I calculated the RATIO between the USED space and LUSED:

upload_2018-10-18_18-20-27.png

upload_2018-10-18_18-20-44.png

As you can see there is no different between 512n and 512e with ashift=9, with or without compression

upload_2018-10-18_18-21-32.png

upload_2018-10-18_18-21-57.png

Here the differences between the disks 512e with different ashift and raid level

upload_2018-10-18_18-24-1.png

upload_2018-10-18_18-24-8.png

So if I well understood, with 512e disks you got the a better saved space using ashift=9, even if the disks are 4k ( physically ). You almost got a RATIO=1 with ashift=12 and a better compression, even with small recordsize, where with ashift=12 you got a lot of wasted space.

Now I'm going to run some benchmark, but I read that shouldn't be difference in performance between ashift=9 and ashift=12, correct?
 
Last edited:
  • Like
Reactions: semanticbeeng
I am not sure, on the disk usage some information is here.

https://github.com/zfsonlinux/zfs/issues/548.

If I am thinking right on the performance side of things, reads should not be impacted too badly if you using ashift=9, it will have to read 4k data e.g. for 512bytes of data, but thats not extra head movement, it just has to read more data whilst its there.
For writes smaller than 4k you will need read-modify-write overhead as it will need to read other data in the sector and then write it all back again at once, rather than just write the 512bytes. I wouldnt expect a huge impact from this, probably in modern workloads a very small overhead which may not even be noticeable.

In my view aligning to 4k on AF spinning disks is not as important as aligning 4k on SSD's. You definitely want to use 4k alignment for partitions, but I think ashift=9 would probably work out ok.
 
So if I well understood, with 512e disks you got the a better saved space using ashift=9, even if the disks are 4k ( physically ).

What is your use case for recordsize=512? What files are typically that small and worth of being stored on persistent storage? I'd only start at minimum of recordsize 4k, as almost every filesystem does nowadays.

In your graphs you can see what I tried to explain already using ashift=9 in comparison to ashift=12 with respect to the compressibility and block sizes equal to or greater than 4k.
 
If I am thinking right on the performance side of things, reads should not be impacted too badly if you using ashift=9, it will have to read 4k data e.g. for 512bytes of data, but thats not extra head movement, it just has to read more data whilst its there.

This is not entierly true. Let say you have a single file with 100 mb size. With ashift=9 you will need to read more blocks from disk compared with ashift=12 or more. The same is for meta-data(=> much arc). And when pool fragmentation will be bigger, then your IOPs will drop. Also if in most of the time for your load you use small files, this do not have a big impact using ashift=9.

For writes smaller than 4k you will need read-modify-write overhead as it will need to read other data in the sector and then write it all back again at once, rather than just write the 512bytes. I wouldnt expect a huge impact from this, probably in modern workloads a very small overhead which may not even be noticeable.

Maybe for SSD pools!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!