zfs used size vs volsize

em.tie

Active Member
Jun 13, 2018
20
0
41
44
Hi all,

i am running proxmox on 2 hardware servers in cluster with a raspberry as quorum device. The server have 4x 4TB SSDs each for data of VMs or containers. I was using mdadm to create a raid-5 array for the disks. But migrating VMs or Containers from one server to the other lasts to long for me. So I wanted to migrate to ZFS raid-z. I created an zpool called rpool on one of the servers and made some benchmarks and the results looked good. So I started migrating volumes to the rpool and noticed that the used size of a volume is 1.5 times bigger than the actual volsize. So I can't migrate all my VMs there :-( . Is that normal behavior? Can I fix it somehow?

I have created the rpool via the webgui of proxmox and settings are like this:

zpool status

Bash:
 pool: rpool
 state: ONLINE
config:

        NAME                                       STATE     READ WRITE CKSUM
        rpool                                      ONLINE       0     0     0
          raidz1-0                                 ONLINE       0     0     0
            ata-SanDisk_SDSSDH3_4T00_20438Q420294  ONLINE       0     0     0
            ata-SanDisk_SDSSDH3_4T00_20438Q421193  ONLINE       0     0     0
            ata-SanDisk_SDSSDH3_4T00_20438Q420164  ONLINE       0     0     0
            ata-SanDisk_SDSSDH3_4T00_20438Q421252  ONLINE       0     0     0

errors: No known data errors

zfs list
Bash:
NAME                  USED  AVAIL     REFER  MOUNTPOINT
rpool                5.94T  4.30T      140K  /rpool
rpool/vm-101-disk-1  5.94T  7.92T     2.32T  -

zfs get compressratio,used,available,volblocksize,volsize

Bash:
rpool                compressratio  1.02x     -
rpool                used           5.94T     -
rpool                available      4.30T     -
rpool                volblocksize   -         -
rpool                volsize        -         -
rpool/vm-101-disk-1  compressratio  1.02x     -
rpool/vm-101-disk-1  used           5.94T     -
rpool/vm-101-disk-1  available      7.92T     -
rpool/vm-101-disk-1  volblocksize   8K        default
rpool/vm-101-disk-1  volsize        4T        local

zpool get all rpool
Bash:
NAME   PROPERTY              VALUE                  SOURCE
rpool  type                  filesystem             -
rpool  creation              Thu Aug 26 22:06 2021  -
rpool  used                  5.94T                  -
rpool  available             4.30T                  -
rpool  referenced            140K                   -
rpool  compressratio         1.02x                  -
rpool  mounted               yes                    -
rpool  quota                 none                   default
rpool  reservation           none                   default
rpool  recordsize            128K                   default
rpool  mountpoint            /rpool                 default
rpool  sharenfs              off                    default
rpool  checksum              on                     default
rpool  compression           on                     local
rpool  atime                 on                     default
rpool  devices               on                     default
rpool  exec                  on                     default
rpool  setuid                on                     default
rpool  readonly              off                    default
rpool  zoned                 off                    default
rpool  snapdir               hidden                 default
rpool  aclmode               discard                default
rpool  aclinherit            restricted             default
rpool  createtxg             1                      -
rpool  canmount              on                     default
rpool  xattr                 on                     default
rpool  copies                1                      default
rpool  version               5                      -
rpool  utf8only              off                    -
rpool  normalization         none                   -
rpool  casesensitivity       sensitive              -
rpool  vscan                 off                    default
rpool  nbmand                off                    default
rpool  sharesmb              off                    default
rpool  refquota              none                   default
rpool  refreservation        none                   default
rpool  guid                  9110619683008062517    -
rpool  primarycache          all                    default
rpool  secondarycache        all                    default
rpool  usedbysnapshots       0B                     -
rpool  usedbydataset         140K                   -
rpool  usedbychildren        5.94T                  -
rpool  usedbyrefreservation  0B                     -
rpool  logbias               latency                default
rpool  objsetid              54                     -
rpool  dedup                 off                    default
rpool  mlslabel              none                   default
rpool  sync                  standard               default
rpool  dnodesize             legacy                 default
rpool  refcompressratio      1.00x                  -
rpool  written               140K                   -
rpool  logicalused           1.64T                  -
rpool  logicalreferenced     42K                    -
rpool  volmode               default                default
rpool  filesystem_limit      none                   default
rpool  snapshot_limit        none                   default
rpool  filesystem_count      none                   default
rpool  snapshot_count        none                   default
rpool  snapdev               hidden                 default
rpool  acltype               off                    default
rpool  context               none                   default
rpool  fscontext             none                   default
rpool  defcontext            none                   default
rpool  rootcontext           none                   default
rpool  relatime              off                    default
rpool  redundant_metadata    all                    default
rpool  overlay               on                     default
rpool  encryption            off                    default
rpool  keylocation           none                   default
rpool  keyformat             none                   default
rpool  pbkdf2iters           0                      default
rpool  special_small_blocks  0                      default

What further Information can I provide?

Thanks for helping me out :)

cu emtie
 
If you use a volblocksize of 8K with raidz1 you will loose 25% of your raw capacit to parity and 25% of your raw capacity to padding overhead (in other words for every 2 blocks of data you get 1 block of padding so eveything is 50% bigger). If you want to minimize padding overhead you need to recreate all your zvols (restoring from backup or migrating should count as recreate and work) with a bigger volblocksize. Here is a table that shows parity+padding overhead. And here is explained why there is padding overhead and how to calculate it. You can set the volblocksize for that pool using the WebUI: Datacenter -> Storage -> YourPool -> Edit -> Blocksize
4 drives is very bad number for raidz1 and I would really recommend that you use a striped mirror instead, especially if you are using it as a VM storage because you should get better IOPS that way.
If you still want to use raidz1 here is how much you will loose to padding:
4K-8K vollbocksize = 50% raw capacity lost
16-32K = 33% raw capacity lost
64-128K = 27% raw capacity lost
256-512K = 26% raw capacity lost
1024K = 25% raw capacity lost

Higher Volblocksizes will be very bad because of all the read and write ampification when writing smaller files. LEts say for example you want to use a MYSQL DB that uses 16K blocks. It that case I wouldn't use a volblocksize that is bigger than 16K or the read/write amplification would be terrible. But with 16K volblocksize you loose alot of capacity...so you need to decide between good performance + longer SSD life or high capacity.
 
Last edited:
Thats okay for space overhead, but what are performance implications of setting volblocksize other than 8k?
 
Thats okay for space overhead, but what are performance implications of setting volblocksize other than 8k?
That really depends on your workload. LEts say you got a 128K volblocksize and want to do a 16K sync write. Now every 16K write will cause 128K reads + 128K writes. So your SSDs will die 8 times faster and write+reads are atleast 8 times slower. And every 16K random read would need to read 128K too. So even all reads are 8 times slower. But if you only use 128K or higher reads/writes there is no penalty at all.

And there is basically no point using raidz1 if you don't increase your volblocksize to atleast 16K. Otherwise you would loose 50% of raw capacity just like with a striped mirror but a striped mirror would get better IOPS and better redundancy.
 
Last edited:
Thanks for your fast reply and the very good explenation, that helped a lot. I need the space so I am thinking about a raid-0 and syncing the volumes to the second server, not the best in terms of availability, but I don' wanna go back to mdadm...
 
Thanks for your fast reply and the very good explenation, that helped a lot. I need the space so I am thinking about a raid-0 and syncing the volumes to the second server, not the best in terms of availability, but I don' wanna go back to mdadm...
You won't only loose the redundancy, you will also loose features like your bit rot protection that can't work without a form of parity. With a stripe your ZFS still can detect data corruptions and tell you "your data got corrupted" but it can't repair it anymore because there is no parity. With a striped mirror or raidz ZFS would easily auto-repair the corrupted data while doing the scheduled scrub.

If you don't care about IOPS and don't got many writes/reads smaller than 32K and you got a free disk slot you could buy a fifth SSD. A raidz1 of 5 SSDs is way more efficient. With 5 SSDs as raidz1 and a 32K volblocksize you would only loose 20% of raw storage to paritiy/padding.
 
Last edited:
Went with raid10. Wanted to solve the space issue with replication because i have backuped each server to the other. So I wanted to give replication a try. RAID10 with ZFS looks promising. But when I started to activate replication the vm I wanted to replicate was twice the size. In my understanding of zfs snapshots the good thing is that only changed blocks are written to disk so that snapshots do not need that much size. What is wrong with my VMs that they use twice the size?
 
Snaoshots will grow over time and can get a multiple of the size of the VM that you snapshotted. ZFS is just a big journal where changed data will not overwrite old data but all changed data will be appended to the end of the journal. So as long as there is a snaphot nothing will be deleted that is never than your snapshot.

Lets say you got a 10GB VM, you create a snapshot and then you create and delete 30x 1GB files inside that VM. Then your VM is still only 10GB but your snapshot is 30GB. So you don't want to keep snapshots for too lon.

Did you enable discard for your virtual disk, did you choose a storage controller like virtio SCSI that supports TRIM command and did you enable discard inside every guest? If not ZFS can't free up space and the virtual disk will grow and grow.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!