Proxmox storage: Questions and / for best practices

Tim Denis

Active Member
May 19, 2016
15
0
41
40
Hi all,

I have some questions about the storgae model, and what the best practices are to deal with this. I hope this thread may become a reference for others.

So, what I have:
8 x 2TB disk and a nvme drive (a slice for the OS, and a slice as cache.)

I configured a zpool:
Code:
 pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          raidz2-0   ONLINE       0     0     0
            sda      ONLINE       0     0     0
            sdb      ONLINE       0     0     0
            sdc      ONLINE       0     0     0
            sdd      ONLINE       0     0     0
            sde      ONLINE       0     0     0
            sdf      ONLINE       0     0     0
            sdg      ONLINE       0     0     0
            sdh      ONLINE       0     0     0
        cache
          nvme0n1p4  ONLINE       0     0     0

So, using raidz2, I lose 3 disks of capacity. Fine. so, I should get 5 x 2 TB = 10TB of storage.
ashift = 12.


And yes, I get that.
Code:
root@pm13:~# zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
tank                9.33T   675G   222K  /tank

But... My disk space is being used up more quickly than I anticipated on.

I have these vm's, using these disks:

Code:
root@pm13:~# pvesm list tank
tank:vm-100-disk-1   raw 17179869184 100
tank:vm-101-disk-1   raw 17179869184 101
tank:vm-101-disk-2   raw 1030792151040 101
tank:vm-101-disk-3   raw 1030792151040 101
tank:vm-101-disk-4   raw 1030792151040 101
tank:vm-102-disk-1   raw 17179869184 102
tank:vm-103-disk-1   raw 8589934592 103
tank:vm-104-disk-1   raw 17179869184 104
tank:vm-105-disk-1   raw 17179869184 105
tank:vm-106-disk-1   raw 17179869184 106
tank:vm-107-disk-1   raw 34359738368 107
tank:vm-107-disk-2   raw 2199023255552 107
tank:vm-110-disk-1   raw 214748364800 110

However, when I do

Code:
root@pm13:~# zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
backup              2.41T  4.61T  2.41T  /backup
tank                9.33T   675G   222K  /tank
tank/vm-100-disk-1  27.6G   675G  27.6G  -
tank/vm-101-disk-1  4.05G   675G  4.05G  -
tank/vm-101-disk-2  1.99T   675G  1.99T  -
tank/vm-101-disk-3  1.98T   675G  1.98T  -
tank/vm-101-disk-4  1.35T   675G  1.35T  -
tank/vm-102-disk-1  15.7G   675G  15.7G  -
tank/vm-103-disk-1  3.34G   675G  3.34G  -
tank/vm-104-disk-1  34.7G   675G  34.7G  -
tank/vm-105-disk-1  2.44G   675G  2.44G  -
tank/vm-106-disk-1  13.3G   675G  13.3G  -
tank/vm-107-disk-1  50.7G   675G  50.7G  -
tank/vm-107-disk-2  3.55T   675G  3.55T  -
tank/vm-110-disk-1   310G   675G   310G  -

I see the volumes are much larger than allocated in storage.

My questions:

1) how come? Why are the volumes bigger? When I count the TB I allocated, it's around 5TB. However, all my space is used up (just short of 10TB)

I read some threads stating this has something to do with parity blocks ZFS stores. But that seems odd to me, since that is why I lose the 3 disks in the first place, right?

2) Could it have something to do with the filesystems being used in the guests? VM 101 is a openmediavault server having 3 data disks (virtual disks, stored on the zfs). They are managed via LVM. Same story for VM 107, but this is a single disk managed in the guest with LVM. Both have on top an ext4 filesystem. Could this be the culprit?

3) I know it isn't 'best practice' to have a storage server with virtualized volumes. But I have no separate storage server available. So I really need the stora server (for file sharing) to be virtualized to. Anything to recommend how to manage / use storage?

4) Could it be the file system? VM 110 is a windows 10 VM, with an allocated disk of 200GB. However, 310GB is reported / used. or has the NTFS filesystem the same 'assumptions / properties' towards the block device below as LVM / ext4 has?

if anyone could shed some light in this darkness, I'd be very grateful!

If any furter information about my setup would be helpful, please ask.

thanks in advance!
 
raidz2 with ashift=12 and a volblocksize of 4k or 8k means a lot of wasted space (because ZFS writes in blocks of 4k of data with ashift=12, so if you write a single 4k block of data it also needs to write a full 4k of parity!). you basically have four options
  • use stripe of mirrors instead of raidz (better performance, 50% of space used for redundancy, but different failure scenarios)
  • use ashift=9 (complete recreation of the pool needed, less space wasted, potentially worse performance depending on your disks, forever locked into ashift=9 for this pool)
  • use higher volblocksize (recreation of existing volumes needed, less space wasted, potentially worse performance depending on your work load)
  • live with the wasted space
 
Last edited:
  • Like
Reactions: chrone
Hi Fabian, thanks for your reply.

Okay, this makes things a bit more clear now. So I accume that the filesystems / LVM used in the guest has no influence on this? Only the ZFS is responsible?

Do I understand correctly from what you are saying that ZFS had already 3 disks 'against failure' and even on top of that, there is still another parity calculated and written to the volume (the reduced volume, of 5 x 2TB = 10 TB, and not relative to the raw disk space (8 x 2TB)

If I would export the VM's to a backup volume, recreate the ZFS with ashift=9, and then restore the VM's, would the missing sspace 'be recovered'? Or would the info from the backup about the volumes change the ashift again, since this was the original value? (basically the question is, is the afshift a property of the pool, or a property of the volume...? I would think the former, but just to be sure...)

another thing: I think I have 4k block disks. I deduct that from the fact that my pool (raidz2) is 9,33TB, very close to the expected value of 5x2 TB. Is this a correct assumption? I read that some pools, when the alignment of the pool and the physical disks differ, up to 40% of the expected space is lost.

You say:

> forever locked into ashift=9 for this pool

what are the disadvantages of this? (other than the potential performance issue you mentioned...)

thanks!
 
!!! It just occured to me I'm wrong about the expected disk space...
raidz2 should mean I lose 2 disks, so I should get 6x2TB = 12 TB, and not 10TB...
#feelinglikeanidiot
 
zfs / raidz does not have parity disks like some traditional raid implementations. it calculates parity blocks for data blocks based on the configured redundancy level - so for raidz2, if you write X data blocks you need to write enough parity blocks to be able to lose blocks stored on any two disks and still recover the data. the parity and data blocks are spread over all disks, you don't have "data disks" and "parity disks".

because of this, the actually usable space for data depends on your write pattern (volblocksize plays a role here), redundancy level (raidz1/2/3), ashift (/disk block size), number of vdevs / stripe width, .... the available space given by zpool is just an estimation for raidz - if you write in a "bad" way, you can use the space up by writing less data.

see https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz for some insight into what raidz is and how it works (written by one of the architects and main devs of ZFS)
 
  • Like
Reactions: chrone
  • use mirror instead of raidz (better performance, less space wasted, but different failure scenarios)
Honestly, I do not understand how mirror with 8 drives (I suppose you mean something like raid10) can have less space wasted (that's 4 complete drives!) than raidz2 where just capacity equal to 2 out of 8 drives is used for parity...
 
Honestly, I do not understand how mirror with 8 drives (I suppose you mean something like raid10) can have less space wasted (that's 4 complete drives!) than raidz2 where just capacity equal to 2 out of 8 drives is used for parity...

sorry, that was a typo. of course mirroring uses more space for redundancy than raidz2 (although not that much more than in the given scenario :p). edited original post for clarity!
 
what are the disadvantages of this? (other than the potential performance issue you mentioned...)

The advantages outweigh everything else if you concern is space. If you have e.g. 4K blocksize inside of your guest, you're going to compress that 4K block and store it inside of a 4K block, so compression takes only time and does not benefit here. If you increase your blocksize to 8K, it could be good, because you can compress a 8K block inside of a 4K block if it fits.

It's a completely other calculation if you compress a 4K block and store it in 0.5K blocks, because you're going to be able to benefit from this a lot. I also encountered a similar problem which was discussed here:

https://forum.proxmox.com/threads/zfs-space-inflation.25230/

especially the very last entry with different block sizes
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!