[SOLVED] ZFS questions working with VM storage issue?

killmasta93

Renowned Member
Aug 13, 2017
973
58
68
31
Hi
I was wondering if someone could shed some light on the issue im having.
Currently i have proxmox working with ZFS and a few vms, I have a VM running ubuntu with OS ext4 but created another virtual disk and within that VM also has a ZFS for snapshots (zentyal)
now here is the issue as the storage on the VM shows differently on the host
i have troubleshooted a while but cannot get it working, i think that there is an alignment issue
I first created virtual disk with volblock size 4k

Code:
rpool/data/vm-145-disk-2                                        volblocksize  4K

on my Vm zentyal i have this usage 39.4 gigs and creating ashift 12

zpool create -f -o ashift=12 -o autotrim=on data2 /dev/sdb

NAME USED AVAIL REFER MOUNTPOINT data2 39.4G 83.6G 38.0G /data2

but the host shows this

rpool/data/vm-145-disk-2 63.9G 1.73T 63.9G -


i then tried with 8k block size and ashift 13 on the VM but i get this

on the vm

data3 40.5G 82.6G 38.5G /data3

and on the host

rpool/data/vm-145-disk-3 57.7G 1.68T 57.7G

Any ideas?
 
It is generally not a good idea to run ZFS ontop of ZFS because the overhead will multiply.

Can you tell us more what your rpool looks like?
 
Thanks for the reply, i did read alot but the snapshots is what really helps alots inside of the vm
my rpool is a raid z-1
root@prometheus2:~# zfs get all rpool NAME PROPERTY VALUE SOURCE rpool type filesystem - rpool creation Wed Aug 12 15:05 2020 - rpool used 5.42T - rpool available 1.66T - rpool referenced 175K - rpool compressratio 1.12x - rpool mounted yes - rpool quota none default rpool reservation none default rpool recordsize 128K default rpool mountpoint /rpool default rpool sharenfs off default rpool checksum on default rpool compression on local rpool atime off local rpool devices on default rpool exec on default rpool setuid on default rpool readonly off default rpool zoned off default rpool snapdir hidden default rpool aclinherit restricted default rpool createtxg 1 - rpool canmount on default rpool xattr on default rpool copies 1 default rpool version 5 - rpool utf8only off - rpool normalization none - rpool casesensitivity sensitive - rpool vscan off default rpool nbmand off default rpool sharesmb off default rpool refquota none default rpool refreservation none default rpool guid 2047715220241130019 - rpool primarycache all default rpool secondarycache all default rpool usedbysnapshots 0B - rpool usedbydataset 175K - rpool usedbychildren 5.42T - rpool usedbyrefreservation 0B - rpool logbias latency default rpool dedup off default rpool mlslabel none default rpool sync disabled local rpool dnodesize legacy default rpool refcompressratio 1.00x - rpool written 175K - rpool logicalused 3.63T - rpool logicalreferenced 44K - rpool volmode default default rpool filesystem_limit none default rpool snapshot_limit none default rpool filesystem_count none default rpool snapshot_count none default rpool snapdev hidden default rpool acltype off default rpool context none default rpool fscontext none default rpool defcontext none default rpool rootcontext none default rpool relatime off default rpool redundant_metadata all default rpool overlay off default
 
How much drive does your raidz1 consist of and what ashift and volblocksize you use on the host? With raidz is it normal that everything is way to big if you don't increase the volblocksize.

For raidz1 the volblocksize should look something like this:
3 disks: 4 * block size (set by ashift; so would be 16K for ashift of 12)
4 disks: 16 * block size (so would be 64K for ashift of 12)
5 disks: 8 * block size (so would be 32K for ashift of 12)
 
Last edited:
Thank you so much for the reply, currently my host has 8 disks

root@prometheus2:~# lsblk -o NAME,PHY-SeC NAME PHY-SEC sda 512 ├─sda1 512 ├─sda2 512 └─sda3 512 sdb 512 ├─sdb1 512 ├─sdb2 512 └─sdb3 512 sdc 512 ├─sdc1 512 ├─sdc2 512 └─sdc3 512 sdd 512 ├─sdd1 512 ├─sdd2 512 └─sdd3 512 sde 512 ├─sde1 512 ├─sde2 512 └─sde3 512 sdf 512 ├─sdf1 512 ├─sdf2 512 └─sdf3 512 sdg 512 ├─sdg1 512 ├─sdg2 512 └─sdg3 512 sdh 512 ├─sdh1 512 ├─sdh2 512 └─sdh3 512

volblock size by default is 8k and the ashift is 12
i tried 4k and 8k but cant get it to show the correct values

rpool/data/vm-145-disk-1 volblocksize 8K default rpool/data/vm-145-disk-2 volblocksize 4K -


root@prometheus2:~# zpool get all | grep ashift rpool ashift 12 local
 
Look at this spreadsheet. With a raidz1 of 8 disks and ashift of 12 you would loose:
volblocksize 4K/8K = 50% raw capacity lost
volblocksize 16K = 33% raw capacity lost
volblocksize 32K/64K = 20% raw capacity lost
volblocksize 128K = 16% raw capacity lost
volblocksize 256/512K = 14% raw capacity lost
volblocksize 1M= 13% raw capacity lost

So right now with a volblocksize of 8K you loose 50% of your raw statorage due to parity and padding overhead. So I would atleast increase it to 32K or use a striped mirror for better write IOPS but less capacity instead.
 
Last edited:
thanks for the reply, so i would need to use block size 32k for that rpool/data/vm-145-disk-2 so within the VM it would show the correct storage?
as for the zfs pool inside of the VM would it need to be ashift 12? and the same block size?
 
thanks for the reply, so i would need to use block size 32k for that rpool/data/vm-145-disk-2 so within the VM it would show the correct storage?
Correct. But volblocksize can only be set at the creation of zvols. So you need to destroy and recreate all virtual disks (can easily be done by backing up VMs and restoring them afterwards) after changing the volblocksize for your rpool (Datacenter -> Storage -> rpool -> Edit -> Block size). But it will always be a bit bigger unless you use a volblocksize of 1M, because with a volblocksize of 32K you will still loose 7% of raw storage due to bad padding. But should be way better than now.
as for the zfs pool inside of the VM would it need to be ashift 12? and the same block size?
I think so.
 
Correct. But volblocksize can only be set at the creation of zvols. So you need to destroy and recreate all virtual disks (can easily be done by backing up VMs and restoring them afterwards) after changing the volblocksize for your rpool (Datacenter -> Storage -> rpool -> Edit -> Block size). But it will always be a bit bigger unless you use a volblocksize of 1M, because with a volblocksize of 32K you will still loose 7% of raw storage due to bad padding. But should be way better than now.

I think so.
so i think i solved this issue, so i had to create volblocksize 64k and on the VM i have to add ashift 16 and now it seems to be showing the data correctly
 
Correct. But volblocksize can only be set at the creation of zvols. So you need to destroy and recreate all virtual disks (can easily be done by backing up VMs and restoring them afterwards) after changing the volblocksize for your rpool (Datacenter -> Storage -> rpool -> Edit -> Block size). But it will always be a bit bigger unless you use a volblocksize of 1M, because with a volblocksize of 32K you will still loose 7% of raw storage due to bad padding. But should be way better than now.

I think so.
i was curious how come by default volblock size on Proxmox is 8k when all the vms normally are 4k the NTFS windows or ext4 woulnt there always be an alightment issue for space? or its it better 8k?
 
i was curious how come by default volblock size on Proxmox is 8k when all the vms normally are 4k the NTFS windows or ext4 woulnt there always be an alightment issue for space? or its it better 8k?
ZFS was initially made for Solaris and there 8K is the default block size. Thats why 8K is the default blocksize for ZFS.
 
do these blocksize matter in the I/OPS? so what i have been seeing is that normally there is going to be an alignment issue as normally windows and linux have 4k by default (except when install MSSQL which should be NTFS 64K)
 
Yes, if you for example use a volblocksize of 64k everything with a lower blocksize than 64K will get terrible overhead. So your 64K might be ok for stuff like videos and fotos but terrible for VMs, databases and so on.
 
thanks for the reply, but what is the VM data storage of a mssql is 64k NTFS wouldn't the volblocksize on proxmox should be the same?
So the lower the volblocksize on zfs Proxmox the better?
 
If your VM only writes as 64K blocks thats fine. But if it for example also uses 4K for some stuff you get a 16 times higher write amplification and 16 times the read overhead. So your read/write performance for 4K will be 16 times worse too and your SSD wil die 16 times faster.
 
Thanks for the reply, so if i understood correctly its always best to leave 8k volblocksize on proxmox and leave by default the windows NTFS and linux ext4 as its 4k volblocksize, and if i change on windows NTFS storage 64k i would leave on proxmox the 8k volblocksize
 
Thanks for the reply, so if i understood correctly its always best to leave 8k volblocksize on proxmox and leave by default the windows NTFS and linux ext4 as its 4k volblocksize, and if i change on windows NTFS storage 64k i would leave on proxmox the 8k volblocksize
You can't just choose any volblocksize. Whats possible and whats not depends on your pool layout. With your 8 disk raidz1 everything below 32K volblocksize is basically too much wasted capacity. If you really want to use a 8K volblocksize you would need to create two separate striped mirrors (raid10) of 4 drives. If you don't want 2 pools the lowest useful volblocksize would be 16K with a striped mirror of 8 disks. All of cause is only viable if you use a ashift of 12. As soon as you increase your ashift you also need to increase your volblocksize. Increasing the volblocksize wont help if you also increase the ashift. If you increase your ashift from 12 to 16 you also need to increase the volblocksize by factor 4.
So your 64K volblocksize + ashift 16 raidz1 is basically the same as a 16K volblocksize with ashift 12 so you still loose 33% raw capacity.

I made a benchmark yesterday where I tested reads/writes with 4K/16K/32K/4096K block sizes to my 32K raidz1 pool. Here you can see for example how much the overhead increases if you try to read/write with a block size that is smaller than the volblocksize.
I need to do more benchmarks but I will most likely delete my raidz1 pool and create a striped mirror of 4 drives with a volblocksize of 16K for normal VMs + a mirror auf 2 drives with a volblocksize of 4K for my VMs heavily utilizing DBs so the volblocksize is as low as possible. I hope that will decrease my high write amplification so the drives will live longer and hopefully also get a better performance.
 
Last edited:
  • Like
Reactions: killmasta93
interesting, did not know about the 8 diskz1 everything below 32k is a waste, so it also depends on how many disks of the server having and if its a raid10 or raid z1
Lets say i raid 10 striped mirror ( 4disks) and running a MSSQL VM the recomended virtual disk would be 4k?
 
interesting, did not know about the 8 diskz1 everything below 32k is a waste, so it also depends on how many disks of the server having and if its a raid10 or raid z1
Lets say i raid 10 striped mirror ( 4disks) and running a MSSQL VM the recomended virtual disk would be 4k?
For striped stuff it should be "data bearing disks * blocksize of your drives = volblocksize" So if using ashift=12 for that pool and 4 drives as striped mirror that would be 2 * 4K = 8K volblocksize.
 
  • Like
Reactions: killmasta93
quick question when running SSD is there a rule of thumb? when configuring proxmox? as i see when configuring VM there is an option for SSD?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!