VM storage conf for 3rd time

ieronymous

Well-Known Member
Apr 1, 2019
285
21
58
45
Multiple times I have configured proxmox from scratch and storage is my weak spot. I am always trying
to set the best possible options for the use case scenario, still finding myself weak at determining which these
best options are, for storage.

Once more, the disks used are 512 native (sas 10k sas3/12G disks) so I have created the HH3VM pool with ashift 9.
This pool is dedicated for VM storage I should mention here.
I enabled afterwards from GUI Thin provision. Block size is by default 8K but my personal guides
(written via extended discussions in forums and tech tutorials in tube..etc) dictate it would be better for block size
to be 16k (correlates with NTFS file system of Win or something). On top of that i have written down that this option can t be changed
afterwards (though it can probably the meaning is doesn t make a difference to the already created VMs).

-Does this mean I had to create the raid10 zfs storage from cli only in order to add to the command the extra option of block size?
-Do I even need to change it since the OSes will be 5-6 Win Serv VMs?
-Any other option I should consider?

Below follows all the options for the (stll empty) pool I have created.

zpool get all HH3VM
Code:
HH3VM  size                           4.34T                          -
HH3VM  capacity                       0%                             -
HH3VM  altroot                        -                              default
HH3VM  health                         ONLINE                         -
HH3VM  guid                           10.......08....79...           -
HH3VM  version                        -                              default
HH3VM  bootfs                         -                              default
HH3VM  delegation                     on                             default
HH3VM  autoreplace                    off                            default
HH3VM  cachefile                      -                              default
HH3VM  failmode                       wait                           default
HH3VM  listsnapshots                  off                            default
HH3VM  autoexpand                     off                            default
HH3VM  dedupratio                     1.00x                          -
HH3VM  free                           4.34T                          -
HH3VM  allocated                      220K                           -
HH3VM  readonly                       off                            -
HH3VM  ashift                         9                              local
HH3VM  comment                        -                              default
HH3VM  expandsize                     -                              -
HH3VM  freeing                        0                              -
HH3VM  fragmentation                  0%                             -
HH3VM  leaked                         0                              -
HH3VM  multihost                      off                            default
HH3VM  checkpoint                     -                              -
HH3VM  load_guid                      9992307875857252571            -
HH3VM  autotrim                       off                            default
HH3VM  compatibility                  off                            default
HH3VM  feature@async_destroy          enabled                        local
HH3VM  feature@empty_bpobj            enabled                        local
HH3VM  feature@lz4_compress           active                         local
HH3VM  feature@multi_vdev_crash_dump  enabled                        local
HH3VM  feature@spacemap_histogram     active                         local
HH3VM  feature@enabled_txg            active                         local
HH3VM  feature@hole_birth             active                         local
HH3VM  feature@extensible_dataset     active                         local
HH3VM  feature@embedded_data          active                         local
HH3VM  feature@bookmarks              enabled                        local
HH3VM  feature@filesystem_limits      enabled                        local
HH3VM  feature@large_blocks           enabled                        local
HH3VM  feature@large_dnode            enabled                        local
HH3VM  feature@sha512                 enabled                        local
HH3VM  feature@skein                  enabled                        local
HH3VM  feature@edonr                  enabled                        local
HH3VM  feature@userobj_accounting     active                         local
HH3VM  feature@encryption             enabled                        local
HH3VM  feature@project_quota          active                         local
HH3VM  feature@device_removal         enabled                        local
HH3VM  feature@obsolete_counts        enabled                        local
HH3VM  feature@zpool_checkpoint       enabled                        local
HH3VM  feature@spacemap_v2            active                         local
HH3VM  feature@allocation_classes     enabled                        local
HH3VM  feature@resilver_defer         enabled                        local
HH3VM  feature@bookmark_v2            enabled                        local
HH3VM  feature@redaction_bookmarks    enabled                        local
HH3VM  feature@redacted_datasets      enabled                        local
HH3VM  feature@bookmark_written       enabled                        local
HH3VM  feature@log_spacemap           active                         local
HH3VM  feature@livelist               enabled                        local
HH3VM  feature@device_rebuild         enabled                        local
HH3VM  feature@zstd_compress          enabled                        local
HH3VM  feature@draid                  enabled                        local

-Why is listsnapshots and autoexpand off? Is it pointless to know where and how many, these snapshots are?
-As for autoexpand, doesn t seem to me, off option helps somewhere. Is this option being used for something else?
-Is autotrim off because it didn t detect any ssds? If that is the case, then why this option is
off as well for the rpool which is based on mirrored ssds?

.....and we move to the Datasets/zvols (zvols here since as I already mentioned, this storage is only for VMs)
Since for every zpool creation a correlated root zvol/dataset is being created, that means properties of this are
being set automatically and are not always the correct ones. Below follows mine

zfs get all HH3VM
Code:
HH3VM  type                  filesystem             -
HH3VM  creation              Thu Mar 17 13:04 2022  -
HH3VM  used                  225K                   -
HH3VM  available             4.22T                  -
HH3VM  referenced            24K                    -
HH3VM  compressratio         1.00x                  -
HH3VM  mounted               yes                    -
HH3VM  quota                 none                   default
HH3VM  reservation           none                   default
HH3VM  recordsize            128K                   default
HH3VM  mountpoint            /HH3VM                 default
HH3VM  sharenfs              off                    default
HH3VM  checksum              on                     default
HH3VM  compression           on                     local
HH3VM  atime                 on                     default
HH3VM  devices               on                     default
HH3VM  exec                  on                     default
HH3VM  setuid                on                     default
HH3VM  readonly              off                    default
HH3VM  zoned                 off                    default
HH3VM  snapdir               hidden                 default
HH3VM  aclmode               discard                default
HH3VM  aclinherit            restricted             default
HH3VM  createtxg             1                      -
HH3VM  canmount              on                     default
HH3VM  xattr                 on                     default
HH3VM  copies                1                      default
HH3VM  version               5                      -
HH3VM  utf8only              off                    -
HH3VM  normalization         none                   -
HH3VM  casesensitivity       sensitive              -
HH3VM  vscan                 off                    default
HH3VM  nbmand                off                    default
HH3VM  sharesmb              off                    default
HH3VM  refquota              none                   default
HH3VM  refreservation        none                   default
HH3VM  guid                  5...01...8792037...    -
HH3VM  primarycache          all                    default
HH3VM  secondarycache        all                    default
HH3VM  usedbysnapshots       0B                     -
HH3VM  usedbydataset         24K                    -
HH3VM  usedbychildren        201K                   -
HH3VM  usedbyrefreservation  0B                     -
HH3VM  logbias               latency                default
HH3VM  objsetid              54                     -
HH3VM  dedup                 off                    default
HH3VM  mlslabel              none                   default
HH3VM  sync                  standard               default
HH3VM  dnodesize             legacy                 default
HH3VM  refcompressratio      1.00x                  -
HH3VM  written               24K                    -
HH3VM  logicalused           79K                    -
HH3VM  logicalreferenced     12K                    -
HH3VM  volmode               default                default
HH3VM  filesystem_limit      none                   default
HH3VM  snapshot_limit        none                   default
HH3VM  filesystem_count      none                   default
HH3VM  snapshot_count        none                   default
HH3VM  snapdev               hidden                 default
HH3VM  acltype               off                    default
HH3VM  context               none                   default
HH3VM  fscontext             none                   default
HH3VM  defcontext            none                   default
HH3VM  rootcontext           none                   default
HH3VM  relatime              off                    default
HH3VM  redundant_metadata    all                    default
HH3VM  overlay               on                     default
HH3VM  encryption            off                    default
HH3VM  keylocation           none                   default
HH3VM  keyformat             none                   default
HH3VM  pbkdf2iters           0                      default
HH3VM  special_small_blocks  0                      default

It confuses me the fact that zvol HH3VM (root dataset of HH3VM zpool) has attributes since it is a block level

Other options here that might make a difference would be:

zfs set atime=off Disables the Accessed attribute on every file that is accessed, this can double IOPS.
zfs set relatime=on On the other side, if some apps need that access time to work and you have it disabled
then disfunction of the app will follow. In such a case, let atime on along with relatime
I don t know what to choose here. Apps will use atime of the OS inside the VM and not the
underlying storage option.

zfs set xattr=sa : According to this statement and it seems logical <<<zvols don't have an xattr property
as there are no xattrs that could be stored >>, why then there is a value for the option
and it is set to on.
Also by definition for xattr: sets the Linux extended attributes as so, this will stop
the file system from writing tiny files and write directly to the inodes.
Does it make a difference for Linux only Vms? As i mentioned above only Win Vms will be used.

zfs set recordsize=16 The recordsize value will be determined by the type of data
on the file system, 16K for VM images and databases or an exact match,
or 1M for collections of 5-9MB JPG files and GB+ movies ETC.
If you are unsure, the default of 128K is good enough for all around
mixes of file sizes.
Should I change the default 128 to 16 before starting creating Vms? (Again Win Vms)

acltype=posixacl, default acltype=off i don t even know about that. Anyone with further info?

primarycache=metadata, all, none Controls what is cached in the primary cache (ARC).
secondarycache=metadata, all, none If this property is set to all, both user data and metadata is cached.
If this property is set to none, neither user data nor metadata is cached.
If this property is set to metadata, then only metadata is cached.
The default value is all.
Primarycache option does have impact on the performance
but not for every workload. In some test, we don't see any differences,
While in other tests, it provides more than 200% boost.
With all this information, you might be lost about whether it’s good
or not to enable the primarycache and which option is better for you.
Here, as a rule of doom: set all VM and LXC to primarycache=metadata
and for very, very specific workload, set it to primarycache=all.
What about secondarycache?

zfs set compression=lz4 We are ok with that since it is the default value

What do you think?

Thank you
 
Last edited:
Once more the disks are 512 native (sas 10k sas3/12G disks) so I have created the HH3VM pool with ashift 9.
This pool is dedicated for VM storage I should mention here.
I enabled afterwards from GUI Thin provision. Block size is by default 8K but my personal guides
(written via extended discussions in forums and tech tutorials in tube..etc) dictate it would be better for block size
to be 16k (correlates with NTFS file system of Win or something). On top of that i have written down that can tbe changed
afterwards.

-Does this mean I had to create the raid10 zfs storage from gui only in order to add to the command
the extra option of blocksize?
You can change the volblocksize later for the pool, but not for already existing zvols. So you should go to Datacenter -> Storage -> YourZfsPool -> Edit -> Block Size and set a good value before restoring/migrating your first VM to that pool because all zvols will use a volblocksize you set there for the pool.
And I wouldn't use 16K. That way stuff like a posgres db doing 8K reads/writes would be slow. When using a ashift of 9 (512B sectors) I would use a 4K volblocksize. That way it matches the 4K blocksize most filesystems are based on and its still 8 times your sectorsize, so should be fine for block level compression and striped mirror of up to 16 disks.
Its never a problem to write/read with a bigger blocksize to/from a smaller blocksize but a big problem to do the opposite. So for example doing 16K block operations on a Zvol with a 8K volblocksize is absolutely fine. But doing a 8K operation on a 16K volblocksize zvol would be bad and cause double the overhead so you just get half the performance.
Going lower with the volblocksize than needed your overhead will go up because the data to metadata ratios get worse and compression won't be that effective. But if your volblocksize is higher than the blocksize of your workload its even more worse. So I would rather use a smaller than a bigger volblocksize.
It confuses me the fact that zvol HH3VM (root dataset of HH3VM zpool) has attributes since it is a block level
Its not a block level device. The root of your pool is like a dataset, so a filesystem.
Should I change the default 128 to 16 before starting creating Vms? (Again Win Vms)
For VMs you don't need to care about the recodsize as it only effects datasets and VMs will only use zvols. Would only be useful when storing VMs as qcow2 images ontop of a dataset.
 
Last edited:
You can change the volblocksize later for the pool
Default is 8k as I noticed. By the way which command would give me the that 8 from cli. tried zpool get volblocksize pool_name and
zfs get volblocksize pool_name.
Since that pool will be used by VMs so the correct command would be something with zpool get what? in order to see that 8k from gui?

Does the record size if it is going to be used for VM backups for example, needs to be a better value than that of 128k?

Any thoughts about the other options like
primarycache=
secondarycache=
zfs set xattr=
zfs set atime=
zfs set relatime=

Thnak you
 
Default is 8k as I noticed. By the way which command would give me the that 8 from cli. tried zpool get volblocksize pool_name and
zfs get volblocksize pool_name.
Since that pool will be used by VMs so the correct command would be something with zpool get what? in order to see that 8k from gui?
The volblocksize isn't defined for the complete ZFS pool. Its defined for each zvol when creating it. If you want to see recordsize of a specific zvol you can use zfs get volblocksize YourPool/YourZvol or if you just want a list of all zvol use zfs get volblocksize
Does the record size if it is going to be used for VM backups for example, needs to be a better value than that of 128k?
"recordsize" is only used for datasets, so only LXCs will make use of it, not VMs. 128K recordsize should be a good allrouder.
Any thoughts about the other options like
primarycache=
secondarycache=
zfs set xattr=
zfs set atime=
zfs set relatime=
Those really depend on the hardware and software you are using.
 
The volblocksize isn't defined for the complete ZFS pool. Its defined for each zvol when creating it. If you want to see recordsize of a specific zvol you can use zfs get volblocksize YourPool/YourZvol or if you just want a list of all zvol use zfs get volblocksize
Three lines from the results indicate you have absolutely right. The pool upon which zvols are being created for each VM hasn t that attribute while the specific VM's Zvol has it. Probably this is why on the new server I can t get the volblocksize because I don t have any VMs yet created. Nice!!
HHVM volblocksize - -
HHVM/vm-100-disk-0 volblocksize 8K default
HHVM/vm-101-disk-0 volblocksize 8K default

recordsize" is only used for datasets, so only LXCs will make use of it, not VMs. 128K recordsize should be a good allrouder.
I know (by now) that it is for datasets. That is why I mentioned VM Backups so files instead of VMs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!