VM storage on ZFS pool vs ZFS dataset

zecas

Member
Dec 11, 2019
51
5
13
Hi,

I'm planing the installation of a new proxmox instance, and I'm thinking about the storage model to follow.

In my case, I will have a 2x Samsung SSD 860 PRO 250Gb disks on mirror mode for proxmox boot, then I'll create a ZFS pool of mirrors (raid10) for storing data.

My initial idea was to create the ZFS pool (say "zfsa") with thin provisioning, then the following datasets:
zfsa/vm
zfsa/iso
zfsa/backup

Their names reflect their contents :). The idea would be to better organize data and for instance to be able to replicate the "zfsa/vm" datasource to a remote zfs, but not the other datasources.

I found some information online regarding setting a dataset to store VMs vs storing the VMs directly on the zfs pool, but the advantages/disadvantages are not clear (assuming there are any).

I also did some testing, and found that for instance, a vm disk "vm-100-disk-0" on the pool, when moved to a dataset, it will give the option for "raw" or "qcow2" (for instance), and selecting any of them will move the vm disk to the storage and add the appropriate file extension ".raw" or ".qcow2". That did allow me to understand the mechanics of it, but gave me no headlights about the best solution.

So I ended up with some questions, that I'm hopping to clarify on this forum what would be the best option, or best practice (if we can say so) for dealing with a zfs storage.

1. Storing the vm disks directly on a zfs pool will always be raw?

2. Is there any advantage of putting the vm on the pool instead of a specific dataset?

3. Are there any performance penalties I shoud consider when putting vm data on zfs pool, vs dataset, and even between formats (raw, qcow2, ...)?

4. Is there any problem on either approach for storing vm disks if zfs storage is set as thin provisioning?

5. Or should I just use the zfs pool, and store iso and backup data somewhere else?

Thank you.


Best Regards.
zecas
 
My unsubstantiated thoughts on your questions:
0. Thin is not a setting for a ZFS pool; you can change the setting at any time and it will affect only newly created virtual disks.
1. It will be block-based like raw, but ZFS is COW like qcow2, so...
2. Faster backups, snapshots and better (non-sequential) performance in general?
3. Yes there are differences between formats and storage types. File based is often slowed than block based. Less features and protections is also often faster.
4. With thin storage, you can always give out more storage than you actually have, which can become a problem if everybody want to use all of it.
5. You can create filesystems on ZFS pools next to VM/CT disks, so you can do both with the same ZFS pool. I suggest storing backups far away from the actual VMs, for safety reasons.
There is a whole section in the manual about storage but ZFS can do it all and has good integration (if you don't obsess on raw disk performance).
 
  • Like
Reactions: semanticbeeng
Hi,

I'm planing the installation of a new proxmox instance, and I'm thinking about the storage model to follow.

In my case, I will have a 2x Samsung SSD 860 PRO 250Gb disks on mirror mode for proxmox boot, then I'll create a ZFS pool of mirrors (raid10) for storing data.

My initial idea was to create the ZFS pool (say "zfsa") with thin provisioning, then the following datasets:
zfsa/vm
zfsa/iso
zfsa/backup

Their names reflect their contents :). The idea would be to better organize data and for instance to be able to replicate the "zfsa/vm" datasource to a remote zfs, but not the other datasources.

I found some information online regarding setting a dataset to store VMs vs storing the VMs directly on the zfs pool, but the advantages/disadvantages are not clear (assuming there are any).
Its the same, so no disadvantages, you just got the benefit of easier management (recursive replication and inherited ZFS options).
But I wouldn't put the backups and the VMs on the same pool. In case your pool fails you loose both your VMs and your backups.
I also did some testing, and found that for instance, a vm disk "vm-100-disk-0" on the pool, when moved to a dataset, it will give the option for "raw" or "qcow2" (for instance), and selecting any of them will move the vm disk to the storage and add the appropriate file extension ".raw" or ".qcow2". That did allow me to understand the mechanics of it, but gave me no headlights about the best solution.
That depends on how you add the dataset to proxmox. If you add it as a "directory storage" it is used as a file level storage and your VMs disks are stored as files and then its possible to store qcow2 files. But if you add the dataset as a "ZFS storage" then it is like a normal ZFS pool created through the webUI, but just that everything will be stored ontop of that dataset and not ontop of the pools root, so you get a block level storage and VMs disks will be stored as zvols. You can only store ISOs and backups on a file level storage, so for that I would add those two datasets as a "directory storage" and if you want less overhead (you can save an additional filesystem layer and and the expensive qcow2 which will use Copy-on-write too) I would add the "vm" dataset as a "ZFS storage".
So I ended up with some questions, that I'm hopping to clarify on this forum what would be the best option, or best practice (if we can say so) for dealing with a zfs storage.

1. Storing the vm disks directly on a zfs pool will always be raw?
Jup, because then then use the native ZFS block devices (zvols).
2. Is there any advantage of putting the vm on the pool instead of a specific dataset?
No, not if you add the dataset as a "zfs storage".
3. Are there any performance penalties I shoud consider when putting vm data on zfs pool, vs dataset, and even between formats (raw, qcow2, ...)?
For lowest overhead (and therefore best performance) I wouldn't store VMs on a "directory storage" and wouldn't use qcow2.
4. Is there any problem on either approach for storing vm disks if zfs storage is set as thin provisioning?
Thin provisioning should work with both options.
 
Hi,

Thank you both for your replies, really helped me to better understand this subject.

I will follow your indications, and really take the backup from the same zfs that will also contain vm data. I'll think of a solution for that, maybe with additional disks for that purpose, even though I intend to replicate important data on a remote zfs instance.

There is a whole section in the manual about storage but ZFS can do it all and has good integration (if you don't obsess on raw disk performance).

The info on that manual is great, but on the available storage types table, it states zfspool as a file level storage? Shouldn't it be a block level?

That depends on how you add the dataset to proxmox. If you add it as a "directory storage" it is used as a file level storage and your VMs disks are stored as files and then its possible to store qcow2 files. But if you add the dataset as a "ZFS storage" then it is like a normal ZFS pool created through the webUI, but just that everything will be stored ontop of that dataset and not ontop of the pools root, so you get a block level storage and VMs disks will be stored as zvols. You can only store ISOs and backups on a file level storage, so for that I would add those two datasets as a "directory storage" and if you want less overhead (you can save an additional filesystem layer and and the expensive qcow2 which will use Copy-on-write too) I would add the "vm" dataset as a "ZFS storage".

I had no idea I could add a dataset as a zfs storage. On many online sources they would directly use the pool, or create the dataset and add it as a directory storage to put it to use.


I've just messed a bit more on my virtual test environment with that knowledge in mind:

Code:
# zfs list
NAME                          USED  AVAIL     REFER  MOUNTPOINT
hddmirror                    1.49M   255M       96K  /hddmirror
hddmirror/backup               96K   255M       96K  /hddmirror/backup
hddmirror/iso                  96K   255M       96K  /hddmirror/iso
hddmirror/vm                   96K   255M       96K  /hddmirror/vm
hddmirror/vm-100-disk-0        56K   255M       56K  -
hddmirror/vm-100-disk-1        56K   255M       56K  -
hddmirror/vm2                 152K   255M       96K  /hddmirror/vm2
hddmirror/vm2/vm-100-disk-0    56K   255M       56K  -

There I have 2 vm disks directly on zfs pool (named "hddmirror").

I've then created a new dataset:
Code:
# zfs create hddmirror/vm2

And on the web interface, added a zfs storage named "hddmirror-vm2", pointing to the "hddmirror/vm2" dataset. After that, I moved the 3rd vm disk to the "hddmirror-vm2" storage, the one defined above as "hddmirror/vm2/vm-100-disk-0".


Storage definitions are as follows:

Code:
# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content backup,vztmpl,iso

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

dir: local-iso
        path /mny/iso
        content iso
        prune-backups keep-all=1
        shared 0

zfspool: hddmirror
        pool hddmirror
        content images,rootdir
        mountpoint /hddmirror
        sparse 1

dir: hddmirror-vm
        path /hddmirror/vm
        content images,rootdir
        prune-backups keep-all=1
        shared 0

dir: hddmirror-iso
        path /hddmirror/iso
        content iso
        prune-backups keep-all=1
        shared 0

dir: hddmirror-backup
        path /hddmirror/backup
        content backup
        prune-backups keep-all=1
        shared 0

zfspool: hddmirror-vm2
        pool hddmirror/vm2
        content rootdir,images
        mountpoint /hddmirror/vm2
        sparse 1


But this means that I can create the zfs pool on the command line (as I'm thinking about doing, for better control), and don't need to add it as a zfs storage, correct (I mean the zfs pool root)? Then I would create the datasets also on the command line and add only the datasets themselves as zfs storage, directory storage, etc...

It's a bit confusing to me the way proxmox renames the disks as they are moved around, as I see from above, it seems the same disk "vm-100-disk-0" exists on distinct storages, but in reality it was "vm-100-disk-2" when it was on "hddmirror" storage. I bet it can get quite messy sometimes :) ...

Regarding thin provisioning, if I move a vm disk from a non-thin storage to a thin provisioning enabled storage, the migration will strip down the unused space, correct?


Thank You.
zecas
 
Last edited:
  • Like
Reactions: semanticbeeng
Hi,

Thank you both for your replies, really helped me to better understand this subject.

I will follow your indications, and really take the backup from the same zfs that will also contain vm data. I'll think of a solution for that, maybe with additional disks for that purpose, even though I intend to replicate important data on a remote zfs instance.
You should have a look at the proxmox backup server (PBS). With it backups are much faster (you just need to backup data that changed since the last backup) and will consume way less space (in contrast to Vzdump backups nothing needs to be stored twice because of deduplication). Sound like you already got another ZFS host you can replicate to. In case it is something like a TrueNAS server you could run PBS in there in a VM.
The info on that manual is great, but on the available storage types table, it states zfspool as a file level storage? Shouldn't it be a block level?
ZFS basically can be both. There are datasets that are file level and zvols that are block level. If you create a "ZFS storage" PVE will create datasets to store LXCs and zvols to store VM disks. So LXCs will use ZFS as a file level storage and VMs as a block level storage.
 
You should have a look at the proxmox backup server (PBS). With it backups are much faster (you just need to backup data that changed since the last backup) and will consume way less space (in contrast to Vzdump backups nothing needs to be stored twice because of deduplication). Sound like you already got another ZFS host you can replicate to. In case it is something like a TrueNAS server you could run PBS in there in a VM.

I've already took some time to read about PBS. Back then I discarded the idea of setting another host for the purpose, but I'll migrate VMs on an existing proxmox to this one I'm planing, so maybe I'll end up with an available host to think about that ...

The idea was to add another local disk(s) to store backups, then replicate zfs pools/datasets to a remote server (direct zfs replication). Seems you just gave me another possibility for the backup part.

But for remote replication (or simply said, storing a copy on a remote location), would you see a better alternative instead of pure zfs replication?

ZFS basically can be both. There are datasets that are file level and zvols that are block level. If you create a "ZFS storage" PVE will create datasets to store LXCs and zvols to store VM disks. So LXCs will use ZFS as a file level storage and VMs as a block level storage.

Ah, so let me see if I understood.

Basically, I create a dataset on the zfs pool, then add that dataset as a zfs storage. For proxmox then, that dataset will be treated as a zfs pool root by itself (since it was added directly as a zfs storage).

Then:
1. If I create an LXC, it will create a dataset in there (a second level dataset), to store the container data;
2. If I create a VM, it will create a zvol (dataset that represents a block device) to store the VM disk.

So basically "hddmirror/vm2/vm-100-disk-0" is a zvol for that VM disk. I can find them listed under "/dev":
Code:
# find /dev/zvol -type l -ls
      529      0 lrwxrwxrwx   1 root     root           13 Apr 27 22:24 /dev/zvol/hddmirror/vm2/vm-100-disk-0 -> ../../../zd48
      485      0 lrwxrwxrwx   1 root     root           10 Apr 27 22:14 /dev/zvol/hddmirror/vm-100-disk-1 -> ../../zd16
      481      0 lrwxrwxrwx   1 root     root           10 Apr 27 22:14 /dev/zvol/hddmirror/vm-100-disk-0 -> ../../zd32
root@vmbox:~#

Code:
# lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                  8:0    0   16G  0 disk
├─sda1               8:1    0 1007K  0 part
├─sda2               8:2    0  512M  0 part
└─sda3               8:3    0 15.5G  0 part
  ├─pve-swap       253:0    0  1.9G  0 lvm  [SWAP]
  ├─pve-root       253:1    0  3.8G  0 lvm  /
  ├─pve-data_tmeta 253:2    0    1G  0 lvm 
  │ └─pve-data     253:4    0    6G  0 lvm 
  └─pve-data_tdata 253:3    0    6G  0 lvm 
    └─pve-data     253:4    0    6G  0 lvm 
sr0                 11:0    1 1024M  0 rom 
zd16               230:16   0    5G  0 disk
zd32               230:32   0    5G  0 disk
zd48               230:48   0    5G  0 disk

If I had created an LXC, would I be able to see the contents (if it is a file level and not a block level storage)?

This is fascinating!


Thank You.
zecas
 
Ah, so let me see if I understood.

Basically, I create a dataset on the zfs pool, then add that dataset as a zfs storage. For proxmox then, that dataset will be treated as a zfs pool root by itself (since it was added directly as a zfs storage).

Then:
1. If I create an LXC, it will create a dataset in there (a second level dataset), to store the container data;
2. If I create a VM, it will create a zvol (dataset that represents a block device) to store the VM disk.

So basically "hddmirror/vm2/vm-100-disk-0" is a zvol for that VM disk. I can find them listed under "/dev":
Code:
# find /dev/zvol -type l -ls
      529      0 lrwxrwxrwx   1 root     root           13 Apr 27 22:24 /dev/zvol/hddmirror/vm2/vm-100-disk-0 -> ../../../zd48
      485      0 lrwxrwxrwx   1 root     root           10 Apr 27 22:14 /dev/zvol/hddmirror/vm-100-disk-1 -> ../../zd16
      481      0 lrwxrwxrwx   1 root     root           10 Apr 27 22:14 /dev/zvol/hddmirror/vm-100-disk-0 -> ../../zd32
root@vmbox:~#

Code:
# lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                  8:0    0   16G  0 disk
├─sda1               8:1    0 1007K  0 part
├─sda2               8:2    0  512M  0 part
└─sda3               8:3    0 15.5G  0 part
  ├─pve-swap       253:0    0  1.9G  0 lvm  [SWAP]
  ├─pve-root       253:1    0  3.8G  0 lvm  /
  ├─pve-data_tmeta 253:2    0    1G  0 lvm
  │ └─pve-data     253:4    0    6G  0 lvm
  └─pve-data_tdata 253:3    0    6G  0 lvm
    └─pve-data     253:4    0    6G  0 lvm
sr0                 11:0    1 1024M  0 rom
zd16               230:16   0    5G  0 disk
zd32               230:32   0    5G  0 disk
zd48               230:48   0    5G  0 disk
Yes. And a dataset used to store the LXC data would be names "subvol-100-disk-0" instead of "vm-100-disk-0".
If I had created an LXC, would I be able to see the contents (if it is a file level and not a block level storage)?
Jup.
 
  • Like
Reactions: jack187
2x Samsung SSD 860 PRO 250Gb disks on mirror
Hopefully not with ZFS. Those drives are not the best for server use.

Basically, I create a dataset on the zfs pool, then add that dataset as a zfs storage. For proxmox then, that dataset will be treated as a zfs pool root by itself (since it was added directly as a zfs storage).
You have to distinguish what a pool is and what the datasets and zvols are.

Storage definitions are as follows:

Code:
# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content backup,vztmpl,iso

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

dir: local-iso
path /mny/iso
content iso
prune-backups keep-all=1
shared 0

zfspool: hddmirror
pool hddmirror
content images,rootdir
mountpoint /hddmirror
sparse 1

dir: hddmirror-vm
path /hddmirror/vm
content images,rootdir
prune-backups keep-all=1
shared 0

dir: hddmirror-iso
path /hddmirror/iso
content iso
prune-backups keep-all=1
shared 0

dir: hddmirror-backup
path /hddmirror/backup
content backup
prune-backups keep-all=1
shared 0

zfspool: hddmirror-vm2
pool hddmirror/vm2
content rootdir,images
mountpoint /hddmirror/vm2
sparse 1
I find this extremely overcomplicated. What is wrong with the default structure that PVE creates on install?
What is your goal in having multiple VM storages?
Why do you have multiple iso storages?
As others have already said, having vzdump backups on the same ZFS is totally useless. I use directly zfs send/receive to get regular snapshots off the machine and store them as a backup.
 
Hopefully not with ZFS. Those drives are not the best for server use.

You got me worried there. My intention was to put those Samsung SSD 860 PRO 250Gb disks as a zfs raid1 on proxmox installation.

Proxmox will create an lvm that can store VMs, but my intentions for those SSDs it to use purely for proxmox OS. No VMs will be stored in there.

After that, I'll add 4x HGST SAS mechanical disks for a single zfs pool (raid10), where I'll store the VMs and ISOs (on distinct datasets).

From the many information and tutorials I found, I understand that zfs would greatly reduce the lifespan on a consumer grade SSD, but from what I saw, enterprise SSDs cost a lot more, that's the main reason for the choice of those 860 PRO.

From that many information and advices, it seemed that using these type of SSDs solely for proxmox installation would not pose any problem.

Maybe I'm wrong after all?!?!??


You have to distinguish what a pool is and what the datasets and zvols are.


I find this extremely overcomplicated. What is wrong with the default structure that PVE creates on install?
What is your goal in having multiple VM storages?
Why do you have multiple iso storages?
As others have already said, having vzdump backups on the same ZFS is totally useless. I use directly zfs send/receive to get regular snapshots off the machine and store them as a backup.

The configurations I posted were just a means for me to test the storage configuration options available.

My intentions are to keep it as simple as possible:
1. 2x SSDs for proxmox OS only, 4x mechanical for zfs data VMs and ISOs;
2. The zfs pool for data will run on mechanical disks and will have: 1 dataset for VMs + 1 dataset for ISOs;
3. Backups will not be on OS disks nor on zfs pool that stores VMs and ISOs.

The proxmox install will provide for a storage where I can put VMs, and I can add a directory storage for ISOs (for example). But since that would be working over consumer grade SSDs, my intention was to separate data from proxmox install, and adding distinct zfs pool (raid10) for the job. At the time being going with mechanical disks, since I don't think they will be a bottleneck, until one day that I need more space and take the chance to go with enterprise grade SSDs (hopefully prices will help that upgrade by then :)).

As for backups, I thought of adding specific disks for that storage, but I'll look into the prospect of putting another machine to use with proxmox backup server. Whatever solution I go for backups, I'm still keeping the idea of replicating the zfs to a remote zfs instance (zfs send/receive of snapshots). I have no experience, but maybe zfs send/receive will be quicker (only changed/new blocks are sent) than a vzdump backups.


Please keep posting your options, I learn the most from other people ideas and experiences :).


Thank you.
zecas
 
Last edited:
From that many information and advices, it seemed that using these type of SSDs solely for proxmox installation would not pose any problem.
Unfortunately, especially for PVE itself, this poses a problem. PVE writes a lot, so you have to monitor your wearout. The Samsung Pro are better than any other consumer SSD, but still not as good as an enterprise SSD. You will also throw away 240 GB of space (PVe itself is every small) per disk, so this is not a good "bang for the buck".

The better solution IMHO would be to buy two used Enterprise SSDs (e.g. 120 GB or more) and use them as special devices for the one and only zpool over all disks you have. The SSDs will be for storing metadata and your PVE install and the spinners will be for the data (but only them, metadata on SSD). You can then also put small datasets on the SSDs if you want for virtual OS disk or special purpose stuff. If you have a lot of sync writes, and you don't have two PCIe slots available for using two 16GB optane disks for SLOG, I recommend to partition the enterprise SSDs so that you will have e.g. 5GB at the end of both and put a mirrored slog on the two partitions, so that sync writes will be very fast.
 
You got me worried there. My intention was to put those Samsung SSD 860 PRO 250Gb disks as a zfs raid1 on proxmox installation.

Proxmox will create an lvm that can store VMs, but my intentions for those SSDs it to use purely for proxmox OS. No VMs will be stored in there.

After that, I'll add 4x HGST SAS mechanical disks for a single zfs pool (raid10), where I'll store the VMs and ISOs (on distinct datasets).

From the many information and tutorials I found, I understand that zfs would greatly reduce the lifespan on a consumer grade SSD, but from what I saw, enterprise SSDs cost a lot more, that's the main reason for the choice of those 860 PRO.

From that many information and advices, it seemed that using these type of SSDs solely for proxmox installation would not pose any problem.

Maybe I'm wrong after all?!?!??
Just for PVE as system disk they should be fine. But keep in mind that those SSDs are still consumer SSDs and got no powerloss protection so you better make sure not to run into any power outages if you don't want to risk corrupting your system disks (see the hundreds of threads here people open with titles like "proxmox won't boot anymore after power outage"), because they will loose all async data in the RAM write cache.
The configurations I posted were just a means for me to test the storage configuration options available.

My intentions are to keep it as simple as possible:
1. 2x SSDs for proxmox OS only, 4x mechanical for zfs data VMs and ISOs;
2. The zfs pool for data will run on mechanical disks and will have: 1 dataset for VMs + 1 dataset for ISOs;
3. Backups will not be on OS disks nor on zfs pool that stores VMs and ISOs.

The proxmox install will provide for a storage where I can put VMs, and I can add a directory storage for ISOs (for example). But since that would be working over consumer grade SSDs, my intention was to separate data from proxmox install, and adding distinct zfs pool (raid10) for the job. At the time being going with mechanical disks, since I don't think they will be a bottleneck, until one day that I need more space and take the chance to go with enterprise grade SSDs (hopefully prices will help that upgrade by then :)).
They can easily be a bottleneck. Usually HDDs only got like 100 IOPS or something in that range. With 4 disks in a raid10 you got 200 IOPS. But then ZFS got alot of write amplification. In general you really want enterprise SSDs for your VM storage.
 
Last edited:
  • Like
Reactions: semanticbeeng
> "PVE writes a lot"

Please elaborate LnxBil, it would be great to have details.
Context of personal interest: "OS / data separation", especially for VMs.
See Debian Bookworm Root on ZFS for how ZFS can partition file system; If we know where most of the writes go and use such a setup then we can control disk wear.
/cc Dunuin
 
The two main location where a lot of writes are done are
  • the PVE internal database in /var/lib/pve-cluster, and
  • RRD graph database (all graphs in PVE for cpu, ram, network, disk) in /var/lib/rrdcached

So, to compartmentalize the disk wear, it would make sense to separate
  • the PVE internal database in /var/lib/pve-cluster, and
  • RRD graph database (all graphs in PVE for cpu, ram, network, disk) in /var/lib/rrdcached
  • And the logs in /var/log
from the rest of the OS to separate zfs volumes on a separate disk as shown in Debian Bookworm Root on ZFS installation, right?

Related question https://forum.proxmox.com/threads/working-with-zfs-virtual-disks.120276/post-610889.
 
So, to compartmentalize the disk wear, it would make sense to separate
  • the PVE internal database in /var/lib/pve-cluster, and
  • RRD graph database (all graphs in PVE for cpu, ram, network, disk) in /var/lib/rrdcached
  • And the logs in /var/log
from the rest of the OS to separate zfs volumes on a separate disk as shown in Debian Bookworm Root on ZFS installation, right?
In theory yes.
I'd never go down that route and just buy two small enterprise SSDs.
 
  • Like
Reactions: semanticbeeng

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!