VM Snapshots: Ceph with ZFS?

Jan 13, 2025
16
4
3
I've tried searching the forum and found lots of similar questions, but nothing that directly answers my use case.

We have primarily been a hyper-v shop up until our SAN recently died and we decided to upgrade our kit and go the proxmox + ceph route. We have a 3-node cluster, each node having the following setup:

CPU: 64 x Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz (2 Sockets)
RAM: 1TB per node
Storage: 2x 960GB SSDs configured as software ZRAID1 during installation for Proxmox boot drives
CEPH Storage: Sonnet M.2 8x4 PCIE-4 cards with 8x 4TB Samsung Pro 990 4TB nvme drives installed, (96TB in total)
Network:
  • 10GB for cluster network
  • 100GB dedicated network for the Ceph storage
  • 25GB network passed over bridge to VMs
We're still very much in the testing phase before we can start moving production into proxmox, we've successfully converted a couple of hyper-v vms into RAW and then imported in proxmox VMs (And dealt with the usual VirtIO pain for Windows VMs) and imported a couple test VMs from ESXI (Which was far easier). We are now thinking about the backup mechanism for the VMs, which lead me down the RAW vs QCOW2, Ceph vs ZFS rabbit hole and I am a little confused. We will be using Bacula for our backup solution going forward.

My understand is that RAW is better for performance, but doesn't have the feature set of QCOW2, such as snapshotting, so the VM would have to be shut down to back it up? Is this correct as it seems mad in this day and age to have to take a server offline to back it up. RAW also wouldn't allow incremental backups, right?

This then lead me to start reading into ZFS, which appears a better solution, but the more I read, the more I get confused. ZFS isn't a file format, so I still need to decide on RAW vs QCOW2 for the VM file format, and I guess I need to decide on Ceph vs ZFS for file system for the VM files to be stored on, as thats the only real comparison google seems to offer,

Can you use ZFS with Ceph, should it be done this way? or am I just loosing my mind.

Please could someone help me understand the above in simple terms so I can wrap my head around this.
 
Please keep in mind, that many storage options in Proxmox VE don't use files at all but expose it as a block device (similar to a physical disk) which will be presented to the VM as its disk.

Therefore, the question Qcow or not will not even come up on these storage options (LVM, ZFS, Ceph, ...).

If the Proxmox VE integrated snapshots are used (qm snapshot), it will also trigger the fsfreeze & thaw commands in the guest agent. They tell the guest to flush caches down to disk. This is the same as when the integrated backups are used.
The snapshots are created on the storage layer, so you could use the storage specific tools to access the snapshots and depending on the storage layer, it is also possible to get an incremental diff to a previous snapshot.

But instead of rolling your own backups which will be brittle, check if Bacula supports Proxmox VE. Otherwise, Proxmox VE comes with a simple backup option out of the box, or in combination with the Proxmox Backup Server, you can get fast incremental and deduplicated backups too. No matter on what kind of storage the VM is stored on.

I hope this helps to clear up some misunderstandings :)
 
  • Like
Reactions: Johannes S
Can you use ZFS with Ceph, should it be done this way? or am I just loosing my mind.
Hi @leedys90 , welcome to the forum.

You have to be more specific in what way you are imagining combining these two solutions. For example:
Ceph/RBD (raw)>disk image(virtual disk)>VM>raw disk inside VM>ZFS : is fine. It does not matter which file system you will use inside VM
Ceph/RBD (raw>ZFS>... : will not work

Based on your description, it seems that you are looking to build a PVE cluster with High Availability, storage included. In this case you should drop ZFS from your consideration. It is not suitable for your particular needs. Stick with Ceph and continue your research and education on it.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
Please keep in mind, that many storage options in Proxmox VE don't use files at all but expose it as a block device (similar to a physical disk) which will be presented to the VM as its disk.

Therefore, the question Qcow or not will not even come up on these storage options (LVM, ZFS, Ceph, ...).

If the Proxmox VE integrated snapshots are used (qm snapshot), it will also trigger the fsfreeze & thaw commands in the guest agent. They tell the guest to flush caches down to disk. This is the same as when the integrated backups are used.
The snapshots are created on the storage layer, so you could use the storage specific tools to access the snapshots and depending on the storage layer, it is also possible to get an incremental diff to a previous snapshot.

But instead of rolling your own backups which will be brittle, check if Bacula supports Proxmox VE. Otherwise, Proxmox VE comes with a simple backup option out of the box, or in combination with the Proxmox Backup Server, you can get fast incremental and deduplicated backups too. No matter on what kind of storage the VM is stored on.

I hope this helps to clear up some misunderstandings :)
Damn, I'm amazed at how fast this thread got a reply!

Okay, so I think I understand, you're saying if we're using ceph, it doesn't matter whether the images are raw or qcow2?

Bacula utilises two plugins for Proxmox VE, The Proxmox plugin, which states:

  • Only Full level backups are possible. This is a Proxmox limitation as its API does not provide methods suitable for other backup levels. This limitation is described in details in the Features chapter, which also describes another module, QEMU that is free of that limitation.)
and the Qemu plugin, which states:

  • Differential backup level is not yet supported. Only Full and Incremental backup levels are supported. This limitation will be removed in the future.
So by the sounds of it we'd be better off using the QEMU plugin to backup the VMs and there config, but does qemu care if the file format is .qcow2 or .raw?

The business is pretty focused on using Bacula as our one-stop shop for backups and restore, however could you tell, if I were to go down the proxmox integrated backup option, do I need a Proxmox backup server or could I just point proxmox at our 1.8PB TrueNas and have it run the backups. Also does the proxmox backup care about the VM disk format or provide more options to either format?
 
Okay, so I think I understand, you're saying if we're using ceph, it doesn't matter whether the images are raw or qcow2?
Yes, because you cannot choose. It will always be shown as "raw", but not stored as a file, but in the RBD, Cephs block device layer, which supports snapshots and such things.

I cannot speak about the Bacula backup plugins due to lack of experience. We are working on a public backup API though that should make it a lot easier for 3rd parties to make use of the same backup methods as Proxmox VE itself does. But I cannot say when it will be ready.

The Proxmox Backup Server is its own stand alone machine. Ideally to get the best performance, it is running bare-metal with SSDs for storage (deduplication results in a lot of smaller random IO).
But we know of users/customers who run it in a VM and point it to a network share. You could run it as a VM on TrueNas for example, so that it is still available, even if the Proxmox VE cluster is dead.

Keep in mind that the Backup Server is also based on Debian Linux. Adding a network share isn't offered directly via its tooling, but you can mount it via default Linux tools, such as the /etc/fstab or systemd mount units.
 
  • Like
Reactions: Johannes S
Hi leedys90 , welcome to the forum.

You have to be more specific in what way you are imagining combining these two solutions. For example:
Ceph/RBD (raw)>disk image(virtual disk)>VM>raw disk inside VM>ZFS : is fine. It does not matter which file system you will use inside VM
Ceph/RBD (raw>ZFS>... : will not work
I'm not sure I understand your first example:

We have Ceph/RDB configured as a pool across all our OSDs, then present this as the storage location for our VM disk storage, which is similar to your first example, but I don't think we have raw disk inside VM>ZFS configured ???

Ah wait, do you mean we have Ceph/RBD > VM storage on RBD Pool > VM Guest OS running ZFS?

Based on your description, it seems that you are looking to build a PVE cluster with High Availability, storage included. In this case you should drop ZFS from your consideration. It is not suitable for your particular needs. Stick with Ceph and continue your research and education on it.

Thank you for a nice, direct answer, our goal as you point out is to have a n+2 cluster (after we've added some more hosts) with HA storage, but also be able to backup the VMs (ideally incrementally for space saving) to our 1.8PB NAS, which will then be replicate off-site.

Yes, because you cannot choose. It will always be shown as "raw", but not stored as a file, but in the RBD, Cephs block device layer, which supports snapshots and such things.

I cannot speak about the Bacula backup plugins due to lack of experience. We are working on a public backup API though that should make it a lot easier for 3rd parties to make use of the same backup methods as Proxmox VE itself does. But I cannot say when it will be ready.

The Proxmox Backup Server is its own stand alone machine. Ideally to get the best performance, it is running bare-metal with SSDs for storage (deduplication results in a lot of smaller random IO).
But we know of users/customers who run it in a VM and point it to a network share. You could run it as a VM on TrueNas for example, so that it is still available, even if the Proxmox VE cluster is dead.

Keep in mind that the Backup Server is also based on Debian Linux. Adding a network share isn't offered directly via its tooling, but you can mount it via default Linux tools, such as the /etc/fstab or systemd mount units.

Okay, awesome, thank you again for your clear answers. I will stick with our current setup of Ceph/RBD and see what Bacula can offer in terms of support, knowing that if all else fails we can utilize Proxmox Backup server to get the flexibility to restore VMs/config. Our TrusNAS has a couple of Nvme SSDs for the OS, so I will see if we can get a Proxmox Backup server hosted on this. Thanks again Aaron.
 
  • Like
Reactions: Johannes S
yes. It may or may not be a good idea. But it really does not (should not) matter for the Hypervisor administrator what the tenant is doing inside their VM.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Slightly off-topic, but you both seem to know your stuff so probably the best to answer this. Is the following config for my setup optimal?

root@mepprox01:~# ceph df
--- RAW STORAGE ---
CLASSSIZEAVAILUSEDRAW USED%RAWUSED
nvme87 TiB87 TiB663 GiB663 GiB0.74
TOTAL87 TiB87 TiB663 GiB663 GiB0.74

--- POOLS ---
POOLIDPGSSTOREDOBJECTSUSED%USEDMAX AVAIL
datastore01532212 GiB56.24k635 GiB0.75%27 TiB
.mgr613.2 MiB29.6 MiB027 TiB

I get why I only have 27TiB available because we've choosen to have 3 copies of our data in case of a single node failure, but I guess my question was around the number of PGS
 
Last edited:
So you only have one pool (we can ignore the Ceph internal .mgr) for now. Set the "target_ratio" value to 1. If you edit the pool in the web UI make sure that the "Advanced" checkbox next to the OK button is enabled.

This tells the autoscaler (calculating the number of PGs for you) how much space you expect the pool to consume in the end. This is a weight, so if you would have another pool and set the target_ratio for both to 1, the autoscaler will calculate the PGs for each pool with the assumption that both are to be expected to consume roughly 50% of the space.

So if you have multiple pools, I would opt to use values between 0.0 and 1.0, or between 0 and 100. This way, you can map the ratios closer to percentages.

Keep in mind, that the autoscaler, when set to ON for the pool, will only change the PGs if the current and optimal number of PGs differs by a factor of 3 or more. If it is only 3 (e.g. 32 -> 64), it will warn you, but you have to change it manually in the pool settings.

One more thing, should you plan to use different device classes [0], those ratios are calculated within a device class. Also, all pools (also the .mgr) need to be assigned to a specific device class to avoid any overlap that would hinder the autoscaler in its calculation.


If you want to do those calculations more manually, the Ceph docs have a calculator [1].

[0] https://docs.ceph.com/en/latest/rados/operations/pgcalc/
[1] https://docs.ceph.com/en/latest/rados/operations/pgcalc/
 
So you only have one pool (we can ignore the Ceph internal .mgr) for now. Set the "target_ratio" value to 1. If you edit the pool in the web UI make sure that the "Advanced" checkbox next to the OK button is enabled.

This tells the autoscaler (calculating the number of PGs for you) how much space you expect the pool to consume in the end. This is a weight, so if you would have another pool and set the target_ratio for both to 1, the autoscaler will calculate the PGs for each pool with the assumption that both are to be expected to consume roughly 50% of the space.

So if you have multiple pools, I would opt to use values between 0.0 and 1.0, or between 0 and 100. This way, you can map the ratios closer to percentages.

Keep in mind, that the autoscaler, when set to ON for the pool, will only change the PGs if the current and optimal number of PGs differs by a factor of 3 or more. If it is only 3 (e.g. 32 -> 64), it will warn you, but you have to change it manually in the pool settings.

One more thing, should you plan to use different device classes [0], those ratios are calculated within a device class. Also, all pools (also the .mgr) need to be assigned to a specific device class to avoid any overlap that would hinder the autoscaler in its calculation.


If you want to do those calculations more manually, the Ceph docs have a calculator [1].

[0] https://docs.ceph.com/en/latest/rados/operations/pgcalc/
[1] https://docs.ceph.com/en/latest/rados/operations/pgcalc/
Hey Aaron,

PGs now showing as 1024 which was the optimal number of PGs Proxmox suggested. I did have to keep going into the UI and adjusting the value a few times before it actually reached 1024, but got there eventually.

Now that everything is optimal, what is the best method for testing IO speeds on the nvme ceph storage? I've read some people using FIO, others simply using CrystalDisk, and wanted yours and bbgeek17's opinion based on our setup. I don't know a great deal about the correct parameters when using FIO.

And to both of you, thank you for all the support thus far.
 
  • Like
Reactions: Johannes S
Depends on what layer you want to benchmark ;)
FIO has a Ceph RADOS backend AFAIK.

If you check the 2023 Ceph Benchmark whitepaper (pinned post), you can use `rados bench` to benchmark the Ceph object layer. If you run tests within a VM, I would opt for FIO. Just keep in mind that you test the convolution of Ceph, Qemu->Ceph or Kernel->Ceph (depending on KRBD enabled or not in the storage config), the disk type of Qemu (scsi,sata,...) and the guest OS.

So quite a few knobs to adjust in case you don't see results, you are okay with ;)

If you can run some actual workloads and see if it performs as expected, you would get the most reliable results. Because as @alexskysilk mentioned, do you know how to replicate your workload in the settings for a synthetic benchmark? (almost no one does ;) )
 
almost no one does
ah. I think this observation is for a specific audience. Among the "homelab" crowd, people are typically not aiming at an actual workload and are just looking to generate hero numbers because reasons. under those conditions, there simply isnt a workload to simulate- just run large block sequential read test and call it a win ;)

Simulating a workload is actually pretty simple- if you understand the io patterns of your applications (eg, mysql) you can either use fio to generate similar pattern or use a app specific benchmark, eg sqlbench. If you dont, just deploy the app and use a stopwatch lol.
 
  • Like
Reactions: Johannes S
wrong question. you should be asking "how do I benchmark storage for MY USE?" which, of course, begs the next question- what do you intend to DO WITH IT?
Wow, still amazed at how active and responsive these forums are!

To answer your question, the storage will be used to host a variety of virtual machines, ranging from Windows DCs, SQL servers, SCCM, PAM auth servers for BeyondTrust ECMs, along with a number of bespoke scientific application servers running under RHEL and Ubuntu server that might use msql, postgres, pm2, docker, shiny-server to host applications internally
 
Depends on what layer you want to benchmark ;)
FIO has a Ceph RADOS backend AFAIK.

If you check the 2023 Ceph Benchmark whitepaper (pinned post), you can use `rados bench` to benchmark the Ceph object layer. If you run tests within a VM, I would opt for FIO. Just keep in mind that you test the convolution of Ceph, Qemu->Ceph or Kernel->Ceph (depending on KRBD enabled or not in the storage config), the disk type of Qemu (scsi,sata,...) and the guest OS.

So quite a few knobs to adjust in case you don't see results, you are okay with ;)

If you can run some actual workloads and see if it performs as expected, you would get the most reliable results. Because as @alexskysilk mentioned, do you know how to replicate your workload in the settings for a synthetic benchmark? (almost no one does ;) )
Yea, I think it was wishful thinking on my part that there might be a "one fits all" command, but I didn't expect i'd be that lucky :p I'll give the rados bench a go, assuming this can be run from any of the proxmox hosts?

Each thread I found on FIO always used different parameters for their tests, which makes sense from what you've said. I'll dig into the man page and see if I can find more specific parameters for our use cases.

We have moved migrated several windows/linux VMs over from our Hyper-V cluster + Physical SAN (Max IO is like 150Mbp/s, hence the upgrade to the HCI environment) and i'm extremely happy with the performance form a sys admin point of view, all vms are snappy, responsive and stable, and migrations are extremely quick to other nodes in the cluster. I guess I was hoping for a more formal way of proving "It's really fast" to my boss :D

Regarding your last question.... no, lol
 
  • Like
Reactions: Johannes S
ah. I think this observation is for a specific audience. Among the "homelab" crowd, people are typically not aiming at an actual workload and are just looking to generate hero numbers because reasons. under those conditions, there simply isnt a workload to simulate- just run large block sequential read test and call it a win ;)

Simulating a workload is actually pretty simple- if you understand the io patterns of your applications (eg, mysql) you can either use fio to generate similar pattern or use a app specific benchmark, eg sqlbench. If you dont, just deploy the app and use a stopwatch lol.
This is extremely useful. Thank you, I will pick a few workloads and then research the correct FIO parameters to test effectively.
 
the storage will be used to host a variety of virtual machines
1. do you have a baseline performance metrics for your previous/existing deployment? in other words, do you have an established acceptance criteria? without one, how would you evaluate "fast/slow" on any results you achieve?
2. When benchmarking, consider the system in its totality. run sufficient threads on all nodes simultaneously to simulate a busy cluster.
3. Consider your drive choice, both type and form factor. 4 per carrier make drive replacement very difficult and time consuming, and require taking down the node. Also, P990 is a consumer product without plp- you CAN use them, but you really shouldnt. consider PM1700 instead with proper enclosures for production use. I imagine you're also aware of the limitations 3 node deployment poses (no rebalance and limited self healing)
 
  • Like
Reactions: bbgeek17
So, first test results using rados bench on the ceph RBD pool:

write:

  • Total time run: 30.0342
  • Total writes made: 12636
  • Write size: 4194304
  • Object size: 4194304
  • Bandwidth (MB/sec): 1682.88
  • Stddev Bandwidth: 69.7436
  • Max bandwidth (MB/sec): 1804
  • Min bandwidth (MB/sec): 1496
  • Average IOPS: 420
  • Stddev IOPS: 17.4359
  • Max IOPS: 451
  • Min IOPS: 374
  • Average Latency(s): 0.038002
  • Stddev Latency(s): 0.0177559
  • Max latency(s): 0.169057
  • Min latency(s): 0.0131105

read:

  • Total time run: 30.0369
  • Total reads made: 12161
  • Read size: 4194304
  • Object size: 4194304
  • Bandwidth (MB/sec): 1619.48
  • Average IOPS: 404
  • Stddev IOPS: 16.7454
  • Max IOPS: 435
  • Min IOPS: 360
  • Average Latency(s): 0.0388034
  • Max latency(s): 0.217903
  • Min latency(s): 0.00477545

Doesn't seem very fast given the maximum I/O stats provided for the nvme drives or the PCI-E card. Again I'm not trying to be one of those people posting their 30gbps results, just want to make sure I've not misconfigured something on my end.
 
Doesn't seem very fast given the maximum I/O stats provided for the nvme drives
I know you read the above comments, but I dont think you understood. What is this simulating? why do you think its fast, slow, or ? what is the correlation to the "maximum I/O stats provided for the nvme drives"?

understand that even when the marketing people publish numbers for the drives, those are for SPECIFIC BENCHMARKS ONLY which may not (and probably dont) have any bearing on your application. to make matters worse, they hide the benchmark information in the very small print.
 
  • Like
Reactions: leedys90
1. do you have a baseline performance metrics for your previous/existing deployment? in other words, do you have an established acceptance criteria? without one, how would you evaluate "fast/slow" on any results you achieve?
2. When benchmarking, consider the system in its totality. run sufficient threads on all nodes simultaneously to simulate a busy cluster.
3. Consider your drive choice, both type and form factor. 4 per carrier make drive replacement very difficult and time consuming, and require taking down the node. Also, P990 is a consumer product without plp- you CAN use them, but you really shouldnt. consider PM1700 instead with proper enclosures for production use. I imagine you're also aware of the limitations 3 node deployment poses (no rebalance and limited self healing)
Hey Alex,

1. We did run some benchmarks, but only real world copy speeds across VMs on the old cluster and CrystalDisk benchmarking on a Windows host, so probably not accurate data by the sounds of it.
2. Is there an way to simulate a number of thread per node using the tools listed?
3. 4 per carrier? there are 8 drives in each pci-e card, one per node. Sadly I learned about the benefits of plp after we'd purchased the drives, but I will see whether I can get these returned and swapped for drives that do support plp. I am aware of the limitations of a 3 node cluster in that we can only sustain a single node failure (assume thats the self healing part) but could you expand on the no rebalance comment. I was under the impression that if we lost a single node the ceph storage would rebalance. We do intend to increase the cluster to 12 nodes to make use of n+2 in the near future.

Fortunately we're still in the POC stage of this installation so all of this information is extremely useful to get ironed out before we can sign off on it as a replacement for our old cluster.
 
  • Like
Reactions: Johannes S