VM Snapshots: Ceph with ZFS?

leedys90 · Friday at 14:42

I've tried searching the forum and found lots of similar questions, but nothing that directly answers my use case.

We have primarily been a hyper-v shop up until our SAN recently died and we decided to upgrade our kit and go the proxmox + ceph route. We have a 3-node cluster, each node having the following setup:

CPU: 64 x Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz (2 Sockets)
RAM: 1TB per node
Storage: 2x 960GB SSDs configured as software ZRAID1 during installation for Proxmox boot drives
CEPH Storage: Sonnet M.2 8x4 PCIE-4 cards with 8x 4TB Samsung Pro 990 4TB nvme drives installed, (96TB in total)
Network:

10GB for cluster network
100GB dedicated network for the Ceph storage
25GB network passed over bridge to VMs

We're still very much in the testing phase before we can start moving production into proxmox, we've successfully converted a couple of hyper-v vms into RAW and then imported in proxmox VMs (And dealt with the usual VirtIO pain for Windows VMs) and imported a couple test VMs from ESXI (Which was far easier). We are now thinking about the backup mechanism for the VMs, which lead me down the RAW vs QCOW2, Ceph vs ZFS rabbit hole and I am a little confused. We will be using Bacula for our backup solution going forward.

My understand is that RAW is better for performance, but doesn't have the feature set of QCOW2, such as snapshotting, so the VM would have to be shut down to back it up? Is this correct as it seems mad in this day and age to have to take a server offline to back it up. RAW also wouldn't allow incremental backups, right?

This then lead me to start reading into ZFS, which appears a better solution, but the more I read, the more I get confused. ZFS isn't a file format, so I still need to decide on RAW vs QCOW2 for the VM file format, and I guess I need to decide on Ceph vs ZFS for file system for the VM files to be stored on, as thats the only real comparison google seems to offer,

Can you use ZFS with Ceph, should it be done this way? or am I just loosing my mind.

Please could someone help me understand the above in simple terms so I can wrap my head around this.

aaron · Friday at 15:02

Please keep in mind, that many storage options in Proxmox VE don't use files at all but expose it as a block device (similar to a physical disk) which will be presented to the VM as its disk.

Therefore, the question Qcow or not will not even come up on these storage options (LVM, ZFS, Ceph, ...).

If the Proxmox VE integrated snapshots are used (qm snapshot), it will also trigger the fsfreeze & thaw commands in the guest agent. They tell the guest to flush caches down to disk. This is the same as when the integrated backups are used.
The snapshots are created on the storage layer, so you could use the storage specific tools to access the snapshots and depending on the storage layer, it is also possible to get an incremental diff to a previous snapshot.

But instead of rolling your own backups which will be brittle, check if Bacula supports Proxmox VE. Otherwise, Proxmox VE comes with a simple backup option out of the box, or in combination with the Proxmox Backup Server, you can get fast incremental and deduplicated backups too. No matter on what kind of storage the VM is stored on.

I hope this helps to clear up some misunderstandings

bbgeek17 · Friday at 15:14

leedys90 said:
Can you use ZFS with Ceph, should it be done this way? or am I just loosing my mind.

Hi @leedys90 , welcome to the forum.

You have to be more specific in what way you are imagining combining these two solutions. For example:
Ceph/RBD (raw)>disk image(virtual disk)>VM>raw disk inside VM>ZFS : is fine. It does not matter which file system you will use inside VM
Ceph/RBD (raw>ZFS>... : will not work

Based on your description, it seems that you are looking to build a PVE cluster with High Availability, storage included. In this case you should drop ZFS from your consideration. It is not suitable for your particular needs. Stick with Ceph and continue your research and education on it.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

leedys90 · Friday at 15:35

aaron said:
Please keep in mind, that many storage options in Proxmox VE don't use files at all but expose it as a block device (similar to a physical disk) which will be presented to the VM as its disk.

Therefore, the question Qcow or not will not even come up on these storage options (LVM, ZFS, Ceph, ...).

If the Proxmox VE integrated snapshots are used (qm snapshot), it will also trigger the fsfreeze & thaw commands in the guest agent. They tell the guest to flush caches down to disk. This is the same as when the integrated backups are used.
The snapshots are created on the storage layer, so you could use the storage specific tools to access the snapshots and depending on the storage layer, it is also possible to get an incremental diff to a previous snapshot.

But instead of rolling your own backups which will be brittle, check if Bacula supports Proxmox VE. Otherwise, Proxmox VE comes with a simple backup option out of the box, or in combination with the Proxmox Backup Server, you can get fast incremental and deduplicated backups too. No matter on what kind of storage the VM is stored on.

I hope this helps to clear up some misunderstandings

Damn, I'm amazed at how fast this thread got a reply!

Okay, so I think I understand, you're saying if we're using ceph, it doesn't matter whether the images are raw or qcow2?

Bacula utilises two plugins for Proxmox VE, The Proxmox plugin, which states:

Only Full level backups are possible. This is a Proxmox limitation as its API does not provide methods suitable for other backup levels. This limitation is described in details in the Features chapter, which also describes another module, QEMU that is free of that limitation.)

and the Qemu plugin, which states:

Differential backup level is not yet supported. Only Full and Incremental backup levels are supported. This limitation will be removed in the future.

So by the sounds of it we'd be better off using the QEMU plugin to backup the VMs and there config, but does qemu care if the file format is .qcow2 or .raw?

The business is pretty focused on using Bacula as our one-stop shop for backups and restore, however could you tell, if I were to go down the proxmox integrated backup option, do I need a Proxmox backup server or could I just point proxmox at our 1.8PB TrueNas and have it run the backups. Also does the proxmox backup care about the VM disk format or provide more options to either format?

aaron · Friday at 15:55

leedys90 said:
Okay, so I think I understand, you're saying if we're using ceph, it doesn't matter whether the images are raw or qcow2?

Yes, because you cannot choose. It will always be shown as "raw", but not stored as a file, but in the RBD, Cephs block device layer, which supports snapshots and such things.

I cannot speak about the Bacula backup plugins due to lack of experience. We are working on a public backup API though that should make it a lot easier for 3rd parties to make use of the same backup methods as Proxmox VE itself does. But I cannot say when it will be ready.

The Proxmox Backup Server is its own stand alone machine. Ideally to get the best performance, it is running bare-metal with SSDs for storage (deduplication results in a lot of smaller random IO).
But we know of users/customers who run it in a VM and point it to a network share. You could run it as a VM on TrueNas for example, so that it is still available, even if the Proxmox VE cluster is dead.

Keep in mind that the Backup Server is also based on Debian Linux. Adding a network share isn't offered directly via its tooling, but you can mount it via default Linux tools, such as the /etc/fstab or systemd mount units.

leedys90 · Friday at 17:04

bbgeek17 said:
Hi leedys90 , welcome to the forum.

You have to be more specific in what way you are imagining combining these two solutions. For example:
Ceph/RBD (raw)>disk image(virtual disk)>VM>raw disk inside VM>ZFS : is fine. It does not matter which file system you will use inside VM
Ceph/RBD (raw>ZFS>... : will not work

I'm not sure I understand your first example:

We have Ceph/RDB configured as a pool across all our OSDs, then present this as the storage location for our VM disk storage, which is similar to your first example, but I don't think we have raw disk inside VM>ZFS configured ???

Ah wait, do you mean we have Ceph/RBD > VM storage on RBD Pool > VM Guest OS running ZFS?

Based on your description, it seems that you are looking to build a PVE cluster with High Availability, storage included. In this case you should drop ZFS from your consideration. It is not suitable for your particular needs. Stick with Ceph and continue your research and education on it.

Thank you for a nice, direct answer, our goal as you point out is to have a n+2 cluster (after we've added some more hosts) with HA storage, but also be able to backup the VMs (ideally incrementally for space saving) to our 1.8PB NAS, which will then be replicate off-site.

aaron said:
Yes, because you cannot choose. It will always be shown as "raw", but not stored as a file, but in the RBD, Cephs block device layer, which supports snapshots and such things.

I cannot speak about the Bacula backup plugins due to lack of experience. We are working on a public backup API though that should make it a lot easier for 3rd parties to make use of the same backup methods as Proxmox VE itself does. But I cannot say when it will be ready.

The Proxmox Backup Server is its own stand alone machine. Ideally to get the best performance, it is running bare-metal with SSDs for storage (deduplication results in a lot of smaller random IO).
But we know of users/customers who run it in a VM and point it to a network share. You could run it as a VM on TrueNas for example, so that it is still available, even if the Proxmox VE cluster is dead.

Keep in mind that the Backup Server is also based on Debian Linux. Adding a network share isn't offered directly via its tooling, but you can mount it via default Linux tools, such as the /etc/fstab or systemd mount units.

Okay, awesome, thank you again for your clear answers. I will stick with our current setup of Ceph/RBD and see what Bacula can offer in terms of support, knowing that if all else fails we can utilize Proxmox Backup server to get the flexibility to restore VMs/config. Our TrusNAS has a couple of Nvme SSDs for the OS, so I will see if we can get a Proxmox Backup server hosted on this. Thanks again Aaron.

bbgeek17 · Friday at 17:06

leedys90 said:
Ah wait, do you mean we have Ceph/RBD > VM storage on RBD Pool > VM Guest OS running ZFS?

yes. It may or may not be a good idea. But it really does not (should not) matter for the Hypervisor administrator what the tenant is doing inside their VM.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

leedys90 · Friday at 17:28

bbgeek17 said:
yes. It may or may not be a good idea. But it really does not (should not) matter for the Hypervisor administrator what the tenant is doing inside their VM.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Slightly off-topic, but you both seem to know your stuff so probably the best to answer this. Is the following config for my setup optimal?

root@mepprox01:~# ceph df
--- RAW STORAGE ---

CLASS	SIZE	AVAIL	USED	RAW USED	%RAW	USED
nvme	87 TiB	87 TiB	663 GiB	663 GiB		0.74
TOTAL	87 TiB	87 TiB	663 GiB	663 GiB		0.74

--- POOLS ---

POOL	ID	PGS	STORED	OBJECTS	USED	%USED	MAX AVAIL
datastore01	5	32	212 GiB	56.24k	635 GiB	0.75%	27 TiB
.mgr	6	1	3.2 MiB	2	9.6 MiB	0	27 TiB

I get why I only have 27TiB available because we've choosen to have 3 copies of our data in case of a single node failure, but I guess my question was around the number of PGS

aaron · Friday at 17:51

So you only have one pool (we can ignore the Ceph internal .mgr) for now. Set the "target_ratio" value to 1. If you edit the pool in the web UI make sure that the "Advanced" checkbox next to the OK button is enabled.

This tells the autoscaler (calculating the number of PGs for you) how much space you expect the pool to consume in the end. This is a weight, so if you would have another pool and set the target_ratio for both to 1, the autoscaler will calculate the PGs for each pool with the assumption that both are to be expected to consume roughly 50% of the space.

So if you have multiple pools, I would opt to use values between 0.0 and 1.0, or between 0 and 100. This way, you can map the ratios closer to percentages.

Keep in mind, that the autoscaler, when set to ON for the pool, will only change the PGs if the current and optimal number of PGs differs by a factor of 3 or more. If it is only 3 (e.g. 32 -> 64), it will warn you, but you have to change it manually in the pool settings.

One more thing, should you plan to use different device classes [0], those ratios are calculated within a device class. Also, all pools (also the .mgr) need to be assigned to a specific device class to avoid any overlap that would hinder the autoscaler in its calculation.

If you want to do those calculations more manually, the Ceph docs have a calculator [1].

[0] https://docs.ceph.com/en/latest/rados/operations/pgcalc/
[1] https://docs.ceph.com/en/latest/rados/operations/pgcalc/

Search

Search

VM Snapshots: Ceph with ZFS?

leedys90

New Member

aaron

Proxmox Staff Member

bbgeek17

Distinguished Member

leedys90

New Member

aaron

Proxmox Staff Member

leedys90

New Member

bbgeek17

Distinguished Member

leedys90

New Member

aaron

Proxmox Staff Member