[SOLVED] Questions about ZFS and Migration/Backups

emilhozan

Member
Aug 27, 2019
51
4
8
Hey all,

I am writing this thread in regards to help understanding ZFS and its capabilities.

From my testing, I see that ZFS can allow live migrations if you create a ZFS pool on all servers with the same name. Only the firs server you check "Add Storage:" option to allow for the cluster wide use of this pool. You then go to UI > Datacenter > Storage > PoolName > Edit, and add the corresponding server to that pool.

Now what i understand about this is that in order to migrate a disk with local disks, the storage pool must be the same name. So this feature allows migration but also limits the free storage capacity. Fort instance, I have 3x nodes in a cluster, each with a pool of the same name and disks. However, one server offers ~10TB each, but my overall free space is restricted.

Essentially I understand this a being a "mirrored RAID" of sorts, where we have server-level redundancy. Is this thought process correct so far?


I saw the option of using PVE-zsync but I don't see the option of migrating data that way, only syncing the data. The issue is that, I am unable to "Restore" the VM on the target server.

What exactly is the purpose of this, or is my understanding correct?


My end goal is two part:
- to get a cluster set up that allows migration while not restricting my storage capacity.
- and to allow backups between two completely separate clusters
Any thoughts or ideas?
From my testing, migration is only possible with how I described above. Backups are done using pve-zsync but still, how do we restore backed up VMs?

I saw a good post about the migration process and this is how I learned about not checking "Add Storage:" on the second and subsequent servers in which the pool will be made up of. I tried a few other searches for an answer but need to cover my basis here and also reach out for help!
 
Also, I tried ceph and loved its feature set but it wasn't working as expected when used for production VMs.
 
Ceph has certain requirements to be usable. Mainly it's own fast network of 10GBit or faster.

Live migration of VMs to a different storage is possible no matter what kind of storage it is. This is possit because the live copying of the disks is done via Qemu and is storage type agnostic.
The target storage option in the GUI and CLI has to be set to use a different story on the target node.

If you want to have replication you need ZFS because the ZFS send/receive capability is used for it. In this case the storage needs to be named the same.

Regarding redundancy I would recommend you read into how ZFS works. There are plenty of resources.

In short: at the lowest level ZFS is made of VDEVs. A VDEV is a collection of block devices. The redundancy of ZFS is done on this level. You can mirror block devices or have single to triple disk parity (raidz1-raidz3).

The VDEVs are collected in the pool. The pool will stripe the data equally among the VDEVs.

A pool can consist of different types of VDEVs but this is not recommended because different VDEV types have different characteristics regarding bandwidth and IOPS.

To have something resembling a Raid10 you would, for example have a pool consisting of mirrored VDEVs.

ZFS has more specialized VDEVs to take over certain tasks and speed them up if placed on fast SSDs while the big bulk of data is stored on slow spinning HDDs.


Ceph on the other hand is getting the redundancy on a cluster level. An analogy could be that while on a mirrored Raid made of 3 disks you have a cluster of 3 nodes which each store a copy of the data.

In a cluster of 3 or more nodes ceph can be nice because it is a shared storage. If you live migrate a VM only the RAM needs to be transferred to the new node.

Have a look at the official documentation regarding the requirements for each storage solution. Both, Ceph and ZFS, want the disks as raw as possible. Direct Sata or through a HBA Controller, ideally in IT Mode.

If you want to run Ceph in a 3 node cluster and don't expect it to grow to more nodes you can omit a 10GBit or faster switch and build a meshed network. I think the wiki has an article about that.
 
  • Like
Reactions: T.Herrmann
Both, ZFS and Ceph can be used to backup VM disks between clusters. ZFS with it's send/receive mechanism. Pve-zsync is a tool around that.

Ceph has the rados gateway which can mirror to another Ceph cluster. There should also be an article in the wiki on how to set this up.
 
Hey @narrateourale

Thanks for your response. Allow me to respond to your comments :

Ceph has certain requirements to be usable. Mainly it's own fast network of 10GBit or faster.
We implemented a 10Gb backbone for ceph specifically. Using 5400 RPM drives


Live migration of VMs to a different storage is possible no matter what kind of storage it is. This is possit because the live copying of the disks is done via Qemu and is storage type agnostic.
The target storage option in the GUI and CLI has to be set to use a different story on the target node.
I'm not sure I follow this. Can you give me an example of what you mean?
Migration worked fine for ceph (we've done extensive testing and all worked well until we used production VMs) but we had to move away from ceph due to our HDD pool making our entire cluster inaccessible at times. It was really weird and highly unexpected.

As for now, if I create a separate ZFS pool on each node (5x nodes) and create a VM on one, I cannot migrate that VM onto another node because the storage pool name is not the same. This is what prompted me to do some digging and then found a link describing that to migrate VMs, you need to create the same named pool on each node but only on one node to "Add Storage:" to make it visible in UI > Datacenter > Storage. After configuring this on all five nodes with only one having that enabled, I can then add the remaining four to that pool and this permits migration.
The issue with this is that the total storage capacity doesn't aggregate, I am restricted to only one node's worth of capacity.


If you want to have replication you need ZFS because the ZFS send/receive capability is used for it. In this case the storage needs to be named the same.
You use replication via pve-zsync, correct?
I replicated a VM from one node that was permitting migrations (e.g., it was part of the UI > Datacenter > Storage pool) to another node that had it's own ZFS pool uniquely named, this worked fine but didn't see how I'd restore from this or anything to that affect. Is there a way that I can have my 5x nodes with unique ZFS pools (to allow full use of storage capacity) and pve-zsync data, and restore a VM?


Regarding redundancy I would recommend you read into how ZFS works. There are plenty of resources.

In short: at the lowest level ZFS is made of VDEVs. A VDEV is a collection of block devices. The redundancy of ZFS is done on this level. You can mirror block devices or have single to triple disk parity (raidz1-raidz3).

The VDEVs are collected in the pool. The pool will stripe the data equally among the VDEVs.

A pool can consist of different types of VDEVs but this is not recommended because different VDEV types have different characteristics regarding bandwidth and IOPS.

To have something resembling a Raid10 you would, for example have a pool consisting of mirrored VDEVs.

ZFS has more specialized VDEVs to take over certain tasks and speed them up if placed on fast SSDs while the big bulk of data is stored on slow spinning HDDs.

This all I am aware of as is. I did read quite a bit of ZFS when comparing against ceph. I have a ZFS pool with 4x HDDs in RAIDz (RAID5-like from what I understand) and a single disk SSD pool. This is true for all 5 nodes.


Have a look at the official documentation regarding the requirements for each storage solution. Both, Ceph and ZFS, want the disks as raw as possible. Direct Sata or through a HBA Controller, ideally in IT Mode.

This as well I have, direct passing of disks with no hardware RAID.
 
Okay, wow, so I had this wrong.

For some reason I assumed my cluster wide pool was limited in storage capacity but my blunder is that each node individually manages the pool, allowing cluster wide access, but each has the space avail.

In short, for whatever reason I expected the pool to aggregate the storage space of all nodes. In fact, each node manages its own pool and has that much space avail.

Thanks for your answers and pardon my ignorance on this.
 
Ah ok, maybe I should have pointed out that Ceph is a clustered solution while ZFS is local only.
We implemented a 10Gb backbone for ceph specifically. Using 5400 RPM drives
The slow HDDs might be a problem, not giving you enough IOPS to be performant.

If you did not configure the storage as shared you should be able to select the target storage when migrating a VM:
1580659377467.png
You use replication via pve-zsync, correct?
You use the build in Replication functionality. pve-zsync is useful if you want to regularly send a VMs disk to another node outside the cluster or to a different cluster / backup machine.
 
  • Like
Reactions: emilhozan
@narrateourale
No worries, thanks for following back up nonetheless.

The slow HDDs might be a problem, not giving you enough IOPS to be performant
I agree with the hard drive being the issue, but it was worse then this - the entire cluster slowed to a crawl and affected managing VMs. Oddly the VMs that we had solely on SSD pools were running fine but we couldn't manage them (reboot, stop, etc.) all the time, only sporadically.

In digging more, I wonder if this issue was perpetuated by the settings used when configuring the VMs themselves, and the use of templates.


If you did not configure the storage as shared you should be able to select the target storage when migrating a VM:
I'm not quite sure I follow with this nor the screen shot - how did you get those options? If I right-click or use the top Migrate tab, I only get an option of selecting which node to migrate a VM onto.


You use the build in Replication functionality. pve-zsync is useful if you want to regularly send a VMs disk to another node outside the cluster or to a different cluster / backup machine.
Ah, I see that they pretty much do the same thing. Except that zsync allows off-cluster replication whereas the built-in feature only support cluster-wide syncing.


My question still stands though. How do you go about restoring a synced VM?
- Assume VM 100 is running on Node 1
- VM 100's data is synced to Node 2 for backups
- Node 1 dies, how to restore VM 100 on Node 2?

I saw this link but was curious if there was additional insight? It seems kind of hacky too, not that it's bad but just that it requires extra steps that I'm looking to mitigate and to make a restoration process quicker.
 
Oddly the VMs that we had solely on SSD pools were running fine but we couldn't manage them (reboot, stop, etc.) all the time, only sporadically.
What is your network configuration? Do you have a dedicated physical interface for the Proxmox cluster communicaton (corosync)? Your problem kinda sounds like corosync might have issues. It does not need a lot of bandwidth but really likes low latency. If you have in on an interface that sees a lot of traffic, even in it's own VLAN, it might suffer from congestion and the resulting higher latency.
- Node 1 dies, how to restore VM 100 on Node 2?
Have a look at HA. If you set it up with the built in replication it should work as well.
I'm not quite sure I follow with this nor the screen shot - how did you get those options? I
Do you try it on a container or VM? It works only on VMs.
 
@narrateourale

What is your network configuration
We moved from this as was, but I believe we left corosync in its default. We used 10Gb for ceph and migration, that I remember for sure.


Have a look at HA. If you set it up with the built in replication it should work as well.
This isn't really what I mean. I am familiar with HA and that makes sense, cool. Here's what my dilemma is: say cluster 1 houses VMs that are regularly backed up to another cluster. If cluster 1 dies, so does HA, but cluster 2 still has my data - how do I restore that data?

I tried what that link suggested but no dice...


Do you try it on a container or VM? It works only on VMs.
I tried this on a VM, we don't use LXCs.
 
Also, even for HA to work - I have to replicate the data first before it works?

I think maybe I'm just thinking about this all wrong. I see mentions of restoring VMs from backups but not from replications. Maybe this is my issue
 
Hmm, let's talk about replication and HA within a cluster first. This is what Proxmox can do out of the box. Either with a shared file system (Ceph, nfs, samba, iscsi,...) or with replication (ZFS) though this works best in a two node cluster because AFAIK you can only replicate to one other node.

Live migration of a running VM that is replicated will not work because AFAIU it can not deal with an already replicated disk. This is something on the roadmap though from what I read to make that possible.

HA between two clusters is a manual step. I heard that multi cluster MGMT is on the roadmap, but have no idea what that will include featurewise.

Make sure you have the VM configs present in the second cluster as well. In case of failure you need to start them yourself.


Considering your need for HA and running a cluster of more than two nodes I would give Ceph another go and try to work out the links you have that make your setup not working smooth. Dealing with multiple nodes in a cluster and ZFS with replication is quite the hassle compared to having Ceph running. Ceph does not care which node will take over if one fails. With ZFS replication you will have one node that is the failover node for a VM...
 
@narrateourale
Hmm, let's talk about replication and HA within a cluster first. This is what Proxmox can do out of the box. Either with a shared file system (Ceph, nfs, samba, iscsi,...) or with replication (ZFS) though this works best in a two node cluster because AFAIK you can only replicate to one other node.
Correct, I tested HA with ceph when I initially got it set up. It worked great!
As for ZFS two node limitation, that's what I've read as well. You can only replicate data from the original host to only one other node.


Live migration of a running VM that is replicated will not work because AFAIU it can not deal with an already replicated disk. This is something on the roadmap though from what I read to make that possible.
I just tested this. I was able to migrate only another node, not to the node with the replicated data. Bear in mind I have 5x nodes, node 4 had the VM and replicated to node 1. I was able to live migrate from node 4 to node 2:

qm migrate <VM ID> <Remote Hostname> --online --with-local-disks

**Note that live migration doesn't work with multiple disks on two different pools.

HA between two clusters is a manual step. I heard that multi cluster MGMT is on the roadmap, but have no idea what that will include featurewise.
We're working on an alternative solution for this.


Make sure you have the VM configs present in the second cluster as well. In case of failure you need to start them yourself.
What do you mean by this?
Will a sync suffice here for the "VM configs"? And how would I restore in such a case?

I can easily restore a VM from backup via GUI. In order to "restore" a VM from a replicated volume, the VM must first have been replicated, and then you mv the cluster-aware file from one place to another. There must be a difference when looking at this from two different clusters.


And this is the main point I am still trying to figure out. Currently working on finalizing some tests for this node. Gonna move VMs onto it and get our second clusters configured with ZFS as well. It'll allow for more testing


Considering your need for HA and running a cluster of more than two nodes I would give Ceph another go and try to work out the links you have that make your setup not working smooth. Dealing with multiple nodes in a cluster and ZFS with replication is quite the hassle compared to having Ceph running. Ceph does not care which node will take over if one fails. With ZFS replication you will have one node that is the failover node for a VM...
If this were up to me, I would definitely reconsider ceph with adjustments.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!