Proxmox offsite backup / cluster configuration

stevenwh

Member
Mar 16, 2024
30
2
8
Hello all,
I have a small homelab server that I'm running Proxmox on. I also have an offsite machine that I use for backups of highly important data. That offsite machine was running ESXi for a while but I recently did some work to it and was switching it over to Proxmox as well. So, I started looking into clustering and such for it. As of right now I'm having problems getting the cluster to work. Every time I try to join the primary nodes cluster, it adds it but then deletes the nodes directory and never finishes the job on the offsite server. And on the main server, an error popups up about not being able to find /etc/pve/nodes/offsite/pve-ssl.pem.

I'm just completely guessing here, but my guess is that it has to do with network communication. In my current test setup, I have the offsite machine in the same location as the primary, but I have it on a separate sub network that simulates the network it is on offsite. Due to how it's configured, the offsite machine can reach the primary machine, but the primary machine can't see a machine in that sub-network. So, I guess the question here is, in a traditional cluster configuration, are the nodes expected to have a site to site VPN established outside of the node to enable communication as if they are on the same network? Or are there certain ports / port forwarding I need to have configured to make it work? I haven't really deeply considered the networking side of things for these yet, I need to do more research in that aspect. Figuring out what I would use / how it would be configured if I did want to have say a domain that would try the primary server first and then the offsite server if primary is unreachable. Any pointers on where to start looking for that would of course be welcome as well.

Next thing is, is this even the right setup for what I want to do?

I'll elaborate more on my goals of the offsite server. First and foremost it's primary duty is to backup my critical data. I currently have this configured via a scheduled cron job running a rsync pull task. This is working currently. Before on ESXi I had this configured passing through a SATA controller to a TrueNAS VM, but since Proxmox supports ZFS directly I figured I'd just do it as a cron job without the VM.

Secondary tasks that I would like to do is backup select VMs / Containers from my primary server. Having the ability to spin up these VMs / Containers on the offsite server if the main server is down / unreachable would be a nice to have. But I don't really require it to be automated, as long as the backup from the data on it is at least somewhat current (within the last hour-ish), and the data can be synced to the primary server when it is available again. I'm not running any mission critical VMs that require 99.999% up time or anything like that, but if I could make it work fully automated and transparent, that would be a nice bonus.

I may also run separate VMs / Containers only on the Primary or Offsite server depending on various needs / desires for it. And it is even possible I might one day find a reason to have a VM running on the offsite server and the primary server be a backup to that one. For example, the offsite location has a faster upload speed. So if I wanted to run a public website or something, I might would prefer it to be primarily ran off the offsite server to take advantage of the faster upload.

I had a test cluster set up previously and one thing I noticed that I really did not like is my inability to do certain things on the primary server if the offsite server was unavailable because of forum things. I think I can get around this by changing how many votes the primary machine gets, or running a raspberry pi as a voting server as well from what I've read so far. I haven't gotten back to playing with that more yet since I'm having problems getting the clusters to actually function atm. Just figured I'd mention it here in case it's relevant to things.

I don't think I have any real capability of running a reliable shared file system between both servers. I've started reading a little about Ceph to see if it would be useful for this configuration, but it's completely new to me and I haven't gotten into it much yet.

I'd welcome any advice on this setup. Just please keep in mind this is a homelab not an enterprise configuration. So some things might not be configured in an industry standard way. And I don't have a huge corporate budget to throw at this :)
 
Last edited:
Two-node clusters are always problematic and latency with a remote locations is also problematic for a cluster (especially with just two nodes). Maybe don't do clustering at all?
I run my main system with Proxmox and a second computer also. Each has a PBS running in a container (which directly use drives on their hosts) and they sync once or twice a week. I can easily restore a VM on either system but I assume remote-migration would also work.
 
  • Like
Reactions: LnxBil
So, I'm curious what would it look like if I just try to do backups that are synced to the offsite VM to restore? I was playing around with that some, and actual backups are way to large as every backup is the entire disk. That is way too much data to try and transfer once an hour for this set up (would be a few hundred gigs every hour)...

I looked at snapshots to see if incremental snapshots would work and it does look a lot better but comes with a lot of limitations as well. A lot more configuration and scripting required since I would need to do a snapshot hen a zfs send and recv. And then that only copies the storage of the VM, none of the VM settings or anything.

I did move the offsite server to the same network as the primary, just to see if I could get cluster working. And while I did, it does not behave as I expected. I can set a replication schedule which seems to automate the snapshot process. But it does not create a VM on the offsite server either... I would've expected there to be a container created there as well =/ I read that you can use High Availability to do this. So I tried playing around with it but guess I don't understand it either cause even though it says it's active all it does is start the container automatically on the primary node. It still did not create the container on the offsite node, so not sure how it is going to start the container on the offsite if primary were actually unreachable.

I suppose I could do a backup of the VMs / containers I want to sync over and restore the backup on offsite and then start doing incremental snapshots and sending and receiving those. But, that is just going to potentially create a mess for multiple reasons, if the primary is down for example while I'm out of town and I bring up the VM on the offsite. If I have it automated to do those incremental zfs snapshots, what is going to happen if the primary comes back up. I could see it easily overwriting everything that has been done on offsite during the downtime. Would require writing a lot more sophisticated script to check the status of the nodes and know when it should do a snapshot in which direction.... essentially... everything that I would assume a cluster replication would do haha. Then there also is the question of network differences that could get overwritten in the snapshots. I guess I can get around that by setting up identical virtual networks at both ends maybe.

You mentioned using PBS in containers on both your servers. I had not looked at PBS yet. I'll try to look into it today to see what kind of solutions it provides that might be helpful here.
 
Last edited:
So, I'm curious what would it look like if I just try to do backups that are synced to the offsite VM to restore? I was playing around with that some, and actual backups are way to large as every backup is the entire disk. That is way too much data to try and transfer once an hour for this set up (would be a few hundred gigs every hour)...
PBS is deduplicating the backup, so it will not be that much, yet you need to automatically restore the VMs on your recovery site daily to have them ready if you need it. ZFS would be better for that, yet a lot of manual work as you already pointed out.

I looked at snapshots to see if incremental snapshots would work and it does look a lot better but comes with a lot of limitations as well. A lot more configuration and scripting required since I would need to do a snapshot hen a zfs send and recv. And then that only copies the storage of the VM, none of the VM settings or anything.
There is zfs replication built into PVE for this, but yes, it is a bit more setup and per VM, yet you can do it over cluster boundaries.
 
Just thinking through a little what would be required for scripting a lot of the zfs stuff and I think I understand why 2 cluster nodes are not very useful, and HA warns that it might not work well without 3 voting nodes lol. You need to have that 3rd vote for times when one node is down due to internet or something, so it never sees itself as down, so both nodes think they should have the most recent copy of the data.

I know I've read of just using a raspberry pi as a voting server or something that effect. But for that to really work and be reliable, it would also need to be on a 3rd separate network to be able to properly handle if the internet is down at one location or the other =/

I think if I write a script to do it, I would just figure out a way to create a lock file when a VM / container is started, and the zfs syncing would only sync a snapshot if that lock file doesn't exist. That way if my primary is down and I spin up the VM / container on the offsite, I don't have to worry about a zfs sync overwriting new data on the offsite if the primary comes back up before I'm aware. Then I just know I have to manually correct the data and remove the lock file when it's safe for syncing to resume. I'm guessing there are lock files created in the VM / container files (I know ESXi did, haven't looked at what Proxmox uses), but I'd want a file that persists even if I shut the VM back down until I manually remove it in this case.

But thinking about this even more, my use case doesn't seem that crazily unique. I'm surprised someone hasn't already written a plugin or something to handle this lol. Maybe they have and I just need to google better...
 
Last edited:
I think if I write a script to do it, I would just figure out a way to create a lock file when a VM / container is started, and the zfs syncing would only sync a snapshot if that lock file doesn't exist. That way if my primary is down and I spin up the VM / container on the offsite, I don't have to worry about a zfs sync overwriting new data on the offsite if the primary comes back up before I'm aware. Then I just know I have to manually correct the data and remove the lock file when it's safe for syncing to resume.
I love ZFS, yet I would not go down that road. Too much work, too many pitfalls to fall in. We just use PBS with our "default" cluster with dedicated shared storage, that requires no work at all after setting up. I cannot stress enough how awesome a "it just works PVE cluster" is.
 
I love ZFS, yet I would not go down that road. Too much work, too many pitfalls to fall in. We just use PBS with our "default" cluster with dedicated shared storage, that requires no work at all after setting up. I cannot stress enough how awesome a "it just works PVE cluster" is.
What do you mean by dedicated shared storage? Shared between the nodes on the cluster? Or shared between PVE and PBS? You also mention default "cluster", as in you have more than one cluster not just more than one node? Because I don't think I can do a shared storage between the nodes if that is what you mean, since they are at different locations. I mean, well I "could", but which ever node is remote to the storage is going to have some really bad storage IO, and if the location with the storage goes down, the remote node wouldn't work either with no access to the storage lol.

I agree I think an it just works cluster would be awesome. But I'm not sure it's going to work in this case with only 2 nodes at separate locations. The quorum issues alone if one of the nodes are unavailable is going to cause headaches. Even if I have an additional voting server, unless I pay to host it in a 3rd location, it won't be helpful to achieve quorum if the location where it is at is the one that is down =/

Basically, I have to travel a lot, and I want access to some of my homelab stuff while traveling. So if my internet at home goes down for example, and I can't do anything on the offsite box because I don't have a quorum, it really isn't doing me any good to have that cluster.

It's entirely possibly I have completely misunderstood some things about these clusters and am looking at it wrong though... And I also haven't had time yet to look at PBS, I did see when I looked briefly that it has some kind of remote synchronization feature, maybe that is the key that I'm missing.
 
What do you mean by dedicated shared storage? Shared between the nodes on the cluster? Or shared between PVE and PBS?
an "off-the-shelf" SAN, shared by all cluster nodes so with zero disk transmission live migration.

You also mention default "cluster", as in you have more than one cluster not just more than one node?
One cluster with at least 3 nodes per datacenter / site. Everything else will not be fun due to the voting and disk sync issues you described.


So if my internet at home goes down for example, and I can't do anything on the offsite box because I don't have a quorum, it really isn't doing me any good to have that cluster.
Maybe redundant internet, e.g. via LTE router (if those are available at your location). It's relatively cheap and you one have traffic if your main line is not working.
 
an "off-the-shelf" SAN, shared by all cluster nodes so with zero disk transmission live migration.


One cluster with at least 3 nodes per datacenter / site. Everything else will not be fun due to the voting and disk sync issues you described.



Maybe redundant internet, e.g. via LTE router (if those are available at your location). It's relatively cheap and you one have traffic if your main line is not working.

Ah, yeah this is over my budget for homelab stuff lol. I'm only running one node per "datacenter", definitely can't run 3 haha. SAN is something I've never actually worked with so I don't know much about the disk transmission, but based on my limited knowledge of it, for it to work it would require physical storage servers at each "datacenter" as well.

I dunno what area you are in but redundant internet is definitely not "relatively cheap" in my area haha. Not a lot of options for one thing, there is really only one ISP in my area. This is actually something I've looked into and considered for my own personal geekiness because I definitely don't "need" it for my homelab lol. I have cable available here, and very slow speed DSL is a secondary option for hard lines. The DSL is ridiculously priced for the speeds you get (the cable is also very expensive compared to areas with more options / fiber etc). Then there is starlink but again very expensive for a "backup" option and not fast enough to be my primary. And also TMobile Home internet is an option, which I have explored a bit. It can function as a backup for a not terrible price. It actually gets pretty awesome download speeds in my area, almost as fast as my cable. But the upload is pretty bad and artificially limited like my cable. Also the biggest problem with it and every other cellular service is CGNAT. I have to create a complicated tunneled network to an external server, adding cost for the external server, latency, and speed loss due to tunnel encryption, making it not a great option either.

I do have a Router device that supports multi-wan and has a 5G modem in it as well. But I haven't found any cellular service (5G or LTE) that didn't suffer from these problems or additionally have extremely low data caps. Is there an LTE service you are referencing that I don't know about maybe?

Again, keep in mind my earlier statement, this is for a homelab. I don't have enterprise space, power, or budget lol :)
 
Again, keep in mind my earlier statement, this is for a homelab. I don't have enterprise space, power, or budget lol :)
Yet you want to have a cluster? Why not keep it as simple as possible while having everything you need? For my home lab, I use systems with RAID and enterprise SSDs, yet besides that, just the commodity hardware runs 24/7 with only 11W. If I need more bang, I start other machines and they provide just services and they start right away. Yes, it would be nice to have one GUI for everything, yet just having all the problems with a two node cluster in order to have that ... no, not worth the hassle for me. Most of the time I just use services that run on PVE and don't need to administer them often. And when I need to do some PVE stuff, I just login into the machine. If I open the GUI for one server or the other is just an URL away. Login is handled via OpenID, so that I can login where I want with my default credentials.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!