Do I really need to cluster? (2 node DR solution options)?

superj · Jul 26, 2024

Hi, I'm looking into using Proxmox to host a single Windows VM ideally with 2 nodes. This is primarily for disaster recovery if the primary node fails.
I've got 2 Dell R350 with 2 1TB each and Proxmox installed on the servers onboard SD card.
Unfortunately (for ZFS) the servers where provided with the H755 PERC, I've got the disks set to non-raid and created a ZFS pool on each server.
This is more of a disaster recovery scenario than high availability (2-6hours of downtime is acceptable if the primary node fails).
Initially I was going to cluster 2 nodes but NOT enable HA, but thought maybe I should avoid clustering all together.
I'm concerned about 2 node related Quorum issues that might only become apparent running running on the backup node.
I'm wondering if it's feasible to set up some automatic replication to the backup node without clustering and have a solid simple VM restore process (even if it's manual).
The DR SOP would be to restore the VM onto the backup node manually. I'm just not sure what mechanism I should use to create a replicated/backup of the VM (ideally 15min or less intervals).

Could cluster and use replication on the ZFS pool (but may run into Quorum issues if not executed carefully). No HA necessary.
Setup some sort of storage share from the backup node
- Was wondering about running Proxmox VE, and BU Server (in a VM?) on the same 2nd (DR) node?

Any advice would be appreciated.

UdoB · Jul 26, 2024

superj said:
Was wondering about running Proxmox VE, and BU Server (in a VM?) on the same 2nd (DR) node?

Just install the second machine with PVE, standalone. Then add PBS "parallel" on the same system. (There are Pros and Cons for this. This is the most simple approach without VM overhead and with direct access to the storage devices.)

Then run backups every few hours (depending on your needs) from the primary machine to this PBS.

When the primary machine dies you can use the secondary one to restore from the PBS and run these VMs.

Of course you really need to test this scenario with a shut down primary system: you might stumble over details like "the network bridges need to be equivalent" for easy restore of a VM and for its services accessible via the same IP/DNS entry. You do not have cluster helping to keep things consistent...

To be honest: I am not really sure if this is a good idea, for me it would be too weak. It is an extremely basic approach without a cluster.

And: in the above construct there are two machines running 24*7 - for backups in short intervals. In my world I would actually cluster them, add a cheap Quorum device and enjoy functions like replication, live migration, possibly High Availabilty, a chance to run Kernel upgrades without downtime and so on.

Good luck

LnxBil · Jul 27, 2024

You can run ZFS replication without clustering the nodes together, yet the Perc RAID is a strong indicator against it, yet would technically still work. Get rid of the Perc RAID and this is the way to go. It's officially integrated and works well.

@UdoB 's PBS solution sound also nice. I would also recommend to automatically restore the VM continously to have a continous test and low restore times (the VM is already there).

esi_y · Jul 27, 2024

superj said:
I'm concerned about 2 node related Quorum issues that might only become apparent running running on the backup node.

With this hardware, I would actually cluster them, but I would not use HA stack at all. With a Qdevice (which can be remote to the nodes, there's no need for low latency/jitter) you would not have quorum issues, but what issues do you worry about since it's not HA, losing quorum is not a disaster, or is it? Also, it is possible to run corosync in a special 2-node only configuration (but Q device is always better). The corosync option is two_node [1], but consider what it means (it implies wait_for_all). Another option for a master-slave like setup is auto_tie_breaker.

[1] https://manpages.debian.org/unstable/corosync/votequorum.5.en.html

UdoB · Jul 27, 2024

esi_y said:
but what issues do you worry about since it's not HA, losing quorum is not a disaster, or is it?

I would like to mention that you can not administrate the remaining node if the only second node is down. A two node cluster is... problematic - you are outside of normal/expected/recommended operation parameters. A Quorum device helps a lot.

Yes, there are workarounds (pvecm expected...), but this is more for disaster recovery than for every day usage.

Again... just my 2€¢

esi_y · Jul 27, 2024

UdoB said:
Yes, there are workarounds (pvecm expected...), but this is more for disaster recovery than for every day usage.

But that's what the OP stated this setup was for! :-D Also the two_note: 1 actually sets quorum to 1 (but, read the docs ... as always). I just do not see any problems with it as long as it is not meant to be used for HA.

leesteken · Jul 27, 2024

Just run two separate nodes and make regular backups with PBS (very fast on running VMs), which you would do anyway if you care about the VMs. Every weekend restore all VMs from backup on the other one, shutdown the current Proxmox and start the VMs on the other, all of which I think can be automated by a Bash script. That way you test your backups and your DR procedure and the hardware.

superj · Jul 30, 2024

Thanks for the tips everybody (sorry I was MIA for a day or two). I wasn't sure if I'd be able to start a VM on the 2nd node if quorum is lost on a 2 node cluster (even without HA). I'll definitely have to run some test scenarios so the restoration procedure is certain.

superj · Aug 2, 2024

LnxBil said:
You can run ZFS replication without clustering the nodes together, yet the Perc RAID is a strong indicator against it, yet would technically still work. Get rid of the Perc RAID and this is the way to go. It's officially integrated and works well.

@UdoB 's PBS solution sound also nice. I would also recommend to automatically restore the VM continously to have a continous test and low restore times (the VM is already there).

If I set the disc's to non-raid is it really an issue to run ZFS? They passing thru the PERC but it's non-abstracting the disk layer anymore.

UdoB · Aug 2, 2024

superj said:
If I set the disc's to non-raid is it really an issue to run ZFS?

One fundamental test is to run "smartctl -i" and "-A". If you see the information of the actual physical drive you're okay. While this setup would still not be recommended, I would probably accept it.

Just my two cent...

Kingneutron · Aug 2, 2024

superj said:
Thanks for the tips everybody (sorry I was MIA for a day or two). I wasn't sure if I'd be able to start a VM on the 2nd node if quorum is lost on a 2 node cluster (even without HA). I'll definitely have to run some test scenarios so the restoration procedure is certain.

You really do not want to run a 2-node cluster, look into Qdevice - even if it's a raspberry pi or VM

LnxBil · Aug 13, 2024

UdoB said:
One fundamental test is to run "smartctl -i" and "-A". If you see the information of the actual physical drive you're okay. While this setup would still not be recommended, I would probably accept it.

Exactly this. Technically it will work, yet you still have the risks concerning the controller. Maybe it can be flashed to IT-mode instead of RAID. That would also work. This heavily depends on the hardware and I've only done it with LSI SAS2008-based RAID controllers - you technically need "just" a SAS HBA.

Search

Search

Do I really need to cluster? (2 node DR solution options)?

superj

New Member

UdoB

Famous Member

LnxBil

Distinguished Member

esi_y

Active Member

UdoB

Famous Member

esi_y

Active Member

leesteken

Distinguished Member

superj

New Member

superj

New Member

UdoB

Famous Member

Kingneutron

Active Member

LnxBil

Distinguished Member