Replication + Heartbeat / HA

MH_MUC · Feb 16, 2021

Hi everyone.
I have a "standard" Proxmox 6 server with zfs in default config (rpool with system + data on it)
I am afraid of hardware failure resulting in a long downtime. So I would like to run a second node with storage replication and heartbeat-configuration that would take over in case of failure.

I have some questions that I wasn't able to resolve myself.
1) Is the replication possible with my current rpool-zfs-setup or do I have to reinstall the node1? ( I guess it should work as it is vm/ct based and not on storage level. So they will just be saved in the target nodes storage)
2) I read about quorum. I am wondering how high the risk would be if I run the cluster in a 2 node config. I am not using shared storage. The idea would be to sync the vm/ct storage to the second node by replication so that they can run independendtly. If I understood it correctly the issue with a 2 node config is, that there is a risk of two nodes trying to bring the ressource online with the risk of running the same VM twice. Can I solve this with hardware watchdog?

Thank you very much for your help.

t.lamprecht · Feb 16, 2021

MH_MUC said:
2) I read about quorum. I am wondering how high the risk would be if I run the cluster in a 2 node config. I am not using shared storage. The idea would be to sync the vm/ct storage to the second node by replication so that they can run independendtly. If I understood it correctly the issue with a 2 node config is, that there is a risk of two nodes trying to bring the ressource online with the risk of running the same VM twice. Can I solve this with hardware watchdog?

That'd be already solved by our HA stack which uses a watchdog to fence another node.
Your actual problem is that you only have two nodes, so if one looses connection to the other one it cannot tell if it the other failed, and it is OK to continue or if there's a network outage or something completely different.

To ensure that it needs quorum, that means more than 50% of the votes, with two nodes and one down there's only one vote which is exactly 50%, but not more than 50%.

I'd suggest to either add a third node or to setup an external voter on some third device (can be an already running server/outside-VM or a Raspberry or the like), check our documentation for details:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support

MH_MUC · Feb 16, 2021

Thank you very much for the quick reply. This looks like an efficient solution.
Is my expectation concering #1 correct?

t.lamprecht · Feb 16, 2021

Yes, while it can make sense to separate PVE OS boot disk and VM/CT data onto different storages (decoupling) it is certainly not a must, and as long as the VM/CT have their disks on a ZFS pool replication to another host with a ZFS pool can work.

MH_MUC · Mar 10, 2021

Hi Thomas. Thank you for helping me with this project.

t.lamprecht said:
Yes, while it can make sense to separate PVE OS boot disk and VM/CT data onto different storages (decoupling) it is certainly not a must, and as long as the VM/CT have their disks on a ZFS pool replication to another host with a ZFS pool can work.

I think this should be the default setup for the proxmox installer.

Right now I just have two nodes and a hard time to find a third one in my setup. I guess a quorom device outside my datacenter wouldn't make sense because of the latency. So for now I will run a two node setup.
I have a follow-up question that I couldn't quite figure out with the docs.

If I run stroage replication only (without HA) there is no live migration, because there is no shared stroage. So in case the node A is failing I would have to bring up the VMs/CTs manually on node B after noticing. So far so good.
So what happens if the node A is fixed and brought up online again. Will I end up in the split brain situation or will the node A try to establish a connection with node B in the cluster again to find that the VM's were manually transfer according to the manual (https://pve.proxmox.com/wiki/Storage_Replication #Migrating a guest in case of error)

Another question related to this and shared storage:
If I run the cluster with a shared storage: Isn't the shared storage another single point of failure that I am just trying to eliminate? If the storage server fails all servers in the cluster are down.

Thank you very much for your help!

t.lamprecht · Mar 10, 2021

MH_MUC said:
I think this should be the default setup for the proxmox installer.

You can already do that now? Just select the separate OS boot disk there and create the VM/CT ZFS (or whatever) storage afterwards over the webinterface..

MH_MUC said:
Right now I just have two nodes and a hard time to find a third one in my setup. I guess a quorom device outside my datacenter wouldn't make sense because of the latency. So for now I will run a two node setup.
I have a follow-up question that I couldn't quite figure out with the docs.

Quorum devices can easily cope with 100ms+ latencies, it's outside the general cluster communication and by default only polled on partition changes (node goes offline/online) and every 20 seconds.

MH_MUC said:
If I run stroage replication only (without HA) there is no live migration, because there is no shared stroage. So in case the node A is failing I would have to bring up the VMs/CTs manually on node B after noticing. So far so good.

There is live migration also for VMs with local storage, the disk is then also live migrated to the other node.
If you use Proxmox VE replication with ZFS then only the delta since the last replication is synced live.
Naturally that works only as long as both nodes are online and working.

MH_MUC said:
So what happens if the node A is fixed and brought up online again. Will I end up in the split brain situation or will the node A try to establish a connection with node B in the cluster again to find that the VM's were manually transfer according to the manual (https://pve.proxmox.com/wiki/Storage_Replication #Migrating a guest in case of error)

Split brain can only happen if you manually set the node quorate, and it alters resources still belonging to the dead node, as when that one comes up again the same resource is altered. If there's an outage, and you confirm the other node is not online and VMs still running and possibly writing data there you can normally move it and you will be fine - if the dead nodes comes up again it will try to update its state and uses the newer state if there are conflicts (so NTP/time-sync is always good to have working).

MH_MUC said:
If I run the cluster with a shared storage: Isn't the shared storage another single point of failure that I am just trying to eliminate? If the storage server fails all servers in the cluster are down.

Depends, if it's a single NAS/NFS box, then yes you are at the mercy of that boxes redundancy and the connection between those box and your PVE servers.
But if you setup a three node PVE+Ceph cluster then you have no single point of failure any more, Ceph can have multiple OSDs (disks) per node so some of those can fail, and a whole node can also fail as two others are left to take over the work (which they still can as they have a quorum).

Search

Search

Replication + Heartbeat / HA

MH_MUC

Well-Known Member

t.lamprecht

Proxmox Staff Member

MH_MUC

Well-Known Member

t.lamprecht

Proxmox Staff Member

MH_MUC

Well-Known Member

t.lamprecht

Proxmox Staff Member

We value your privacy