High Availability + Replication = Disaster

Ashkaan4 · Mar 10, 2024

Hi, I've configured HA and replication, but when I actually reboot a server, the failover fails stating that the volume already exists at the target node.

Some helpful info:

I have 8x nodes
my goal is to have replications every 15m
when there's an event that causes a failover, either a quick delta to get the latest, or just run the 15m old version
I have each vm set to failover and replicate to 2x other nodes (just in case)
I have groups configured with priorities
I separately have replications configured from Source to Destination A and Destination B

gurubert · Mar 11, 2024

Wouldn't it be easier to create shared storage with Ceph on the 8 nodes instead of asynchronously replicating the VM disks?
Or is that a non-local cluster with latencies too high to allow Ceph?

homelabenthusiast · Mar 11, 2024

@Ashkaan Hassan
1. What's the status of replicaion?
2. Are you using traditional storage replication or pve-zsync?
3. Maybe you can delete your replication tasks and remove the dangling vmdisk from the other nodes and do another clean replication

Ashkaan4 · Mar 11, 2024

gurubert said:
Wouldn't it be easier to create shared storage with Ceph on the 8 nodes instead of asynchronously replicating the VM disks?
Or is that a non-local cluster with latencies too high to allow Ceph?

Ceph is too slow for our needs.

Ashkaan4 · Mar 11, 2024

homelabenthusiast said:
@Ashkaan Hassan
1. What's the status of replicaion?
2. Are you using traditional storage replication or pve-zsync?
3. Maybe you can delete your replication tasks and remove the dangling vmdisk from the other nodes and do another clean replication

Error. "2024-03-11 08:54:04 110-0: volume 'rpool/data/vm-110-disk-0' already exists"
I just went to Replication under Datacenter and configured each one there.
Then the replication will be happy and the next time I need to failover, it will error again.

Ashkaan4 · Jun 17, 2024

I'm still having this issue. Can anyone help with replication not working (as expected)?

Again, the goal is just to have replication jobs running and successful so that we when we reboot a server, the VM migrates QUICKLY without needing to send a full replication at that moment. This is possible with VMWare and Hyper-V. It's got to be possible here.

Ashkaan4 · Jul 11, 2024

Am I the only person that needs to do updates to Proxmox hosts?

How are you guys managing these things?

Ashkaan4 · Jul 14, 2024

I'll be happy to pay a $100 bounty to the first person that solves this issue for me. Please DM me or post here. I'm happy to run a Zoom meeting so you can see clearly where I'm having trouble with this platform.

esi_y · Jul 15, 2024

Ashkaan Hassan said:
Error. "2024-03-11 08:54:04 110-0: volume 'rpool/data/vm-110-disk-0' already exists"

I just went to Replication under Datacenter and configured each one there.

Then the replication will be happy and the next time I need to failover, it will error again.

The whole HA and ZFS replication is a bit of a joke, all it took for me was one response [1] from PVE project lead and I got the message it was some sort of leftover and no pays attention to how well it works. You may also look at the reply I got in the filed Bugzilla ticket. Sooner or later you will run into issues where you have to manually intervene, I just made one test and run into it. It certainly does not scale well for 8 hosts.

[1] https://forum.proxmox.com/threads/what-is-wrong-with-high-availability.139056/#post-620923

Ashkaan4 · Jul 15, 2024

I see. Thank you for this. I guess it's time to move back to VMWare.

esi_y · Jul 15, 2024

Ashkaan Hassan said:
I see. Thank you for this. I guess it's time to move back to VMWare.

I suppose you are attempting to get some sort of reaction from PVE staff. Sometimes (EU business hours) you get it here, but you will soon notice there's some setups they like to support more than others (e.g. shared storage instead of replicated ZFS).

If another solution works better for you, without making it like a snide remark, sure go for it - I would argue there's other options than Broadcom's and PVE. With 8 nodes, you have a good choice. For PVE, the HA implementation is far from perfect, the scheduler is e.g. very primitive.

justinclift · Jul 15, 2024

Ashkaan Hassan said:
... the failover fails stating that the volume already exists at the target node.

For my setups (mostly dual node clusters for now), this means there was an older VM with the same id (eg id 100) that already had replication set up.

But when that older VM was deleted, the replicated disk on "the other node" wasn't cleared out for some reason. The left over volume on the other node can block replication for a new VM (with the same id) unless these left over volumes are cleared out first.

If you're wanting to do this stuff in an automated way, then the pvesr cli tool is what you want. That's used for setting up replication jobs, and for removing them.

You can also just directly run the appropriate zfs destroy command on the target node to clear out old VMs too. Both ways will work.

Anyway, once you've cleared out any left over volumes (of the same VM id) on the target node, then your replication setup should work ok. In theory.

esi_y · Jul 15, 2024

justinclift said:
pvesr cli tool is what you want. That's used for setting up replication jobs, and for removing them.

You can also just directly run the appropriate zfs destroy command on the target node to clear out old VMs too. Both ways will work.

I just noticed the OP was having yet another issue, which is far worse:
https://forum.proxmox.com/threads/high-availability-with-local-zfs-storage.122922/#post-684207

Also with 8 nodes, I don't want to imagine all the manual replication setups to manage 3-way for potentially 100s of VMs that way ...

justinclift · Jul 15, 2024

Ouch. Something causing from-scratch replication (ie complete volumes) to occur instead of deltas. I've not seen that before.

Personally I'd definitely double check if that's indeed what's happening (just in case), because it seems really unusual.

If that's indeed is what's going wrong, then that's a bad bug which will need tracking down and fixing.

gurubert · Jul 15, 2024

Another alternative would be to use DRBD: https://linbit.com/blog/linstor-setup-proxmox-ve-volumes/

ness1602 · Jul 15, 2024

Anything except CEPH in these number of nodes is by my opinion out of questions.

gurubert · Jul 15, 2024

IMHO if a distributed storage (Ceph or DRBD) is too slow for the application then the application has to do the replication itself.
I.e. run a Galera cluster with local storage. It does not matter if a node (and its VM) goes down because the other members of the Galera cluster run on other Proxmox nodes.

esi_y · Jul 15, 2024

gurubert said:
IMHO if a distributed storage (Ceph or DRBD) is too slow for the application then the application has to do the replication itself.

So he might as well run e.g. k8s nodes and ditch PVE-like solution altogether.

Ashkaan4 · Aug 9, 2024

Yep, I've gone through all of these. I think I just have a tough use case.

LnxBil · Aug 9, 2024

Ashkaan Hassan said:
I think I just have a tough use case.

Can you describe that more in details? It's uncommon that CEPH is NOT the way to go with that high number of nodes. It normally just means that the hardware is not on the required level. ZFS is a great filesystem, yet not for a cluster and HA with ZFS is IMHO no real HA due to all' the problem you're describing. Dedicated or distributed shared storage is the only way to have a fast and easy maintainable cluster.

High Availability + Replication = Disaster

Active Member

Famous Member

Member

Active Member

Active Member

Active Member

Active Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Active Member

Distinguished Member