What is wrong with High Availability?

esi_y · Jan 2, 2024

Test scenario, 3 nodes - pve{3,4,5} - default PVE install (LVM, so no ZFS), 1 container set as HA, started on pve3.

Figured out replication not possible without ZFS, alright, never mind, pve3 went down, HA attempted to restart CT on pve4, no volume available, still understood.

Now when pve3 starts up again, HA does nothing, to some extent, this could be tolerated, after all the last thing it knew was it failed to restart that CT as long as it is concerned, for unknown reason.

Now since the HA migration failed, it is strange that it got stuck showing the CT as failed migrated to pve4 (and not stuck on dead pve3), so manually requesting to "migrate back" ends up in:

Code:

Requesting HA migration for CT 101 to node pve3
service 'ct:101' in error state, must be disabled and fixed first
TASK ERROR: command 'ha-manager migrate ct:101 pve3' failed: exit code 255

Ok, this is getting annoying, went to HA and disabled, retried again:

Code:

2024-01-02 01:30:33 starting migration of CT 101 to node 'pve3' (10.67.10.203)
2024-01-02 01:30:33 found local volume 'local:101/vm-101-disk-0.raw' (in current VM config)
failed to stat '/var/lib/vz/images/101/vm-101-disk-0.raw'
Use of uninitialized value $format in string eq at /usr/share/perl5/PVE/Storage/Plugin.pm line 1615.
2024-01-02 01:30:34 ERROR: storage migration for 'local:101/vm-101-disk-0.raw' to storage 'local' failed - volume 'local:101/vm-101-disk-0.raw' does not exist
2024-01-02 01:30:34 aborting phase 1 - cleanup resources
2024-01-02 01:30:34 ERROR: found stale volume copy 'local:101/vm-101-disk-0.raw' on node 'pve3'
2024-01-02 01:30:34 start final cleanup
2024-01-02 01:30:34 ERROR: migration aborted (duration 00:00:01): storage migration for 'local:101/vm-101-disk-0.raw' to storage 'local' failed - volume 'local:101/vm-101-disk-0.raw' does not exist
TASK ERROR: migration aborted

Yes of course the local volume does not exist (on pve4), this was the very reason the "migration" had failed onto pve4 from pve3, and it even goes on to "clean up" the "stale volume" on pve3?

So seriously, what is wrong here? I do understand the ZFS part (actually I do not, as in it should be then supported on BTRFS too, but fair enough, one could use CEPH or LUN, etc.). What I do not understand is:

1. How can it half-migrate without checking first the volume will be available over at the target?
2. How can it recognise there was a copy on the manual migration back, but completely ignore it?

If this happens in production (because replication jobs could also fail in other ways) with hundreds of CTs, what's the mitigation strategy.

EDIT: So apparently simple mv /etc/pve/nodes/pve4/lxc/101.conf /etc/pve/nodes/pve3/lxc/ could "fix" this one instance, but the question is, why not auto-cleanup after a botched HA migration by itself? Or better yet, not to move on to migrate at all.

dietmar · Jan 2, 2024

First, I want to notice that the recommended way is to use a shared storage for HA, for example Ceph.
Please notice that all things you describe above simply cant happen in such (the recommended) setup.

tempacc346235 said:
EDIT: So apparently simple mv /etc/pve/nodes/pve4/lxc/101.conf /etc/pve/nodes/pve3/lxc/ could "fix" this one instance, but the question is, why not auto-cleanup after a botched HA migration by itself? Or better yet, not to move on to migrate at all.

It is simply considered to dangerous to auto-cleanup, because it involves deleting volumes.

esi_y · Jan 2, 2024

dietmar said:
First, I want to notice that the recommended way is to use a shared storage for HA, for example Ceph.

Thank you for the reply. I was not aware anything was wrong with e.g. ZFS replicated datasets to be present on the potential stand-in nodes.

dietmar said:
Please notice that all things you describe above simply cant happen in such (the recommended) setup.

I did not test that scenario, but I suspect e.g. if the said volume (even if shared storage is used) is not available at the very moment of HA migration happening, it would not auto-start (of course) and end up being stuck in that state until manual intervention (making an inference here based on what happened above).

dietmar said:
It is simply considered to dangerous to auto-cleanup, because it involves deleting volumes.

What about pre-migration check before HA goes on to auto-migrate? As in, checking if the node to be migrated onto has such a volume available fore moving the config file?

dietmar · Jan 2, 2024

tempacc346235 said:
What about pre-migration check before HA goes on to auto-migrate? As in, checking if the node to be migrated onto has such a volume available fore moving the config file?

AFAIK there is already such check based on the storage configuration. Of cause this fails if you have a wrong storage configuration.
Anyways, I guess this check could be further improved...

esi_y · Jan 2, 2024

tempacc346235 said:

Are the Perl warnings on for any purpose on standard PVE host install (without debug mode)? Is this a no-subscription thing? It's not being OCD, but felt confusing within the context for a brief moment.

tempacc346235 said:

Code:

2024-01-02 01:30:34 ERROR: storage migration for 'local:101/vm-101-disk-0.raw' to storage 'local' failed - volume 'local:101/vm-101-disk-0.raw' does not exist
2024-01-02 01:30:34 aborting phase 1 - cleanup resources

What was it cleaning?

tempacc346235 said:
Code:

2024-01-02 01:30:34 ERROR: found stale volume copy 'local:101/vm-101-disk-0.raw' on node 'pve3'

I just wonder since it already checks for this, why not just restart it there on that "stale" copy? How does it determine it is stale when it has no other reference (no more recent one available at all in this case)?

tempacc346235 said:
Code:

2024-01-02 01:30:34 start final cleanup

I misunderstood here, I thought it was going to clean up the "stale" volume. I still do not know what "final cleanup" means.

The last thing I found a bit strange, why even allow to set HA on anything that does not run on shared volumes or ZFS?

esi_y · Jan 2, 2024

dietmar said:
Of cause this fails if you have a wrong storage configuration.

So I understand other than shared storage for HA is discouraged, but is it really "wrong" to have ZFS replica and expect to spin it up there? I mean, what other reasons to have replicas if not HA? If one wants just a backup, they would not do replicas, but that.

aaron · Jan 2, 2024

When you have a working ZFS replication, you can also use it with HA. The only downside to a shared storage is that the replication is async with the shortest interval being one minute. So in the case of a node failing and the HA guest being started on the other node, you might have some data loss as the disk image contains the data since the last successful replication run.

Another use-case is faster live migrations between nodes, as only the last changes in the disk image need to be transferred.

esi_y · Jan 2, 2024

I just want to make it clear, I was doing this all as test, in which process I realised that replication would not be possible because no ZFS (which I was aware of, I simply installed the nodes wrong), then I let it fail anyways to see how it will behave. But it's not great, because this test scenario can happen in other legit setups. E.g. replication failing. The other thing I wonder now is that in the replication I have to manual set where to, it will only go on one other node? I suppose I can set it multiple times, but it's a bummer as if someone wanted / forgot to do it the HA does not (?) have any way of knowing where it might migrate to.

And the checks in this scenario did not happen - it did go on to move the config file despite there clearly was no volume present. Suppose I have that on LUN and for that one node it picked for auto-migration it's not accessible, but only for that one. Shouldn't it try elsewhere or at least abandon before botching the originating node (the config relating to it in the respective cluster filepath)?

aaron · Jan 2, 2024

What I wanted to make clear (also for other people who might read this thread) is that ZFS + Replication is a valid use case for HA if one can live with the downside of async replication.

The main requirement for HA to work is that all resources a VM needs are present on the node it is recovered to. Mainly the disk images, but it could also include passed through PCI or USB devices.
And yes, the replication needs to be configured manually and can be done multiple times to include more nodes. Would I use it in a 5-node cluster? Probably not, but in smaller 2 or maybe even 3-node clusters it can be useful to avoid relying on some external storage for shared storage.

If you have a setup in the cluster where the nodes a different, for example, one node does not have the needed storage configured, you could either limit the nodes on which a guest is allowed to run via the HA groups. A selection of nodes and the restricted option are what you would need.
Alternatively, the "Max Relocate" option could be increased. If you do so, the HA stack will move the guest to a different node if it cannot start it on the first.

Regarding the checks, feel free to open a bug report in our bugtracker. This way we can keep track of it and fix it.

esi_y · Jan 2, 2024

aaron said:
What I wanted to make clear (also for other people who might read this thread) is that ZFS + Replication is a valid use case for HA if one can live with the downside of async replication.

That's what I thought, as long as the volume is available there, no difference to the HA stack anyhow. Lots of use cases where even hourly replica is enough, sometimes even weekly is enough if the config is not changing and logs are sent out anyhow.

aaron said:
The main requirement for HA to work is that all resources a VM needs are present on the node it is recovered to. Mainly the disk images, but it could also include passed through PCI or USB devices.
And yes, the replication needs to be configured manually and can be done multiple times to include more nodes. Would I use it in a 5-node cluster? Probably not, but in smaller 2 or maybe even 3-node clusters it can be useful to avoid relying on some external storage for shared storage.

All clear. I basically see it better alternative to e.g. CEPH in a 3-node cluster with slow network.

aaron said:
If you have a setup in the cluster where the nodes a different, for example, one node does not have the needed storage configured, you could either limit the nodes on which a guest is allowed to run via the HA groups. A selection of nodes and the restricted option are what you would need.

Thank you for this!

aaron said:
Alternatively, the "Max Relocate" option could be increased. If you do so, the HA stack will move the guest to a different node if it cannot start it on the first.

Is this round robin (by node number?) or random when it comes to picking the next one?

aaron said:
Regarding the checks, feel free to open a bug report in our bugtracker. This way we can keep track of it and fix it.

I will, I think it should check before moving the config file, at least that way if the dead node goes up, it can be started there.

Is there anything one can do for the HA to migrate it back? As in, preferred node for this guest is pve3, it went down so it migrated to pve4, but now that we noticed it is up again, we move it back to pve4.

The last thing, there's nothing within HA that would "rebalance" the guests based on some other metric? For instance one node went down, all guests migrated elsewhere, but some other node is now experiencing excessive load as a result.

esi_y · Jan 2, 2024

Filed https://bugzilla.proxmox.com/show_bug.cgi?id=5156

Search

Search

What is wrong with High Availability?

esi_y

Active Member

dietmar

Proxmox Staff Member

esi_y

Active Member

dietmar

Proxmox Staff Member

esi_y

Active Member

esi_y

Active Member

aaron

Proxmox Staff Member

esi_y

Active Member

aaron

Proxmox Staff Member

esi_y

Active Member

esi_y

Active Member