Test scenario, 3 nodes -
Figured out replication not possible without ZFS, alright, never mind,
Now when
Now since the HA migration failed, it is strange that it got stuck showing the CT as failed migrated to
Ok, this is getting annoying, went to HA and disabled, retried again:
Yes of course the local volume does not exist (on
So seriously, what is wrong here? I do understand the ZFS part (actually I do not, as in it should be then supported on BTRFS too, but fair enough, one could use CEPH or LUN, etc.). What I do not understand is:
1. How can it half-migrate without checking first the volume will be available over at the target?
2. How can it recognise there was a copy on the manual migration back, but completely ignore it?
If this happens in production (because replication jobs could also fail in other ways) with hundreds of CTs, what's the mitigation strategy.
EDIT: So apparently simple
pve{3,4,5}
- default PVE install (LVM, so no ZFS), 1 container set as HA, started on pve3
.Figured out replication not possible without ZFS, alright, never mind,
pve3
went down, HA attempted to restart CT on pve4
, no volume available, still understood.Now when
pve3
starts up again, HA does nothing, to some extent, this could be tolerated, after all the last thing it knew was it failed to restart that CT as long as it is concerned, for unknown reason.Now since the HA migration failed, it is strange that it got stuck showing the CT as failed migrated to
pve4
(and not stuck on dead pve3
), so manually requesting to "migrate back" ends up in:
Code:
Requesting HA migration for CT 101 to node pve3
service 'ct:101' in error state, must be disabled and fixed first
TASK ERROR: command 'ha-manager migrate ct:101 pve3' failed: exit code 255
Ok, this is getting annoying, went to HA and disabled, retried again:
Code:
2024-01-02 01:30:33 starting migration of CT 101 to node 'pve3' (10.67.10.203)
2024-01-02 01:30:33 found local volume 'local:101/vm-101-disk-0.raw' (in current VM config)
failed to stat '/var/lib/vz/images/101/vm-101-disk-0.raw'
Use of uninitialized value $format in string eq at /usr/share/perl5/PVE/Storage/Plugin.pm line 1615.
2024-01-02 01:30:34 ERROR: storage migration for 'local:101/vm-101-disk-0.raw' to storage 'local' failed - volume 'local:101/vm-101-disk-0.raw' does not exist
2024-01-02 01:30:34 aborting phase 1 - cleanup resources
2024-01-02 01:30:34 ERROR: found stale volume copy 'local:101/vm-101-disk-0.raw' on node 'pve3'
2024-01-02 01:30:34 start final cleanup
2024-01-02 01:30:34 ERROR: migration aborted (duration 00:00:01): storage migration for 'local:101/vm-101-disk-0.raw' to storage 'local' failed - volume 'local:101/vm-101-disk-0.raw' does not exist
TASK ERROR: migration aborted
Yes of course the local volume does not exist (on
pve4
), this was the very reason the "migration" had failed onto pve4
from pve3
, and it even goes on to "clean up" the "stale volume" on pve3
?So seriously, what is wrong here? I do understand the ZFS part (actually I do not, as in it should be then supported on BTRFS too, but fair enough, one could use CEPH or LUN, etc.). What I do not understand is:
1. How can it half-migrate without checking first the volume will be available over at the target?
2. How can it recognise there was a copy on the manual migration back, but completely ignore it?
If this happens in production (because replication jobs could also fail in other ways) with hundreds of CTs, what's the mitigation strategy.
EDIT: So apparently simple
mv /etc/pve/nodes/pve4/lxc/101.conf /etc/pve/nodes/pve3/lxc/
could "fix" this one instance, but the question is, why not auto-cleanup after a botched HA migration by itself? Or better yet, not to move on to migrate at all.
Last edited: