error migration CT --- HA - proxmox 6.1

POL CRIOLLO

New Member
Feb 4, 2020
17
0
1
49
Greetings.
I am implementing a cluster of three nodes and high availability, with the cluster everything is working very well, in each node created a CT to do my tests.
I create the HA group with the three nodes, I add a resource HA one of the CT, I test the fall of that node where that CT is.
The CT manages to move to another node, initially it is in started mode, then the service is stopped and the CT is turned OFF.
I am sending the Log.
################
task started by HA resource agent
Job for pve-container@100.service failed because the control process exited with error code.
See "systemctl status pve-container@100.service" and "journalctl -xe" for details.
TASK ERROR: command 'systemctl start pve-container@100' failed: exit code 1

log
//////////

task started by HA resource agent
2020-02-17 18:25:10 starting migration of CT 100 to node 'node2' (192.168.1.38)
2020-02-17 18:25:10 found local volume 'local-zfs:subvol-100-disk-1' (in current VM config)
cannot open 'rpool/data/subvol-100-disk-1': dataset does not exist
usage:
snapshot [-r] [-o property=value] ... <filesystem|volume>@<snap> ...
For the property list, run: zfs set|get
2020-02-17 18:25:10 ERROR: zfs error: For the delegated permission list, run: zfs allow|unallow
2020-02-17 18:25:10 aborting phase 1 - cleanup resources
2020-02-17 18:25:10 ERROR: found stale volume copy 'local-zfs:subvol-100-disk-1' on node 'node2'
2020-02-17 18:25:10 start final cleanup
2020-02-17 18:25:10 ERROR: migration aborted (duration 00:00:01): zfs error: For the delegated permission list, run: zfs allow|unallow
TASK ERROR: migration aborted

Paul Criollo
 
I create the HA group with the three nodes, I add a resource HA one of the CT, I test the fall of that node where that CT is.
The CT manages to move to another node, initially it is in started mode, then the service is stopped and the CT is turned OFF.
I am sending the Log.

It seems you are using local ZFS as backing storage for those containers.
Do you also setup replication? As else this cannot work - the node which recovers the service does not have the storage available, it's local on the dead node..

Either go for a shared storage, ceph could be good for a three node setup - reliable and and redundant.

As workaround you could use the storage replication - but that means that some progress is lost as replications happen at a fixed interval, i.e., are not shared in realtime.