container ended on the wrong machine and now I can not move it from there

danielo515

Member
Nov 20, 2020
32
1
13
38
Hello.
I have a HA cluster with three nodes.
I have replication between two nodes. Each of those two nodes have a zfs pool that is used for replication. I have replication rules setup between those two for HA.
I have an older third node that does not have ZFS setup, so I don't use it as a replication target.
Today one of the nodes with ZFS died, and one of the containers ended on the third node, I don't know how.
Now I can not migrate this CT to the correct remaining node.

When I try to migrate the CT to the correct node I get this error:


Code:
Replication Log

2022-03-21 17:53:01 105-0: start replication job
2022-03-21 17:53:01 105-0: guest => CT 105, running => 0
2022-03-21 17:53:01 105-0: volumes => rpool:subvol-105-disk-0
2022-03-21 17:53:01 105-0: create snapshot '__replicate_105-0_1647881581__' on rpool:subvol-105-disk-0
2022-03-21 17:53:01 105-0: end replication job with error: zfs error: For the delegated permission list, run: zfs allow|unallow


Obviously the third node doesn't have the rpool volume, so I don't know why proxmox decided to migrate that CT there. How can I start that container on the remaining node? The appropriate disk exists on it's ZFS pool, I can't just migrate to it.
 
Here is an screenshot of my current cluster state and the replication tasks. As you can see, replication is only done between nodes proxmox-2 and proxmox-3, which are the ones having ZFS storage.

screenshot-2022-03-21 at 18-26-21.jpg
 
By default HA will use any node as a target. On "Datacenter --> HA --> Groups" you can define a group of nodes consisting of NodeX and NodeY. Then HA will ingore the presence of NodeA and migration will only happen between these specifically named nodes.
 
By default HA will use any node as a target. On "Datacenter --> HA --> Groups" you can define a group of nodes consisting of NodeX and NodeY. Then HA will ingore the presence of NodeA and migration will only happen between these specifically named nodes.
I did that already, and the CT still ends in the wrong node. Take a look at this screenshots:1668926413271.png
1668926423793.png
 
  • Like
Reactions: danielo515
Hi,

while HA should prefer the nodes in the group, to make sure that the other node is never used, you should restricted to Yes.
Ok, thank you for your answer, gonna try that out. Then I don't know what the purpose of the HA group is, because in my case, there were two nodes available and still it chosen the only which was not on the group.
 
Ok, thank you for your answer, gonna try that out. Then I don't know what the purpose of the HA group is, because in my case, there were two nodes available and still it chosen the only which was not on the group.
The HA manager should only choose non-group-members of non-restricted groups if the group members are offline. So that is strange. Can you share the output of journalctl --since <date before the issue happened> -u pve-ha-crm.service on the node where the CRM was master at the time (I guess you have to check on all nodes to find out)?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!