Hi all. Any chance anyone can help a noob here?
I've got a 3 node cluster that I was running failover tests on. No Ceph, so doing HA via replication every 5 mins.
Yanked power on node1, forgot that in all my tweaking, I forgot to add CTs to a HA group, so I figured I'd set that with the node offline from node2. I should mention at this point that problem CT is only replicated to node3.
CT tries to start on node3, but gets:
I checked, it does.
But then it migrated, succesfully, somehow, to node2, where the dataset doesn't exist.
Now, I can't start it, or migrate it. I'd really love some help as I'm a bit stuck.
The ZFS volume is present on nodes 1 and 3, but I can't get the CT to migrate to either of those. I'd previously done a live migration from the GUI with no issues, this was the first time simulating a node failure.
I'm aware I'll need to create a restricted HA group for this CT to only use nodes 1 and 3 in future, I'm just not sure how to get it back up at present. I have a backup, but I'd like to know how to recover from this scenario for future reference.
Edit:
For anyone coming across this in the future, I was able to manually move the config file to node1 and start it again:
On node2:
cd /etc/pve/lxc
nano 180.conf, copy contents
mv 180.conf notactanymore.conf
On node1:
cd /etc/pve/lxc
nano 180.conf, paste contents
Looks like permissions got applied automatically, CT appeared in the UI, and I could start it perfectly fine.
I've got a 3 node cluster that I was running failover tests on. No Ceph, so doing HA via replication every 5 mins.
Yanked power on node1, forgot that in all my tweaking, I forgot to add CTs to a HA group, so I figured I'd set that with the node offline from node2. I should mention at this point that problem CT is only replicated to node3.
CT tries to start on node3, but gets:
TASK ERROR: zfs error: cannot open 'local-nvme/subvol-180-disk-0': dataset does not exist
I checked, it does.
zfs list | grep subvol-180
, shows it.But then it migrated, succesfully, somehow, to node2, where the dataset doesn't exist.
Now, I can't start it, or migrate it. I'd really love some help as I'm a bit stuck.
The ZFS volume is present on nodes 1 and 3, but I can't get the CT to migrate to either of those. I'd previously done a live migration from the GUI with no issues, this was the first time simulating a node failure.
I'm aware I'll need to create a restricted HA group for this CT to only use nodes 1 and 3 in future, I'm just not sure how to get it back up at present. I have a backup, but I'd like to know how to recover from this scenario for future reference.
Edit:
For anyone coming across this in the future, I was able to manually move the config file to node1 and start it again:
On node2:
cd /etc/pve/lxc
nano 180.conf, copy contents
mv 180.conf notactanymore.conf
On node1:
cd /etc/pve/lxc
nano 180.conf, paste contents
Looks like permissions got applied automatically, CT appeared in the UI, and I could start it perfectly fine.
Last edited: