I have a 3 node cluster running a variety of VMs and LXCs. I noticed this afternoon that my Nextcloud instance was no longer available. I logged into Proxmox and found the task log getting spammed with 'Error: migration aborted' messages for three of my containers. One of them happens to be the reverse proxy for my DMZ, hence the inability to access a number of important services including Nextcloud.
All three of these service had been running merrily on Node 1, which is my most powerful node and where I run all of my important services, but when I checked the list I found that Proxmox is reporting them as being present on Node 2 and stopped. If I try to start any of them I get this error, using CT 2002 that runs on a zfs pool called zfsNVMeVol1 as an example:
If I try to migrate any of them to Node 1 or Node 3, I get this error:
My first thought was that I must have a drive failure somewhere but my ZFS pools are reporting perfect health, and I can migrate other services on the same pool between all three nodes with no issue and they start and run just fine. Based on the logs it appears that my cluster decided that it wanted to move them to Node 2, and then screwed them up in the process and can't figure out what to do now. All three nodes have been running 24/7 with no reboots or other hiccups in the logs that I can find.
I haven't found anything that seems particularly relevant by searching this forum. What could have caused this decision to move these particular LXCs to another node and how can I get them running again, on any node?
Edited to add:
I am running PVE 7.3-6 with an active pve-enterprise subscription.
Edited to add more:
When I check the 'CT Volumes' for zfsNVMeVol1 in Node 2 I don't see anything there, although the 4 messed up containers show up in the sidebar as being on this node.
When I check the 'CT Volumes' for zfsNVMeVol1 in Node 1, I see the all 4 containers listed there, along with all the other containers running on this node. Do I need to manually move something from Node 1 to Node 2 so I can get the containers working and migrate them back to Node 1 where they belong? If so, what?
Edited a third time to add:
I had daily backups of three of the four containers on Node 1. I chose the one that would be least painful to lose and experimented with it. I found that I could restore it using a different CT ID and it showed up on Node 1 and functioned correctly. After confirming all backups worked correctly I deleted the faulty versions that were showing up on Node 2 and then was able to re-restore those backups using the original CT ID. However, there are now duplicate entries under Node 1 'CT volumes' with what I am guessing are the restored backups listed as 'disk-1 and the old volumes that were left behind when Proxmox decide to shuffle my containers are disk-0.
I have one container left which was not backed up, and while it is not critical the fact that this could even happen certainly is so I will wait to hear what the official response is on how to restore this to a functional container again.
All three of these service had been running merrily on Node 1, which is my most powerful node and where I run all of my important services, but when I checked the list I found that Proxmox is reporting them as being present on Node 2 and stopped. If I try to start any of them I get this error, using CT 2002 that runs on a zfs pool called zfsNVMeVol1 as an example:
Code:
TASK ERROR: zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist
If I try to migrate any of them to Node 1 or Node 3, I get this error:
Code:
2023-03-21 20:31:21 starting migration of CT 2002 to node 'pve1' (10.212.11.231)
2023-03-21 20:31:21 found local volume 'zfsNVMeVol1:subvol-2002-disk-0' (in current VM config)
2023-03-21 20:31:21 start replication job
2023-03-21 20:31:21 guest => CT 2002, running => 0
2023-03-21 20:31:21 volumes => zfsNVMeVol1:subvol-2002-disk-0
2023-03-21 20:31:21 end replication job with error: zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist
2023-03-21 20:31:21 ERROR: zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist
2023-03-21 20:31:21 aborting phase 1 - cleanup resources
2023-03-21 20:31:21 start final cleanup
2023-03-21 20:31:21 ERROR: migration aborted (duration 00:00:00): zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist
TASK ERROR: migration aborted
My first thought was that I must have a drive failure somewhere but my ZFS pools are reporting perfect health, and I can migrate other services on the same pool between all three nodes with no issue and they start and run just fine. Based on the logs it appears that my cluster decided that it wanted to move them to Node 2, and then screwed them up in the process and can't figure out what to do now. All three nodes have been running 24/7 with no reboots or other hiccups in the logs that I can find.
I haven't found anything that seems particularly relevant by searching this forum. What could have caused this decision to move these particular LXCs to another node and how can I get them running again, on any node?
Edited to add:
I am running PVE 7.3-6 with an active pve-enterprise subscription.
Edited to add more:
When I check the 'CT Volumes' for zfsNVMeVol1 in Node 2 I don't see anything there, although the 4 messed up containers show up in the sidebar as being on this node.
When I check the 'CT Volumes' for zfsNVMeVol1 in Node 1, I see the all 4 containers listed there, along with all the other containers running on this node. Do I need to manually move something from Node 1 to Node 2 so I can get the containers working and migrate them back to Node 1 where they belong? If so, what?
Edited a third time to add:
I had daily backups of three of the four containers on Node 1. I chose the one that would be least painful to lose and experimented with it. I found that I could restore it using a different CT ID and it showed up on Node 1 and functioned correctly. After confirming all backups worked correctly I deleted the faulty versions that were showing up on Node 2 and then was able to re-restore those backups using the original CT ID. However, there are now duplicate entries under Node 1 'CT volumes' with what I am guessing are the restored backups listed as 'disk-1 and the old volumes that were left behind when Proxmox decide to shuffle my containers are disk-0.
I have one container left which was not backed up, and while it is not critical the fact that this could even happen certainly is so I will wait to hear what the official response is on how to restore this to a functional container again.
Last edited: