Proxmox cluster randomly decided to migrate several LXCs and broke them

kocherjj · Mar 22, 2023

I have a 3 node cluster running a variety of VMs and LXCs. I noticed this afternoon that my Nextcloud instance was no longer available. I logged into Proxmox and found the task log getting spammed with 'Error: migration aborted' messages for three of my containers. One of them happens to be the reverse proxy for my DMZ, hence the inability to access a number of important services including Nextcloud.

All three of these service had been running merrily on Node 1, which is my most powerful node and where I run all of my important services, but when I checked the list I found that Proxmox is reporting them as being present on Node 2 and stopped. If I try to start any of them I get this error, using CT 2002 that runs on a zfs pool called zfsNVMeVol1 as an example:

Code:

TASK ERROR: zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist

If I try to migrate any of them to Node 1 or Node 3, I get this error:

Code:

2023-03-21 20:31:21 starting migration of CT 2002 to node 'pve1' (10.212.11.231)
2023-03-21 20:31:21 found local volume 'zfsNVMeVol1:subvol-2002-disk-0' (in current VM config)
2023-03-21 20:31:21 start replication job
2023-03-21 20:31:21 guest => CT 2002, running => 0
2023-03-21 20:31:21 volumes => zfsNVMeVol1:subvol-2002-disk-0
2023-03-21 20:31:21 end replication job with error: zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist
2023-03-21 20:31:21 ERROR: zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist
2023-03-21 20:31:21 aborting phase 1 - cleanup resources
2023-03-21 20:31:21 start final cleanup
2023-03-21 20:31:21 ERROR: migration aborted (duration 00:00:00): zfs error: cannot open 'zfsNVMeVol1/subvol-2002-disk-0': dataset does not exist
TASK ERROR: migration aborted

My first thought was that I must have a drive failure somewhere but my ZFS pools are reporting perfect health, and I can migrate other services on the same pool between all three nodes with no issue and they start and run just fine. Based on the logs it appears that my cluster decided that it wanted to move them to Node 2, and then screwed them up in the process and can't figure out what to do now. All three nodes have been running 24/7 with no reboots or other hiccups in the logs that I can find.

I haven't found anything that seems particularly relevant by searching this forum. What could have caused this decision to move these particular LXCs to another node and how can I get them running again, on any node?

Edited to add:

I am running PVE 7.3-6 with an active pve-enterprise subscription.

Edited to add more:

When I check the 'CT Volumes' for zfsNVMeVol1 in Node 2 I don't see anything there, although the 4 messed up containers show up in the sidebar as being on this node.
When I check the 'CT Volumes' for zfsNVMeVol1 in Node 1, I see the all 4 containers listed there, along with all the other containers running on this node. Do I need to manually move something from Node 1 to Node 2 so I can get the containers working and migrate them back to Node 1 where they belong? If so, what?

Edited a third time to add:

I had daily backups of three of the four containers on Node 1. I chose the one that would be least painful to lose and experimented with it. I found that I could restore it using a different CT ID and it showed up on Node 1 and functioned correctly. After confirming all backups worked correctly I deleted the faulty versions that were showing up on Node 2 and then was able to re-restore those backups using the original CT ID. However, there are now duplicate entries under Node 1 'CT volumes' with what I am guessing are the restored backups listed as 'disk-1 and the old volumes that were left behind when Proxmox decide to shuffle my containers are disk-0.

I have one container left which was not backed up, and while it is not critical the fact that this could even happen certainly is so I will wait to hear what the official response is on how to restore this to a functional container again.

fabian · Mar 22, 2023

it sounds like you enabled HA with local storage, without (properly) setting up replication and configuring HA to restrict to replicated nodes? the only thing that ever moves guests or their configs like that without admin interaction is HA..

anyhow, in case HA (in local storage+replication setups) accidentally recovers a guest onto a node where it doesn't belong, simply moving back the config to the original node (in /etc/pve) should allow you to start it again.

regarding the duplicate disks, if you are sure you don't need them anymore, you can rescan the corresponding guest (it will add the volumes to the guest config as unusedX) and then delete them via the guest hardware tab in the GUI.

kocherjj · Mar 22, 2023

Thank you for the reply Fabian,

I spent hours reading through the HA documentation to setup my HA system but based on your response and my symptoms it sounds like I still managed to get it wrong.

To recover from this specific situation, when you say to simply move the config back to the original node do you mean I should just move the 2002.conf file from /etc/pve/lxc/ on Node 2 to /etc/pve/lxc/ on Node 1?

fabian · Mar 22, 2023

from /etc/pve/nodes/node2/.. to /etc/pve/nodes/node1/.. , assuming node2 is the "wrong" node and node1 the "correct" one

(/etc/pve/lxc is just a symlink pointing to the CT config folder for the local node)

kocherjj · Mar 22, 2023

Thank you, this worked!

However, I noticed I have another issue now which must be related. I have an hourly backup job scheduled for several containers, and one of these containers is now throwing an error. This is NOT one of the 4 containers that got moved to node 2 but each backup fails with a similar but opposite dataset error message

Code:

ERROR: Backup of VM 3032 failed - zfs error: cannot create snapshot 'zfsNVMeVol1/subvol-3032-disk-0@vzdump': dataset already exists

This particular container is still on Node 1 where it has always been. Attempting a manual backup results in the same message. All other containers that are part of the backup job seem to be backing up fine, and this error started at 5PM yesterday, the same time everything else came apart due to to the original post issue.

Shouldn't there be a dataset for this container since it exists?

fabian · Mar 22, 2023

could you post the full contents of that container's config, as well as the output of zfs list -t all -r zfsNVMeVol1/subvol-3032-disk-0?

kocherjj · Mar 22, 2023

cat /etc/pve/lxc/3032.conf

Code:

#SSH Regen process%3A
#
#/bin/rm -v /etc/ssh/ssh_host_*
#
#dpkg-reconfigure openssh-server
#
#systemctl restart ssh
arch: amd64
cores: 2
hostname: mqtt
memory: 512
nameserver: 192.168.17.1
net0: name=eth0,bridge=vmbr1,gw=192.168.17.1,hwaddr=2A:36:23:BD:F4:75,ip=192.168.17.55/24,tag=11,type=veth
onboot: 1
ostype: ubuntu
parent: Snap20230322A
rootfs: zfsNVMeVol1:subvol-3032-disk-0,size=8G
searchdomain: localdomain.net
startup: order=3,up=2
swap: 0
unprivileged: 1

[Snap20221202A]
#Everything working
arch: amd64
cores: 2
hostname: mqtt
memory: 512
nameserver: 192.168.17.1
net0: name=eth0,bridge=vmbr1,gw=192.168.17.1,hwaddr=2A:36:23:BD:F4:75,ip=192.168.17.55/24,tag=11,type=veth
onboot: 1
ostype: ubuntu
rootfs: zfsNVMeVol1:subvol-3032-disk-0,size=8G
searchdomain: localdomain.net
snaptime: 1670043242
startup: order=3,up=2
swap: 0
unprivileged: 1

[Snap20230122A]
arch: amd64
cores: 2
hostname: mqtt
memory: 512
nameserver: 192.168.17.1
net0: name=eth0,bridge=vmbr1,gw=192.168.17.1,hwaddr=2A:36:23:BD:F4:75,ip=192.168.17.55/24,tag=11,type=veth
onboot: 1
ostype: ubuntu
parent: Snap20221202A
rootfs: zfsNVMeVol1:subvol-3032-disk-0,size=8G
searchdomain: localdomain.net
snaptime: 1674395156
startup: order=3,up=2
swap: 0
unprivileged: 1

[Snap20230122B]
#Logging to file enabled, curl installed
arch: amd64
cores: 2
hostname: mqtt
memory: 512
nameserver: 192.168.17.1
net0: name=eth0,bridge=vmbr1,gw=192.168.17.1,hwaddr=2A:36:23:BD:F4:75,ip=192.168.17.55/24,tag=11,type=veth
onboot: 1
ostype: ubuntu
parent: Snap20230122A
rootfs: zfsNVMeVol1:subvol-3032-disk-0,size=8G
searchdomain: localdomain.net
snaptime: 1674402241
startup: order=3,up=2
swap: 0
unprivileged: 1

[Snap20230122C]
#Crowdsec installed
arch: amd64
cores: 2
hostname: mqtt
memory: 512
nameserver: 192.168.17.1
net0: name=eth0,bridge=vmbr1,gw=192.168.17.1,hwaddr=2A:36:23:BD:F4:75,ip=192.168.17.55/24,tag=11,type=veth
onboot: 1
ostype: ubuntu
parent: Snap20230122B
rootfs: zfsNVMeVol1:subvol-3032-disk-0,size=8G
searchdomain: localdomain.net
snaptime: 1674416768
startup: order=3,up=2
swap: 0
unprivileged: 1

[Snap20230322A]
arch: amd64
cores: 2
hostname: mqtt
memory: 512
nameserver: 192.168.17.1
net0: name=eth0,bridge=vmbr1,gw=192.168.17.1,hwaddr=2A:36:23:BD:F4:75,ip=192.168.17.55/24,tag=11,type=veth
onboot: 1
ostype: ubuntu
parent: Snap20230122C
rootfs: zfsNVMeVol1:subvol-3032-disk-0,size=8G
searchdomain: localdomain.net
snaptime: 1679489923
startup: order=3,up=2
swap: 0
unprivileged: 1

zfs list -t all -r zfsNVMeVol1/subvol-3032-disk-0

Code:

NAME                                                             USED  AVAIL     REFER  MOUNTPOINT
zfsNVMeVol1/subvol-3032-disk-0                                  3.11G  5.90G     2.10G  /zfsNVMeVol1/subvol-3032-disk-0
zfsNVMeVol1/subvol-3032-disk-0@Snap20221202A                     374M      -     1.54G  -
zfsNVMeVol1/subvol-3032-disk-0@Snap20230122A                     115M      -     1.68G  -
zfsNVMeVol1/subvol-3032-disk-0@Snap20230122B                    98.4M      -     1.68G  -
zfsNVMeVol1/subvol-3032-disk-0@Snap20230122C                     215M      -     1.86G  -
zfsNVMeVol1/subvol-3032-disk-0@vzdump                           16.4M      -     2.10G  -
zfsNVMeVol1/subvol-3032-disk-0@__replicate_3032-0_1679486408__  6.12M      -     2.10G  -
zfsNVMeVol1/subvol-3032-disk-0@Snap20230322A                    1008K      -     2.10G  -
zfsNVMeVol1/subvol-3032-disk-0@__replicate_3032-1_1679490793__   992K      -     2.10G  -

fabian · Mar 22, 2023

I would suggest removing the ZFS snapshot zfsNVMeVol1/subvol-3032-disk-0@vzdump on all nodes:


zfs destroy zfsNVMeVol1/subvol-3032-disk-0@vzdump

kocherjj · Mar 22, 2023

Thank you Fabian and Google Translate.

This worked, and my backup jobs are running again without error.

Thanks again for your support!

fabian · Mar 23, 2023

sorry for switching language there

Proxmox cluster randomly decided to migrate several LXCs and broke them

kocherjj

Member

fabian

Proxmox Staff Member

kocherjj

Member

fabian

Proxmox Staff Member

kocherjj

Member

fabian

Proxmox Staff Member

kocherjj

Member

fabian

Proxmox Staff Member

kocherjj

Member

fabian

Proxmox Staff Member

We value your privacy