Migrating VMs between clusters erased some VM disks - also how to qmrestore while keeping disks that were NOT backup up originally

jsalas424

Active Member
Jul 5, 2020
142
2
38
34
I recently ran into some odd issues with clustering. I migrated over a disk PRIOR to actually making the backend storage available to the new node. This caused some wonky things to happen including some disks being erased from their original location. I had disks in the original storage directory deleted that weren't ever migrated, not sure whats going on.

I also realized that I could NOT migrate back because it kept throwing all kinds of "cannot activate storage" etc. as I now understand is normal behavior.

So I went to just manually restore the VM on the first node for which I had backups of the main disks for the respective VMs. I have some disks attached as just large data drives that I don't backup (Ex: Zoneminder NVR storage). aside from the OS disks.

I went to qmrestore from the command line declaring the respective backup but was told that I can't backup if there is already a VM available, so I first deleted the VM on the new node so I could restore it to the old node.

What I discovered is that the disks that aren't marked for backup (Zoneminder NVR) were then deleted when I destroyed the VM and so when I went to restore the VM on the old node, I had the OS disks but the main storage disks were gone for good. I've since reprovisioned disks and this was only a minor hiccup since I don't really care to store NVR data long term.

I now have another VM that I have to restore on to the old node. The OS disk is backed up but the storage disk is not and I would rather avoid doing so. The storage drive is located exactly where it should be, I'm only missing the OS currently (there are different disks).

----------
Should I move my storage disk somewhere else before deleting the VM and restoring OS from backup?
-----------
 
Hi,
I recently ran into some odd issues with clustering. I migrated over a disk PRIOR to actually making the backend storage available to the new node. This caused some wonky things to happen including some disks being erased from their original location. I had disks in the original storage directory deleted that weren't ever migrated, not sure whats going on.
please share the output of pveversion -v, the output of cat /etc/pve/storage.cfg and qm config <ID> replacing <ID> with the ID of the (now restored) VM.

I also realized that I could NOT migrate back because it kept throwing all kinds of "cannot activate storage" etc. as I now understand is normal behavior.

So I went to just manually restore the VM on the first node for which I had backups of the main disks for the respective VMs. I have some disks attached as just large data drives that I don't backup (Ex: Zoneminder NVR storage). aside from the OS disks.

I went to qmrestore from the command line declaring the respective backup but was told that I can't backup if there is already a VM available, so I first deleted the VM on the new node so I could restore it to the old node.

What I discovered is that the disks that aren't marked for backup (Zoneminder NVR) were then deleted when I destroyed the VM and so when I went to restore the VM on the old node, I had the OS disks but the main storage disks were gone for good. I've since reprovisioned disks and this was only a minor hiccup since I don't really care to store NVR data long term.
Deleting a VM will delete its disks. Just asking to clarify: the Zoneminder NVR disk had been successfully migrated? Or was it on a shared storage? It shouldn't have been deleted if it were still on the old node, or is that what happened?

I now have another VM that I have to restore on to the old node. The OS disk is backed up but the storage disk is not and I would rather avoid doing so. The storage drive is located exactly where it should be, I'm only missing the OS currently (there are different disks).

----------
Should I move my storage disk somewhere else before deleting the VM and restoring OS from backup?
-----------
You also can restore with a fresh ID and only remove the old VM later. If you need to keep a volume around, currently you need to rename it and remove the entry from the configuration file. In a future version of PVE, it will be possible to reassign a disk to a different VM.
 
Hi,

please share the output of pveversion -v, the output of cat /etc/pve/storage.cfg and qm config <ID> replacing <ID> with the ID of the (now restored) VM.

root@TracheServ:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
root@TracheServ:~# cat /etc/pve/storage.cfg
zfspool: local-zfs
pool rpool/data
content rootdir,images
nodes TracheServ
sparse 0

dir: local
path /var/lib/vz
content vztmpl,iso,snippets,rootdir,images,backup
prune-backups keep-last=1
shared 0

zfspool: Storage.1
pool Storage.1
content rootdir,images
mountpoint /Storage.1
nodes TracheServ
sparse 0

zfs: Storage.1_iscsi
blocksize 4k
target iqn.2010-08.org.illumos:02:b00c9870-6a97-6f0b-847e-bbfb69d2e581:tank1
pool Storage.1
iscsiprovider comstar
portal 192.168.1.129
content images

zfspool: Nextcloud.Storage
pool Nextcloud.Storage
content images,rootdir
mountpoint /Nextcloud.Storage
nodes TracheServ
sparse 0

zfs: Nextcloud.Storage_iscsi
blocksize 4k
target iqn.2010-08.org.illumos:02:b00c9870-6a97-6f0b-847e-bbfb69d2e581:tank1
pool Nextcloud.Storage
iscsiprovider comstar
portal 192.168.1.129
content images

dir: NC.VM.Backups.dir
path /Nextcloud.Storage/NC.VM.Backups
content vztmpl,rootdir,backup,images,iso,snippets
is_mountpoint 1
prune-backups keep-last=18
shared 1

dir: NC.Disks.Dir
path /Nextcloud.Storage
content snippets,iso,images,backup,rootdir,vztmpl
is_mountpoint 1
prune-backups keep-last=10
shared 1

dir: User.Data.Backups
path /Nextcloud.Storage/User.Data.Backups
content vztmpl,rootdir,images,backup,iso,snippets
is_mountpoint 1
prune-backups keep-last=10
shared 1

nfs: Proxmox_backups
disable
export /data/backups/proxmox
path /mnt/pve/Proxmox_backups
server 192.168.1.139
content rootdir,backup,images,iso,snippets,vztmpl
options vers=4.2
prune-backups keep-last=6

pbs: PBS
disable
datastore 8048956_TracheSave
server pbs.tuxis.nl
content backup
encryption-key 1
fingerprint 2d:40:eb:b3:52:30:ea:29:70:cf:87:34:e0:97:19:74:93:be:46:d9:3d:42:c5:f4:85:6c:0e:06:9a:df:76:e1
prune-backups keep-last=3
username 8048956@pbs

zfspool: new_ssd
pool new_ssd
content rootdir,images
mountpoint /new_ssd
nodes TracheServ
sparse 0

dir: new_ssd_dir
path /new_ssd
content vztmpl,iso,snippets,rootdir,backup,images
is_mountpoint 1
prune-backups keep-last=4
shared 1

zfspool: media
pool new_ssd/media
content images,rootdir
mountpoint /new_ssd
nodes TracheServ
sparse 0

dir: media_dir
path /new_ssd/media
content images,backup,rootdir,snippets,iso,vztmpl
is_mountpoint 1
prune-backups keep-last=4
shared 1

zfs: new_ssd_iscsi
blocksize 4k
target iqn.2010-08.org.illumos:02:b00c9870-6a97-6f0b-847e-bbfb69d2e581:tank1
pool new_ssd
iscsiprovider comstar
portal 192.168.1.129
content images
root@TracheServ:~# qm config 600
agent: 1
boot: cdn
bootdisk: scsi0
cores: 2
cpu: kvm64,flags=+aes
description: scsi1%3A NC.Disks.Dir%3A600/vm-600-disk-0.qcow2,backup=0,discard=on,size=150G%0Ascsi1%3A NC.Disks.Dir%3A600/vm-600-disk-0.qcow2,backup=0,size=150G
ide2: none,media=cdrom
localtime: 1
memory: 3072
name: ZoneMinder
net0: virtio=0E:3F:EA:81:F7:09,bridge=vmbr0,tag=69
numa: 0
onboot: 1
ostype: l26
scsi0: new_ssd_dir:600/vm-600-disk-0.qcow2,discard=on,size=25G,ssd=1
scsi1: NC.Disks.Dir:600/vm-600-disk-0.qcow2,backup=0,size=150G
scsi2: new_ssd_dir:600/vm-600-disk-1.qcow2,discard=on,size=10G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=0bc0afc5-1318-42aa-b897-67f0d587def2
sockets: 3
startup: order=4
tablet: 0
vmgenid: f475edaa-d573-4664-9968-0e661b3f3644
vmstatestorage: User.Data.Backups


Deleting a VM will delete its disks. Just asking to clarify: the Zoneminder NVR disk had been successfully migrated? Or was it on a shared storage? It shouldn't have been deleted if it were still on the old node, or is that what happened?
The zoneminder VM had migrated and was still running but I had NOT shared the storage yet so I guess it was just running from memory? This was a big oops but perhaps there's a way for your team to introduce a sanity check to the migration progress, preventing migration if the storage isn't available on the remote node.

You also can restore with a fresh ID and only remove the old VM later. If you need to keep a volume around, currently you need to rename it and remove the entry from the configuration file. In a future version of PVE, it will be possible to reassign a disk to a different VM.
Great!
 
The zoneminder VM had migrated and was still running but I had NOT shared the storage yet so I guess it was just running from memory? This was a big oops but perhaps there's a way for your team to introduce a sanity check to the migration progress, preventing migration if the storage isn't available on the remote node.
So the NC.Disks.Dir did not have the shared 1 flag set? Or was the storage not mounted on the other node yet?

We do have such checks, but there might be some edge case that isn't covered. Could you also provide the migration task log (see the VM's Task History)?

Code:
zfspool: Nextcloud.Storage
       pool Nextcloud.Storage
       content images,rootdir
       mountpoint /Nextcloud.Storage
       nodes TracheServ
       sparse 0

zfs: Nextcloud.Storage_iscsi
       blocksize 4k
       target iqn.2010-08.org.illumos:02:b00c9870-6a97-6f0b-847e-bbfb69d2e581:tank1
       pool Nextcloud.Storage
       iscsiprovider comstar
       portal 192.168.1.129
       content images

dir: NC.Disks.Dir
       path /Nextcloud.Storage
       content snippets,iso,images,backup,rootdir,vztmpl
       is_mountpoint 1
       prune-backups keep-last=10
       shared 1
One shouldn't add the same backing storage with the same content types multiple times. PVE does not (and in general cannot easily) check for that. It might work for the zfspool+dir combo, just because of how the subdirectories are laid out, but the zfs+zfspool entries most likely clash.
 
So the NC.Disks.Dir did not have the shared 1 flag set? Or was the storage not mounted on the other node yet?
The shared flag was set but the storage was not mounted.

We do have such checks, but there might be some edge case that isn't covered. Could you also provide the migration task log (see the VM's Task History)?
Where do I find this task history? An interesting observation I made was that when I tried to migrate a VM where the shared flag wasnt declared, sanity checks prevented me from doing so. But if I tried to migrate a VM where the shared is checked but the storage isn't mounted, it "works" and then gets weird.
Code:
zfspool: Nextcloud.Storage
       pool Nextcloud.Storage
       content images,rootdir
       mountpoint /Nextcloud.Storage
       nodes TracheServ
       sparse 0

zfs: Nextcloud.Storage_iscsi
       blocksize 4k
       target iqn.2010-08.org.illumos:02:b00c9870-6a97-6f0b-847e-bbfb69d2e581:tank1
       pool Nextcloud.Storage
       iscsiprovider comstar
       portal 192.168.1.129
       content images

dir: NC.Disks.Dir
       path /Nextcloud.Storage
       content snippets,iso,images,backup,rootdir,vztmpl
       is_mountpoint 1
       prune-backups keep-last=10
       shared 1
One shouldn't add the same backing storage with the same content types multiple times. PVE does not (and in general cannot easily) check for that. It might work for the zfspool+dir combo, just because of how the subdirectories are laid out, but the zfs+zfspool entries most likely clash.
Thanks for addressing this, I'm new to clustering and was curious about this. So, the storage.cfg file we're looking at seems to be shared in the cluster. After joining the new node to the cluster, when I went to the storage.cfg on the node it was a replica of the storage.cfg file on the host! So the zfspool entry you see is the initial storage declaration from the host and then the zfs-iscsi was added on the new node to make that original storage accessible.

As for the directory mounted at the ZFS pool, this was because I wanted branching snapshot ability which is only available for directory backed qcow2 disks as per discussion here. If this will cause more issues than it's worth in the long run, I will reorganize.

Thanks for the input!
 
Last edited:
Where do I find this task history? An interesting observation I made was that when I tried to migrate a VM where the shared flag wasnt declared, sanity checks prevented me from doing so. But if I tried to migrate a VM where the shared is checked but the storage isn't mounted, it "works" and then gets weird.
In the UI, select the source node of the migration and then the Task History entry in the left panel. The task entry should be called VM <ID> - Migrate.

Thanks for addressing this, I'm new to clustering and was curious about this. So, the storage.cfg file we're looking at seems to be shared in the cluster. After joining the new node to the cluster, when I went to the storage.cfg on the node it was a replica of the storage.cfg file on the host! So the zfspool entry you see is the initial storage declaration from the host and then the zfs-iscsi was added on the new node to make that original storage accessible.
The problem is when content types and backing paths clash, images for zfspool: Nextcloud.Storage and images for zfspool: Nextcloud.Storage, because the same volume exists in two different (from PVEs perspective) storages and there's problems with locking for concurrent operations.

I'd suggest to cleanly separate the storages using something like
Code:
zfs create Nextcloud.Storage/local
zfs create Nextcloud.Storage/shared
zfs create Nextcloud.Storage/shared/dir
and
Code:
zfspool: Nextcloud.Storage_local
       pool Nextcloud.Storage/local
       content images,rootdir
       mountpoint /Nextcloud.Storage/local
       nodes TracheServ
       sparse 0

zfs: Nextcloud.Storage_iscsi
       blocksize 4k
       target iqn.2010-08.org.illumos:02:b00c9870-6a97-6f0b-847e-bbfb69d2e581:tank1
       pool Nextcloud.Storage/shared
       iscsiprovider comstar
       portal 192.168.1.129
       content images

dir: NC.Disks.Dir
       path /Nextcloud.Storage/shared/dir
       content snippets,iso,images,backup,rootdir,vztmpl
       is_mountpoint 1
       prune-backups keep-last=10
       shared 1
so there's a 1-1 mapping between storages and backing paths. And if you just use the dir storage for rootdir and the shared storage for images, you wouldn't even need the local entry (and the shared prefix for the others).
 
In the UI, select the source node of the migration and then the Task History entry in the left panel. The task entry should be called VM <ID> - Migrate.


The problem is when content types and backing paths clash, images for zfspool: Nextcloud.Storage and images for zfspool: Nextcloud.Storage, because the same volume exists in two different (from PVEs perspective) storages and there's problems with locking for concurrent operations.

I'd suggest to cleanly separate the storages using something like
Code:
zfs create Nextcloud.Storage/local
zfs create Nextcloud.Storage/shared
zfs create Nextcloud.Storage/shared/dir
and
Code:
zfspool: Nextcloud.Storage_local
       pool Nextcloud.Storage/local
       content images,rootdir
       mountpoint /Nextcloud.Storage/local
       nodes TracheServ
       sparse 0

zfs: Nextcloud.Storage_iscsi
       blocksize 4k
       target iqn.2010-08.org.illumos:02:b00c9870-6a97-6f0b-847e-bbfb69d2e581:tank1
       pool Nextcloud.Storage/shared
       iscsiprovider comstar
       portal 192.168.1.129
       content images

dir: NC.Disks.Dir
       path /Nextcloud.Storage/shared/dir
       content snippets,iso,images,backup,rootdir,vztmpl
       is_mountpoint 1
       prune-backups keep-last=10
       shared 1
so there's a 1-1 mapping between storages and backing paths. And if you just use the dir storage for rootdir and the shared storage for images, you wouldn't even need the local entry (and the shared prefix for the others).

So for the dir, if I'm going to share it in the cluster with NFS, would I still add a nfs entry to /etc/pve/storage.cfg? Or would I just add it in /etc/fstab and be done with it?
 
So for the dir, if I'm going to share it in the cluster with NFS, would I still add a nfs entry to /etc/pve/storage.cfg? Or would I just add it in /etc/fstab and be done with it?
You can do either, but adding an NFS entry is preferable. PVE should mount it automatically in /mnt/pve/<storage ID>, so you don't even need an fstab entry.
 
You can do either, but adding an NFS entry is preferable. PVE should mount it automatically in /mnt/pve/<storage ID>, so you don't even need an fstab entry.
If adding NFS entry to the storage.cfg auto-mounts everything in /mnt/pve/<storage ID> than it seems to me that this is the less preferable method. From what I can tell, when you share a storage in a cluster, the node expects the folder to be mounted in the same place on both the node and host. In other words, my node expects that the /Nextcloud.Storage directory from my host also be on the node, and thus I put in an fstab entry to specify where it mounts as such:

Code:
192.168.1.129:/Nextcloud.Storage /Nextcloud.Storage nfs defaults 0 0

Where 192.168.1.129 is the host and this is the fstab entry on the node

Am I missing something fundamental here about how clustering and shared file systems should work in PVE?
 
If adding NFS entry to the storage.cfg auto-mounts everything in /mnt/pve/<storage ID> than it seems to me that this is the less preferable method. From what I can tell, when you share a storage in a cluster, the node expects the folder to be mounted in the same place on both the node and host.
PVE will mount the storage on the node exporting the NFS at the same place too. In that sense, it doesn't really matter if the shared storage is external or residing on a cluster node. Although, if you're going for an HA setup, the node exporting the share is a single point of failure, which is not ideal.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!