Cannot migrate between ZFS and Ceph

kenneth_vkd · Nov 14, 2021

Hi
We have some larger virtual servers running on a ZFS mirror of nVME drives. These virtual servers run various different applications, but mostly they run Microsoft Windows Server with Microsoft SQL server on top.
In terms of ressources, they have 500-900GB of disk space on a single virtual hard drive and 10-60GB og RAM.
We are now in the process of migrating these to run on Ceph, but we are running into some issues.
When trying shut down a VM and do an offline migration, we are faced with an error about not being able to migrate from zfspool to rbd

Code:

2021-11-14 11:04:49 starting migration of VM 314 to node 'ns7771' (172.16.0.16)
2021-11-14 11:04:49 found local disk 'local-zfs:vm-314-disk-0' (in current VM config)
2021-11-14 11:04:50 copying local disk images
2021-11-14 11:04:50 using a bandwidth limit of 104857600 bps for transferring 'local-zfs:vm-314-disk-0'
2021-11-14 11:04:50 ERROR: storage migration for 'local-zfs:vm-314-disk-0' to storage 'tier1' failed - cannot migrate from storage type 'zfspool' to 'rbd'
2021-11-14 11:04:50 aborting phase 1 - cleanup resources
2021-11-14 11:04:50 ERROR: migration aborted (duration 00:00:01): storage migration for 'local-zfs:vm-314-disk-0' to storage 'tier1' failed - cannot migrate from storage type 'zfspool' to 'rbd'
TASK ERROR: migration aborted

Then we tried live/online migration, which works for small virtual servers, but for the larger ones it does not.
Here is the last few lines of an 8 vCore, 40GB RAM, 800GB virtual hard drive, system where the live migration failed just before being completed

Code:

2021-11-14 14:47:17 migration active, transferred 42.8 GiB of 40.0 GiB VM-state, 152.4 MiB/s
2021-11-14 14:47:17 xbzrle: send updates to 21567 pages in 4.5 MiB encoded memory, cache-miss 92.97%, overflow 262
query migrate failed: VM 314 qmp command 'query-migrate' failed - client closed connection

2021-11-14 14:47:18 query migrate failed: VM 314 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 314 not running

2021-11-14 14:47:20 query migrate failed: VM 314 not running
query migrate failed: VM 314 not running

2021-11-14 14:47:21 query migrate failed: VM 314 not running
query migrate failed: VM 314 not running

2021-11-14 14:47:22 query migrate failed: VM 314 not running
query migrate failed: VM 314 not running

2021-11-14 14:47:23 query migrate failed: VM 314 not running
query migrate failed: VM 314 not running

2021-11-14 14:47:24 query migrate failed: VM 314 not running
2021-11-14 14:47:24 ERROR: online migrate failure - too many query migrate failures - aborting
2021-11-14 14:47:24 aborting phase 2 - cleanup resources
2021-11-14 14:47:24 migrate_cancel
2021-11-14 14:47:24 migrate_cancel error: VM 314 not running
drive-scsi0: Cancelling block job
2021-11-14 14:47:24 ERROR: VM 314 not running
2021-11-14 14:52:40 ERROR: migration finished with problems (duration 03:47:17)
TASK ERROR: migration problems

Before migration, we did verify that there was enough space on the Ceph cluster and that the target node had enough free RAM to hold the VM

Is there a way to migrate such size of VM using offline migration when no zfspool storage is available on the target node?

LnxBil · Nov 14, 2021

kenneth_vkd said:
Is there a way to migrate such size of VM using offline migration when no zfspool storage is available on the target node?

Backup & restore should always work, just restore on the other node to the new storage type.

kenneth_vkd · Nov 15, 2021

LnxBil said:
Backup & restore should always work, just restore on the other node to the new storage type.

Backup & Restore does not really solve it, as it increases service downtime a lot. This method would require us to the servers out of service, so that data does not change, then make the backup, delete the VM and then restore it on the target node. if the backup is then corrupt, all data is lost unless we keep the "old" VM until we have verified that it actually was restored properly.

Surely there must be a way to do an actual migration

fabian · Nov 15, 2021

assuming your Ceph storage is available on the current node as well, you can just use 'move disk' (while the VM is running). why the live migration is failing might be worthy of investigation as well though, but is likely an unrelated issue.

kenneth_vkd · Nov 15, 2021

fabian said:
assuming your Ceph storage is available on the current node as well, you can just use 'move disk' (while the VM is running). why the live migration is failing might be worthy of investigation as well though, but is likely an unrelated issue.

The ceph storage is not locally available on the node. but enabling it in the cluster storage configuration for the node does allow it to show up.
However since the current host only has a "slow" 2Gbps link, Ceph will perform poorly, asuming that it will even migrate to the Ceph cluster when cluster disks are not attached directly to the local node

fabian · Nov 15, 2021

it should work, but yes, performance will be a bit limited. it might allow you to move the VM offline to the other node though in case you can't fix the live migration issue. alternatively, you could also move to some other non-ZFS local storage, and then offline migrate the VM (e.g., raw image files/volumes on directory/NFS/LVMThin to Ceph should all work)

fabian · Nov 16, 2021

likely found the issue with live-migration with storage migration, it's already fixed upstream, and I just sent a patch for including the fix in our next qemu package release (>= 6.1.0-2). once that hits the repos and you have upgraded, you'll need to cold-restart (poweroff, than start again) your VMs so that they run the updated, fixed binary.

migration should work again then. note that this only affects live-migration with local disks/changing storage, so any VMs that only use shared disks without switching the storage should be unaffected (and consequently, don't need to be restarted either).

Search

Search

Cannot migrate between ZFS and Ceph

kenneth_vkd

Well-Known Member

LnxBil

Distinguished Member

kenneth_vkd

Well-Known Member

fabian

Proxmox Staff Member

kenneth_vkd

Well-Known Member

fabian

Proxmox Staff Member

fabian

Proxmox Staff Member

We value your privacy