PVE Online Migration Fails After Update to PVE 5.4-3

HE_Cole

Member
Oct 25, 2018
45
1
6
33
Miami, FL
Hello Everyone!

My PVE cluster has been running great for many months but i just got around to updating the all the pve nodes to the latest version of PVE 5.4-3 now.

I started live migrating all my VM's to another node and that worked great.

I then set node-out on the the first node and i live migrated all the VM's from it and then i set each OSD to out on that node and waited for backfill.

Once backfill was complete i updated the all the nodes and rebooted the node that was empty.

The node came back online fine and quorate is perfect and the cluster is healthy.

BUT now when i try to live migrate the VM's back the the updated node i receive a error and live migration fails.

Here is the error

Code:
2019-04-12 16:21:20 starting migration of VM 104 to node 'he-s08-r01-pve02' (23.136.0-hidden)
2019-04-12 16:21:21 copying disk images
2019-04-12 16:21:21 starting VM 104 on remote node 'he-s08-r01-pve02'
2019-04-12 16:21:22 error with cfs lock 'storage-VM-STOR2-PVE02': rbd create vm-104-cloudinit' error: rbd: create error: (17) File exists
2019-04-12 16:21:22 ERROR: online migrate failure - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=he-s08-r01-pve02' root@23.136.0.11 qm start 104 --skiplock --migratedfrom he-s07-r01-pve02 --migration_type secure --stateuri unix --machine pc-i440fx-2.12' failed: exit code 255
2019-04-12 16:21:22 aborting phase 2 - cleanup resources
2019-04-12 16:21:22 migrate_cancel
2019-04-12 16:21:23 ERROR: migration finished with problems (duration 00:00:03)
TASK ERROR: migration problems

I have tried live migration on other VM's on the other nodes and everyone fails with the same error.

Any ideas on how to correct this?


I am running the latest PVE version on ALL nodes. 2 of the 3 have NOT been rebooted since update.

Code:
# pveversion --verbose proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve) pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62) pve-kernel-4.15: 5.3-3 pve-kernel-4.15.18-12-pve: 4.15.18-35 pve-kernel-4.15.18-10-pve: 4.15.18-32 pve-kernel-4.15.18-9-pve: 4.15.18-30 pve-kernel-4.15.17-1-pve: 4.15.17-9 ceph: 12.2.11-pve1 corosync: 2.4.4-pve1 criu: 2.11.1-1~bpo90 glusterfs-client: 3.8.8-1 ksm-control-daemon: 1.2-2 libjs-extjs: 6.0.1-2
 
Last edited:
Hi new information on this error.

I found that if i remove the cloud-init drive from the VM in question, I can then live migrate it to any node,.

BUT if i re add the cloud-init drive to the VM after migration the VM will NOT start and gives the error

Code:
rbd: create error: (17) File exists2019-04-12 17:36:31.736199 7f696d44f0c0 -1 librbd: rbd image vm-104-cloudinit already exists
TASK ERROR: error with cfs lock 'storage-VM-STOR2-PVE02': rbd create vm-104-cloudinit' error: rbd: create error: (17) File exists2019-04-12 17:36:31.736199 7f696d44f0c0 -1 librbd: rbd image vm-104-cloudinit already exists

This error occurs on any VM and Any node in the cluster when you delete the cloud-init drive then migrate the VM then re ADD the cloud init drive it WONT start and gives the same error above.

Seams to be related to the cloud-init drive.

As a note both the VM's drive and the cloud init drive are stored on ceph shared storage.

And this error has never happened before i have done plenty of live migrations but after the update to the new PVE 5.4-3 i can no longer migrate my VM and they error out..

I hope this new info helps.
 
Live-Migration under Proxmox really is a hit & miss. Never run any livemigration after an update before testing it in a lab envoriment. Stuff like this happens way to often since PVE5. Dont get me wrong, i live proxmox and i love the work theyre doing but livemigration did became kinda meh since v5 and it seems like having a subscribtion doesnt help either.

Hope this will be fixed soon!

Regards