Livemigration with localdisk doesnt coppy and data from the hdds anymore

Sralityhe

Well-Known Member
Jul 5, 2017
81
3
48
30
Hi Guys,

i just wanted to upgrade my debian systems for the latest systemd CVEs, so i always i coppied them via
"qm migrate vmid pveid --online --with-local-disks"

What was strange was the time, it was way faster then usal - and indeed the VM instantly crashed on the new host.
It seems like it didnt coppy ANY data from the qcow2 files. Its just an empty big file!

I just spend the last 2 hours recovering my stuff - shit happens - but now i tried to reproduce this behavior - and it worked.

I created a new vm - used grml to create a filesystem - touched a file - moved the vm to another host and tried to mount the hdd on the new host (see screenshot). Nothing.. everything went down to hell :D

It seems like the livemigrations doesnt coppy and real data beside the metainformation of the virtual hdds anymore.

Feedback would be highly appreciated.

Thanks!
 

Attachments

  • 1f252d3b4c50e1a8e563.png
    1f252d3b4c50e1a8e563.png
    96.1 KB · Views: 12
Last edited:
please post your `/etc/pve/storage.cfg`

Hi Stoiko,

thanks for you reply.

Code:
dir: local
        path /var/lib/vz
        content rootdir,images,backup,vztmpl
        maxfiles 0
        shared 0

nfs: nfs
        export /var/nfs
        path /mnt/pve/nfs
        server 10.14.0.9
        content images,iso,vztmpl
        maxfiles 1
        options vers=3

Its the same on all Nodes.

Kind regards!
 
I have had a similar experience yesterday where I added pve-manager/5.3-7/e8ed1e22 to cluster with pve-manager/5.2-1/0fcd7879 and ZFS as storage.
I live migrated with local disks a vm from PM 5.2 over to 5.3 and it worked. Then I migrated it back to 5.2 and started getting fs erros, so i tried to reboot and it it could not boot anymore.
I thought it have something to do with migrating back to lower PM version as it is probably not tested / supported and left it at that. But seeing this post made me think it might be a bug and it should live migrate back without problems. Sadly I do not have two nodes with 5.3 to test with.
When I get a chance, will test it out again and submit the logs.
 
Can confirm that it worked with 5.2 but not 5.3 (Kernel and every other package up2date).

Regards
 
@freakshow10000 : could I ask you for the output of `pveversion -v` with the working config (5.2) and the one where you have problems?
(helps narrow down the potential regression) - Thanks!
 
Here is what if found testing live migration. I think the problem comes from starting suffix number for disks.

From pve-manager/5.2-1/0fcd7879 to pve-manager/5.2-1/0fcd7879 works as expected.
Code:
root@p27:~# qm migrate 125 p25 --online --with-local-disks
2019-01-16 18:04:13 starting migration of VM 125 to node 'p25' (10.31.1.25)
2019-01-16 18:04:13 found local disk 'local-zfs:vm-125-disk-1' (in current VM config)
2019-01-16 18:04:13 copying disk images
2019-01-16 18:04:13 starting VM 125 on remote node 'p25'
2019-01-16 18:04:16 start remote tunnel
2019-01-16 18:04:17 ssh tunnel ver 1
2019-01-16 18:04:17 starting storage migration
2019-01-16 18:04:17 scsi0: start migration to nbd:10.31.1.25:60000:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 0 bytes remaining: 16106127360 bytes total: 16106127360 bytes progression: 0.00 % busy: 1 ready: 0
drive-scsi0: transferred: 542113792 bytes remaining: 15564013568 bytes total: 16106127360 bytes progression: 3.37 % busy: 1 ready: 0
drive-scsi0: transferred: 1224736768 bytes remaining: 14881390592 bytes total: 16106127360 bytes progression: 7.60 % busy: 1 ready: 0
drive-scsi0: transferred: 1865416704 bytes remaining: 14240710656 bytes total: 16106127360 bytes progression: 11.58 % busy: 1 ready: 0
drive-scsi0: transferred: 2502950912 bytes remaining: 13603176448 bytes total: 16106127360 bytes progression: 15.54 % busy: 1 ready: 0
drive-scsi0: transferred: 2762997760 bytes remaining: 13343129600 bytes total: 16106127360 bytes progression: 17.15 % busy: 1 ready: 0
drive-scsi0: transferred: 2861563904 bytes remaining: 13244563456 bytes total: 16106127360 bytes progression: 17.77 % busy: 1 ready: 0
drive-scsi0: transferred: 3382706176 bytes remaining: 12723421184 bytes total: 16106127360 bytes progression: 21.00 % busy: 1 ready: 0
drive-scsi0: transferred: 4062183424 bytes remaining: 12043943936 bytes total: 16106127360 bytes progression: 25.22 % busy: 1 ready: 0
drive-scsi0: transferred: 4701814784 bytes remaining: 11404312576 bytes total: 16106127360 bytes progression: 29.19 % busy: 1 ready: 0
drive-scsi0: transferred: 4863295488 bytes remaining: 11242831872 bytes total: 16106127360 bytes progression: 30.20 % busy: 1 ready: 0
drive-scsi0: transferred: 4880072704 bytes remaining: 11226054656 bytes total: 16106127360 bytes progression: 30.30 % busy: 1 ready: 0
drive-scsi0: transferred: 4880072704 bytes remaining: 11226054656 bytes total: 16106127360 bytes progression: 30.30 % busy: 1 ready: 0
drive-scsi0: transferred: 5114953728 bytes remaining: 10991173632 bytes total: 16106127360 bytes progression: 31.76 % busy: 1 ready: 0
drive-scsi0: transferred: 5738856448 bytes remaining: 10367270912 bytes total: 16106127360 bytes progression: 35.63 % busy: 1 ready: 0
drive-scsi0: transferred: 6311378944 bytes remaining: 9794748416 bytes total: 16106127360 bytes progression: 39.19 % busy: 1 ready: 0
drive-scsi0: transferred: 7051673600 bytes remaining: 9054453760 bytes total: 16106127360 bytes progression: 43.78 % busy: 1 ready: 0
drive-scsi0: transferred: 7758413824 bytes remaining: 8347713536 bytes total: 16106127360 bytes progression: 48.17 % busy: 1 ready: 0
drive-scsi0: transferred: 8426356736 bytes remaining: 7679770624 bytes total: 16106127360 bytes progression: 52.32 % busy: 1 ready: 0
drive-scsi0: transferred: 8902410240 bytes remaining: 7203717120 bytes total: 16106127360 bytes progression: 55.27 % busy: 1 ready: 0
drive-scsi0: transferred: 8902410240 bytes remaining: 7203717120 bytes total: 16106127360 bytes progression: 55.27 % busy: 1 ready: 0
drive-scsi0: transferred: 8902410240 bytes remaining: 7203717120 bytes total: 16106127360 bytes progression: 55.27 % busy: 1 ready: 0
drive-scsi0: transferred: 8902410240 bytes remaining: 7203717120 bytes total: 16106127360 bytes progression: 55.27 % busy: 1 ready: 0
drive-scsi0: transferred: 8902410240 bytes remaining: 7203717120 bytes total: 16106127360 bytes progression: 55.27 % busy: 1 ready: 0
drive-scsi0: transferred: 9414115328 bytes remaining: 6692208640 bytes total: 16106323968 bytes progression: 58.45 % busy: 1 ready: 0
drive-scsi0: transferred: 10020192256 bytes remaining: 6086131712 bytes total: 16106323968 bytes progression: 62.21 % busy: 1 ready: 0
drive-scsi0: transferred: 10634657792 bytes remaining: 5471666176 bytes total: 16106323968 bytes progression: 66.03 % busy: 1 ready: 0
drive-scsi0: transferred: 10872684544 bytes remaining: 5233639424 bytes total: 16106323968 bytes progression: 67.51 % busy: 1 ready: 0
drive-scsi0: transferred: 10872684544 bytes remaining: 5233639424 bytes total: 16106323968 bytes progression: 67.51 % busy: 1 ready: 0
drive-scsi0: transferred: 10872684544 bytes remaining: 5233639424 bytes total: 16106323968 bytes progression: 67.51 % busy: 1 ready: 0
drive-scsi0: transferred: 10872684544 bytes remaining: 5233639424 bytes total: 16106323968 bytes progression: 67.51 % busy: 1 ready: 0
drive-scsi0: transferred: 10872684544 bytes remaining: 5233639424 bytes total: 16106323968 bytes progression: 67.51 % busy: 1 ready: 0
drive-scsi0: transferred: 10872684544 bytes remaining: 5233639424 bytes total: 16106323968 bytes progression: 67.51 % busy: 1 ready: 0
drive-scsi0: transferred: 11012145152 bytes remaining: 5094178816 bytes total: 16106323968 bytes progression: 68.37 % busy: 1 ready: 0
drive-scsi0: transferred: 11012145152 bytes remaining: 5094178816 bytes total: 16106323968 bytes progression: 68.37 % busy: 1 ready: 0
drive-scsi0: transferred: 11245977600 bytes remaining: 4860346368 bytes total: 16106323968 bytes progression: 69.82 % busy: 1 ready: 0
drive-scsi0: transferred: 11856248832 bytes remaining: 4250075136 bytes total: 16106323968 bytes progression: 73.61 % busy: 1 ready: 0
drive-scsi0: transferred: 12481200128 bytes remaining: 3625123840 bytes total: 16106323968 bytes progression: 77.49 % busy: 1 ready: 0
drive-scsi0: transferred: 13074694144 bytes remaining: 3031629824 bytes total: 16106323968 bytes progression: 81.18 % busy: 1 ready: 0
drive-scsi0: transferred: 13778288640 bytes remaining: 2328035328 bytes total: 16106323968 bytes progression: 85.55 % busy: 1 ready: 0
drive-scsi0: transferred: 14431551488 bytes remaining: 1674772480 bytes total: 16106323968 bytes progression: 89.60 % busy: 1 ready: 0
drive-scsi0: transferred: 14991491072 bytes remaining: 1114832896 bytes total: 16106323968 bytes progression: 93.08 % busy: 1 ready: 0
drive-scsi0: transferred: 15694036992 bytes remaining: 412286976 bytes total: 16106323968 bytes progression: 97.44 % busy: 1 ready: 0
drive-scsi0: transferred: 16106323968 bytes remaining: 0 bytes total: 16106323968 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 16106323968 bytes remaining: 0 bytes total: 16106323968 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 16106323968 bytes remaining: 0 bytes total: 16106323968 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2019-01-16 18:05:03 starting online/live migration on unix:/run/qemu-server/125.migrate
2019-01-16 18:05:03 migrate_set_speed: 8589934592
2019-01-16 18:05:03 migrate_set_downtime: 0.1
2019-01-16 18:05:03 set migration_caps
2019-01-16 18:05:03 set cachesize: 134217728
2019-01-16 18:05:03 start migrate command to unix:/run/qemu-server/125.migrate
2019-01-16 18:05:04 migration speed: 21.79 MB/s - downtime 26 ms
2019-01-16 18:05:04 migration status: completed
drive-scsi0: transferred: 16106323968 bytes remaining: 0 bytes total: 16106323968 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0 : finished
2019-01-16 18:05:09 migration finished successfully (duration 00:00:57)

From pve-manager/5.2-1/0fcd7879 to pve-manager/5.3-7/e8ed1e22 also works almost as expected.
ZFS disk device has changed from vm-125-disk-1 to vm-125-disk-0.
Code:
root@p25:~# qm migrate 125 p28  --online --with-local-disks
2019-01-16 18:06:50 starting migration of VM 125 to node 'p28' (10.31.1.28)
2019-01-16 18:06:50 found local disk 'local-zfs:vm-125-disk-1' (in current VM config)
2019-01-16 18:06:50 copying disk images
2019-01-16 18:06:50 starting VM 125 on remote node 'p28'
2019-01-16 18:06:53 start remote tunnel
2019-01-16 18:06:53 ssh tunnel ver 1
2019-01-16 18:06:53 starting storage migration
2019-01-16 18:06:53 scsi0: start migration to nbd:10.31.1.28:60000:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 0 bytes remaining: 16106127360 bytes total: 16106127360 bytes progression: 0.00 % busy: 1 ready: 0
drive-scsi0: transferred: 954204160 bytes remaining: 15151923200 bytes total: 16106127360 bytes progression: 5.92 % busy: 1 ready: 0
drive-scsi0: transferred: 1264582656 bytes remaining: 14841544704 bytes total: 16106127360 bytes progression: 7.85 % busy: 1 ready: 0
drive-scsi0: transferred: 2006974464 bytes remaining: 14099152896 bytes total: 16106127360 bytes progression: 12.46 % busy: 1 ready: 0
drive-scsi0: transferred: 2678063104 bytes remaining: 13428064256 bytes total: 16106127360 bytes progression: 16.63 % busy: 1 ready: 0
drive-scsi0: transferred: 3695181824 bytes remaining: 12410945536 bytes total: 16106127360 bytes progression: 22.94 % busy: 1 ready: 0
drive-scsi0: transferred: 4696571904 bytes remaining: 11409555456 bytes total: 16106127360 bytes progression: 29.16 % busy: 1 ready: 0
drive-scsi0: transferred: 5680136192 bytes remaining: 10425991168 bytes total: 16106127360 bytes progression: 35.27 % busy: 1 ready: 0
drive-scsi0: transferred: 6637486080 bytes remaining: 9468641280 bytes total: 16106127360 bytes progression: 41.21 % busy: 1 ready: 0
drive-scsi0: transferred: 7356809216 bytes remaining: 8749318144 bytes total: 16106127360 bytes progression: 45.68 % busy: 1 ready: 0
drive-scsi0: transferred: 7356809216 bytes remaining: 8749318144 bytes total: 16106127360 bytes progression: 45.68 % busy: 1 ready: 0
drive-scsi0: transferred: 7837057024 bytes remaining: 8269070336 bytes total: 16106127360 bytes progression: 48.66 % busy: 1 ready: 0
drive-scsi0: transferred: 7885291520 bytes remaining: 8220835840 bytes total: 16106127360 bytes progression: 48.96 % busy: 1 ready: 0
drive-scsi0: transferred: 8519680000 bytes remaining: 7586447360 bytes total: 16106127360 bytes progression: 52.90 % busy: 1 ready: 0
drive-scsi0: transferred: 9334423552 bytes remaining: 6771703808 bytes total: 16106127360 bytes progression: 57.96 % busy: 1 ready: 0
drive-scsi0: transferred: 10162798592 bytes remaining: 5943328768 bytes total: 16106127360 bytes progression: 63.10 % busy: 1 ready: 0
drive-scsi0: transferred: 11140071424 bytes remaining: 4966055936 bytes total: 16106127360 bytes progression: 69.17 % busy: 1 ready: 0
drive-scsi0: transferred: 11626610688 bytes remaining: 4479516672 bytes total: 16106127360 bytes progression: 72.19 % busy: 1 ready: 0
drive-scsi0: transferred: 12583960576 bytes remaining: 3522166784 bytes total: 16106127360 bytes progression: 78.13 % busy: 1 ready: 0
drive-scsi0: transferred: 13614710784 bytes remaining: 2491416576 bytes total: 16106127360 bytes progression: 84.53 % busy: 1 ready: 0
drive-scsi0: transferred: 14286848000 bytes remaining: 1819279360 bytes total: 16106127360 bytes progression: 88.70 % busy: 1 ready: 0
drive-scsi0: transferred: 15258877952 bytes remaining: 847249408 bytes total: 16106127360 bytes progression: 94.74 % busy: 1 ready: 0
drive-scsi0: transferred: 16106127360 bytes remaining: 0 bytes total: 16106127360 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 16106127360 bytes remaining: 0 bytes total: 16106127360 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2019-01-16 18:07:17 starting online/live migration on unix:/run/qemu-server/125.migrate
2019-01-16 18:07:17 migrate_set_speed: 8589934592
2019-01-16 18:07:17 migrate_set_downtime: 0.1
2019-01-16 18:07:17 set migration_caps
2019-01-16 18:07:17 set cachesize: 134217728
2019-01-16 18:07:17 start migrate command to unix:/run/qemu-server/125.migrate
2019-01-16 18:07:18 migration speed: 40.96 MB/s - downtime 21 ms
2019-01-16 18:07:18 migration status: completed
drive-scsi0: transferred: 16106127360 bytes remaining: 0 bytes total: 16106127360 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0 : finished
2019-01-16 18:07:24 migration finished successfully (duration 00:00:35)
root@p25:~#

Migration back is reported successfully, but in reality it failed.
Code:
root@p28:~# qm migrate 125 p27 --online --with-local-disks
2019-01-16 18:13:33 starting migration of VM 125 to node 'p27' (10.31.1.27)
2019-01-16 18:13:33 found local disk 'local-zfs:vm-125-disk-0' (in current VM config)
2019-01-16 18:13:33 copying disk images
2019-01-16 18:13:33 starting VM 125 on remote node 'p27'
2019-01-16 18:13:36 start remote tunnel
2019-01-16 18:13:37 ssh tunnel ver 1
2019-01-16 18:13:37 starting online/live migration on unix:/run/qemu-server/125.migrate
2019-01-16 18:13:37 migrate_set_speed: 8589934592
2019-01-16 18:13:37 migrate_set_downtime: 0.1
2019-01-16 18:13:37 set migration_caps
2019-01-16 18:13:37 set cachesize: 134217728
2019-01-16 18:13:37 start migrate command to unix:/run/qemu-server/125.migrate
2019-01-16 18:13:38 migration speed: 1024.00 MB/s - downtime 8 ms
2019-01-16 18:13:38 migration status: completed
2019-01-16 18:13:41 migration finished successfully (duration 00:00:09)
We can still see the disk on the source:
Code:
root@p28:~# zfs list -t all | grep 125
rpool/data/vm-125-disk-0    891M   892G   891M  -
This is the new disk on destination, still empty:
Code:
rpool/data/vm-125-disk-1                                              56K  4.62T    56K  -
VM is still running, but spitting out FS errors, because data is gone.
I can only stop it and it won't start again obviously because:
Code:
Could not open '/dev/zvol/rpool/data/vm-125-disk-0': No such file or directory
I guess i could zfs send that drive from 5.3 back to 5.2, fix the naming in ZFS or VM conf file and make it running again.

So it seems that the problem might be related to the index number at the end of the drive. 5.3 starts with -0 while 5.2 starts with 1.
I guess if that were to be fixed so that 5.3 also starts with -1 suffix, all might be good and we could upgrade our clusters.
 
  • Like
Reactions: Sralityhe
@freakshow10000 : could I ask you for the output of `pveversion -v` with the working config (5.2) and the one where you have problems?
(helps narrow down the potential regression) - Thanks!

Im sorry but i dont have any 5.2 nodes in production anymore - i just remember that it worked back then.
But i can say that my uptime was ~60 days with the latest kernel. I remember migrating vms in order to update the kernel, so the bug did probabbly appear within the last 60 days.

Sorry that i cant tell anything more.

Regards!

Not working version is:
Code:
pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-7 (running version: 5.3-7/e8ed1e22)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-19
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-35
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-1
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-33
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-44
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
 
Not yet - please open the bugreport (referencing this thread) at https://bugzilla.proxmox.com.

I didn't get around to reproduce it yesterday, but will try to do so today
Thanks for catching this!
 
Please let us know when the patch reaches the pve no subs repo.

Also on a related note, disk suffixes eg. vm-ID-disk-X where X is suffix, should not change with live migration, right?
 
With local storages (--with-local-disks flag) they do change, because a new disk-image gets created on the target node, which is filled via qemu-drive-mirror
 
When doing live migrate from 5.2 to 5.2 they do not change, they only change when doing live migrate from 5.2 to 5.3. See my detailed report above.
Why would 5.3 change this? Shouldn't we strive to keep VM disk names consistent?
 
The change between local-zfs:vm-125-disk-1 and local-zfs:vm-125-disk-0 is due to a change in pve-storage, with which we started counting the disks from 0 instead of 1.
The image names would change in 5.2 -> 5.2 as well, if you had a old, leftover image on the target-node (e.g. when using `qm move_disk` without the delete-flag
 
I see. Seems reasonable. I will have to fix my backup plans, but only once after upgrade, so it's fine. :-)
Now we are just eagerly awaiting the patch to hit the repos. :-)
 
Has this hit the no subs repos yet? I would really like to upgrade and extend my cluster but can not without live migration.
I'm going to upgrade and test it now, to check and report back.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!