Hi,
could you provide the VM config (and in case you have it, the config from before the migration) and the migration command you used?
Could you also provide the storage configuration and tell us which storages the disks were on before the migration?
If you started migration via the GUI which 'Target storage' did you select?
Hello, I'm a coworker of Conacious. Our setup is as follows:
- Thin LV for all VM disks in all nodes (usually only one disk in one thin lv per VM) in source and target storages. The underlying disk technology is NVMe or fast SSD in all nodes in software RAID 1 setup. On top of the RAID 1 sits the PV used in the VG that hosts the thin LV.
- We are in process of updating all our cluster nodes from PVE 5.4 to PVE 6, but this behaviour is observed in VM live migrations from 5.4 to 6 and also 6 to 6
- We start the VM live migration using this command in the host source node:
Code:
# qm migrate <vmid> <target_node> --online --with-local-disks
When we do that, we see the following in the console:
Code:
# qm migrate 108 i11 --online --with-local-disks
2019-10-03 10:49:11 starting migration of VM 108 to node 'i11' (192.168.100.11)
2019-10-03 10:49:11 found local disk 'thin:vm-108-disk-0' (in current VM config)
2019-10-03 10:49:11 copying disk images
2019-10-03 10:49:11 starting VM 108 on remote node 'i11'
2019-10-03 10:49:14 start remote tunnel
2019-10-03 10:49:15 ssh tunnel ver 1
2019-10-03 10:49:15 starting storage migration
2019-10-03 10:49:15 scsi0: start migration to nbd:192.168.100.11:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
The console output freezes at this point and also, on the target node using the command "lvs" we see how the LV used Data% grows slowly and almost no use of network resources in our managed switch monitor.
After some time (several minutes, depending on the disk size), the target node thin LV for the migrating VM has almost 100% used Data% and then we see in the source host console the usual disk migration process logs showing the disk migration progress:
Code:
2019-10-03 10:49:15 scsi0: start migration to nbd:192.168.100.11:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 53477376 bytes remaining: 32158777344 bytes total: 32212254720 bytes progression: 0.17 % busy: 1 ready: 0
drive-scsi0: transferred: 359661568 bytes remaining: 31852593152 bytes total: 32212254720 bytes progression: 1.12 % busy: 1 ready: 0
drive-scsi0: transferred: 704643072 bytes remaining: 31507611648 bytes total: 32212254720 bytes progression: 2.19 % busy: 1 ready: 0
[...]
drive-scsi0: transferred: 31529631744 bytes remaining: 683016192 bytes total: 32212647936 bytes progression: 97.88 % busy: 1 ready: 0
drive-scsi0: transferred: 31936479232 bytes remaining: 276168704 bytes total: 32212647936 bytes progression: 99.14 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212779008 bytes remaining: 0 bytes total: 32212779008 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2019-10-03 10:52:27 starting online/live migration on tcp:192.168.100.11:60000
2019-10-03 10:52:27 migrate_set_speed: 8589934592
2019-10-03 10:52:27 migrate_set_downtime: 0.1
2019-10-03 10:52:27 set migration_caps
2019-10-03 10:52:27 set cachesize: 536870912
2019-10-03 10:52:27 start migrate command to tcp:192.168.100.11:60000
2019-10-03 10:52:28 migration status: active (transferred 408198861, remaining 1423462400), total 4312604672)
2019-10-03 10:52:28 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2019-10-03 10:52:29 migration status: active (transferred 892438118, remaining 431419392), total 4312604672)
2019-10-03 10:52:29 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2019-10-03 10:52:30 migration speed: 21.01 MB/s - downtime 128 ms
2019-10-03 10:52:30 migration status: completed
drive-scsi0: transferred: 32212779008 bytes remaining: 0 bytes total: 32212779008 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0 : finished
Logical volume "vm-108-disk-0" successfully removed
2019-10-03 10:52:36 migration finished successfully (duration 00:03:25)
This part goes at high speed (several Gb/s) as expected.
We do not understand what is happening in the first stage, when the target thin LV used Data% is growing slowly with almost no network usage. This is the part of the live migration that is taking the highest amount of the whole process time. It looks like there is no data flowing through the network and only disk space reservation / provisioning is being done in the target node in an unefficient way.
The VM config file is this:
Code:
agent: 1
balloon: 1536
boot: cdn
bootdisk: scsi0
cores: 2
hotplug: disk,network,usb
memory: 4096
name: jenkins
net0: virtio=00:04:00:00:00:24,bridge=vmbr2
numa: 0
onboot: 1
ostype: l26
scsi0: thin:vm-108-disk-0,cache=writeback,discard=on,format=raw,size=30G
scsihw: virtio-scsi-pci
smbios1: uuid=872d4b6e-321f-4f3f-8efa-0b25486756f3
sockets: 1
Another problem we found is that although the live migration works, after some minutes of normal operation, the migrated VM gets unstable and panics/freezes and we have to reboot it. We are using different Debian versions (8, 9, 10) in our guest VMs (see attached image).
Thanks for your help.