About live migrations

Conacious

New Member
Sep 17, 2019
24
0
1
31
Hello,

We have some virtual machines on our proxmox enviorment and a cluster of about 8 servers. Each server is configured with thin lv like the VMS.
The fact is that, when we live migrate a VM from a server to an other we see like 3 phases:

1st fase. On this first phase the migration is slow, like 4-8Mbps it goes from 0 to 100% copying files.
2nd fase. It's like it starts to copying again the files. Now goes a lot faster like 2-6Gbps.
3rd fase. RAM phase this is as fast as expected.

We would like to know why the 1st phase is not like the 2nd phase.
We do have debian qm guest agent package.

Thanks for your help.
 
this is migration with local disk ?

you should have 2 phases:

1)
the disk migration. (if you have multiple disk, it begin with the first disk, then second disk (and keeping to sync first disk at the same time).
When all disks are synced to target vm, the disks are switched between source vm and target vm. (source vm read/write disks through network/targetvm)

2) the memory migration occur (like a classic migration with shared storage)
 
Yes, on the first phase we can see like 2 sub phases. First migration without logs. This first transfer we monitorize it via our switch, and we can see from 4 to 8 Mbps. Then the second phase where we can see logs and a progress bar and it goes very fast Gbps.

I dont't know if the normal behaviour.

Thanks.
 
Hi,
could you provide the VM config (and in case you have it, the config from before the migration) and the migration command you used?
Could you also provide the storage configuration and tell us which storages the disks were on before the migration?
If you started migration via the GUI which 'Target storage' did you select?
 
Hi,
could you provide the VM config (and in case you have it, the config from before the migration) and the migration command you used?
Could you also provide the storage configuration and tell us which storages the disks were on before the migration?
If you started migration via the GUI which 'Target storage' did you select?

Hello, I'm a coworker of Conacious. Our setup is as follows:

- Thin LV for all VM disks in all nodes (usually only one disk in one thin lv per VM) in source and target storages. The underlying disk technology is NVMe or fast SSD in all nodes in software RAID 1 setup. On top of the RAID 1 sits the PV used in the VG that hosts the thin LV.
- We are in process of updating all our cluster nodes from PVE 5.4 to PVE 6, but this behaviour is observed in VM live migrations from 5.4 to 6 and also 6 to 6
- We start the VM live migration using this command in the host source node:

Code:
  # qm migrate <vmid> <target_node> --online --with-local-disks

When we do that, we see the following in the console:

Code:
# qm migrate 108 i11 --online --with-local-disks
2019-10-03 10:49:11 starting migration of VM 108 to node 'i11' (192.168.100.11)
2019-10-03 10:49:11 found local disk 'thin:vm-108-disk-0' (in current VM config)
2019-10-03 10:49:11 copying disk images
2019-10-03 10:49:11 starting VM 108 on remote node 'i11'
2019-10-03 10:49:14 start remote tunnel
2019-10-03 10:49:15 ssh tunnel ver 1
2019-10-03 10:49:15 starting storage migration
2019-10-03 10:49:15 scsi0: start migration to nbd:192.168.100.11:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0

The console output freezes at this point and also, on the target node using the command "lvs" we see how the LV used Data% grows slowly and almost no use of network resources in our managed switch monitor.

After some time (several minutes, depending on the disk size), the target node thin LV for the migrating VM has almost 100% used Data% and then we see in the source host console the usual disk migration process logs showing the disk migration progress:

Code:
2019-10-03 10:49:15 scsi0: start migration to nbd:192.168.100.11:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 53477376 bytes remaining: 32158777344 bytes total: 32212254720 bytes progression: 0.17 % busy: 1 ready: 0
drive-scsi0: transferred: 359661568 bytes remaining: 31852593152 bytes total: 32212254720 bytes progression: 1.12 % busy: 1 ready: 0
drive-scsi0: transferred: 704643072 bytes remaining: 31507611648 bytes total: 32212254720 bytes progression: 2.19 % busy: 1 ready: 0
[...]
drive-scsi0: transferred: 31529631744 bytes remaining: 683016192 bytes total: 32212647936 bytes progression: 97.88 % busy: 1 ready: 0
drive-scsi0: transferred: 31936479232 bytes remaining: 276168704 bytes total: 32212647936 bytes progression: 99.14 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212647936 bytes remaining: 0 bytes total: 32212647936 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 32212779008 bytes remaining: 0 bytes total: 32212779008 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2019-10-03 10:52:27 starting online/live migration on tcp:192.168.100.11:60000
2019-10-03 10:52:27 migrate_set_speed: 8589934592
2019-10-03 10:52:27 migrate_set_downtime: 0.1
2019-10-03 10:52:27 set migration_caps
2019-10-03 10:52:27 set cachesize: 536870912
2019-10-03 10:52:27 start migrate command to tcp:192.168.100.11:60000
2019-10-03 10:52:28 migration status: active (transferred 408198861, remaining 1423462400), total 4312604672)
2019-10-03 10:52:28 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2019-10-03 10:52:29 migration status: active (transferred 892438118, remaining 431419392), total 4312604672)
2019-10-03 10:52:29 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 0 overflow 0
2019-10-03 10:52:30 migration speed: 21.01 MB/s - downtime 128 ms
2019-10-03 10:52:30 migration status: completed
drive-scsi0: transferred: 32212779008 bytes remaining: 0 bytes total: 32212779008 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0 : finished
Logical volume "vm-108-disk-0" successfully removed
2019-10-03 10:52:36 migration finished successfully (duration 00:03:25)

This part goes at high speed (several Gb/s) as expected.

We do not understand what is happening in the first stage, when the target thin LV used Data% is growing slowly with almost no network usage. This is the part of the live migration that is taking the highest amount of the whole process time. It looks like there is no data flowing through the network and only disk space reservation / provisioning is being done in the target node in an unefficient way.

The VM config file is this:

Code:
agent: 1
balloon: 1536
boot: cdn
bootdisk: scsi0
cores: 2
hotplug: disk,network,usb
memory: 4096
name: jenkins
net0: virtio=00:04:00:00:00:24,bridge=vmbr2
numa: 0
onboot: 1
ostype: l26
scsi0: thin:vm-108-disk-0,cache=writeback,discard=on,format=raw,size=30G
scsihw: virtio-scsi-pci
smbios1: uuid=872d4b6e-321f-4f3f-8efa-0b25486756f3
sockets: 1

Another problem we found is that although the live migration works, after some minutes of normal operation, the migrated VM gets unstable and panics/freezes and we have to reboot it. We are using different Debian versions (8, 9, 10) in our guest VMs (see attached image).

Screenshot_20191003_112129.png

Thanks for your help.
 
the first stage is allocating the full-size thin LV with zeroes. the second stage is then actually sending the contents.
 
some users reported better migration behaviour on thin-LVM with virtio-blk instead of virtio-scsi - maybe that is worth a try for you as well?
 
some users reported better migration behaviour on thin-LVM with virtio-blk instead of virtio-scsi - maybe that is worth a try for you as well?

Is it possible to change VirtIO SCSI to virtio-blk? Does virtio-blk honor DISCARD? I don't find ih the web interface how to do it. I don't understand why the process of live disk migration has to fill with zeroes the target disk, it doesn't follow the thin LV approach of not using disk resources if zeroed/discarded. Could it be skipped from the process?

Also, do you have any idea about the second problem with guest freezing after live migration?

Thank you very much.
 
for testing, you could just create a new VM with virtio-blk?
 
>>The console output freezes at this point and also, on the target node using the command "lvs" we see how the LV used Data% grows slowly and >>almost no use of network resources in our managed switch monitor.

Normally, it shouldn't hang here. I have seen this bug in past, but specific storage, don't remember exactly. Maybe it's specic to lvm. (I'm 100% it's working instant with ceph rbd for example).

Can you reproduce it with "move disk" too ? (it's almost the same method, but locally between 2 storages)
 
A related issue, LVM thin storage(s), online migration with local disks.

VM with virtio SCSI - 1st phase: thickening the LV, 2nd phase: actually copying the disk data
old VM with virtio - single phase: just copying the disk data (so in total ~ half of the time)
(I'm only talking about the disk copy, memory is copied the same in both cases)

virtio scsi is listed as recommended disk interface type; maybe it's better in terms of "normal" use, but in live migration (at least over lvm) simple virtio seems better (in my 6.x tests zfs seemed smart enough to speeed up the initial zero-ing)
what is the "official" recommendations to this matter ? does simple virtio supports discard ? why is better to use virtio-scsi ?
 
  • Like
Reactions: guletz

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!