Offline VM migration causing I/O delays

Aug 14, 2023
11
1
3
Hi all,

We are noticing that offline migrations are a lot more resource intensive than live-migrations.
Because of this, when we perform offline migrations running VMs on the destination node are experiencing I/O delays. We are using lvm-thin storage. Disks are 6 Intel D3-4620 SSDs in RAID10.

I have the following log of a VM migration:

Code:
2024-04-29 12:27:24 use dedicated network address for sending migration traffic (10.40.4.28)
2024-04-29 12:27:24 starting migration of VM 251 to node 'vmh009' (10.40.4.28)
2024-04-29 12:27:25 found local disk 'local-lvm:vm-251-disk-0' (attached)
2024-04-29 12:27:25 found local disk 'local-lvm:vm-251-disk-1' (attached)
2024-04-29 12:27:25 found local disk 'local-lvm:vm-251-disk-2' (attached)
2024-04-29 12:27:25 copying local disk images
2024-04-29 12:27:30 2148466688 bytes (2.1 GB, 2.0 GiB) copied, 3 s, 716 MB/s
2024-04-29 12:27:33 4294901760 bytes (4.3 GB, 4.0 GiB) copied, 6 s, 716 MB/s
2024-04-29 12:27:36 6343294976 bytes (6.3 GB, 5.9 GiB) copied, 9 s, 705 MB/s
2024-04-29 12:27:57 9127788544 bytes (9.1 GB, 8.5 GiB) copied, 30 s, 304 MB/s
2024-04-29 12:28:00 11252203520 bytes (11 GB, 10 GiB) copied, 33 s, 341 MB/s
2024-04-29 12:28:03 13327859712 bytes (13 GB, 12 GiB) copied, 36 s, 370 MB/s
2024-04-29 12:28:06 15372582912 bytes (15 GB, 14 GiB) copied, 39 s, 394 MB/s
2024-04-29 12:28:09 17514364928 bytes (18 GB, 16 GiB) copied, 42 s, 417 MB/s
2024-04-29 12:28:43 19582615552 bytes (20 GB, 18 GiB) copied, 45 s, 435 MB/s
2024-04-29 12:28:47 22812753920 bytes (23 GB, 21 GiB) copied, 80 s, 285 MB/s
2024-04-29 12:28:50 360448+0 records in
2024-04-29 12:28:50 360448+0 records out
2024-04-29 12:28:50 23622320128 bytes (24 GB, 22 GiB) copied, 83.7643 s, 282 MB/s
2024-04-29 12:28:50 [vmh009]   Logical volume "vm-251-disk-0" created.
2024-04-29 12:29:03 [vmh009] 356391+8903 records in
2024-04-29 12:29:03 [vmh009] 356391+8903 records out
2024-04-29 12:29:03 [vmh009] 23622320128 bytes (24 GB, 22 GiB) copied, 95.8912 s, 246 MB/s
2024-04-29 12:29:03 [vmh009] successfully imported 'local-lvm:vm-251-disk-0'
2024-04-29 12:29:03 volume 'local-lvm:vm-251-disk-0' is 'local-lvm:vm-251-disk-0' on the target
2024-04-29 12:29:08 1132920832 bytes (1.1 GB, 1.1 GiB) copied, 3 s, 377 MB/s
2024-04-29 12:29:11 2103902208 bytes (2.1 GB, 2.0 GiB) copied, 6 s, 351 MB/s
2024-04-29 12:29:14 3392077824 bytes (3.4 GB, 3.2 GiB) copied, 9 s, 377 MB/s
[...]
2024-04-29 12:29:47 14458617856 bytes (14 GB, 13 GiB) copied, 42 s, 344 MB/s
2024-04-29 12:29:50 15572336640 bytes (16 GB, 15 GiB) copied, 45 s, 346 MB/s
2024-04-29 12:30:25 18103730176 bytes (18 GB, 17 GiB) copied, 80 s, 226 MB/s
2024-04-29 12:30:45 21986476032 bytes (22 GB, 20 GiB) copied, 100 s, 220 MB/s
2024-04-29 12:30:55 26314473472 bytes (26 GB, 25 GiB) copied, 110 s, 239 MB/s
2024-04-29 12:31:07 491520+0 records in
2024-04-29 12:31:07 491520+0 records out
2024-04-29 12:31:07 32212254720 bytes (32 GB, 30 GiB) copied, 122.622 s, 263 MB/s
2024-04-29 12:31:07 [vmh009]   Logical volume "vm-251-disk-1" created.
2024-04-29 12:31:32 [vmh009] 452405+82849 records in
2024-04-29 12:31:32 [vmh009] 452405+82849 records out
2024-04-29 12:31:32 [vmh009] 32212254720 bytes (32 GB, 30 GiB) copied, 147.362 s, 219 MB/s
2024-04-29 12:31:32 [vmh009] successfully imported 'local-lvm:vm-251-disk-1'
2024-04-29 12:31:32 volume 'local-lvm:vm-251-disk-1' is 'local-lvm:vm-251-disk-1' on the target
2024-04-29 12:31:38 2012151808 bytes (2.0 GB, 1.9 GiB) copied, 3 s, 670 MB/s
2024-04-29 12:31:41 3997630464 bytes (4.0 GB, 3.7 GiB) copied, 6 s, 666 MB/s
2024-04-29 12:31:44 6174801920 bytes (6.2 GB, 5.8 GiB) copied, 9 s, 686 MB/s
2024-04-29 12:32:01 7344619520 bytes (7.3 GB, 6.8 GiB) copied, 27 s, 267 MB/s
2024-04-29 12:32:01 7344685056 bytes (7.3 GB, 6.8 GiB) copied, 27 s, 267 MB/s
2024-04-29 12:32:01 7344750592 bytes (7.3 GB, 6.8 GiB) copied, 27 s, 267 MB/s
2024-04-29 12:32:01 7344816128 bytes (7.3 GB, 6.8 GiB) copied, 27 s, 267 MB/s
2024-04-29 12:32:01 7344881664 bytes (7.3 GB, 6.8 GiB) copied, 27 s, 267 MB/s
2024-04-29 12:32:01 7344947200 bytes (7.3 GB, 6.8 GiB) copied, 27 s, 267 MB/s
[...]
2024-04-29 12:40:05 138736500736 bytes (139 GB, 129 GiB) copied, 510 s, 272 MB/s
2024-04-29 12:40:15 145722769408 bytes (146 GB, 136 GiB) copied, 520 s, 280 MB/s
2024-04-29 12:41:15 155170308096 bytes (155 GB, 145 GiB) copied, 580 s, 268 MB/s
2024-04-29 12:41:35 159309561856 bytes (159 GB, 148 GiB) copied, 600 s, 266 MB/s
2024-04-29 12:41:54 2457600+0 records in
2024-04-29 12:41:54 2457600+0 records out
2024-04-29 12:41:54 161061273600 bytes (161 GB, 150 GiB) copied, 620.585 s, 260 MB/s
2024-04-29 12:41:54 [vmh009]   Logical volume "vm-251-disk-2" created.
2024-04-29 12:42:00 [vmh009] 2424463+72024 records in
2024-04-29 12:42:00 [vmh009] 2424463+72024 records out
2024-04-29 12:42:00 [vmh009] 161061273600 bytes (161 GB, 150 GiB) copied, 625.816 s, 257 MB/s
2024-04-29 12:42:00 [vmh009] successfully imported 'local-lvm:vm-251-disk-2'
2024-04-29 12:42:00 volume 'local-lvm:vm-251-disk-2' is 'local-lvm:vm-251-disk-2' on the target
  Logical volume "vm-251-disk-0" successfully removed.
  Logical volume "vm-251-disk-1" successfully removed.
  Logical volume "vm-251-disk-2" successfully removed.
2024-04-29 12:42:02 migration finished successfully (duration 00:14:38)
TASK OK

Screenshot 2024-04-29 at 13.01.21.png

In this screenshot, you can see the disk writes + I/O being done during the migrations.
Note that there's also another migration running just before 12:27:24, as well as around 12:51. These are live migrations. The migration between 12:27:24 and 12:42:02 is an offline migration.

As you can see, the offline migration requires a lot more IOPS, while not even doing a lot more GB/s writes. In fact the network interface utilization is nowhere near being capped:Screenshot 2024-04-29 at 13.11.01.png

However, because of these IOPS spikes, the other running VMs are seeing a lot of IO-wait:

Screenshot 2024-04-29 at 13.05.07.png

For completeness sake, here is the log of the VM migration between 12:51:10 - 12:55:41, which is visible in the screenshots above, and which did not generate any I/O issues on the VMs:

Code:
2024-04-29 12:51:10 use dedicated network address for sending migration traffic (10.40.4.28)
2024-04-29 12:51:10 starting migration of VM 140 to node 'vmh009' (10.40.4.28)
2024-04-29 12:51:10 found local disk 'local-lvm:vm-140-disk-0' (attached)
2024-04-29 12:51:10 found local disk 'local-lvm:vm-140-disk-1' (attached)
2024-04-29 12:51:10 found local disk 'local-lvm:vm-140-disk-2' (attached)
2024-04-29 12:51:10 starting VM 140 on remote node 'vmh009'
2024-04-29 12:51:13 volume 'local-lvm:vm-140-disk-0' is 'local-lvm:vm-140-disk-0' on the target
2024-04-29 12:51:13 volume 'local-lvm:vm-140-disk-1' is 'local-lvm:vm-140-disk-1' on the target
2024-04-29 12:51:13 volume 'local-lvm:vm-140-disk-2' is 'local-lvm:vm-140-disk-2' on the target
2024-04-29 12:51:13 start remote tunnel
2024-04-29 12:51:14 ssh tunnel ver 1
2024-04-29 12:51:14 starting storage migration
2024-04-29 12:51:14 scsi0: start migration to nbd:10.40.4.28:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 50.0 GiB (0.00%) in 0s
drive-scsi0: transferred 1.0 GiB of 50.0 GiB (2.07%) in 1s
drive-scsi0: transferred 2.1 GiB of 50.0 GiB (4.13%) in 2s
drive-scsi0: transferred 3.1 GiB of 50.0 GiB (6.20%) in 3s
[...]
drive-scsi0: transferred 46.6 GiB of 50.0 GiB (93.13%) in 46s
drive-scsi0: transferred 47.6 GiB of 50.0 GiB (95.13%) in 47s
drive-scsi0: transferred 48.6 GiB of 50.0 GiB (97.14%) in 48s
drive-scsi0: transferred 49.6 GiB of 50.0 GiB (99.15%) in 49s
drive-scsi0: transferred 50.0 GiB of 50.0 GiB (100.00%) in 50s, ready
all 'mirror' jobs are ready
2024-04-29 12:52:04 scsi2: start migration to nbd:10.40.4.28:60001:exportname=drive-scsi2
drive mirror is starting for drive-scsi2
drive-scsi2: transferred 0.0 B of 100.0 GiB (0.00%) in 0s
drive-scsi2: transferred 1.0 GiB of 100.0 GiB (1.03%) in 1s
drive-scsi2: transferred 2.1 GiB of 100.0 GiB (2.09%) in 2s
drive-scsi2: transferred 3.1 GiB of 100.0 GiB (3.10%) in 3s
drive-scsi2: transferred 4.1 GiB of 100.0 GiB (4.13%) in 4s
drive-scsi2: transferred 5.2 GiB of 100.0 GiB (5.16%) in 5s
drive-scsi2: transferred 6.2 GiB of 100.0 GiB (6.18%) in 6s
[...]
drive-scsi2: transferred 96.7 GiB of 100.0 GiB (96.74%) in 1m 38s
drive-scsi2: transferred 97.7 GiB of 100.0 GiB (97.73%) in 1m 39s
drive-scsi2: transferred 98.7 GiB of 100.0 GiB (98.72%) in 1m 40s
drive-scsi2: transferred 99.7 GiB of 100.0 GiB (99.70%) in 1m 41s
drive-scsi2: transferred 100.0 GiB of 100.0 GiB (100.00%) in 1m 42s, ready
all 'mirror' jobs are ready
2024-04-29 12:53:46 scsi1: start migration to nbd:10.40.4.28:60001:exportname=drive-scsi1
drive mirror is starting for drive-scsi1
drive-scsi1: transferred 0.0 B of 100.0 GiB (0.00%) in 0s
drive-scsi1: transferred 1020.0 MiB of 100.0 GiB (1.00%) in 1s
drive-scsi1: transferred 2.0 GiB of 100.0 GiB (1.99%) in 2s
[...]
drive-scsi1: transferred 96.2 GiB of 100.4 GiB (95.79%) in 1m 34s
drive-scsi1: transferred 97.2 GiB of 100.4 GiB (96.76%) in 1m 35s
drive-scsi1: transferred 98.2 GiB of 100.4 GiB (97.75%) in 1m 36s
drive-scsi1: transferred 99.2 GiB of 100.4 GiB (98.73%) in 1m 37s
drive-scsi1: transferred 100.1 GiB of 100.4 GiB (99.72%) in 1m 38s
drive-scsi1: transferred 100.4 GiB of 100.4 GiB (100.00%) in 1m 39s, ready
all 'mirror' jobs are ready
2024-04-29 12:55:25 starting online/live migration on tcp:10.40.4.28:60000
2024-04-29 12:55:25 set migration capabilities
2024-04-29 12:55:25 migration downtime limit: 100 ms
2024-04-29 12:55:25 migration cachesize: 1.0 GiB
2024-04-29 12:55:25 set migration parameters
2024-04-29 12:55:25 start migrate command to tcp:10.40.4.28:60000
2024-04-29 12:55:26 migration active, transferred 1.1 GiB of 8.0 GiB VM-state, 1002.4 MiB/s
2024-04-29 12:55:27 migration active, transferred 2.2 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-04-29 12:55:28 migration active, transferred 3.4 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-04-29 12:55:29 migration active, transferred 4.5 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-04-29 12:55:30 migration active, transferred 5.6 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-04-29 12:55:31 migration active, transferred 6.8 GiB of 8.0 GiB VM-state, 1.1 GiB/s
2024-04-29 12:55:32 migration active, transferred 8.0 GiB of 8.0 GiB VM-state, 691.2 MiB/s
2024-04-29 12:55:32 xbzrle: send updates to 1898 pages in 917.2 KiB encoded memory, overflow 21
2024-04-29 12:55:33 auto-increased downtime to continue migration: 200 ms
2024-04-29 12:55:33 average migration speed: 1.0 GiB/s - downtime 192 ms
2024-04-29 12:55:33 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job_id...
drive-scsi0: Completed successfully.
drive-scsi1: Completing block job_id...
drive-scsi1: Completed successfully.
drive-scsi2: Completing block job_id...
drive-scsi2: Completed successfully.
drive-scsi0: mirror-job finished
drive-scsi1: mirror-job finished
drive-scsi2: mirror-job finished
2024-04-29 12:55:34 stopping NBD storage migration server on target.
2024-04-29 12:55:35 issuing guest fstrim
  Logical volume "vm-140-disk-0" successfully removed.
  Logical volume "vm-140-disk-1" successfully removed.
  Logical volume "vm-140-disk-2" successfully removed.
2024-04-29 12:55:41 migration finished successfully (duration 00:04:32)
TASK OK

I am aware of the options to throttle migrations, but I am not sure if that applies here. With live migrations, we can cap the network interface without problems. However the offline migrations generate so much more IOs it becomes a problem even though the network interface is not being capped out.

Is there anything we can do here to improve the offline VM migrations?
Thanks!
 
Last edited:
Migration, like cloning, just generates a lot of writes. Faster drives might help or a slower network. I don't know if you can apply a bandwidth limit on offline migrations.
Alternatively, use shared storage (so no data needs to be moved) or a storage that support replication (so only the last few changes need to be copied). Or maybe move the VM by a backup&restore via PBS (which can be throttled)?
 
  • Like
Reactions: fiona
Hi,
I don't know if you can apply a bandwidth limit on offline migrations.
yes, you can set it in the target storage's definition with pvesm set <storage ID> --bwlimit <options>
Code:
--bwlimit [clone=<LIMIT>] [,default=<LIMIT>] [,migration=<LIMIT>] [,move=<LIMIT>] [,restore=<LIMIT>]
           Set I/O bandwidth limit for various operations (in KiB/s).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!