PM 6.2 KVM Live migration failed (bug or ?)

mailinglists

Renowned Member
Mar 14, 2012
641
67
93
A VM with two disks live migration failed.
It also died on source side.
Offline (as it was dead anyway) migration worked and VM recovered after.

I'm attaching the live migration log.
Should I report a bug or..?

Code:
Proxmox
Virtual Environment 6.2-12
Virtual Machine 142 (XYZ) on node 'p37'
Logs
()
2020-12-09 19:04:06 starting migration of VM 142 to node 'p37' (10.31.1.37)
2020-12-09 19:04:06 found local, replicated disk 'local-zfs:vm-142-disk-0' (in current VM config)
2020-12-09 19:04:06 found local, replicated disk 'local-zfs:vm-142-disk-1' (in current VM config)
2020-12-09 19:04:06 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2020-12-09 19:04:06 scsi1: start tracking writes using block-dirty-bitmap 'repl_scsi1'
2020-12-09 19:04:06 replicating disk images
2020-12-09 19:04:06 start replication job
2020-12-09 19:04:06 guest => VM 142, running => 9635
2020-12-09 19:04:06 volumes => local-zfs:vm-142-disk-0,local-zfs:vm-142-disk-1
2020-12-09 19:04:08 create snapshot '__replicate_142-0_1607537046__' on local-zfs:vm-142-disk-0
2020-12-09 19:04:08 create snapshot '__replicate_142-0_1607537046__' on local-zfs:vm-142-disk-1
2020-12-09 19:04:08 using secure transmission, rate limit: none
2020-12-09 19:04:08 incremental sync 'local-zfs:vm-142-disk-0' (__replicate_142-0_1607536986__ => __replicate_142-0_1607537046__)
2020-12-09 19:04:10 rpool/data/vm-142-disk-0@__replicate_142-0_1607536986__    name    rpool/data/vm-142-disk-0@__replicate_142-0_1607536986__    -
2020-12-09 19:04:11 send from @__replicate_142-0_1607536986__ to rpool/data/vm-142-disk-0@__replicate_142-0_1607537046__ estimated size is 28.6M
2020-12-09 19:04:11 total estimated size is 28.6M
2020-12-09 19:04:11 TIME        SENT   SNAPSHOT rpool/data/vm-142-disk-0@__replicate_142-0_1607537046__
2020-12-09 19:04:11 successfully imported 'local-zfs:vm-142-disk-0'
2020-12-09 19:04:11 incremental sync 'local-zfs:vm-142-disk-1' (__replicate_142-0_1607536986__ => __replicate_142-0_1607537046__)
2020-12-09 19:04:13 rpool/data/vm-142-disk-1@__replicate_142-0_1607536986__    name    rpool/data/vm-142-disk-1@__replicate_142-0_1607536986__    -
2020-12-09 19:04:14 send from @__replicate_142-0_1607536986__ to rpool/data/vm-142-disk-1@__replicate_142-0_1607537046__ estimated size is 4.14M
2020-12-09 19:04:14 total estimated size is 4.14M
2020-12-09 19:04:14 TIME        SENT   SNAPSHOT rpool/data/vm-142-disk-1@__replicate_142-0_1607537046__
2020-12-09 19:04:14 successfully imported 'local-zfs:vm-142-disk-1'
2020-12-09 19:04:14 delete previous replication snapshot '__replicate_142-0_1607536986__' on local-zfs:vm-142-disk-0
2020-12-09 19:04:14 delete previous replication snapshot '__replicate_142-0_1607536986__' on local-zfs:vm-142-disk-1
2020-12-09 19:04:15 (remote_finalize_local_job) delete stale replication snapshot '__replicate_142-0_1607536986__' on local-zfs:vm-142-disk-0
2020-12-09 19:04:15 (remote_finalize_local_job) delete stale replication snapshot '__replicate_142-0_1607536986__' on local-zfs:vm-142-disk-1
2020-12-09 19:04:15 end replication job
2020-12-09 19:04:16 copying local disk images
2020-12-09 19:04:16 starting VM 142 on remote node 'p37'
2020-12-09 19:04:19 start remote tunnel
2020-12-09 19:04:20 ssh tunnel ver 1
2020-12-09 19:04:20 starting storage migration
2020-12-09 19:04:20 scsi1: start migration to nbd:unix:/run/qemu-server/142_nbd.migrate:exportname=drive-scsi1
drive mirror re-using dirty bitmap 'repl_scsi1'
drive mirror is starting for drive-scsi1
drive-scsi1: transferred: 0 bytes remaining: 3145728 bytes total: 3145728 bytes progression: 0.00 % busy: 1 ready: 0 
drive-scsi1: transferred: 3211264 bytes remaining: 0 bytes total: 3211264 bytes progression: 100.00 % busy: 0 ready: 1 
all mirroring jobs are ready 
2020-12-09 19:04:21 volume 'local-zfs:vm-142-disk-1' is 'local-zfs:vm-142-disk-1' on the target
2020-12-09 19:04:21 scsi0: start migration to nbd:unix:/run/qemu-server/142_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 524288 bytes remaining: 14745600 bytes total: 15269888 bytes progression: 3.43 % busy: 1 ready: 0 
drive-scsi1: transferred: 3211264 bytes remaining: 0 bytes total: 3211264 bytes progression: 100.00 % busy: 0 ready: 1 
drive-scsi0: transferred: 16056320 bytes remaining: 0 bytes total: 16056320 bytes progression: 100.00 % busy: 0 ready: 1 
drive-scsi1: transferred: 3211264 bytes remaining: 0 bytes total: 3211264 bytes progression: 100.00 % busy: 0 ready: 1 
all mirroring jobs are ready 
2020-12-09 19:04:22 volume 'local-zfs:vm-142-disk-0' is 'local-zfs:vm-142-disk-0' on the target
2020-12-09 19:04:22 starting online/live migration on unix:/run/qemu-server/142.migrate
2020-12-09 19:04:22 set migration_caps
2020-12-09 19:04:22 migration speed limit: 8589934592 B/s
2020-12-09 19:04:22 migration downtime limit: 100 ms
2020-12-09 19:04:22 migration cachesize: 2147483648 B
2020-12-09 19:04:22 set migration parameters
2020-12-09 19:04:22 start migrate command to unix:/run/qemu-server/142.migrate
2020-12-09 19:04:23 migration status: active (transferred 123341959, remaining 14559850496), total 14697963520)
2020-12-09 19:04:23 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-12-09 19:04:24 migration status: active (transferred 339052571, remaining 14338920448), total 14697963520)
2020-12-09 19:04:24 migration xbzrle cachesize: 2147483648 transferred 0 pages 0 cachemiss 0 overflow 0
2020-12-09 19:04:25 migration status: active (transferred 514497418, remaining 14158819328), total 14697963520)
...
2020-12-09 19:05:52 migration status: active (transferred 15892510128, remaining 19333120), total 14697963520)
2020-12-09 19:05:52 migration xbzrle cachesize: 2147483648 transferred 138148789 pages 130000 cachemiss 385936 overflow 4091
2020-12-09 19:05:53 migration status: active (transferred 15901339942, remaining 29462528), total 14697963520)
2020-12-09 19:05:53 migration xbzrle cachesize: 2147483648 transferred 141392157 pages 138455 cachemiss 387292 overflow 4094
2020-12-09 19:05:53 migration status: active (transferred 15910220375, remaining 22675456), total 14697963520)
2020-12-09 19:05:53 migration xbzrle cachesize: 2147483648 transferred 146480920 pages 146102 cachemiss 388206 overflow 4104
2020-12-09 19:05:53 migration status: active (transferred 15917693927, remaining 4939776), total 14697963520)
2020-12-09 19:05:53 migration xbzrle cachesize: 2147483648 transferred 151067342 pages 154090 cachemiss 388898 overflow 4113
2020-12-09 19:05:53 migration status: active (transferred 15920411695, remaining 5074944), total 14697963520)
2020-12-09 19:05:53 migration xbzrle cachesize: 2147483648 transferred 151815218 pages 157656 cachemiss 389372 overflow 4115
query migrate failed: VM 142 not running

2020-12-09 19:05:53 query migrate failed: VM 142 not running
query migrate failed: VM 142 not running

2020-12-09 19:05:54 query migrate failed: VM 142 not running
query migrate failed: VM 142 not running

2020-12-09 19:05:55 query migrate failed: VM 142 not running
query migrate failed: VM 142 not running

2020-12-09 19:05:56 query migrate failed: VM 142 not running
query migrate failed: VM 142 not running

2020-12-09 19:05:57 query migrate failed: VM 142 not running
query migrate failed: VM 142 not running

2020-12-09 19:05:59 query migrate failed: VM 142 not running
2020-12-09 19:05:59 ERROR: online migrate failure - too many query migrate failures - aborting
2020-12-09 19:05:59 aborting phase 2 - cleanup resources
2020-12-09 19:05:59 migrate_cancel
2020-12-09 19:05:59 migrate_cancel error: VM 142 not running
drive-scsi0: Cancelling block job
drive-scsi1: Cancelling block job
2020-12-09 19:05:59 ERROR: VM 142 not running
2020-12-09 19:05:59 scsi1: removing block-dirty-bitmap 'repl_scsi1'
2020-12-09 19:05:59 ERROR: VM 142 not running
2020-12-09 19:06:01 ERROR: migration finished with problems (duration 00:01:55)
TASK ERROR: migration problems
 
Small update, I did live migrate 5 more VMs, and all worked.
Only one that died, was the WHM/cPanel VM with two disks.
Maybe it is related to number of disks, ...
 
couple of weeks / months ago I had a similar issue, errors when live migrate
I didn't got any clue, but some VMs were more susceptible to live migration failure, even their usage (even i/o) was very light
I tried to "fstrim -av" the guests just before the live migration and no error since then (but having the guest kernel updated and also the kvm/qemu package updated twice, I cannot state that the fstrim command made any difference, but I'm still doing it - some housekeeping doesn't hurt)
but you could try it (if supported by host storage and guests), should made no harm, only some i/o (which indeed may become heavy in some scenario, so use with caution)
 
FYI I live migrated another 50 KVM VMs without issue, including WHM/cPanel/CloudLinux.
Live migration only failed for that VM with two disks.
Some time in the future, after I upgrade both nodes to most recent PM version, will test again.
If it fails agian, then it is reproducable, so I will open a bug report.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!