What could be the reason a VM migration failed?

Helmo

Well-Known Member
Jan 11, 2018
34
3
48
(I know too broad a question ... I'll try to narrow it down)

I'm often migrating vm's but sometimes it fails.

E.g. I migrate a vm from node 1 to node 2, and it failed after memory was transferred. (log below)

A second attempt between the same hosts worked OK.

node1: proxmox-ve: 6.2-1 (running kernel: 5.3.18-3-pve)
node 2: proxmox-ve: 6.2-1 (running kernel: 5.4.44-1-pve)

Both nodes use local storage on LVM. I found no relevant messages in syslog.

The worst thing is that after such a failure the vm is not running any more. The disks are cleaned up on the target node, and I can just start it again on node 1. But it would be nice if the VM was started again (or restored) on the source node automatically.

Code:
...
2020-07-13 15:26:10 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 113245 overflow 0
2020-07-13 15:26:10 migration status: active (transferred 4220021110, remaining 7905280), total 4312604672)
2020-07-13 15:26:10 migration xbzrle cachesize: 536870912 transferred 0 pages 0 cachemiss 116150 overflow 0
query migrate failed: VM 115 qmp command 'query-migrate' failed - client closed connection

2020-07-13 15:26:10 query migrate failed: VM 115 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 115 not running

2020-07-13 15:26:11 query migrate failed: VM 115 not running
query migrate failed: VM 115 not running

2020-07-13 15:26:13 query migrate failed: VM 115 not running
query migrate failed: VM 115 not running

2020-07-13 15:26:14 query migrate failed: VM 115 not running
query migrate failed: VM 115 not running

2020-07-13 15:26:15 query migrate failed: VM 115 not running
query migrate failed: VM 115 not running

2020-07-13 15:26:16 query migrate failed: VM 115 not running
2020-07-13 15:26:16 ERROR: online migrate failure - too many query migrate failures - aborting
2020-07-13 15:26:16 aborting phase 2 - cleanup resources
2020-07-13 15:26:16 migrate_cancel
2020-07-13 15:26:16 migrate_cancel error: VM 115 not running
drive-scsi0: Cancelling block job
2020-07-13 15:26:16 ERROR: VM 115 not running
2020-07-13 15:26:19 ERROR: migration finished with problems (duration 00:04:24)
TASK ERROR: migration problems
 
is there anything in the journal around the failed migration? looks like the VM is crashing.. if this happens often/you can reproduce it, you can also try starting the VM in the foreground ('qm showcmd XXX', then remove '-daemonize' and run the resulting command line) and then attempting the migration, maybe it prints an error message.
 
no, journalctl just shows the same lines as I reviewd in syslog.

Do you mean to start it in the forground on node 1? Would the migration process not start it's own variant on node 2 (and that's where it fails) ?
It's probably less then 1 in 20 migrations that fails. So I'll have to plan some time for dummy migrations.
 
no, the failure is on the source node. possibly the actual cause is on the other end though. could you post the full migration log and the syslog/journalctl output from the same time?
 
okay, nothing out of the ordinary visible there.. please try the foreground variant if it is reproducible!
 
When I tried starting a test vm in the forground I got 'Could not open '/dev/baressd/vm-209-disk-0': No such file or directory' The lvm volume is not active. I manually activated it with ' lvchange --activate y baressd/vm-209-disk-0'

Unfortunately this test vm migrated flawlessly. ;) I'll try again later.
 
Had the same issue today migrating from 6.2 to 6.3. Instance was stopped after failure and had to start it manually on source node. But on second attempt it succeeded. On dest node I got log after the failed transfer:

Code:
Dec  3 07:12:11 pve2 pvesm[3374]: zfs error: cannot destroy 'storage/vm-125-disk-0': dataset is busy
Dec  3 07:12:11 pve2 pvesm[63790]: <root@pam> end task UPID:pve2:00000D2E:00589D9E:5FC881B5:imgdel:125@storage:root@pam: zfs error: cannot destroy 'storage/vm-125-disk-0': dataset is busy

But this is after failure occured, I could later manually delete the volume without failure.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!