PVE 4 KVM live migration problem

I have exactly the same issue as alitvak69. HA VMs migrate but sometimes fail to resume on the target system. Manual resume after that works. I also have the same errors in the migration log. I'm willing to perform any test required to get to the bottom of this.

For me this happens on a cluster of 3 identical HP ProLiant DL360 Gen9 servers.
 
Last edited:
Thom, Dietmar, do you have any feedback on this. If this is not fixed any time soon perhaps I need to migrate back to 3.4 which would be a huge waste of time for me as I have invested some time into getting cluster to this stage. Please let me know if you need to test a fix, I would be happy to do it.
 
Is anything available from git I could install today / tomorrow?
I hate to push but this thing is a show stopper for me.

Sent from my SM-G900V using Tapatalk
 
Is anything available from git I could install today / tomorrow?
I hate to push but this thing is a show stopper for me.

Sent from my SM-G900V using Tapatalk

can you send your vmid.conf and failing migration task log ?

Sound like a race with config file move and vm resume
 
Here you go

Code:
cat 101.conf 
bootdisk: virtio0
cores: 2
hotplug: disk,network
ide2: none,media=cdrom
localtime: 1
memory: 2048
name: monitor1n1-la.siptalk.com
net0: virtio=AE:5E:BD:28:73:51,bridge=vmbr0,firewall=1,queues=4
net1: virtio=CE:A2:78:60:F3:C5,bridge=vmbr1,firewall=1
numa: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=d7de6a65-fd2b-415c-bada-bb23a6ed136c
sockets: 2
virtio0: ceph:vm-101-disk-1,size=32G

And log

Code:
[COLOR=#000000][FONT=tahoma]task started by HA resource agent[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:05 starting migration of VM 101 to node 'virt2n2-la' (38.102.250.229)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:05 copying disk images[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:05 starting VM 101 on remote node 'virt2n2-la'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:07 starting ssh migration tunnel[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:08 starting online/live migration on localhost:60000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:08 migrate_set_speed: 8589934592[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:08 migrate_set_downtime: 0.1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:10 migration status: active (transferred 233192395, remaining 98045952), total 2156601344)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:10 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:12 migration speed: 512.00 MB/s - downtime 11 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:12 migration status: completed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:12 ERROR: unable to find configuration file for VM 101 - no such machine[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:12 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 101 --skiplock' failed: exit code 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 11 10:24:15 ERROR: migration finished with problems (duration 00:00:10)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems
[/FONT][/COLOR]
 
can you reproduce it ?

if yes, can you try to edit
/usr/share/perl5/PVE/QemuMigrate.pm

and add a sleep in line 584
Code:
    die "Failed to move config to node '$self->{node}' - rename failed: $!\n"
        if !rename($conffile, $newconffile);

[B]    sleep 1;   [/B] 


    if ($self->{livemigration}) {
        # now that config file is move, we can resume vm on target if livemigrate
        my $cmd = [@{$self->{rem_ssh}}, 'qm', 'resume', $vmid, '--skiplock'];
        eval{ PVE::Tools::run_command($cmd, outfunc => sub {},
                errfunc => sub {
                    my $line = shift;
                    $self->log('err', $line);
                });
        };

do it on source and target node,

then restart

/etc/init.d/pvedaemon restart

on source and target nodes

then try migration again.
 
can you add some logs

Code:
[B]  $self->log('info', "moving vm conf file");[/B]


  die "Failed to move config to node '$self->{node}' - rename failed: $!\n"
        if !rename($conffile, $newconffile);
[B]
  $self->log('info', "sleep 1");[/B]
[B]    sleep 1;    [/B]


[B]  $self->log('info', "resume vm");

[/B]
    if ($self->{livemigration}) {
        # now that config file is move, we can resume vm on target if livemigrate
        my $cmd = [@{$self->{rem_ssh}}, 'qm', 'resume', $vmid, '--skiplock'];
        eval{ PVE::Tools::run_command($cmd, outfunc => sub {},
                errfunc => sub {
                    my $line = shift;
                    $self->log('err', $line);
                });
        };


(and restart pve-daemon like before)
 
First of all, i have exactly the same problem as alitvak69. Same logging/error messages. Im running 2 Proxmox instances (Virtual) on top of a single Proxmox box. As CPU type i have no other option then to use the host cpu (else i get a no hardware acceleration found), but then its the same cpu so i guess its no problem.

I have added the suggested "sleep" code to QemuMigrate.pm on both nodes, restarted pve-daemon on both but the problem still exists with the same error's.
 
Added logging, here is the task log

Code:
[COLOR=#000000][FONT=tahoma]task started by HA resource agent[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:03 starting migration of VM 101 to node 'virt2n2-la' (38.102.250.229)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:03 copying disk images[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:03 starting VM 101 on remote node 'virt2n2-la'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:05 starting ssh migration tunnel[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:05 starting online/live migration on localhost:60000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:05 migrate_set_speed: 8589934592[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:05 migrate_set_downtime: 0.1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:07 migration status: active (transferred 237514970, remaining 67153920), total 2156601344)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:07 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:09 migration speed: 512.00 MB/s - downtime 11 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:09 migration status: completed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:09 moving vm conf file[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:09 sleep 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:10 resume vm[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:11 ERROR: unable to find configuration file for VM 101 - no such machine[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:11 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 101 --skiplock' failed: exit code 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 07:24:13 ERROR: migration finished with problems (duration 00:00:10)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems
[/FONT][/COLOR]
 
ok, so resume it's really the problem.

can you check on target node if the vmid.conf file is correctly located in /etc/pve/nodes/nodeX/qemu-server/ ?


Can you try to increase the sleep value to an higher value ? (sleep 10).

I would like to known if the file is correctly moved or not, or if it's work but take some times.
 
File is / was getting there all the time because I can / could resume every time on the proper target machine.

I rebooted one of the boxes and kept sleep 1 on both nodes. Now I tried reproducing the problem and I cannot, not within two nodes I experimenting with. Does it mean that pvdaemon restart was not enough? What bothers me that even if work around works now, I don't know why it did not work until I rebooted.

If this work on all 4 nodes perhaps it can be a work around until the actual fix can be applied. Having 10 seconds on resume may be not a good solution though.

I will test more and update you.
 
I rebooted one of the boxes and kept sleep 1 on both nodes. Now I tried reproducing the problem and I cannot, not within two nodes I experimenting with. Does it mean that pvdaemon restart was not enough?
it's enough as you can see the new logs messages

What bothers me that even if work around works now, I don't know why it did not work until I rebooted.

If this work on all 4 nodes perhaps it can be a work around until the actual fix can be applied. Having 10 seconds on resume may be not a good solution though.
Yes, sure, it should work without any sleep. (it's works for me).
If reboot solve the problem, maybe it's a pve-cluster/corsync/multicast problem or maybe HA. It's very difficult to debug.

you can try to remove sleep and restart pvedaemon to be sure.
 
it's enough as you can see the new logs messages


Yes, sure, it should work without any sleep. (it's works for me).
If reboot solve the problem, maybe it's a pve-cluster/corsync/multicast problem or maybe HA. It's very difficult to debug.

you can try to remove sleep and restart pvedaemon to be sure.
Spirit,
I really appreciate your help.

Since it works for you and per this thread I am not the only one with the issue, can you post your cluster configuration ? Also I am interested in firewall settings which perhaps can affect multicast communications.

Thanks again,

Sent from my SM-G900V using Tapatalk
 
Spirit,
I really appreciate your help.

Since it works for you and per this thread I am not the only one with the issue, can you post your cluster configuration ? Also I am interested in firewall settings which perhaps can affect multicast communications.

Thanks again,

Sent from my SM-G900V using Tapatalk

I have a simple cluster config, but I'm using a dedicated vlan for kvm hosts ip. (so only multicast traffic,live migration and vm management occur on this network

Code:
logging {
  debug: off
  to_syslog: yes
}


nodelist {
  node {
    nodeid: 3
    quorum_votes: 1
    ring0_addr: node1
  }


  node {
    nodeid: 2
    quorum_votes: 1
    ring0_addr: node2
  }


  node {
    nodeid: 1
    quorum_votes: 1
    ring0_addr: node3
  }


  node {
    nodeid: 4
    quorum_votes: 1
    ring0_addr: node4
  }


  node {
    nodeid: 5
    quorum_votes: 1
    ring0_addr: node5
  }


}


quorum {
  provider: corosync_votequorum
}


totem {
  cluster_name: mycluster
  config_version: 5
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
    bindnetaddr: 10.0.0.10
    ringnumber: 0
  }


}

for corosync, you can check logs in

cat /var/log/daemon.log|grep corosync


It could be also a software bug in pve-cluster (the management of /etc/pve fuse fs), but it's really to debug remotely.
 
No, sleep 1 second is not a solution at least not reliable work around.

1. I copied all the files to nodes 3 and 4, then restarted pvedaemons but tasks with logged sleep have not shown up until I rebooted the nodes 3 and 4.
2. Error still remains when I migrate from 3 to 4 and back

Code:
[COLOR=#000000][FONT=tahoma]task started by HA resource agent[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:24 starting migration of VM 101 to node 'virt2n3-la' (38.102.250.231)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:24 copying disk images[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:24 starting VM 101 on remote node 'virt2n3-la'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:26 starting ssh migration tunnel[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:27 starting online/live migration on localhost:60000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:27 migrate_set_speed: 8589934592[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:27 migrate_set_downtime: 0.1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:29 migration status: active (transferred 233059413, remaining 97058816), total 2156601344)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:29 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:31 migration speed: 512.00 MB/s - downtime 57 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:31 migration status: completed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:31 moving vm conf file[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:31 sleep 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:32 resume vm[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:33 ERROR: unable to find configuration file for VM 101 - no such machine[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:33 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.231 qm resume 101 --skiplock' failed: exit code 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 18:34:36 ERROR: migration finished with problems (duration 00:00:12)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems

I am getting ready to roll back to 3 which is not something I want to do but so far I am not getting any closer to understanding what causes this


[/FONT][/COLOR]
 
Spirit,

One of the errors comes from different library.
Can you help to debug that?

QemuServer.pm: die "unable to find configuration file for VM $vmid - no such machine\n"
 
Spirit,

One of the errors comes from different library.
Can you help to debug that?

QemuServer.pm: die "unable to find configuration file for VM $vmid - no such machine\n"

the problem is:

source node : config file is move to target node. (this is a local move to source node, and this move is replicated to target node through corosync)

target node : resume is called. In QemuServer.pm (sub vm_resume) the config file is read, locally on target node. If the move file action is not yet replicate, you'll have the error.


That's why I ask to a add sleep to test , to be sure that the config file is moved.

Do you have this problem only when HA is enabled is enabled or not ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!