PVE 4 KVM live migration problem

in QemuServer.pm,

can you edit "sub check_running",

and change the line

die "unable to find configuration file for VM $vmid - no such machine\n"
if !$nocheck && ! -f $filename;

to

die "unable to find configuration file for VM $vmid - no such machine $nocheck $filename\n"
if !$nocheck && ! -f $filename;


Then restart (or stop/start maybe), pvedaemon
 
Here is the task log

Code:
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:19 starting migration of VM 101 to node 'virt2n2-la' (38.102.250.229)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:19 copying disk images[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:19 starting VM 101 on remote node 'virt2n2-la'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:21 starting ssh migration tunnel[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:21 starting online/live migration on localhost:60000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:21 migrate_set_speed: 8589934592[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:21 migrate_set_downtime: 0.1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:23 migration status: active (transferred 237958471, remaining 94547968), total 2156601344)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:23 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:25 migration speed: 512.00 MB/s - downtime 43 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:25 migration status: completed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:25 moving vm conf file[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:25 sleep 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:27 resume vm[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:28 ERROR: Use of uninitialized value $nocheck in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 2243.[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:28 ERROR: unable to find configuration file for VM 101 - no such machine  /etc/pve/nodes/virt2n2-la/qemu-server/101.conf[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:28 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 101 --skiplock' failed: exit code 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 20:50:31 ERROR: migration finished with problems (duration 00:00:12)[/FONT][/COLOR]
[FONT=tahoma][COLOR=#000000]TASK ERROR: migration problems[/COLOR][/FONT]

Answering your previous post, I remember that I if no previous HA operation was done on the resource, I couldn't reproduce the problem. I am probably repeating my earlier posts but I had to stop machine migrate, remove it from HA, migrate offline to another node, start, and then I could migrate 10 times between all 4 nodes back in forth, online, no problems.
 
Look at the file time stamp on the remote node.

Code:
ls -l --time-style=full-iso  /etc/pve/nodes/virt2n2-la/qemu-server/101.conf
-rw-r----- 1 root www-data 392 2015-10-13 20:50:31.000000000 -0700 /etc/pve/nodes/virt2n2-la/qemu-server/101.conf


Now look at the time stamp on the task log errors

Code:
Oct 13 20:50:27 resume vm
Oct 13 20:50:28 ERROR: Use of uninitialized value $nocheck in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer.pm line 2243.
Oct 13 20:50:28 ERROR: unable to find configuration file for VM 101 - no such machine  /etc/pve/nodes/virt2n2-la/qemu-server/101.conf
Oct 13 20:50:28 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 101 --skiplock' failed: exit code 2
Oct 13 20:50:31 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems

File finished creation / modification 2 seconds later then actual resume operation happened ?
 
Last edited:
>>Answering your previous post, I remember that I if no previous HA operation was done on the resource, I couldn't reproduce the problem. I am probably repeating my earlier posts but I had to stop machine migrate, remove it from HA, migrate offline to another node, start, and then I could migrate 10 times between all 4 nodes back in forth, online, no problems.

mmm,interesting... I'll dig into ha code.



can you edit the line and only add $filename to die message:

die "unable to find configuration file for VM $vmid - no such machine $filename\n"
if !$nocheck && ! -f $filename;
 
Look at the file time stamp on the remote node.

File finished creation / modification 2 seconds later then actual resume operation happened ?

Yes, this is strange. The date should be the date at the time of the move.
Like I said, it seem to replicate too late on target node

I really don't known where HA is related here, I will do tests to try to reproduce it.
 
Maybe can you try to do a move manually of vmid.conf, when ha is enabled, to see if date is correct on target node.

on source node:

date && mv /etc/pve/nodes/sourcenode/qemu-server/vmid.conf /etc/pve/nodes/targetnode/qemu-server/



on target node :

ls -l --time-style=full-iso /etc/pve/nodes/targetnode/qemu-server/vmid.conf
 
Responding to a time check first:

Source Node

Code:
root@virt2n2-la:~# date && mv /etc/pve/nodes/virt2n2-la/qemu-server/101.conf /etc/pve/nodes/virt2n1-la/qemu-server/
Tue Oct 13 21:29:23 PDT 2015

Target node

Code:
root@virt2n1-la:~# ls -l --time-style=full-iso /etc/pve/nodes/virt2n1-la/qemu-server/101.conf 
-rw-r----- 1 root www-data 392 2015-10-13 21:29:23.000000000 -0700 /etc/pve/nodes/virt2n1-la/qemu-server/101.conf
 
Changed printout as requested

Code:
[COLOR=#000000][FONT=tahoma]task started by HA resource agent[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:41 starting migration of VM 101 to node 'virt2n2-la' (38.102.250.229)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:41 copying disk images[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:41 starting VM 101 on remote node 'virt2n2-la'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:42 starting ssh migration tunnel[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:43 starting online/live migration on localhost:60000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:43 migrate_set_speed: 8589934592[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:43 migrate_set_downtime: 0.1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:45 migration status: active (transferred 238479671, remaining 94027776), total 2156601344)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:45 migration xbzrle cachesize: 134217728 transferred 0 pages 0 cachemiss 0 overflow 0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:47 migration speed: 512.00 MB/s - downtime 24 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:47 migration status: completed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:47 moving vm conf file[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:47 sleep 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:48 resume vm[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:48 ERROR: unable to find configuration file for VM 101 - no such machine /etc/pve/nodes/virt2n2-la/qemu-server/101.conf[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:48 ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' root@38.102.250.229 qm resume 101 --skiplock' failed: exit code 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Oct 13 21:39:51 ERROR: migration finished with problems (duration 00:00:10)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]TASK ERROR: migration problems

[/FONT][/COLOR]root@virt2n2-la:~# ls -l --time-style=full-iso /etc/pve/nodes/virt2n2-la/qemu-server/101.conf 
-rw-r----- 1 root www-data 392 2015-10-13 21:39:51.000000000 -0700 /etc/pve/nodes/virt2n2-la/qemu-server/101.conf



3 seconds difference

However

Code:
root@virt2n2-la:~# date && ssh -t virt2n1-la date
Tue Oct 13 21:43:11 PDT 2015
Tue Oct 13 21:43:11 PDT 2015
Connection to virt2n1-la closed.




 
Last edited:
Hi,

I'm still working on it,

could you look

cat /var/log/daemon.log |grep pve-ha-lrm


when the problem occur




(BTW, the right command to restart service if you update the code , is

systemctl restart pve-ha-lrm.service

like this, you don't need to reboot to get modifications applied)
)
 
On source node I see:

Oct 14 12:11:58 virt2n2-la pve-ha-lrm[158659]: Task still active, waitingOct 14 12:11:59 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:00 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:01 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:02 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:03 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:04 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:05 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:06 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:07 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:08 virt2n2-la pve-ha-lrm[158659]: Task still active, waiting
Oct 14 12:12:09 virt2n2-la pve-ha-lrm[158660]: migration problems

No new entries on target,
 
Hi,

I have made a fix for this bug,
It has been applied today in proxmox git repository,
packages should be available soon
Thank you very much Spirit. How soon is soon? Can I build a specific package from git meanwhile?

Sent from my SM-G900V using Tapatalk
 
I installed it and ran but now I have the problem which looks even worse. HA Migrate task executed but, nothing else happens.

Daemon log on the source node continuously printing

Code:
Oct 15 07:08:59 virt2n1-la pve-ha-lrm[119275]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:09 virt2n1-la pve-ha-lrm[119299]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:19 virt2n1-la pve-ha-lrm[119330]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:30 virt2n1-la pve-ha-lrm[119354]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:39 virt2n1-la pve-ha-lrm[119363]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:49 virt2n1-la pve-ha-lrm[119393]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:59 virt2n1-la pve-ha-lrm[119417]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.

Here is another clue

Code:
root@virt2n1-la:~/patches# ha-manager status
quorum OK
master virt2n2-la (active, Thu Oct 15 07:13:13 2015)
lrm virt2n1-la (active, Thu Oct 15 07:13:19 2015)
lrm virt2n2-la (active, Thu Oct 15 07:13:16 2015)
lrm virt2n3-la (active, Thu Oct 15 07:13:16 2015)
lrm virt2n4-la (active, Thu Oct 15 07:13:16 2015)
service vm:101 (virt2n1-la, migrate)
 
I installed it and ran but now I have the problem which looks even worse. HA Migrate task executed but, nothing else happens.

Daemon log on the source node continuously printing

Code:
Oct 15 07:08:59 virt2n1-la pve-ha-lrm[119275]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:09 virt2n1-la pve-ha-lrm[119299]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:19 virt2n1-la pve-ha-lrm[119330]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:30 virt2n1-la pve-ha-lrm[119354]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:39 virt2n1-la pve-ha-lrm[119363]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:49 virt2n1-la pve-ha-lrm[119393]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.
Oct 15 07:09:59 virt2n1-la pve-ha-lrm[119417]: service 'vm:101' not on this node at /usr/share/perl5/PVE/HA/Env/PVE2.pm line 389.

Here is another clue

Code:
root@virt2n1-la:~/patches# ha-manager status
quorum OK
master virt2n2-la (active, Thu Oct 15 07:13:13 2015)
lrm virt2n1-la (active, Thu Oct 15 07:13:19 2015)
lrm virt2n2-la (active, Thu Oct 15 07:13:16 2015)
lrm virt2n3-la (active, Thu Oct 15 07:13:16 2015)
lrm virt2n4-la (active, Thu Oct 15 07:13:16 2015)
service vm:101 (virt2n1-la, migrate)

Did you have a previous migration failure ? I have see this sometime with the resume bug.

stop/start the vm should fix this.
 
I now recall it happened one time with me before today's update. Fixing it this way may be acceptable during the testing like now, but it definitely a problem in production. Do you know what is happening there?
 
I now recall it happened one time with me before today's update. Fixing it this way may be acceptable during the testing like now, but it definitely a problem in production. Do you know what is happening there?

proxmox dev are currently working to fix this bug.

But basicaly, migration task is in fail state, because of the resume error. But the vm has been correctly migrate to target node.
But HA think that the vm is still on source node (because the migration task have an error).


Do you still have the resume error with the new .deb file ?
 
proxmox dev are currently working to fix this bug.

But basicaly, migration task is in fail state, because of the resume error. But the vm has been correctly migrate to target node.
But HA think that the vm is still on source node (because the migration task have an error).


Do you still have the resume error with the new .deb file ?
Yes it still happening with new deb

Sent from my SM-G900V using Tapatalk
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!