64bit vm freeze after live migration

resoli

Renowned Member
Mar 9, 2010
147
4
83
Hi,

I'm running a pve cluster from two bare-metal installs[1]. I'm testing online migration of windows and linux vms , and found that a karmic 64bit machine is often freezing after a live migration.

The migration itself (machines are lvm-based, on a SAN provided disk, presented to both cluster nodes) succeds, but after some time the machine freezes.

Windows (server 2003, 32 bit) machines migrates without problems, and so appears for karmic 32 bit.

In particular, freeze happens only migrating back from node to master, not from master to node.

Furthermore, i found that there is a significant drift of the clock, which appears only after a migration from node to master. This is evidenced by running ping[2] during migration;
the "Warning: time of day goes back (-<microseconds>us), taking countermeasures." appears just after migration, and only when migrating from node to master.

Both cluster nodes are running ntp configured with our reference ntp server on internal lan.

Any hint?

bye,
rob

[1] # pveversion -v
pve-manager: 1.5-8 (pve-manager/1.5/4674)
running kernel: 2.6.18-2-pve
proxmox-ve-2.6.18: 1.5-5
pve-kernel-2.6.18-2-pve: 2.6.18-5
qemu-server: 1.1-11
pve-firmware: 1.0-3
libpve-storage-perl: 1.0-10
vncterm: 0.9-2
vzctl: 3.0.23-1pve8
vzdump: 1.2-5
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm-2.6.18: 0.9.1-5

[2] $ ping 192.168.10.48
PING 192.168.10.48 (192.168.10.48) 56(84) bytes of data.
64 bytes from 192.168.10.48: icmp_seq=1 ttl=64 time=359 ms
64 bytes from 192.168.10.48: icmp_seq=2 ttl=64 time=0.311 ms
Warning: time of day goes back (-2158us), taking countermeasures.
Warning: time of day goes back (-2046us), taking countermeasures.
64 bytes from 192.168.10.48: icmp_seq=3 ttl=64 time=0.000 ms
64 bytes from 192.168.10.48: icmp_seq=4 ttl=64 time=0.854 ms
64 bytes from 192.168.10.48: icmp_seq=5 ttl=64 time=2.66 ms
Warning: time of day goes back (-2144us), taking countermeasures.
64 bytes from 192.168.10.48: icmp_seq=6 ttl=64 time=0.000 ms
 
At last, i have found that the problem is not related to architecture, but to the number of cpu sockets configured on the vm.

If simply i configure only one cpu socket, even with clocksource=kvm-clock i can migrate flawlessly.

bye,
rob
 
If simply i configure only one cpu socket, even with clocksource=kvm-clock i can migrate flawlessly.

I just ran into the same problem. Thanks for posting your experience and saving me a bunch of troubleshooting. Does anyone know why multiple vcpu exhibit this problem on Karmic?