VM migration problems (networking does not work correctly)

Hivane

Active Member
Feb 11, 2014
38
0
26
Paris
www.hivane.net
Hi all,

I have a problem with online migration. The VM looses its network when migrated.
Here is my setup:

Two hypervisors, configured as a cluster (let's name it A & B), each of it has two physical network connections (eth0 + eth1).
-> On eth0, I have 3 vlans (tagged: 274+275+277) plus a "bond setup" (with only eth0 in it).
-> On eth1, I only have on vlan (untagged), which is used for iSCSI shared storage.

I have the following /etc/network/interface on both hypervisors:
Code:
auto lo
iface lo inet loopback

auto bond0
iface bond0 inet manual
        slaves eth0
        bond_miimon 100
        bond_mode 4


## Vlan 274 -- Xenserver
auto bond0.274
iface bond0.274 inet manual
        vlan-raw-device bond0

auto vmbr0
iface vmbr0 inet static
    address 193.17.x.x
    netmask 255.255.255.240
    gateway 193.17.x.x
    bridge_ports bond0.274
    bridge_stp off
    bridge_fd 0

## Vlan 90 - Untag - BO
auto vmbr1
iface vmbr1 inet static
        address 172.16.192.4
        netmask 255.255.0.0
        bridge_ports eth1
        bridge_stp off
        bridge_fd 0


## Vlan 275 -- Proxmox IVR
auto bond0.275
iface bond0.275 inet manual
        vlan-raw-device bond0

auto vmbr2
iface vmbr2 inet manual
        bridge_ports bond0.275
        bridge_stp off
        bridge_fd 0


## Vlan 277 -- Proxmox GN
auto bond0.277
iface bond0.277 inet manual
        vlan-raw-device bond0

auto vmbr3
iface vmbr3 inet manual
        bridge_ports bond0.277
        bridge_stp off
        bridge_fd 0


The problem I encounter is the following:
When I create a VM on node A, boot it, etc, it works perfectly.
-> If I migrate it to node B, the migration succeeds, but I cannot ping anymore the VM.
I also cannot get a console connection using the web interface (unresponsive), or either shutdown it (timeout).
The only working alternatives for me, are either to re-migrate it to node A (and it pings again, and I can notice it didn't reboot), or to reset it using the web interface.
However, I can notice on the switch side, that the VM's mac is present at a layer2 level, on the right node.

-> If I reset the VM while it is running on node B (and being unresponsive), it reboots, and ping again.
After that, I can migrate it from B to A (and vice versa), and everything is fine.

I can reproduce the problem on vlan 274 and vlan 275. Didn't try on vlan 275 yet, because I do not have any available IPs on it yet.


Conclusion: If the VM has been booted on node B, it can be migrated smoothly. If it has been booted on node A, it only can work on node A.


Did I miss something on my conf ? Or, maybe, there is a bug ? The VMs network settings have been tested with e1000 & virtio.
 
Last edited:
Other informations:

- I have checked with brctl show & tcpdump that the tap has been correctly attached to the right bridge, and that traffic arrives to the VM: OK
- I have noticed that the VM is using 100% of its CPU when it is unresponsive on node B.

It can be confirmed with a top on the hypervisor node:
Code:
73185 root      20   0 2347m 161m 3308 S 399.1  0.5  72:33.61 kvm

When comparing both of the nodes A & B (CPU level), I can say that A is a bi-quad core, and B a bi-dual core. The VMs does not have more than 2 sockets/2 cores, on its config.
However, frequency + flags are different on both nodes:

Node A Flags (Frequency is 2.83GHz);
Code:
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow vnmi flexpriority

Node B flags (Frequency is 3GHz):
Code:
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dts tpr_shadow

Compared to A, node B lacks sep & nx flags.
Is that a problem ?
 
Last edited:
Hi,It seems that sep flag is something important for pve/qemu (since we can see "-cpu kvm64,+x2apic,+sep" on the kvm command line arguments)

.Is there any clean way to disable it without ugly hacks ?
 
Last edited: