Proxmox Live Migration no network after Migration is done

Mitterhuemer

New Member
Jan 29, 2018
15
1
1
31
Hello,

German Post at Hetzner Board:
https://forum.hetzner.com/thread/25240-vswitch-proxmox-live-migration-kein-netzwerk-nach-abschluss/

i have 2 Dedicated Hetzner Servers

One is in Helsinki and one is in Falkenstein.

I have a Public IP /29 Subnet on vlan4000 together. (all VMs need to have MTU 1400)

The first usable IP of the Subnet is the Gateway Router of Hetzner.

I bridged vlan4000 (vmbr4000).
I added this interface to the VMs.

Code:
source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback

iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet manual

auto vmbr0
iface vmbr0 inet static
        address  PUBLIC IP OF SERVER
        netmask  NETMASK OF SERVER
        gateway  HETZNER GW
        bridge-ports enp0s31f6
        bridge-fd 0
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off

iface vmbr0 inet6 static
        address  PUBLIC IP OF SERVER
        netmask  64
        gateway  fe80::1

auto vmbr4000
iface vmbr4000 inet manual
        bridge-ports enp0s31f6.4000
        bridge-fd 0
        mtu 1400
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off
#Public

auto vmbr4001
iface vmbr4001 inet static
        address  10.100.30.10
        netmask  255.255.255.0
        bridge-ports enp0s31f6.4001
        bridge-fd 0
        mtu 1400
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off
#Cluster

Everything works. All VMs are reachable over a Public IP of the subnet.

When i do a live migration from Server A to Server B then the Network of the migrated VM is not available anymore after the Migration is done.

About 20 - 30 minutes after migration sometimes the Network comes back.

I tried this:
ip -s -s neigh flush all

But did not work.

also proxy_arp settings did not solve it.

Hetzner asked if Proxmox sends Gratuitous ARP after migration?

I enabled this features in the kernel

Code:
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.default.arp_ignore = 1
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.vmbr0.arp_ignore = 1
net.ipv4.conf.vmbr0.arp_announce = 2
net.ipv4.conf.vmbr0.arp_accept = 1
net.ipv4.conf.default.arp_accept = 1
net.ipv4.conf.all.arp_accept = 1
net.ipv4.conf.vmbr0.arp_notify = 1
net.ipv4.conf.default.arp_notify = 1
net.ipv4.conf.all.arp_notify = 1

But no Live Migration from Falkenstein to Helsinki is possible. (Migration works but the VM has no Internet anymore)

Strange thing:
The Live Migration from Helsinki to Falkenstein works everytime and also the network is reachable immediately.

I tried the migration with the integrated network and also with openvswitch.

It makes no differences.

If i use openvswitch the command arp -a shows me the MAC Adresses of all VMs and their IP Adress they are using (shown on all Nodes)

But however i dont have any idea anymore how i can solve this.

Maybe someone can help me and the Hetzner community?
There are a bunch of customers with the same Problem without Solution.
 
Last edited:
Hi there,

Did you ever figure this out?

I have the exact same issue and no matter what I have done, and all the forum surfing also, I cannot get it to work properly. It's as if the vSwitch is hanging on to the MAC address of the old Hypervisor, and no matter what I do I cannot get it to update (even manually via Gratuitous ARP commands - yet the vSwitch gateway answers).

The traffic does not come through, but if I move the VM back to the other DC it works fine.

Any pointers?

---

Aside:

I know for sure that the vSwitch is passing traffic, as I have another VM (test) in a different DC, moving between three DCs, and will continue to (internally) ping the VMs on the vSwitch - regardless as to where they are. It's just an issue getting traffic to and from the vSwitches gateway.
 
Hi,

I'm currently working to add vxlan / bgp-evpn in proxmox, should be ready in coming months. (something like vmware nsx)

This will allow to have anycast gateway on proxmox (same vm gateway ip on each proxmox), then after that routing will works. (even across multidatacenter too)

I have begin to write some doc:

https://git.proxmox.com/?p=pve-docs...5;hb=9f400be21b58701daefd8083e441f2c6c8f9ab39

I'm currently polishing integration in gui and proxmox code, to have an easy setup
 
Hi spirit,

That sounds like an interesting setup, and seems like something I may end up looking at - when complete, however, I'm stuck for time and really need to figure out the issue I have as of now; according to Hetzner all should work fine - but I have to wonder if there is some setting missing their side (on their vSwtiches) or my side as really this should be an easy configuration.

Simply, it does work - just not all the time, and at times I'm waiting a long time minutes/hours/days for the ARP to be updated from the vSwitch side i.e. to point to the correct Hypervisor/Proxmox host on the vSwitch/vlan.

It's either a case that I do not have the network/guest configured correctly, and the proxmox node the system was on (or moves to) is not able to send out an ARP update request as it does not know about the IP (somehow - even though I have Guest tools installed and it can see it in the GUI) OR the vSwitch is not honouring the update request from the node.

Any ideas how I can watch the vLan traffic to see if the "Gratuitous ARP" are actually been sent by the Proxmox systems? (Or are they meant to be sent by the guest?). OR does it have to be enabled in the first place, and how to enable it.

Another item of note is that I am not HOT migrating the guests, I am first shutting them down and migrating offline; I am unable to migrate online at this time.
 
Hello spirit,

Thanks for coming back to me; it is much appreciated.

Ok, so to comment on your comment:
  • "if you use virtio-net nic, yes,"
    • I am using this on all guests.
    • I presume CentOS 7 guests are fine for this, and will send; though do I need to configure them to send this?
  • "the gratuitous arp is sent directly by the vm after the live migration."
    • Note that I am not Live Migrating, only offline migration.
    • Does the guest still send this once re-started on the new node?
    • (Maybe it has to be forced in this scenario?)
 
  • "if you use virtio-net nic, yes,"
    • I am using this on all guests.
    • I presume CentOS 7 guests are fine for this, and will send; though do I need to configure them to send this?
  • "the gratuitous arp is sent directly by the vm after the live migration."
    • Note that I am not Live Migrating, only offline migration.
    • Does the guest still send this once re-started on the new node?
    • (Maybe it has to be forced in this scenario?)

The gratuitous arp is send after a live migration. (working out of the box with virtio, nothing to configure).


If you do an "offline" migration, the guest os of the vm at boot should send an arp request to find his gateway for example. (you can tcpdump -e -i vmbr.. to see them).
 
Hello Spirit,

So, I did that and got the following - amended slightly:

tcpdump -nnti eth0 arp or icmp and host X.X.X.211

ARP, Request who-has X.X.X.209 tell X.X.X.211, length 28
ARP, Reply X.X.X.209 is-at X:X:X:X:X:40, length 42

So the guest is sending the request, upon boot, and is getting a response from the gateway IP.

That got me thinking and ended creating a clone of the working guest and started to move them around between DCs (Nuremberg and Frankenstein) and kept a ping running from where I moved them from - starting in Falkenstein:

- Every-time, the second the guest started in Nuremberg it was ping-ing away.
- Move it back to Falkenstein and no ping.

Now, at one point - randomly, after moving there and back again, one of the guests in Falkenstein started pinging - but it took a few minutes; the other was still not pinging (both are being moved at the same time and started at the same time).

So I did it again, back to Nuremberg (both work instantly) and then back to Falkenstein and neither are working - and I left them sitting for quite some time.

And what is more interesting: in Falkenstein, after moving there is no ARP response from the gateway IP.

So to me it seems to be a routing issue or maybe a timing/refresh issue i.e. the vSwitch eventually releasing it's gaze on the source at Nuremberg and then allowing another DC to send traffic.

[That is what I think, but I could be off base]

ASIDE:

I should point out also, that I have two nodes in Falkenstein - in different Data Centers, and they have the same issue: guests that are moved to them cannot ARP Response between them either.

But I know for sure they can communicate as I had one on each DC (in Falkenstein) and they were pinging each other, it is only when they were moved to another DC they hit the wall.

The thing is, if I change the IP on a guest to another on the subnet instant ping - and between nodes, even if I only change one at a time, then I move the IP changed one and hit the wall again.
 
We were using the vSwitches just fine for some time but then we were facing issues with offline nodes. Nodes were not capable to talk to each other anymore. Not even migrating anything to a new host or DC.
The only thing that made it working again was to remove all hosts from the vswitch and add them again in the Robot. This is not good since everything is offline for some minutes. I opened a request with Hetzner to see how we can solve this in case it's something on their side.
I mean it was working just fine for some time so I don't understand what changed so it stopped working. Without vSwitches we would be back to some failover IP setup which requires more complex mechanism for failover.

Anyone got up-to-date experience?
 
We were using the vSwitches just fine for some time but then we were facing issues with offline nodes. Nodes were not capable to talk to each other anymore. Not even migrating anything to a new host or DC.
The only thing that made it working again was to remove all hosts from the vswitch and add them again in the Robot. This is not good since everything is offline for some minutes. I opened a request with Hetzner to see how we can solve this in case it's something on their side.
I mean it was working just fine for some time so I don't understand what changed so it stopped working. Without vSwitches we would be back to some failover IP setup which requires more complex mechanism for failover.

Anyone got up-to-date experience?
We are running a three Node Promox VE HA cluster at Hetzner as well.
All of the three nodes are located in different DCs in Falkenstein.
The servers are connected via VLANs using the Hetzner vSwitch feature.
Our KVM hosts are connected via a pfSense cluster to the internet. The pfSense cluster is configured to hold our public IPs using CARP on the WAN interface.
Moving a VM from host 1 to 3 works seemlessly. Not a single ping is being lost. Moving the same VM to host 2 causes the ping from the VM to some resource in the internet to fail, once it is active. It does not get the MAC address for it's default gateway, which is a CARP IP on the LAN interface of the firewall.
First we assumed, that it might be related to the second firewall node running on that host.
To verify this, we moved the 2nd firewall node to host 2 and did the same tests again.
This time, the ping stopped again when moving the VM to host 2 and started again instantly when moving the VM to host 3, where now no firewall node was running.
From my current point of view the issue seems to be related to the DC where host 2 is located.
I will go through the interface config of all three hosts again and will verify, that they are the same. Sometimes, when moving the VM to host 2 the ping came back after 254 to 300 lost packets. This is just the result of two tests though.

Best regards
Sebastian
 
We are running a three Node Promox VE HA cluster at Hetzner as well.
All of the three nodes are located in different DCs in Falkenstein.
The servers are connected via VLANs using the Hetzner vSwitch feature.
Our KVM hosts are connected via a pfSense cluster to the internet. The pfSense cluster is configured to hold our public IPs using CARP on the WAN interface.
Moving a VM from host 1 to 3 works seemlessly. Not a single ping is being lost. Moving the same VM to host 2 causes the ping from the VM to some resource in the internet to fail, once it is active. It does not get the MAC address for it's default gateway, which is a CARP IP on the LAN interface of the firewall.
First we assumed, that it might be related to the second firewall node running on that host.
To verify this, we moved the 2nd firewall node to host 2 and did the same tests again.
This time, the ping stopped again when moving the VM to host 2 and started again instantly when moving the VM to host 3, where now no firewall node was running.
From my current point of view the issue seems to be related to the DC where host 2 is located.
I will go through the interface config of all three hosts again and will verify, that they are the same. Sometimes, when moving the VM to host 2 the ping came back after 254 to 300 lost packets. This is just the result of two tests though.

Best regards
Sebastian


Hi.
I'm using also hetzner , and I would to setup the same config adding other 2 host.
I run some servers for big files distribution,
You can explain how you setup the vlan switch to have the automatic switch failover ip with hetzner? :)
 
Hello,

German Post at Hetzner Board:
https://forum.hetzner.com/thread/25240-vswitch-proxmox-live-migration-kein-netzwerk-nach-abschluss/

i have 2 Dedicated Hetzner Servers

One is in Helsinki and one is in Falkenstein.

I have a Public IP /29 Subnet on vlan4000 together. (all VMs need to have MTU 1400)

The first usable IP of the Subnet is the Gateway Router of Hetzner.

I bridged vlan4000 (vmbr4000).
I added this interface to the VMs.

Code:
source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback

iface lo inet6 loopback

auto enp0s31f6
iface enp0s31f6 inet manual

auto vmbr0
iface vmbr0 inet static
        address  PUBLIC IP OF SERVER
        netmask  NETMASK OF SERVER
        gateway  HETZNER GW
        bridge-ports enp0s31f6
        bridge-fd 0
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off

iface vmbr0 inet6 static
        address  PUBLIC IP OF SERVER
        netmask  64
        gateway  fe80::1

auto vmbr4000
iface vmbr4000 inet manual
        bridge-ports enp0s31f6.4000
        bridge-fd 0
        mtu 1400
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off
#Public

auto vmbr4001
iface vmbr4001 inet static
        address  10.100.30.10
        netmask  255.255.255.0
        bridge-ports enp0s31f6.4001
        bridge-fd 0
        mtu 1400
        bridge_hello 2
        bridge_maxage 12
        bridge_stp off
#Cluster

Everything works. All VMs are reachable over a Public IP of the subnet.

When i do a live migration from Server A to Server B then the Network of the migrated VM is not available anymore after the Migration is done.

About 20 - 30 minutes after migration sometimes the Network comes back.

I tried this:
ip -s -s neigh flush all

But did not work.

also proxy_arp settings did not solve it.

Hetzner asked if Proxmox sends Gratuitous ARP after migration?

I enabled this features in the kernel

Code:
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.default.arp_ignore = 1
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.vmbr0.arp_ignore = 1
net.ipv4.conf.vmbr0.arp_announce = 2
net.ipv4.conf.vmbr0.arp_accept = 1
net.ipv4.conf.default.arp_accept = 1
net.ipv4.conf.all.arp_accept = 1
net.ipv4.conf.vmbr0.arp_notify = 1
net.ipv4.conf.default.arp_notify = 1
net.ipv4.conf.all.arp_notify = 1

But no Live Migration from Falkenstein to Helsinki is possible. (Migration works but the VM has no Internet anymore)

Strange thing:
The Live Migration from Helsinki to Falkenstein works everytime and also the network is reachable immediately.

I tried the migration with the integrated network and also with openvswitch.

It makes no differences.

If i use openvswitch the command arp -a shows me the MAC Adresses of all VMs and their IP Adress they are using (shown on all Nodes)

But however i dont have any idea anymore how i can solve this.

Maybe someone can help me and the Hetzner community?
There are a bunch of customers with the same Problem without Solution.

Hi,

Same problem with our Hetzner Proxmox Cluster installation.

In our case, whe have 2 servers on Falkenstein

FSN1 (Falkenstein)
FSN1-DC1 (Node-1)
FSN1-DC5 (Node-2)

Network failure on live migration and offline migration of any VM from node-1 to node-2 and viceversa.

Can anyone contribute some knowledge about this? We are talking to hetzner support and they don't give us any solution.

Regards.
 
Hi,
I'm still working on bgp-evpn sdn for proxmox (target for 6.2), I think it could help for this problem.
(It'll replace the hertzner vswitch, doing routing between proxmox nodes. (with anycast vm gateway, same ip on vmbr of different proxmox nodes)

I'll try to have some hertzner and ovh server to test it.
 
Hi.
I'm using also hetzner , and I would to setup the same config adding other 2 host.
I run some servers for big files distribution,
You can explain how you setup the vlan switch to have the automatic switch failover ip with hetzner? :)
Hi openaspace,

I'm not sure if I understood your question properly.

What I did was to create a switch via the Hetzner Robot and order a public IP address range within the vswitch menu.
I'm using VLAN 4000 for my WAN connectivity.
Go to vswitches, click on your VLAN that's facing the public internet and there you will find four menus items:
Virtual-Switch, IPs, Cancellation and Monitoring.

Go to IPs and chose to order additional IPs and networks on the left bottom side.

I hope, this is the answer you were looking for.

Best regards
Sebastian
 
Hi,
I'm still working on bgp-evpn sdn for proxmox (target for 6.2), I think it could help for this problem.
(It'll replace the hertzner vswitch, doing routing between proxmox nodes. (with anycast vm gateway, same ip on vmbr of different proxmox nodes)

I'll try to have some hertzner and ovh server to test it.
Hi spirit,

many thanks for the efforts you are putting into this.

I'll have a cluster for testing purposes available by the end of next week using two servers at Hetzner.
I'd be willing to share this with you to perform tests if you like.
My plan ist to keep this cluster until the end of January 2020.
I'm not sure, if this fits your schedule and your availability.

Just let me know, if this is of any help for you.

2019-12-20: During the migration of my current VMs to the Hetzner Proxmox cluster it turned out, that I can run less of them per host, as I'm also using Ceph on the Proxmox cluster. Before I was using a plain KVM solution. Thus, I unfortunately won't be able to provide you with the test cluster, as I needed to add them to my existing cluster. Sorry for that.

Best regards
Sebastian
 
Last edited:
Hi all,

today I was moving a VM from one host to the other, which was working fine.
Later on I moved another host and this ended in a loss of connectivity for the virtual host.

I issued the following command from the command prompt of the virtual machine and the ping came back instantly.

arping -c 4 -A -I <network interface> <IP of the virtual machine>

In my case I issued

arping -c 4 -A -I eth0 192.168.0.100

This is, as far as I understand, a manually issued Gratious ARP.

I'd be glad if some of you guys could verify this.
It still might be possible, that in my case the ping just came back in that moment by accident.

If this is working out for you, too, then it seems, as if the VMs are sometimes not sending out gratious ARP packages properly or the switches at Hetzner seem to not handle them properly every now and then.

Many thanks and best regards
Sebastian
 
Hi all,

today I was moving a VM from one host to the other, which was working fine.
Later on I moved another host and this ended in a loss of connectivity for the virtual host.

I issued the following command from the command prompt of the virtual machine and the ping came back instantly.

arping -c 4 -A -I <network interface> <IP of the virtual machine>

In my case I issued

arping -c 4 -A -I eth0 192.168.0.100

This is, as far as I understand, a manually issued Gratious ARP.

I'd be glad if some of you guys could verify this.
It still might be possible, that in my case the ping just came back in that moment by accident.

If this is working out for you, too, then it seems, as if the VMs are sometimes not sending out gratious ARP packages properly or the switches at Hetzner seem to not handle them properly every now and then.

Many thanks and best regards
Sebastian

Hi SebastianS

We have performed the tests that you comment, but they have not worked. When we stop receiving pings from the migrated VM, just after finishing the migration we launch the arping command that you suggest from the console of the migrated VM, but the ping is still inaccessible.

Anyway, we have seen with tcpdump that the arping is launched automatically from the VM when the migration is finished, so it's not necessary (in our case) to launch the command manually.

Thank you very much anyway for your suggestions, but in our case we continue with the problems.

Kind Regards.
 
Yes, but when we made a live migration between proxmox servers, the VM need at least 5 minutes to get up online. Anyway, I don't know if hetzner has changed anything last weekend, since now I'm having problems in outbound.

Thanks for your questions. I can have any test if you need.

Regards.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!