Issue with Migration network just working on manual migrations

mlazarin

Member
Dec 10, 2023
2
0
6
Hi! First, sorry for the long text. I wanted to provide as much detail as I could.

I'm facing an issue setting up my Migration network and would like to check whether someone has encountered a similar issue or can offer guidance.

I built my Proxmox 9 cluster with 3 servers (3x HP DL360G9), with one dual 10GB card (2x 10GB ports), one dual 2.5GB card (2x 2.5GB ports), and 4 native 1Gb ports. I set the network as follows (I'm giving the first server; the others are set the same way, just changing the last octet of the IP addresses to 20 and 30, respectively).

- nic0: first 1GB internal NIC, set as Corosync Ring 0 (for redundancy), IP 10.25.36.10/24
- nic1: second 1GB internal NIC, set as Corosync Ring 1 (for redundancy), IP 10.25.37.10/24
- bond0: the 2x 2.5GB NICs bonded for Management and VM access
- vmbr0: set as the main gateway, with IP 10.25.35.10/24 (bridge for management and VM access), set as VLAN aware, and VLAN IDs 30 (VMs), 35 (Mgmt), 36 (Corosync Ring 0), and 37 (Corosync Ring 1)
- bond1: the 2x 10GB NICs bonded for Ceph (public and private) and Migration
- vmbr1: set as the Ceph and Migration bridge, set as VLAN aware, and VLAN IDs 31 (Ceph Pub), 32 (Ceph Priv), and 33 (Migration), without IPs, divided as:
- vmbr1.31: Linux VLAN set as Ceph Public, IP 10.25.31.10/24, VLAN 31
- vmbr1.32: Linux VLAN set as Ceph Private, IP 10.25.32.10/24, VLAN 32
- vmbr1.33: Linux VLAN set as Migration, IP 10.25.33.10/24, VLAN 33

The 2 Corosync rings (VLANs 36 and 37) are working as expected, management VLAN 35 is working fine, VM access is working fine on VLAN 30 (for each VM, I set their IP and VLAN properly on VLAN 30), and Ceph is also properly set using Public (VLAN 31) and Private (VLAN 32) networks, healthy and working fine.

Now I'm setting the HA for my VMs. Using the default configuration (Datacenter -> Options -> Migration Settings not set) works well. This configuration works fine for manual migrations, and also for automatic migrations based on affinity/HA rules (for example, if I move a VM in server 1 that has an affinity rule to stay in server 1, I can move it manually to server 2, and the system will detect that server 1 is still up and running, and will automatically move it back to server 1, without any issues).

But when I set the Migration network (Datacenter -> Options -> Migration Settings) as 10.25.33.10/24 on the first node (so following the rule of last octet as 10 on this first node), my problem starts. I can still do the manual migration (moving, for example, the same VM that worked before from server 1 to server 2), but the automatic migrations (the same example above, the VM moved to server 2 with affinity to server 1, so the cluster tries to automatically move it back) it fails with the following error message:

task started by HA resource agent
could not get migration ip: multiple, different, IP address configured for network '10.25.33.10/24'
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=vmserver01' -o 'UserKnownHostsFile=/etc/pve/nodes/vmserver01/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.25.35.10 pvecm mtunnel -migration_network 10.25.33.10/24 -get_migration_ip' failed: exit code 255


If I run this command manually, I get the following error:

pvecm mtunnel -migration_network 10.25.33.10/24 -get_migration_ip
could not get migration ip: multiple, different, IP address configured for network '10.25.33.10/24'


Ping from server 2 (source) to server 1 (target) on both VLANs 35 and 33 are working fine (I can ping server 1 from server 2 on both 10.25.35.10 and 10.25.33.10).

Additional info:
- My DNS is running on 3 VMs (one on each server) with PiHole/Unbound, and load balancing via keepalived, and synchronization via nebula-sync (working properly)
- To avoid problems with the cluster if the 3 VMs running PiHole come down (rare, but can happen if I need to shut down the whole cluster), I have the IPs (management VLAN 35) set on /etc/hosts on the 3 Proxmox servers (so the servers can talk to each other)

Things that I need help with:
- Any suggestions/criticisms (constructive, please! :) ) on my architecture? The idea to split them is to avoid impacts on Ceph and Migration networks, and to use the fastest NICs (bonded) for it.
- Is the Datacenter -> Options -> Migration Settings defined as my first server (10.25.33.10/24, coming from the selection options) the correct way to set it? Do I need to change any other configuration?
- Is there a way to be 100% sure about which network/VLAN the manual and automatic migrations are running?
- Why is the manual migration working but the automatic migration not when I set the Migration network (maybe it is still using the default network, that's why the previous question is important :) )

Any suggestions/support is really appreciated. For now, I'm keeping the Migration Settings as default (so working, but on a slower 2.5G network).

Thanks in advance,

Marcelo
 
what does "/sbin/ip address show to 10.25.33.10/24 up" print on each of your nodes?
 
  • Like
Reactions: mlazarin
Your suggestion helped me to fix it!

When I checked the output of ip address, I found that they were correct on each node (interface showing as UP), but there was also a second IP on the same VLAN created by keepalived (virtual IP between the 3 nodes), which was configured by my automation script. I removed that configuration, and now just one IP on that VLAN shows up.

I'm not at home to do a full test, but from the command line, I did an online migration (same example mentioned above), and I saw the VM moving to the 2nd node, and coming back without errors. I will do a full test later when I'm back home, but I think this fixed the issue.

Thanks for your help, Fabian! Appreciate!