[SOLVED] Use secondary network for PVE commands

Jan 17, 2024
21
1
8
Is there a way to use the secondary network in a cluster for command (ssh) communication?
I have two interfaces on all hosts in a cluster: a local 192.168.x.x, and a public with public ips, set as ring0 and ring1, respectively in corosync. Due to a switch failure, some of the interfaces in ring0 are not working on some hosts.

If I try to migrate from such a host, qm tries to ssh using the failed network, even though I set the migration network to be the ring1 one. For example:
Code:
qm migrate 520 pve4 --online
ssh: connect to host 192.168.0.23 port 22: No route to host
command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve4' -o 'UserKnownHostsFile=/etc/pve/nodes/pve4/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.0.23 pvecm mtunnel -migration_network xxx.xxx.xxx.xxx/yy -get_migration_ip' failed: exit code 255
I already set knet_link_priority in corosync, but that does not fix this problem.

How can I coax the system to use the ring1 address for the target host? Preferably without editing the primary ip address for any hosts.
 
Last edited:
My understanding is that the initial SSH connection is made to the node IP address that Proxmox VE recognizes for pve4, which in this case is 192.168.0.23.

The migration_network option is passed only after that SSH connection has been established.

In other words, migration_network is used to select the network for the migration tunnel or data transfer, but it does not help if the target node is not reachable during the initial SSH connection phase.

Therefore, the simplest solution would be to restore or repair the ring0 network.

It might also be possible to solve this by modifying the network configuration or the SSH client configuration, but I believe that would involve more complex steps and could potentially affect the cluster.

If live migration is not strictly required, but the VM needs to run on another PVE node, it may be more practical to consider alternatives such as backup and restore.
 
that's pretty much correct, regular communication (SSH or API proxying) will happen on the IP that the hostname of the node resolves to. only the actual data streams will use the migration network (and the IP to connect to is discovered using the regular network).
 
regular communication (SSH or API proxying) will happen on the IP that the hostname of the node resolves to
That did not seem to be the case. The first thing I tried was to change the ip for the host in /etc/hosts, and the behaviour was the same.
What I ended up doing was to change the get_ssh_info function in /usr/share/perl5/PVE/SSHInfo.pm, so it returns the other IP for that host. That way, I could migrate away all VMs.
Of course, I managed to get a replacement switch eventually.

If I understand correctly, the only long-term resilient solution seems to be using multiple switches on the ring0 network redundantly.

Thank you to everyone for the replies!
 
for some parts, you also need to restart pve-cluster, since the IP is resolved at startup there and broadcast to the rest of the cluster