[SOLVED] VM live migration not working despite cluster being in good condition

Larion · Mar 8, 2021

Im testing proxmox and got surprised by the following error:

Running a 3 node hyperconverged cluster with ceph. (pve-manager/6.3-4/0a38c56f (running kernel: 5.4.101-1-pve ; last updated on 04.03.2021 nosub-repo)

I can clone VMs to the other nodes.

I can offline migrate VMs to the other nodes.

But live migration fails.
After 15 minutes it errors out with: broken pipe.
see attached screenshot.

all nodes identical hardware.
Cluster Quorum OK - no apparent error.
Date and Time on all nodes in sync.
firewall off.
All nodes can ping each other on all interfaces.
25gbase direct attached network for ceph
20gbase network (2x 10gbase) for management/migration/backup
1gbase network for corosync
20gbase network (2x 10gbase) for VM traffic but not assigned to VMs at the time of testing

VM Hardware or bios settings make no difference: machine q35 or i440; cpu kvm or epyc ; efi or legacy; no network adapter at time of test assigned;
Windows VM (20H2) with guest agent installed but operating system type makes no difference (tested Linux and Win10)

Any ideas ?

abma · Mar 8, 2021

https://forum.proxmox.com/threads/b...node-naming-and-ssh_known_hosts-fumble.85418/ maybe?

Larion · Mar 8, 2021

Thanks for your reply.
Will look into it but dont think so.
Have installed proxmox fresh and havent removed or added another node afterwards.
Atm i dont understand why cloning and offline migration work (is that not done over ssh too ?) but live migration isnt.

dcsapak · Mar 8, 2021

Larion said:
Atm i dont understand why cloning and offline migration work (is that not done over ssh too ?) but live migration isnt.

if the disks are on a shared storage (ceph in your case), there is nothing to copy/move when cloning (ceph does that) or offline migration (its just a config move)
for live migration, we have to move the memory though

i'd check the syslog/journal or the network inbetween (firewall maybe?)

Larion · Mar 8, 2021

Thanks Dominik.
Yes I know that about shared storage, but the config file is copied over ssh isnt it ?
firewall on all nodes off.
no external firewall should be able to interfere because the traffic is not leaving the network so no gateway is used.
since I can connect from my pc to each cluster node over ssh using the same network interface (20gbase network (2x 10gbase) for management/migration/backup) and using the same root user/password I would be surprised if a firewall or switch is responsible.
will check the logs again anyway.
any further help is appreciated

dcsapak · Mar 8, 2021

Larion said:
Yes I know that about shared storage, but the config file is copied over ssh isnt it ?

actually no, the config is copied locally in /etc/pve, which is our cluster filesystem pmxcfs which is synced across the cluster via corosync

Larion said:
no external firewall should be able to interfere because the traffic is not leaving the network so no gateway is used.
since I can connect from my pc to each cluster node over ssh using the same network interface (20gbase network (2x 10gbase) for management/migration/backup) and using the same root user/password I would be surprised if a firewall or switch is responsible.
will check the logs again anyway.

ok, hopefully there are some useful messages in there

abma · Mar 8, 2021

Larion said:
since I can connect from my pc to each cluster node over ssh using the same network interface (20gbase network (2x 10gbase) for management/migration/backup) and using the same root user/password I would be surprised if a firewall or switch is responsible.

can you connect from each node to the other nodes via ssh? i.e. login at leiptir2 and ssh to 172.16.11.62: thats what very likely will fail.

Larion · Mar 8, 2021

Just tested it again.
Didnt work this time either.

I can go into Datacenter/node1/shell and enter:
ssh 172.16.11.62 -l root
and it connects to node2 without a problem

I can go into Datacenter/node2/shell and enter:
ssh 172.16.11.61 -l root
and it connects to node1 without a problem

Therefore wrong ssh keys are out of the question.
Still checking the logs but havent found anything shouting "look at me, Im your gremlin" yet

Larion · Mar 8, 2021

just solved the problem.
mtu size was set to 9000 on the migration network interface instead of default 1500
what kind of program is used to copy the memory state over the ssh tunnel ?
dd ?

Is there a recommendation for mtu size for the migration traffic because of such an issue like the one I had ?

If you would be so kind to answer above questions and then close this thread as being solved.
Thanks for your help.

dcsapak · Mar 9, 2021

Larion said:
what kind of program is used to copy the memory state over the ssh tunnel ?
dd ?

qemu does this itself, we just tunnel the traffic over ssh/unix socket

Larion said:
Is there a recommendation for mtu size for the migration traffic because of such an issue like the one I had ?

no, anything your network can handle, but note that for a working mtu >1500 every network device must be configured for that and should work with it. while it can offer some performance improvement to have mtu 9000, there are often some devices which do not handle that correctly

Larion said:
If you would be so kind to answer above questions and then close this thread as being solved.

you can do that yourself with the thread tools at the top of the page

Search

Search

[SOLVED] VM live migration not working despite cluster being in good condition

Larion

New Member

Attachments

abma

Member

Larion

New Member

dcsapak

Proxmox Staff Member

Larion

New Member

dcsapak

Proxmox Staff Member

abma

Member

Larion

New Member

Larion

New Member

dcsapak

Proxmox Staff Member