Live migration

nightowl · May 12, 2023

Hi,

We have a 3-node cluster that has relatively large nodes (2 x 64-core EPYC 7763's with 1TiB of RAM), and we are stress testing Proxmox in large-node configurations that have a large number of VMs. In our test scenario, we have around 440 test VMs on the first node, with the other two nodes being empty. Proxmox 7.4 is up-to-date while using the non-production repo and the 6.2 opt-in kernel.

When testing live migration with 10 parallel workers (datacenter.cfg has a max-workers set to 30), we often see this task error:

2023-05-09 22:18:26 use dedicated network address for sending migration traffic (172.26.0.2)
2023-05-09 22:18:26 starting migration of VM 244 to node 'proxmox002' (172.26.0.2)
2023-05-09 22:18:26 starting VM 244 on remote node 'proxmox002'
2023-05-09 22:18:53 [proxmox002] kvm: -incoming tcp:172.26.0.2:60000: Failed to find an available port: Address already in use
2023-05-09 22:19:08 [proxmox002] start failed: QEMU exited with code 1
2023-05-09 22:19:08 ERROR: online migrate failure - remote command failed with exit code 255
2023-05-09 22:19:08 aborting phase 2 - cleanup resources
2023-05-09 22:19:08 migrate_cancel
2023-05-09 22:19:14 ERROR: migration finished with problems (duration 00:00:49)
TASK ERROR: migration problems

It appears that Proxmox can use TCP Ports 60000 through 60050 for live migration connections. With only 10 parallel transfers, I can't imagine all of the ports being used.

While migrating all 440 VMs, this happened 58 times, but mostly near the end of the migration.

I thought I'd bring it up in case there is a bug in the port allocation, where an "already in use" port is being selected incorrectly.

With 3 rounds of "migrate all" VMs, all VMs finally migrate successfully.

Any ideas would be useful.

Thanks!

Eric

LnxBil · May 13, 2023

nightowl said:
I thought I'd bring it up in case there is a bug in the port allocation, where an "already in use" port is being selected incorrectly.

Have you checked what process keeps the port open?

fiona · May 15, 2023

Hi,
there is a bug report for this: https://bugzilla.proxmox.com/show_bug.cgi?id=4501 but it only happens with the insecure migration setting. Please read the bug report and re-open it if you really need this and can't/don't want to use the default, secure migration type.

Falk R. · May 15, 2023

fiona said:
Hi,
there is a bug report for this: https://bugzilla.proxmox.com/show_bug.cgi?id=4501 but it only happens with the insecure migration setting. Please read the bug report and re-open it if you really need this and can't/don't want to use the default, secure migration type.

Hello Fiona,

if you use secure, it already limits the network a lot.
In my test cluster at home, I get a maximum of 18 GBit throughput with secure migrations.
If I switch to insecure, I get the 40 GBit line speed.

I often have 100 GBit networks at my customers, so you don't want to migrate with less than 20 GBit.

floh8 · May 15, 2023

I think this is a bug that has to be fix quickly. Normally u not get into that trouble, but with the new maintenance mode proxmox itself use a operation where many workloads will be migrate at once. Also, of course i wanne use the maximum bandwidth of my migration link. In the future it will be more relevant because the cpu power increases and you can host much much more workloads on one node.

fiona · May 15, 2023

floh8 said:
I think this is a bug that has to be fix quickly. Normally u not get into that trouble, but with the new maintenance mode proxmox itself use a operation where many workloads will be migrate at once. Also, of course i wanne use the maximum bandwidth of my migration link. In the future it will be more relevant because the cpu power increases and you can host much much more workloads on one node.

This thread is only the second time this issue was reported. I don't have any numbers on how many people are using mass migration with insecure, but it seems to also depend on some other factor or at least it might be very racy. Otherwise, I'd expect more people to complain. If you are experiencing the issue, feel free to re-open the bug and maybe also link back to this thread. In the bug report, @fabian thought that ssh overhead would not be a big deal.

Falk R. · May 15, 2023

At high bandwidths this does play a role. I guess only 1 core is used and that's why it limits the bandwidth so much.
But I have not had any problems with mass migrations and insecure.

floh8 · May 15, 2023

I also cant imagine u need secure migration at all. In production use u go with closed environments and when u build local-remote clusters u use a VPN. But here is a input question and thats also important.

LnxBil · May 15, 2023

Falk R. said:
At high bandwidths this does play a role. I guess only 1 core is used and that's why it limits the bandwidth so much.

Yes, the problem with secure migration is the SSH tunnel, which sadly not multi-threaded. Parallel migration can mitigate this problem to some extend.

bbgeek17 · May 22, 2023

@nightowl,
You might find this technote helpful:

https://kb.blockbridge.com/technote/proxmox-concurrent-vm-migration/

The technote discusses insecure vs. secure migration and provides guidance on addressing spurious failures with the default secure migration mode related to connection drops. With a simple sshd tunable, the issues were resolved. We also advise avoiding insecure migration for various reasons.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

Live migration

nightowl

Active Member

LnxBil

Distinguished Member

fiona

Proxmox Staff Member

Falk R.

Distinguished Member

floh8

Renowned Member

fiona

Proxmox Staff Member

Falk R.

Distinguished Member

floh8

Renowned Member

LnxBil

Distinguished Member

bbgeek17

Distinguished Member

We value your privacy