May 5, 2020
15
12
23
50
Hi,

We have a 3-node cluster that has relatively large nodes (2 x 64-core EPYC 7763's with 1TiB of RAM), and we are stress testing Proxmox in large-node configurations that have a large number of VMs. In our test scenario, we have around 440 test VMs on the first node, with the other two nodes being empty. Proxmox 7.4 is up-to-date while using the non-production repo and the 6.2 opt-in kernel.

When testing live migration with 10 parallel workers (datacenter.cfg has a max-workers set to 30), we often see this task error:

2023-05-09 22:18:26 use dedicated network address for sending migration traffic (172.26.0.2)
2023-05-09 22:18:26 starting migration of VM 244 to node 'proxmox002' (172.26.0.2)
2023-05-09 22:18:26 starting VM 244 on remote node 'proxmox002'
2023-05-09 22:18:53 [proxmox002] kvm: -incoming tcp:172.26.0.2:60000: Failed to find an available port: Address already in use
2023-05-09 22:19:08 [proxmox002] start failed: QEMU exited with code 1
2023-05-09 22:19:08 ERROR: online migrate failure - remote command failed with exit code 255
2023-05-09 22:19:08 aborting phase 2 - cleanup resources
2023-05-09 22:19:08 migrate_cancel
2023-05-09 22:19:14 ERROR: migration finished with problems (duration 00:00:49)
TASK ERROR: migration problems

It appears that Proxmox can use TCP Ports 60000 through 60050 for live migration connections. With only 10 parallel transfers, I can't imagine all of the ports being used.

While migrating all 440 VMs, this happened 58 times, but mostly near the end of the migration.

I thought I'd bring it up in case there is a bug in the port allocation, where an "already in use" port is being selected incorrectly.

With 3 rounds of "migrate all" VMs, all VMs finally migrate successfully.

Any ideas would be useful.

Thanks!

Eric
 
  • Like
Reactions: dignus
Hi,
there is a bug report for this: https://bugzilla.proxmox.com/show_bug.cgi?id=4501 but it only happens with the insecure migration setting. Please read the bug report and re-open it if you really need this and can't/don't want to use the default, secure migration type.
 
Hi,
there is a bug report for this: https://bugzilla.proxmox.com/show_bug.cgi?id=4501 but it only happens with the insecure migration setting. Please read the bug report and re-open it if you really need this and can't/don't want to use the default, secure migration type.
Hello Fiona,

if you use secure, it already limits the network a lot.
In my test cluster at home, I get a maximum of 18 GBit throughput with secure migrations.
If I switch to insecure, I get the 40 GBit line speed.

I often have 100 GBit networks at my customers, so you don't want to migrate with less than 20 GBit.
 
I think this is a bug that has to be fix quickly. Normally u not get into that trouble, but with the new maintenance mode proxmox itself use a operation where many workloads will be migrate at once. Also, of course i wanne use the maximum bandwidth of my migration link. In the future it will be more relevant because the cpu power increases and you can host much much more workloads on one node.
 
I think this is a bug that has to be fix quickly. Normally u not get into that trouble, but with the new maintenance mode proxmox itself use a operation where many workloads will be migrate at once. Also, of course i wanne use the maximum bandwidth of my migration link. In the future it will be more relevant because the cpu power increases and you can host much much more workloads on one node.
This thread is only the second time this issue was reported. I don't have any numbers on how many people are using mass migration with insecure, but it seems to also depend on some other factor or at least it might be very racy. Otherwise, I'd expect more people to complain. If you are experiencing the issue, feel free to re-open the bug and maybe also link back to this thread. In the bug report, @fabian thought that ssh overhead would not be a big deal.
 
At high bandwidths this does play a role. I guess only 1 core is used and that's why it limits the bandwidth so much.
But I have not had any problems with mass migrations and insecure.
 
I also cant imagine u need secure migration at all. In production use u go with closed environments and when u build local-remote clusters u use a VPN. But here is a input question and thats also important.
 
At high bandwidths this does play a role. I guess only 1 core is used and that's why it limits the bandwidth so much.
Yes, the problem with secure migration is the SSH tunnel, which sadly not multi-threaded. Parallel migration can mitigate this problem to some extend.
 
@nightowl,
You might find this technote helpful:
The technote discusses insecure vs. secure migration and provides guidance on addressing spurious failures with the default secure migration mode related to connection drops. With a simple sshd tunable, the issues were resolved. We also advise avoiding insecure migration for various reasons.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!