Hi,
We have a 3-node cluster that has relatively large nodes (2 x 64-core EPYC 7763's with 1TiB of RAM), and we are stress testing Proxmox in large-node configurations that have a large number of VMs. In our test scenario, we have around 440 test VMs on the first node, with the other two nodes being empty. Proxmox 7.4 is up-to-date while using the non-production repo and the 6.2 opt-in kernel.
When testing live migration with 10 parallel workers (datacenter.cfg has a max-workers set to 30), we often see this task error:
2023-05-09 22:18:26 use dedicated network address for sending migration traffic (172.26.0.2)
2023-05-09 22:18:26 starting migration of VM 244 to node 'proxmox002' (172.26.0.2)
2023-05-09 22:18:26 starting VM 244 on remote node 'proxmox002'
2023-05-09 22:18:53 [proxmox002] kvm: -incoming tcp:172.26.0.2:60000: Failed to find an available port: Address already in use
2023-05-09 22:19:08 [proxmox002] start failed: QEMU exited with code 1
2023-05-09 22:19:08 ERROR: online migrate failure - remote command failed with exit code 255
2023-05-09 22:19:08 aborting phase 2 - cleanup resources
2023-05-09 22:19:08 migrate_cancel
2023-05-09 22:19:14 ERROR: migration finished with problems (duration 00:00:49)
TASK ERROR: migration problems
It appears that Proxmox can use TCP Ports 60000 through 60050 for live migration connections. With only 10 parallel transfers, I can't imagine all of the ports being used.
While migrating all 440 VMs, this happened 58 times, but mostly near the end of the migration.
I thought I'd bring it up in case there is a bug in the port allocation, where an "already in use" port is being selected incorrectly.
With 3 rounds of "migrate all" VMs, all VMs finally migrate successfully.
Any ideas would be useful.
Thanks!
Eric
We have a 3-node cluster that has relatively large nodes (2 x 64-core EPYC 7763's with 1TiB of RAM), and we are stress testing Proxmox in large-node configurations that have a large number of VMs. In our test scenario, we have around 440 test VMs on the first node, with the other two nodes being empty. Proxmox 7.4 is up-to-date while using the non-production repo and the 6.2 opt-in kernel.
When testing live migration with 10 parallel workers (datacenter.cfg has a max-workers set to 30), we often see this task error:
2023-05-09 22:18:26 use dedicated network address for sending migration traffic (172.26.0.2)
2023-05-09 22:18:26 starting migration of VM 244 to node 'proxmox002' (172.26.0.2)
2023-05-09 22:18:26 starting VM 244 on remote node 'proxmox002'
2023-05-09 22:18:53 [proxmox002] kvm: -incoming tcp:172.26.0.2:60000: Failed to find an available port: Address already in use
2023-05-09 22:19:08 [proxmox002] start failed: QEMU exited with code 1
2023-05-09 22:19:08 ERROR: online migrate failure - remote command failed with exit code 255
2023-05-09 22:19:08 aborting phase 2 - cleanup resources
2023-05-09 22:19:08 migrate_cancel
2023-05-09 22:19:14 ERROR: migration finished with problems (duration 00:00:49)
TASK ERROR: migration problems
It appears that Proxmox can use TCP Ports 60000 through 60050 for live migration connections. With only 10 parallel transfers, I can't imagine all of the ports being used.
While migrating all 440 VMs, this happened 58 times, but mostly near the end of the migration.
I thought I'd bring it up in case there is a bug in the port allocation, where an "already in use" port is being selected incorrectly.
With 3 rounds of "migrate all" VMs, all VMs finally migrate successfully.
Any ideas would be useful.
Thanks!
Eric