Optimization of CPU usage during live migration.

Jun 13, 2023
8
1
3
Poland
Hello,
We have a need to migrate machines between nodes, but we need to do it quickly.
Is there a possibility of optimization?
Are there any magic parameters?
It seems that in our case the processor is the bottleneck.,When starting the migration it used full CPU, which causes us not to be able to saturate the disk and network interface (only 50%) and the migration takes relatively long. Each of the VM disks has only a single TB. The server seems to be relatively new but still we think it is not optimal. We would like the effect that both the disks and the network are fully saturated.
What can we do to make it faster?
something like RDMA ?
screeen attached
 

Attachments

  • 20240801_proxmox_slow_migration.png
    20240801_proxmox_slow_migration.png
    544.9 KB · Views: 7
high memory and cpu utilization.
Do you have deduplication on?

Code:
zfs get dedup

show yor pool configuration, source and destination
Code:
zpool status
 
high memory and cpu utilization.
Do you have deduplication on?

Code:
zfs get dedup

show yor pool configuration, source and destination
Code:
zpool status
root@s1prod:~# zfs get dedup
NAME PROPERTY VALUE SOURCE
main dedup off default
main/vm-101-disk-0 dedup off default
main/vm-101-disk-1 dedup off default
main/vm-102-disk-0 dedup off default
main/vm-104-disk-0 dedup off default
main/vm-107-disk-0 dedup off default
main/vm-107-disk-1 dedup off default
main/vm-107-disk-2 dedup off default
main/vm-109-disk-0 dedup off default
main/vm-109-disk-1 dedup off default
main/vm-109-disk-2 dedup off default
main/vm-110-disk-0 dedup off default
main/vm-110-disk-1 dedup off default
main/vm-110-disk-2 dedup off default
main/vm-110-disk-3 dedup off default
main/vm-110-disk-4 dedup off default
main/vm-110-disk-5 dedup off default
main/vm-110-disk-6 dedup off default
main/vm-110-disk-7 dedup off default
main/vm-111-disk-0 dedup off default
main/vm-111-disk-1 dedup off default
main/vm-111-disk-2 dedup off default
main/vm-111-disk-3 dedup off default
main/vm-112-disk-0 dedup off default
main/vm-112-disk-1 dedup off default
main/vm-114-disk-0 dedup off default
main/vm-120-disk-0 dedup off default
main/vm-120-disk-1 dedup off default
main/vm-120-disk-2 dedup off default
main/vm-120-disk-3 dedup off default
main/vm-120-disk-4 dedup off default
main/vm-120-disk-5 dedup off default
main/vm-122-disk-0 dedup off default
main/vm-122-disk-1 dedup off default
main/vm-122-disk-2 dedup off default
rpool dedup off default
rpool/ROOT dedup off default
rpool/ROOT/pve-1 dedup off default
rpool/data dedup off default
rpool/data/subvol-801-disk-0 dedup off default


root@s1prod:~# zpool status
pool: main
state: ONLINE
scan: scrub repaired 0B in 04:26:22 with 0 errors on Sun Jul 14 04:50:23 2024
config:

NAME STATE READ WRITE CKSUM
main ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
nvme-SAMSUNG_MZQL27T6HBLA-00A07_XXX_1-part1 ONLINE 0 0 0
nvme-SAMSUNG_MZQL27T6HBLA-00A07_YYY-part1 ONLINE 0 0 0
nvme-SAMSUNG_MZQL27T6HBLA-00A07_ZZZ_1-part1 ONLINE 0 0 0
nvme-eui.36434b305750038600200000001-part1 ONLINE 0 0 0
nvme-eui.36434b305750039700200000001-part1 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:00:18 with 0 errors on Sun Jul 14 00:24:20 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.3636363054b0315800200000001-part3 ONLINE 0 0 0
nvme-SAMSUNG_MZ1L21T9HCLS-00A07_XXX-part3 ONLINE 0 0 0

errors: No known data errors
root@s1prod:~#
 
I do insecure option but it still start ssh - how to apply it? - we do have direct (separate network) cable connection betwen nodes

2024-08-01 22:45:38 use dedicated network address for sending migration traffic (10.255.255.252)
2024-08-01 22:45:38 starting migration of VM 101 to node 's2back' (10.255.255.252)
2024-08-01 22:45:38 found local disk 'main:vm-101-disk-1' (attached)
2024-08-01 22:45:38 starting VM 101 on remote node 's2back'
2024-08-01 22:45:41 volume 'main:vm-101-disk-1' is 'main:vm-101-disk-0' on the target
2024-08-01 22:45:41 start remote tunnel
2024-08-01 22:45:42 ssh tunnel ver 1
2024-08-01 22:45:42 starting storage migration
2024-08-01 22:45:42 scsi0: start migration to nbd:10.255.255.252:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
[...]
all 'mirror' jobs are ready
2024-08-01 22:46:04 starting online/live migration on tcp:10.255.255.252:60000
2024-08-01 22:46:04 set migration capabilities
2024-08-01 22:46:04 migration downtime limit: 100 ms
2024-08-01 22:46:04 migration cachesize: 512.0 MiB
2024-08-01 22:46:04 set migration parameters
2024-08-01 22:46:04 start migrate command to tcp:10.255.255.252:60000
2024-08-01 22:46:05 average migration speed: 4.9 GiB/s - downtime 88 ms
2024-08-01 22:46:05 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0: mirror-job finished
2024-08-01 22:46:07 stopping NBD storage migration server on target.
2024-08-01 22:46:12 migration finished successfully (duration 00:00:34)
 
Last edited:
I do insecure option but it still start ssh - how to apply it? - we do have direct (separate network) cable connection betwen nodes

Code:
qm migrate VMxxx nodexxx --migration_type insecure


raidz1-0 ONLINE 0 0 0
nvme-SAMSUNG_MZQL27T6HBLA-00A07_XXX_1-part1 ONLINE 0 0 0
nvme-SAMSUNG_MZQL27T6HBLA-00A07_YYY-part1 ONLINE 0 0 0
nvme-SAMSUNG_MZQL27T6HBLA-00A07_ZZZ_1-part1 ONLINE 0 0 0
nvme-eui.36434b305750038600200000001-part1 ONLINE 0 0 0
nvme-eui.36434b305750039700200000001-part1 ONLINE 0 0 0
nvme-eui.36434b305750039700200000001-part1 ONLINE 0 0 0
is the same disk as SAMSUNG_MZQL27T6HBLA ?
 
Last edited:
Code:
qm migrate VMxxx nodexxx --migration_type insecure



nvme-eui.36434b305750039700200000001-part1 ONLINE 0 0 0
is the same disk as SAMSUNG_MZQL27T6HBLA ?
Thank you for the advice so far, it's better because insecure mode goes from total 10GB/s to 18GB/s
cpu is +30% - still a lot but better.
Maybe there are some more tricks :)? I would be happy reaching 25GB/s

yes, the disks are identical, but they are displyed strangely
 

Attachments

  • 20240801_proxmox_slow_migration2.png
    20240801_proxmox_slow_migration2.png
    348.7 KB · Views: 1
In my tests, it turned out that a disabled machine migrates much faster. If you can afford to disable the VM, try this method.
 
Sorry, I did another test in the same way from S2 to S1, and the CPU consumption is similar (insecure mode is enabled), the transfer is better.
the question is still active - is there more way to optimize migration?
I can not turn off VM
 

Attachments

  • 20240801_proxmox_slow_migration3.png
    20240801_proxmox_slow_migration3.png
    365.9 KB · Views: 5
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!