Specifying a one-off migration network

mort47

New Member
Sep 12, 2024
6
0
1
Hi,

I've read this is the way to migrate over a specific network:

qm migrate 100 targethost --online --migration_network 192.168.2.1/24

It doesn't, though. PVE creates the job and then migrates using whatever the currently configured migration network is, ignoring the --migration_network option. What am I doing wrong here? Is there another way I should be doing this?

Thanks.
 
Hi,
as far as I understand it, the migration_network option requires the network address and not a single host within a network. In your case like that:
Code:
--migration_network 192.168.2.0/24
 
For migration to work both nodes need an IP in the given network (yeah, no brainer) and ssh must be possible. Can you connect with ssh from node with vm to the target node using this network's addresses?
 
All good, it's always worth checking the basics when I'm looking for something I might have missed. Yes, they're both on that network and I can SSH to the target node on that network.
 
Does it work if you are using pvesh? The node running the command on is not important then.
Code:
pvesh create /nodes/<source_node>/qemu/<vmid>/migrate -target <target_node> -online 1 -migration_network 192.168.2.0/24 -migration_type insecure
Would you like to share the network settings of the cluster nodes and the log of the migration task?
 
Last edited:
I tried it and it looks like it's doing the same thing:
Code:
task started by HA resource agent
2024-12-06 21:34:03 use dedicated network address for sending migration traffic (192.168.1.3)
2024-12-06 21:34:04 starting migration of VM 109 to node 'humboldt' (192.168.1.3)
2024-12-06 21:34:04 found local, replicated disk 'zfs:vm-109-disk-0' (attached)
2024-12-06 21:34:04 found local, replicated disk 'zfs:vm-109-disk-1' (attached)
2024-12-06 21:34:04 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2024-12-06 21:34:04 scsi1: start tracking writes using block-dirty-bitmap 'repl_scsi1'
2024-12-06 21:34:04 replicating disk images
2024-12-06 21:34:04 start replication job
2024-12-06 21:34:04 guest => VM 109, running => 82760
2024-12-06 21:34:04 volumes => zfs:vm-109-disk-0,zfs:vm-109-disk-1
2024-12-06 21:34:06 freeze guest filesystem
2024-12-06 21:34:06 create snapshot '__replicate_109-1_1733517244__' on zfs:vm-109-disk-0
2024-12-06 21:34:07 create snapshot '__replicate_109-1_1733517244__' on zfs:vm-109-disk-1
2024-12-06 21:34:07 thaw guest filesystem
2024-12-06 21:34:07 using secure transmission, rate limit: none
2024-12-06 21:34:07 incremental sync 'zfs:vm-109-disk-0' (__replicate_109-1_1733517002__ => __replicate_109-1_1733517244__)
2024-12-06 21:34:08 send from @__replicate_109-1_1733517002__ to zfs/vm-109-disk-0@__replicate_109-2_1733517012__ estimated size is 1.30M
2024-12-06 21:34:08 send from @__replicate_109-2_1733517012__ to zfs/vm-109-disk-0@__replicate_109-1_1733517244__ estimated size is 2.69M
2024-12-06 21:34:08 total estimated size is 3.98M
2024-12-06 21:34:08 TIME        SENT   SNAPSHOT zfs/vm-109-disk-0@__replicate_109-2_1733517012__
2024-12-06 21:34:08 TIME        SENT   SNAPSHOT zfs/vm-109-disk-0@__replicate_109-1_1733517244__
2024-12-06 21:34:08 successfully imported 'zfs:vm-109-disk-0'
2024-12-06 21:34:08 incremental sync 'zfs:vm-109-disk-1' (__replicate_109-1_1733517002__ => __replicate_109-1_1733517244__)
2024-12-06 21:34:09 send from @__replicate_109-1_1733517002__ to zfs/vm-109-disk-1@__replicate_109-2_1733517012__ estimated size is 1.38M
2024-12-06 21:34:09 send from @__replicate_109-2_1733517012__ to zfs/vm-109-disk-1@__replicate_109-1_1733517244__ estimated size is 28.3M
2024-12-06 21:34:09 total estimated size is 29.7M
2024-12-06 21:34:09 TIME        SENT   SNAPSHOT zfs/vm-109-disk-1@__replicate_109-2_1733517012__
2024-12-06 21:34:09 TIME        SENT   SNAPSHOT zfs/vm-109-disk-1@__replicate_109-1_1733517244__
2024-12-06 21:34:09 successfully imported 'zfs:vm-109-disk-1'
2024-12-06 21:34:09 delete previous replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-0
2024-12-06 21:34:09 delete previous replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-1
2024-12-06 21:34:11 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-0
2024-12-06 21:34:11 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-1
2024-12-06 21:34:11 end replication job
2024-12-06 21:34:11 starting VM 109 on remote node 'humboldt'
2024-12-06 21:34:14 volume 'zfs:vm-109-disk-0' is 'zfs:vm-109-disk-0' on the target
2024-12-06 21:34:14 volume 'zfs:vm-109-disk-1' is 'zfs:vm-109-disk-1' on the target
2024-12-06 21:34:14 start remote tunnel
2024-12-06 21:34:15 ssh tunnel ver 1
2024-12-06 21:34:15 starting storage migration
2024-12-06 21:34:15 scsi1: start migration to nbd:unix:/run/qemu-server/109_nbd.migrate:exportname=drive-scsi1
drive mirror re-using dirty bitmap 'repl_scsi1'
drive mirror is starting for drive-scsi1
drive-scsi1: transferred 0.0 B of 1.9 MiB (0.00%) in 0s
drive-scsi1: transferred 2.2 MiB of 2.2 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-12-06 21:34:16 scsi0: start migration to nbd:unix:/run/qemu-server/109_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 2.6 MiB (0.00%) in 0s
drive-scsi0: transferred 2.6 MiB of 2.6 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-12-06 21:34:17 switching mirror jobs to actively synced mode
drive-scsi0: switching to actively synced mode
drive-scsi1: switching to actively synced mode
drive-scsi0: successfully switched to actively synced mode
drive-scsi1: successfully switched to actively synced mode
2024-12-06 21:34:18 starting online/live migration on unix:/run/qemu-server/109.migrate
2024-12-06 21:34:18 set migration capabilities
2024-12-06 21:34:18 migration downtime limit: 100 ms
2024-12-06 21:34:18 migration cachesize: 128.0 MiB
2024-12-06 21:34:18 set migration parameters
2024-12-06 21:34:18 start migrate command to unix:/run/qemu-server/109.migrate
2024-12-06 21:34:20 migration active, transferred 107.5 MiB of 1.0 GiB VM-state, 151.9 MiB/s
2024-12-06 21:34:21 migration active, transferred 213.8 MiB of 1.0 GiB VM-state, 111.2 MiB/s
2024-12-06 21:34:22 migration active, transferred 321.2 MiB of 1.0 GiB VM-state, 95.5 MiB/s
2024-12-06 21:34:23 migration active, transferred 431.1 MiB of 1.0 GiB VM-state, 109.2 MiB/s
2024-12-06 21:34:24 migration active, transferred 542.7 MiB of 1.0 GiB VM-state, 115.1 MiB/s
2024-12-06 21:34:25 migration active, transferred 652.2 MiB of 1.0 GiB VM-state, 112.2 MiB/s
2024-12-06 21:34:26 migration active, transferred 762.2 MiB of 1.0 GiB VM-state, 112.5 MiB/s
2024-12-06 21:34:27 migration active, transferred 895.8 MiB of 1.0 GiB VM-state, 111.7 MiB/s
2024-12-06 21:34:27 average migration speed: 115.6 MiB/s - downtime 67 ms
2024-12-06 21:34:27 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi1: Completing block job...
drive-scsi1: Completed successfully.
drive-scsi0: mirror-job finished
drive-scsi1: mirror-job finished
2024-12-06 21:34:29 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=humboldt' -o 'UserKnownHostsFile=/etc/pve/nodes/humboldt/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.1.3 pvesr set-state 109 \''{"local/chinstrap":{"last_sync":1733517244,"last_iteration":1733517244,"last_try":1733517244,"storeid_list":["zfs"],"last_node":"chinstrap","duration":7.04853,"fail_count":0},"local/magellanic":{"last_sync":1733517012,"last_iteration":1733517002,"last_node":"chinstrap","last_try":1733517012,"storeid_list":["zfs"],"fail_count":0,"duration":10.935445}}'\'
2024-12-06 21:34:30 stopping NBD storage migration server on target.
2024-12-06 21:34:32 issuing guest fstrim
2024-12-06 21:34:38 migration finished successfully (duration 00:00:35)
TASK OK
I've got three nodes each with two network devices. 192.168.1.0/24 and 192.168.2.0/24. The .1 network is faster so I've set it to my migration network and the .2 is the management network. .1 uses USB ethernet devices and .2 uses the internal ethernet device in each node.

One of my nodes has a kind of flaky USB controller. It disconnects maybe once a week and wants a reboot to fix it. Most of my services run on the faster .1 network so that's kind of annoying. Unfortunately HA doesn't help because the management network is on .2 and as long as the nodes can see one another on the management network they're happy to keep running, even though the VMs are basically unreachable because all my client devices are on .1.

I wrote a script to reboot the node when it loses its network. However reboot policy is set to migrate and of course it tries to migrate the VM on the .1 network and fails repeatedly and never reboots. And that's the problem I'm trying to solve. Yes there absolutely are workarounds that involve changing my network configuration but I'd like to call those plan B for now.
 
The migration initiated by HA Manager, like seen in the log above, will always use the network configured in datacenter.cfg.
Your goal is to initiate a manual migration via the still working network interface and when it is finished to trigger the reboot of the node?
 
Exactly. And whatever command I use, including the one from the docs I found, it always uses the configured migration network.
 
Is HA configured for this vm? If yes, it will not work as you want, because the migration is not done directly but the HA resource agent is triggered. And it uses the config from datacenter.cfg
Therefore only disabling HA for this VM will allow you to choose the network or update your datacenter.cfg before (and after) migration.
 
Ah ha. Yes, HA is enabled. That'll be it. Thanks. I can't try it right now because I'm doing other things but next time I look at it I'll play around with temporarily disabling HA in the script. Hopefully it'll be as simple as stopping the manager service on that one node.
 
Watch out when enabling HA again, if you use groups/priorities for the nodes. The vm might be migrated instantaneously to the "wrong" node again.
 
I'm glad to have found this post while searching for solutions to network migration issues. My situation is similar to that of mort47, involving 4 PVE nodes, which I have named Alpha, Bravo, Cobar, and NUC. The first three nodes each have 3 network cards:

1. A USB to 2.5G network card: 192.168.31.0/24
2. A built-in network card: 192.168.21.0/24
3. A dual-port 40G network card

The problem arises with the NUC, which, in addition to the above 3 network cards, also has a dual-port 10G network card connected to both the 31.0/24 and 21.0/24 networks. Since the datacenter.cfg defaults to using the 31.0/24 network for migration, attempting to migrate an LXC to the NUC results in the error could not get migration ip: multiple, different ....
because the NUC has two network cards on the 31.0/24 network (the 2.5G and the 10G).

My current workaround is to set IP aliases for the network cards intended for migration(The 31.0/24 network segment.), such as:

- Alpha: 1.0.0.119
- Bravo: 1.0.0.118
- Cobar: 1.0.0.117
- NUC: 1.0.0.9

Then, I modify the datacenter.cfg to: migration: secure,network=1.0.0.0/24, which allows for normal LXC migration.

However, I am not entirely satisfied with this solution. My goal is to utilize the 40G network for migrations. Since I haven't purchased a 40G switch, I have set up a ring network and used FRRouting to connect and configure these 4 nodes:

1742737173310.png

In this scenario, I can no longer use IP aliases to achieve the aforementioned setup. I’m unsure if there’s a better way to make these four dual-port 40G network cards work effectively.