Specifying a one-off migration network

mort47 · Dec 6, 2024

Hi,

I've read this is the way to migrate over a specific network:

qm migrate 100 targethost --online --migration_network 192.168.2.1/24

It doesn't, though. PVE creates the job and then migrates using whatever the currently configured migration network is, ignoring the --migration_network option. What am I doing wrong here? Is there another way I should be doing this?

Thanks.

fba · Dec 6, 2024

Hi,
as far as I understand it, the migration_network option requires the network address and not a single host within a network. In your case like that:

Code:

--migration_network 192.168.2.0/24

mort47 · Dec 6, 2024

Thanks, but I tried it and no difference.

fba · Dec 6, 2024

For migration to work both nodes need an IP in the given network (yeah, no brainer) and ssh must be possible. Can you connect with ssh from node with vm to the target node using this network's addresses?

mort47 · Dec 6, 2024

All good, it's always worth checking the basics when I'm looking for something I might have missed. Yes, they're both on that network and I can SSH to the target node on that network.

fba · Dec 6, 2024

Does it work if you are using pvesh? The node running the command on is not important then.

Code:

pvesh create /nodes/<source_node>/qemu/<vmid>/migrate -target <target_node> -online 1 -migration_network 192.168.2.0/24 -migration_type insecure

Would you like to share the network settings of the cluster nodes and the log of the migration task?

mort47 · Dec 6, 2024

I tried it and it looks like it's doing the same thing:

Code:

task started by HA resource agent
2024-12-06 21:34:03 use dedicated network address for sending migration traffic (192.168.1.3)
2024-12-06 21:34:04 starting migration of VM 109 to node 'humboldt' (192.168.1.3)
2024-12-06 21:34:04 found local, replicated disk 'zfs:vm-109-disk-0' (attached)
2024-12-06 21:34:04 found local, replicated disk 'zfs:vm-109-disk-1' (attached)
2024-12-06 21:34:04 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2024-12-06 21:34:04 scsi1: start tracking writes using block-dirty-bitmap 'repl_scsi1'
2024-12-06 21:34:04 replicating disk images
2024-12-06 21:34:04 start replication job
2024-12-06 21:34:04 guest => VM 109, running => 82760
2024-12-06 21:34:04 volumes => zfs:vm-109-disk-0,zfs:vm-109-disk-1
2024-12-06 21:34:06 freeze guest filesystem
2024-12-06 21:34:06 create snapshot '__replicate_109-1_1733517244__' on zfs:vm-109-disk-0
2024-12-06 21:34:07 create snapshot '__replicate_109-1_1733517244__' on zfs:vm-109-disk-1
2024-12-06 21:34:07 thaw guest filesystem
2024-12-06 21:34:07 using secure transmission, rate limit: none
2024-12-06 21:34:07 incremental sync 'zfs:vm-109-disk-0' (__replicate_109-1_1733517002__ => __replicate_109-1_1733517244__)
2024-12-06 21:34:08 send from @__replicate_109-1_1733517002__ to zfs/vm-109-disk-0@__replicate_109-2_1733517012__ estimated size is 1.30M
2024-12-06 21:34:08 send from @__replicate_109-2_1733517012__ to zfs/vm-109-disk-0@__replicate_109-1_1733517244__ estimated size is 2.69M
2024-12-06 21:34:08 total estimated size is 3.98M
2024-12-06 21:34:08 TIME        SENT   SNAPSHOT zfs/vm-109-disk-0@__replicate_109-2_1733517012__
2024-12-06 21:34:08 TIME        SENT   SNAPSHOT zfs/vm-109-disk-0@__replicate_109-1_1733517244__
2024-12-06 21:34:08 successfully imported 'zfs:vm-109-disk-0'
2024-12-06 21:34:08 incremental sync 'zfs:vm-109-disk-1' (__replicate_109-1_1733517002__ => __replicate_109-1_1733517244__)
2024-12-06 21:34:09 send from @__replicate_109-1_1733517002__ to zfs/vm-109-disk-1@__replicate_109-2_1733517012__ estimated size is 1.38M
2024-12-06 21:34:09 send from @__replicate_109-2_1733517012__ to zfs/vm-109-disk-1@__replicate_109-1_1733517244__ estimated size is 28.3M
2024-12-06 21:34:09 total estimated size is 29.7M
2024-12-06 21:34:09 TIME        SENT   SNAPSHOT zfs/vm-109-disk-1@__replicate_109-2_1733517012__
2024-12-06 21:34:09 TIME        SENT   SNAPSHOT zfs/vm-109-disk-1@__replicate_109-1_1733517244__
2024-12-06 21:34:09 successfully imported 'zfs:vm-109-disk-1'
2024-12-06 21:34:09 delete previous replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-0
2024-12-06 21:34:09 delete previous replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-1
2024-12-06 21:34:11 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-0
2024-12-06 21:34:11 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-1_1733517002__' on zfs:vm-109-disk-1
2024-12-06 21:34:11 end replication job
2024-12-06 21:34:11 starting VM 109 on remote node 'humboldt'
2024-12-06 21:34:14 volume 'zfs:vm-109-disk-0' is 'zfs:vm-109-disk-0' on the target
2024-12-06 21:34:14 volume 'zfs:vm-109-disk-1' is 'zfs:vm-109-disk-1' on the target
2024-12-06 21:34:14 start remote tunnel
2024-12-06 21:34:15 ssh tunnel ver 1
2024-12-06 21:34:15 starting storage migration
2024-12-06 21:34:15 scsi1: start migration to nbd:unix:/run/qemu-server/109_nbd.migrate:exportname=drive-scsi1
drive mirror re-using dirty bitmap 'repl_scsi1'
drive mirror is starting for drive-scsi1
drive-scsi1: transferred 0.0 B of 1.9 MiB (0.00%) in 0s
drive-scsi1: transferred 2.2 MiB of 2.2 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-12-06 21:34:16 scsi0: start migration to nbd:unix:/run/qemu-server/109_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 2.6 MiB (0.00%) in 0s
drive-scsi0: transferred 2.6 MiB of 2.6 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-12-06 21:34:17 switching mirror jobs to actively synced mode
drive-scsi0: switching to actively synced mode
drive-scsi1: switching to actively synced mode
drive-scsi0: successfully switched to actively synced mode
drive-scsi1: successfully switched to actively synced mode
2024-12-06 21:34:18 starting online/live migration on unix:/run/qemu-server/109.migrate
2024-12-06 21:34:18 set migration capabilities
2024-12-06 21:34:18 migration downtime limit: 100 ms
2024-12-06 21:34:18 migration cachesize: 128.0 MiB
2024-12-06 21:34:18 set migration parameters
2024-12-06 21:34:18 start migrate command to unix:/run/qemu-server/109.migrate
2024-12-06 21:34:20 migration active, transferred 107.5 MiB of 1.0 GiB VM-state, 151.9 MiB/s
2024-12-06 21:34:21 migration active, transferred 213.8 MiB of 1.0 GiB VM-state, 111.2 MiB/s
2024-12-06 21:34:22 migration active, transferred 321.2 MiB of 1.0 GiB VM-state, 95.5 MiB/s
2024-12-06 21:34:23 migration active, transferred 431.1 MiB of 1.0 GiB VM-state, 109.2 MiB/s
2024-12-06 21:34:24 migration active, transferred 542.7 MiB of 1.0 GiB VM-state, 115.1 MiB/s
2024-12-06 21:34:25 migration active, transferred 652.2 MiB of 1.0 GiB VM-state, 112.2 MiB/s
2024-12-06 21:34:26 migration active, transferred 762.2 MiB of 1.0 GiB VM-state, 112.5 MiB/s
2024-12-06 21:34:27 migration active, transferred 895.8 MiB of 1.0 GiB VM-state, 111.7 MiB/s
2024-12-06 21:34:27 average migration speed: 115.6 MiB/s - downtime 67 ms
2024-12-06 21:34:27 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi1: Completing block job...
drive-scsi1: Completed successfully.
drive-scsi0: mirror-job finished
drive-scsi1: mirror-job finished
2024-12-06 21:34:29 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=humboldt' -o 'UserKnownHostsFile=/etc/pve/nodes/humboldt/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.1.3 pvesr set-state 109 \''{"local/chinstrap":{"last_sync":1733517244,"last_iteration":1733517244,"last_try":1733517244,"storeid_list":["zfs"],"last_node":"chinstrap","duration":7.04853,"fail_count":0},"local/magellanic":{"last_sync":1733517012,"last_iteration":1733517002,"last_node":"chinstrap","last_try":1733517012,"storeid_list":["zfs"],"fail_count":0,"duration":10.935445}}'\'
2024-12-06 21:34:30 stopping NBD storage migration server on target.
2024-12-06 21:34:32 issuing guest fstrim
2024-12-06 21:34:38 migration finished successfully (duration 00:00:35)
TASK OK

I've got three nodes each with two network devices. 192.168.1.0/24 and 192.168.2.0/24. The .1 network is faster so I've set it to my migration network and the .2 is the management network. .1 uses USB ethernet devices and .2 uses the internal ethernet device in each node.

One of my nodes has a kind of flaky USB controller. It disconnects maybe once a week and wants a reboot to fix it. Most of my services run on the faster .1 network so that's kind of annoying. Unfortunately HA doesn't help because the management network is on .2 and as long as the nodes can see one another on the management network they're happy to keep running, even though the VMs are basically unreachable because all my client devices are on .1.

I wrote a script to reboot the node when it loses its network. However reboot policy is set to migrate and of course it tries to migrate the VM on the .1 network and fails repeatedly and never reboots. And that's the problem I'm trying to solve. Yes there absolutely are workarounds that involve changing my network configuration but I'd like to call those plan B for now.

fba · Dec 7, 2024

The migration initiated by HA Manager, like seen in the log above, will always use the network configured in datacenter.cfg.
Your goal is to initiate a manual migration via the still working network interface and when it is finished to trigger the reboot of the node?

mort47 · Dec 7, 2024

Exactly. And whatever command I use, including the one from the docs I found, it always uses the configured migration network.

fba · Dec 9, 2024

Is HA configured for this vm? If yes, it will not work as you want, because the migration is not done directly but the HA resource agent is triggered. And it uses the config from datacenter.cfg
Therefore only disabling HA for this VM will allow you to choose the network or update your datacenter.cfg before (and after) migration.

mort47 · Dec 9, 2024

Ah ha. Yes, HA is enabled. That'll be it. Thanks. I can't try it right now because I'm doing other things but next time I look at it I'll play around with temporarily disabling HA in the script. Hopefully it'll be as simple as stopping the manager service on that one node.

fba · Dec 9, 2024

Watch out when enabling HA again, if you use groups/priorities for the nodes. The vm might be migrated instantaneously to the "wrong" node again.

Search

Search

Specifying a one-off migration network

mort47

New Member

fba

Member

mort47

New Member

fba

Member

mort47

New Member

fba

Member

mort47

New Member

fba

Member

mort47

New Member

fba

Member

mort47

New Member

fba

Member