VM Migration stuck

danr · Mar 2, 2021

Hi everyone,
my current setup consists of 3 nodes in a cluster, 2 with 160GB of ram and one with 192GB, actually HA is not enabled cause I'm waiting the disks to arrive to set up CEPH.
I'm starting to migrate some VMs to free RAM on the first node but the migration hangs..

Code:

2021-03-01 09:43:54 starting migration of VM 109 to node 'pve2' (192.168.1.109)
2021-03-01 09:43:55 found local, replicated disk 'local-zfs:vm-109-disk-0' (in current VM config)
2021-03-01 09:43:55 found local, replicated disk 'local-zfs:vm-109-disk-1' (in current VM config)
2021-03-01 09:43:55 replicating disk images
2021-03-01 09:43:55 start replication job
2021-03-01 09:43:55 guest => VM 109, running => 0
2021-03-01 09:43:55 volumes => local-zfs:vm-109-disk-0,local-zfs:vm-109-disk-1
2021-03-01 09:43:56 create snapshot '__replicate_109-0_1614588235__' on local-zfs:vm-109-disk-0
2021-03-01 09:43:57 create snapshot '__replicate_109-0_1614588235__' on local-zfs:vm-109-disk-1
2021-03-01 09:43:59 using secure transmission, rate limit: none
2021-03-01 09:43:59 incremental sync 'local-zfs:vm-109-disk-0' (__replicate_109-0_1614359062__ => __replicate_109-0_1614588235__)
2021-03-01 09:44:01 send from @__replicate_109-0_1614359062__ to rpool/data/vm-109-disk-0@__replicate_109-0_1614588235__ estimated size is 624B
2021-03-01 09:44:01 total estimated size is 624B
2021-03-01 09:44:02 TIME        SENT   SNAPSHOT rpool/data/vm-109-disk-0@__replicate_109-0_1614588235__
2021-03-01 09:44:03 successfully imported 'local-zfs:vm-109-disk-0'
2021-03-01 09:44:03 incremental sync 'local-zfs:vm-109-disk-1' (__replicate_109-0_1614359062__ => __replicate_109-0_1614588235__)
2021-03-01 09:44:05 send from @__replicate_109-0_1614359062__ to rpool/data/vm-109-disk-1@__replicate_109-0_1614588235__ estimated size is 624B
2021-03-01 09:44:05 total estimated size is 624B
2021-03-01 09:44:06 TIME        SENT   SNAPSHOT rpool/data/vm-109-disk-1@__replicate_109-0_1614588235__
2021-03-01 09:44:08 successfully imported 'local-zfs:vm-109-disk-1'
2021-03-01 09:44:08 delete previous replication snapshot '__replicate_109-0_1614359062__' on local-zfs:vm-109-disk-0
2021-03-01 09:44:09 delete previous replication snapshot '__replicate_109-0_1614359062__' on local-zfs:vm-109-disk-1
2021-03-01 09:44:12 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-0_1614359062__' on local-zfs:vm-109-disk-0
2021-03-01 09:44:12 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-0_1614359062__' on local-zfs:vm-109-disk-1
2021-03-01 09:44:12 end replication job

it's stuck here.. I don't see anything in the log..
I also tried to create a VM on the second node and migrate to third (which are the same exact version) but still the same..
Versions:

Code:

root@pve:~# pveversion
pve-manager/6.3-2/22f57405 (running kernel: 5.4.73-1-pve)
root@pve2:~# pveversion
pve-manager/6.3-4/0a38c56f (running kernel: 5.4.98-1-pve)
root@pve3:~# pveversion
pve-manager/6.3-4/0a38c56f (running kernel: 5.4.98-1-pve)

Cluster status

Code:

root@pve:~# pvecm status
Cluster information
-------------------
Name:             Cluster
Config Version:   7
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Mar  2 09:26:07 2021
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.20a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.32 (local)
0x00000002          1 192.168.1.109
0x00000003          1 192.168.1.124

Dominic · Mar 2, 2021

First of all, it would be good to have exactly the same versions running.

The next message should be "copying local disk images" and in between PVE performs a scan. Could you please post

Code:

pvesm status

?

danr · Mar 2, 2021

Thanks for the quick reply!
The 3 nodes were running the same version, I updated the 2 unused nodes to try if updating does the trick..
Here you have the pvesm status output from the source node to the destination node, the migration is happening between the local-zfs storages..

Code:

root@pve:~# pvesm status
Name             Type     Status           Total            Used       Available        %
Pool1             rbd     active      1235782272               0      1235782272    0.00%
local             dir     active      3384137472        36129664      3348007808    1.07%
local-zfs     zfspool     active      3734008228       386000392      3348007836   10.34%

root@pve2:~# pvesm status
Name             Type     Status           Total            Used       Available        %
Pool1             rbd     active      1235782272               0      1235782272    0.00%
local             dir     active       420196608         1531008       418665600    0.36%
local-zfs     zfspool     active       421036952         2371316       418665636    0.56%

danr · Mar 3, 2021

anyone can help?

Dominic · Mar 4, 2021

Can you try to do the migration on one of your other storages?

danr · Mar 8, 2021

I was trying to migrate from pve -> local-zfs to pve2 -> local-zfs but it won't migrate, I tried to disable the storage Pool1 (ceph not fully configured) and it worked.. I don't understand why if I migrate from a local storage to another local store it won't let me do it....
At the moment is solved because I disabled the Ceph storage..

Search

Search

VM Migration stuck

danr

New Member

Dominic

Proxmox Retired Staff

danr

New Member

danr

New Member

Dominic

Proxmox Retired Staff

danr

New Member

We value your privacy