[SOLVED] Replication doesn't speed up migration (6.2 community edition)

ianr-ks · Aug 2, 2020

Hello all, I just created this account to ask this question as I can't find the answer elsewhere.

I have a VM set up on a 6.2 community edition cluster. It's not using HA, the cluster is quorate and all seems well. The VM has replication set up for the two other nodes. The documentation suggests that replication will make migration much faster as it only needs to push over changes since the last replication run, however this isn't the case for me. The VM is set up on local storage on zfs-thin provisioned storage. When I migrate it sends the whole 60 gig over rather than just a smattering of bytes I was expecting (the VM is offline).

So far I've tried removing the leftover copies of the VM's disc on the other nodes left behind by previous migrations, deleting the replication and waiting for that deletion to sync, then recreating the replication and have also updated all nodes to the latest community edition of proxmox (6.2).

Am I wrong in thinking that replication should aid migration? The docs say "Guests with replication enabled can currently only be migrated offline. Only changes since the last replication (so-called deltas) need to be transferred if the guest is migrated to a node to which it already is replicated. This reduces the time needed significantly."

Thanks

IanR

budy · Aug 2, 2020

Well… it seems logical, but only if you perform a non-live migration. But once the guest has been shutdown, it all boils down to a delta-migration and a restart of the guest on the new host. However, a live migration is only possible on shared storage. You could estimate the time for such an migration to be more or less the same as the latest durations for the replications. Usually, the replications of my guests are in the range from a couple of seconds to a few minutes, but ymmv, of course.

fabian · Aug 3, 2020

actually, PVE now also supports live-migration on top of replicated ZFS volumes. the process looks like this:

add dirty bitmap tracking writes on each replicated volume
replicate
regular migration, but only migrate changed bits according to bitmap from 1

if this is not the case for you, please post

pveversion -v
VM config
migration log

ianr-ks · Aug 3, 2020

Hello both, the VM is being migrated offline, I did put that in the original post although in parentheses.

Here's PVEVersion:

Code:

root@d01c02n01f:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

Here's the VM config file (this doesn't seem to include replication data):

Code:

root@d01c02n01f:/etc/pve/qemu-server# cat 101.conf
agent: 1
balloon: 2048
bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 8192
name: kali2019
net0: virtio=FE:1F:41:8E:62:76,bridge=vmbr0
net1: virtio=8E:51:9F:C1:B3:DB,bridge=vmbr1
numa: 0
ostype: l26
parent: preupdate20200801
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=d0742553-7161-4d8d-a7e6-229c75234c46
sockets: 1
spice_enhancements: foldersharing=1
vga: qxl,memory=24
virtio0: local-zfs:vm-101-disk-0,discard=on,size=60G
vmgenid: 2c0e9b06-37d7-4796-a2ff-789c4bb9a458

[preupdate20200801]
#About to update whole of kali
agent: 1
balloon: 2048
bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 8192
name: kali2019
net0: virtio=FE:1F:41:8E:62:76,bridge=vmbr0
net1: virtio=8E:51:9F:C1:B3:DB,bridge=vmbr1
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=d0742553-7161-4d8d-a7e6-229c75234c46
snaptime: 1596302525
sockets: 1
spice_enhancements: foldersharing=1
vga: qxl,memory=24
virtio0: local-zfs:vm-101-disk-0,discard=on,size=60G
vmgenid: 2c0e9b06-37d7-4796-a2ff-789c4bb9a458

Here's the migration log (migrating to the node the above details were taken from):

Code:

2020-08-02 10:11:44 starting migration of VM 101 to node 'd01c02n01f' (172.16.188.121)
2020-08-02 10:11:44 found local, replicated disk 'local-zfs:vm-101-disk-0' (in current VM config)
2020-08-02 10:11:44 found local disk 'local-zfs:vm-101-disk-1' (via storage)
2020-08-02 10:11:44 replicating disk images
2020-08-02 10:11:44 start replication job
2020-08-02 10:11:44 guest => VM 101, running => 0
2020-08-02 10:11:44 volumes => local-zfs:vm-101-disk-0
2020-08-02 10:11:45 create snapshot '__replicate_101-0_1596359504__' on local-zfs:vm-101-disk-0
2020-08-02 10:11:45 using secure transmission, rate limit: none
2020-08-02 10:11:45 incremental sync 'local-zfs:vm-101-disk-0' (__replicate_101-0_1596358804__ => __replicate_101-0_1596359504__)
2020-08-02 10:11:46 rpool/data/vm-101-disk-0@__replicate_101-0_1596358804__    name    rpool/data/vm-101-disk-0@__replicate_101-0_1596358804__    -
2020-08-02 10:11:46 send from @__replicate_101-0_1596358804__ to rpool/data/vm-101-disk-0@__replicate_101-0_1596359504__ estimated size is 624B
2020-08-02 10:11:46 total estimated size is 624B
2020-08-02 10:11:46 TIME        SENT   SNAPSHOT rpool/data/vm-101-disk-0@__replicate_101-0_1596359504__
2020-08-02 10:11:46 successfully imported 'local-zfs:vm-101-disk-0'
2020-08-02 10:11:46 delete previous replication snapshot '__replicate_101-0_1596358804__' on local-zfs:vm-101-disk-0
2020-08-02 10:11:47 (remote_finalize_local_job) delete stale replication snapshot '__replicate_101-0_1596358804__' on local-zfs:vm-101-disk-0
2020-08-02 10:11:47 end replication job
2020-08-02 10:11:47 copying local disk images
2020-08-02 10:11:48 full send of rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__ estimated size is 59.5G
2020-08-02 10:11:48 send from @__replicate_101-0_1593783001__ to rpool/data/vm-101-disk-1@__migration__ estimated size is 30.3G
2020-08-02 10:11:48 total estimated size is 89.8G
2020-08-02 10:11:48 TIME        SENT   SNAPSHOT rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:49 10:11:49   90.9M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:50 10:11:50    201M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:51 10:11:51    311M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:52 10:11:52    422M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:53 10:11:53    532M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:54 10:11:54    636M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:55 10:11:55    746M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:56 10:11:56    854M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:11:57 10:11:57    965M   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__

...

2020-08-02 10:45:50 10:45:50   59.9G   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:45:51 10:45:51   60.0G   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:45:52 10:45:52   60.1G   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:45:52 TIME        SENT   SNAPSHOT rpool/data/vm-101-disk-1@__migration__
2020-08-02 10:45:53 10:45:53   3.02M   rpool/data/vm-101-disk-1@__migration__
2020-08-02 10:45:54 10:45:54   3.02M   rpool/data/vm-101-disk-1@__migration__
2020-08-02 10:45:55 10:45:55   3.02M   rpool/data/vm-101-disk-1@__migration__

...

2020-08-02 11:02:44 11:02:44   30.4G   rpool/data/vm-101-disk-1@__migration__
2020-08-02 11:02:45 11:02:45   30.5G   rpool/data/vm-101-disk-1@__migration__
2020-08-02 11:03:06 successfully imported 'local-zfs:vm-101-disk-1'
2020-08-02 11:03:06 volume 'local-zfs:vm-101-disk-1' is 'local-zfs:vm-101-disk-1' on the target
2020-08-02 11:03:07 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=d01c02n01f' root@172.16.188.121 pvesr set-state 101 \''{"local/d01c02n03":{"last_try":1596358800,"storeid_list":["local-zfs"],"last_sync":1596358800,"last_iteration":1596358800,"last_node":"d01c02n02f","fail_count":0,"duration":4.243232},"local/d01c02n02f":{"fail_count":0,"duration":2.699149,"last_iteration":1596359504,"last_node":"d01c02n02f","last_sync":1596359504,"last_try":1596359504,"storeid_list":["local-zfs"]}}'\'
2020-08-02 11:03:08 migration finished successfully (duration 00:51:24)
TASK OK

I wasn't quite expecting the changeover halfway through the migration log:

2020-08-02 10:45:52 10:45:52   60.1G   rpool/data/vm-101-disk-1@__replicate_101-0_1593783001__
2020-08-02 10:45:52 TIME        SENT   SNAPSHOT rpool/data/vm-101-disk-1@__migration__
2020-08-02 10:45:53 10:45:53   3.02M   rpool/data/vm-101-disk-1@__migration__

I don't know if there's any significance to that? There is a replication of a 60 gig image then a migration of a 30 gig image, the VM itself only has one disc and it's 60 gig.

At the very top of the migration log (I've not previously noticed this) the following two lines appear:

2020-08-02 10:11:44 found local, replicated disk 'local-zfs:vm-101-disk-0' (in current VM config)
2020-08-02 10:11:44 found local disk 'local-zfs:vm-101-disk-1' (via storage)

I suspect this might have something to do with it but I'd already deleted local storage discs so it seems it's adding them back in somehow, do you think this is significant and if so, how to fix? I only want the one 60 gig image. I think there was a previous vm-101 and the 30 gig image may be a leftover from that. I just tried to delete the second disc (disk-1) from the local storage but was told I had to delete it from the hardware tab for the VM however it is not on the hardware tab for that VM or in the config files.

Thanks for any insights that can be provided.

fabian · Aug 3, 2020

if you are sure you don't need the vm-101-disk-1 disk anymore, delete it and your migration should be faster

migration picks up all disks associated with the VM, but replication only those referenced in the config file. so the -1 disk will not be replicated, and migration will always need to transfer it in full.

ianr-ks · Aug 3, 2020

fabian said:
if you are sure you don't need the vm-101-disk-1 disk anymore, delete it and your migration should be faster migration picks up all disks associated with the VM, but replication only those referenced in the config file. so the -1 disk will not be replicated, and migration will always need to transfer it in full.

Thanks but it wouldn't delete it. I think I have found the source of the problem however, googling for the error message I was given when attempting to delete the phantom disk prompted me to run "qm rescan --vmid 101" which added not just the phantom disc but another two discs to the machine's config, these were left over images from when I imported the VM into ProxMox from Virtualbox as a qcow image so I think there have been some phantom files being moved around as a result of those.

I'll give replication a chance to do its work then will try migration again and report back.

ianr-ks · Aug 3, 2020

ianr-ks said:
Thanks but it wouldn't delete it. I think I have found the source of the problem however, googling for the error message I was given when attempting to delete the phantom disk prompted me to run "qm rescan --vmid 101" which added not just the phantom disc but another two discs to the machine's config, these were left over images from when I imported the VM into ProxMox from Virtualbox as a qcow image so I think there have been some phantom files being moved around as a result of those.

I'll give replication a chance to do its work then will try migration again and report back.

OK that seems to have sorted it. It seems that the migration was picking up a bunch of old images and moving them with the VM despite those images not being in the VM's config, once I did the qm rescan it associated those old images with the VM config, making them visible and permitting me to delete them. The latest migration took less than 5 seconds which is much better. I think this one can be ticked off as "done" now.

Search

Search

[SOLVED] Replication doesn't speed up migration (6.2 community edition)

ianr-ks

Member

budy

Well-Known Member

fabian

Proxmox Staff Member

ianr-ks

Member

fabian

Proxmox Staff Member

ianr-ks

Member

ianr-ks

Member

We value your privacy