[solved] Migrating VM fails if its running, but works when shutdown

Elleni · Nov 5, 2020

I get the following error if I try to migrate vm to another node while it is running. It successfully migrates if done while VM is shutdown.

Code:

Proxmox
Virtual Environment 6.2-15
Virtual Machine 106 (piHole) on node 'srv1'
Logs
()
2020-11-05 01:23:15 starting migration of VM 106 to node 'srv2' (192.168.57.61)
2020-11-05 01:23:15 found local, replicated disk 'disks:vm-106-disk-0' (in current VM config)
2020-11-05 01:23:15 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2020-11-05 01:23:15 replicating disk images
2020-11-05 01:23:15 start replication job
2020-11-05 01:23:15 guest => VM 106, running => 3269
2020-11-05 01:23:15 volumes => disks:vm-106-disk-0
2020-11-05 01:23:16 freeze guest filesystem
2020-11-05 01:23:16 create snapshot '__replicate_106-0_1604535795__' on disks:vm-106-disk-0
2020-11-05 01:23:16 thaw guest filesystem
2020-11-05 01:23:16 using secure transmission, rate limit: none
2020-11-05 01:23:16 incremental sync 'disks:vm-106-disk-0' (__replicate_106-0_1604532743__ => __replicate_106-0_1604535795__)
2020-11-05 01:23:16 send from @__replicate_106-0_1604532743__ to rea_daten/vm-106-disk-0@__replicate_106-0_1604535795__ estimated size is 20.5M
2020-11-05 01:23:16 total estimated size is 20.5M
2020-11-05 01:23:16 TIME        SENT   SNAPSHOT rea_daten/vm-106-disk-0@__replicate_106-0_1604535795__
2020-11-05 01:23:16 rea_daten/vm-106-disk-0@__replicate_106-0_1604532743__    name    rea_daten/vm-106-disk-0@__replicate_106-0_1604532743__    -
2020-11-05 01:23:17 01:23:17   14.2M   rea_daten/vm-106-disk-0@__replicate_106-0_1604535795__
2020-11-05 01:23:18 successfully imported 'disks:vm-106-disk-0'
2020-11-05 01:23:18 delete previous replication snapshot '__replicate_106-0_1604532743__' on disks:vm-106-disk-0
2020-11-05 01:23:18 (remote_finalize_local_job) delete stale replication snapshot '__replicate_106-0_1604532743__' on disks:vm-106-disk-0
2020-11-05 01:23:19 end replication job
2020-11-05 01:23:19 copying local disk images
2020-11-05 01:23:19 starting VM 106 on remote node 'srv2'
2020-11-05 01:23:19 start remote tunnel
2020-11-05 01:23:20 ssh tunnel ver 1
2020-11-05 01:23:20 starting storage migration
2020-11-05 01:23:20 scsi0: start migration to nbd:unix:/run/qemu-server/106_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 0 bytes remaining: 1114112 bytes total: 1114112 bytes progression: 0.00 % busy: 1 ready: 0
drive-scsi0: transferred: 1114112 bytes remaining: 0 bytes total: 1114112 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2020-11-05 01:23:21 volume 'disks:vm-106-disk-0' is 'disks:vm-106-disk-0' on the target
2020-11-05 01:23:21 starting online/live migration on unix:/run/qemu-server/106.migrate
2020-11-05 01:23:21 set migration_caps
2020-11-05 01:23:21 migration speed limit: 8589934592 B/s
2020-11-05 01:23:21 migration downtime limit: 100 ms
2020-11-05 01:23:21 migration cachesize: 134217728 B
2020-11-05 01:23:21 set migration parameters
2020-11-05 01:23:21 start migrate command to unix:/run/qemu-server/106.migrate
channel 4: open failed: connect failed: open failed

channel 3: open failed: connect failed: open failed

2020-11-05 01:23:22 migration status error: failed
2020-11-05 01:23:22 ERROR: online migrate failure - aborting
2020-11-05 01:23:22 aborting phase 2 - cleanup resources
2020-11-05 01:23:22 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2020-11-05 01:23:22 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2020-11-05 01:23:24 ERROR: migration finished with problems (duration 00:00:09)
TASK ERROR: migration problems

I would love to know how can this be troubleshooted and fixed. I tried to ssh from one node to the other by hostname srv1/2 which are resolvable by ip in resolv.conf and by node1/2 which are put in the hostfile. What else can I provide or check in order to fix this?

Moayad · Nov 5, 2020

Hello,

Please post output of the following commands:

Bash:

pveversion -v 
pvecm status
qm config <VMID>

Elleni · Nov 5, 2020

pveversion -v:

Code:

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.4-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-6
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-4
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-18
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

pvecm status:

Code:

Cluster information
-------------------
Name:             reacluster
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Nov  5 08:43:33 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.57.60 (local)
0x00000002          1    A,V,NMW 172.16.57.61
0x00000000          1            Qdevice

qm config 106

Code:

agent: 1
balloon: 512
boot: c
bootdisk: scsi0
cores: 1
description: qmdump#map%3Ascsi0%3Adrive-scsi0%3Adisks%3A%3A
ide2: none,media=cdrom
memory: 1024
name: piHole
net0: virtio=CE:FB:22:1C:A9:1E,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: disks:vm-106-disk-0,cache=writeback,discard=on,format=raw,size=16G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=bac1d50d-a5ae-4203-b9e4-9245585a9e36
sockets: 1
startup: order=3
vmgenid: e448233b-238a-4ee2-af6d-b9e72f601001

Elleni · Nov 5, 2020

What I did was the following. Initially I had created the nodes with an encrypted datapool where the vms resided. As I then found out that replication does not work with encrypted pools, I separated a node without reinstalling, destroyed the encrypted pools and re-created a cluster. Before creating the cluster I removed the entry of the other node from authorized keys, and I also did pvecm updatecerts as on one node I could not access the webinterface.

Finally I recreated the cluster without encrypted pools and added Qdevice on a proxmox-backup-server. Replication and migration of vms that are shutdown now works fine. But as soon as a vm is booted the migration fails. Thanks for your assistance.

fabian · Nov 5, 2020

https://git.proxmox.com/?p=pve-qemu.git;a=commitdiff;h=2130e925a89a640657df785e323cd403e88f68e7 should fix this (pve-qemu-kvm 5.1.0-5)

Elleni · Nov 5, 2020

HI Fabian, thanks for your reply. I was afraid, you'd tell me that it would work if I completely reinstall proxmox-ve because I had removed them from cluster and recreated a new one.

When will this be available? As I dont know howto patch the system myself, I maybe should just wait, but if its easy and there is a howto I could test it.

fabian · Nov 5, 2020

it's already available packaged on pvetest

Elleni · Nov 5, 2020

I tried it, but it does not work for me. A reboot is not necessary, right?

Code:

pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.4-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-6
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-5
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-18
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

Code:

Proxmox
Virtual Environment 6.2-15
Knoten 'srv2'
Logs
()
2020-11-05 09:32:13 starting migration of VM 106 to node 'srv1' (192.168.57.60)
2020-11-05 09:32:13 found local, replicated disk 'disks:vm-106-disk-0' (in current VM config)
2020-11-05 09:32:13 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2020-11-05 09:32:13 replicating disk images
2020-11-05 09:32:13 start replication job
2020-11-05 09:32:13 guest => VM 106, running => 25459
2020-11-05 09:32:13 volumes => disks:vm-106-disk-0
2020-11-05 09:32:14 freeze guest filesystem
2020-11-05 09:32:14 create snapshot '__replicate_106-0_1604565133__' on disks:vm-106-disk-0
2020-11-05 09:32:14 thaw guest filesystem
2020-11-05 09:32:14 using secure transmission, rate limit: none
2020-11-05 09:32:14 incremental sync 'disks:vm-106-disk-0' (__replicate_106-0_1604565005__ => __replicate_106-0_1604565133__)
2020-11-05 09:32:15 send from @__replicate_106-0_1604565005__ to rea_daten/vm-106-disk-0@__replicate_106-0_1604565133__ estimated size is 1.85M
2020-11-05 09:32:15 total estimated size is 1.85M
2020-11-05 09:32:15 TIME        SENT   SNAPSHOT rea_daten/vm-106-disk-0@__replicate_106-0_1604565133__
2020-11-05 09:32:15 rea_daten/vm-106-disk-0@__replicate_106-0_1604565005__    name    rea_daten/vm-106-disk-0@__replicate_106-0_1604565005__    -
2020-11-05 09:32:15 successfully imported 'disks:vm-106-disk-0'
2020-11-05 09:32:15 delete previous replication snapshot '__replicate_106-0_1604565005__' on disks:vm-106-disk-0
2020-11-05 09:32:15 (remote_finalize_local_job) delete stale replication snapshot '__replicate_106-0_1604565005__' on disks:vm-106-disk-0
2020-11-05 09:32:15 end replication job
2020-11-05 09:32:15 copying local disk images
2020-11-05 09:32:15 starting VM 106 on remote node 'srv1'
2020-11-05 09:32:16 start remote tunnel
2020-11-05 09:32:17 ssh tunnel ver 1
2020-11-05 09:32:17 starting storage migration
2020-11-05 09:32:17 scsi0: start migration to nbd:unix:/run/qemu-server/106_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 65536 bytes remaining: 720896 bytes total: 786432 bytes progression: 8.33 % busy: 1 ready: 0
drive-scsi0: transferred: 786432 bytes remaining: 0 bytes total: 786432 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2020-11-05 09:32:18 volume 'disks:vm-106-disk-0' is 'disks:vm-106-disk-0' on the target
2020-11-05 09:32:18 starting online/live migration on unix:/run/qemu-server/106.migrate
2020-11-05 09:32:18 set migration_caps
2020-11-05 09:32:18 migration speed limit: 8589934592 B/s
2020-11-05 09:32:18 migration downtime limit: 100 ms
2020-11-05 09:32:18 migration cachesize: 134217728 B
2020-11-05 09:32:18 set migration parameters
2020-11-05 09:32:18 start migrate command to unix:/run/qemu-server/106.migrate
channel 4: open failed: connect failed: open failed

2020-11-05 09:32:19 migration status error: failed
2020-11-05 09:32:19 ERROR: online migrate failure - aborting
2020-11-05 09:32:19 aborting phase 2 - cleanup resources
2020-11-05 09:32:19 migrate_cancel
drive-scsi0: Cancelling block job
channel 3: open failed: connect failed: open failed

drive-scsi0: Done.
2020-11-05 09:32:19 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2020-11-05 09:32:20 ERROR: migration finished with problems (duration 00:00:07)
TASK ERROR: migration problems

fabian · Nov 5, 2020

the VM needs to be powered down and started again (so that it's running the new/fixed code)

Elleni · Nov 5, 2020

Great, I can confirm it works. Sorry not having known that. I am still quite new with ProxMox. Finally how can I change back to non pvetest repo? Is it enough to just put an # before the pvetest in sources.list?

Thanks for your quick and valuable support!

fabian · Nov 5, 2020

Search

Search

[solved] Migrating VM fails if its running, but works when shutdown

Elleni

Member

Moayad

Proxmox Staff Member

Elleni

Member

Elleni

Member

fabian

Proxmox Staff Member

Elleni

Member

fabian

Proxmox Staff Member

Elleni

Member

fabian

Proxmox Staff Member

Elleni

Member

fabian

Proxmox Staff Member