[solved] Migrating VM fails if its running, but works when shutdown

Elleni

Active Member
Jul 6, 2020
174
10
38
51
I get the following error if I try to migrate vm to another node while it is running. It successfully migrates if done while VM is shutdown.
Code:
Proxmox
Virtual Environment 6.2-15
Virtual Machine 106 (piHole) on node 'srv1'
Logs
()
2020-11-05 01:23:15 starting migration of VM 106 to node 'srv2' (192.168.57.61)
2020-11-05 01:23:15 found local, replicated disk 'disks:vm-106-disk-0' (in current VM config)
2020-11-05 01:23:15 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2020-11-05 01:23:15 replicating disk images
2020-11-05 01:23:15 start replication job
2020-11-05 01:23:15 guest => VM 106, running => 3269
2020-11-05 01:23:15 volumes => disks:vm-106-disk-0
2020-11-05 01:23:16 freeze guest filesystem
2020-11-05 01:23:16 create snapshot '__replicate_106-0_1604535795__' on disks:vm-106-disk-0
2020-11-05 01:23:16 thaw guest filesystem
2020-11-05 01:23:16 using secure transmission, rate limit: none
2020-11-05 01:23:16 incremental sync 'disks:vm-106-disk-0' (__replicate_106-0_1604532743__ => __replicate_106-0_1604535795__)
2020-11-05 01:23:16 send from @__replicate_106-0_1604532743__ to rea_daten/vm-106-disk-0@__replicate_106-0_1604535795__ estimated size is 20.5M
2020-11-05 01:23:16 total estimated size is 20.5M
2020-11-05 01:23:16 TIME        SENT   SNAPSHOT rea_daten/vm-106-disk-0@__replicate_106-0_1604535795__
2020-11-05 01:23:16 rea_daten/vm-106-disk-0@__replicate_106-0_1604532743__    name    rea_daten/vm-106-disk-0@__replicate_106-0_1604532743__    -
2020-11-05 01:23:17 01:23:17   14.2M   rea_daten/vm-106-disk-0@__replicate_106-0_1604535795__
2020-11-05 01:23:18 successfully imported 'disks:vm-106-disk-0'
2020-11-05 01:23:18 delete previous replication snapshot '__replicate_106-0_1604532743__' on disks:vm-106-disk-0
2020-11-05 01:23:18 (remote_finalize_local_job) delete stale replication snapshot '__replicate_106-0_1604532743__' on disks:vm-106-disk-0
2020-11-05 01:23:19 end replication job
2020-11-05 01:23:19 copying local disk images
2020-11-05 01:23:19 starting VM 106 on remote node 'srv2'
2020-11-05 01:23:19 start remote tunnel
2020-11-05 01:23:20 ssh tunnel ver 1
2020-11-05 01:23:20 starting storage migration
2020-11-05 01:23:20 scsi0: start migration to nbd:unix:/run/qemu-server/106_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 0 bytes remaining: 1114112 bytes total: 1114112 bytes progression: 0.00 % busy: 1 ready: 0
drive-scsi0: transferred: 1114112 bytes remaining: 0 bytes total: 1114112 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2020-11-05 01:23:21 volume 'disks:vm-106-disk-0' is 'disks:vm-106-disk-0' on the target
2020-11-05 01:23:21 starting online/live migration on unix:/run/qemu-server/106.migrate
2020-11-05 01:23:21 set migration_caps
2020-11-05 01:23:21 migration speed limit: 8589934592 B/s
2020-11-05 01:23:21 migration downtime limit: 100 ms
2020-11-05 01:23:21 migration cachesize: 134217728 B
2020-11-05 01:23:21 set migration parameters
2020-11-05 01:23:21 start migrate command to unix:/run/qemu-server/106.migrate
channel 4: open failed: connect failed: open failed

channel 3: open failed: connect failed: open failed

2020-11-05 01:23:22 migration status error: failed
2020-11-05 01:23:22 ERROR: online migrate failure - aborting
2020-11-05 01:23:22 aborting phase 2 - cleanup resources
2020-11-05 01:23:22 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2020-11-05 01:23:22 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2020-11-05 01:23:24 ERROR: migration finished with problems (duration 00:00:09)
TASK ERROR: migration problems

I would love to know how can this be troubleshooted and fixed. I tried to ssh from one node to the other by hostname srv1/2 which are resolvable by ip in resolv.conf and by node1/2 which are put in the hostfile. What else can I provide or check in order to fix this?
 
Last edited:
Hello,

Please post output of the following commands:

Bash:
pveversion -v 
pvecm status
qm config <VMID>
 
  • Like
Reactions: Elleni
pveversion -v:
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.4-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-6
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-4
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-18
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2
pvecm status:
Code:
Cluster information
-------------------
Name:             reacluster
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Nov  5 08:43:33 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 172.16.57.60 (local)
0x00000002          1    A,V,NMW 172.16.57.61
0x00000000          1            Qdevice

qm config 106
Code:
agent: 1
balloon: 512
boot: c
bootdisk: scsi0
cores: 1
description: qmdump#map%3Ascsi0%3Adrive-scsi0%3Adisks%3A%3A
ide2: none,media=cdrom
memory: 1024
name: piHole
net0: virtio=CE:FB:22:1C:A9:1E,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: disks:vm-106-disk-0,cache=writeback,discard=on,format=raw,size=16G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=bac1d50d-a5ae-4203-b9e4-9245585a9e36
sockets: 1
startup: order=3
vmgenid: e448233b-238a-4ee2-af6d-b9e72f601001
 
What I did was the following. Initially I had created the nodes with an encrypted datapool where the vms resided. As I then found out that replication does not work with encrypted pools, I separated a node without reinstalling, destroyed the encrypted pools and re-created a cluster. Before creating the cluster I removed the entry of the other node from authorized keys, and I also did pvecm updatecerts as on one node I could not access the webinterface.

Finally I recreated the cluster without encrypted pools and added Qdevice on a proxmox-backup-server. Replication and migration of vms that are shutdown now works fine. But as soon as a vm is booted the migration fails. Thanks for your assistance.
 
Last edited:
HI Fabian, thanks for your reply. I was afraid, you'd tell me that it would work if I completely reinstall proxmox-ve because I had removed them from cluster and recreated a new one. :)

When will this be available? As I dont know howto patch the system myself, I maybe should just wait, but if its easy and there is a howto I could test it.
 
Last edited:
it's already available packaged on pvetest
 
  • Like
Reactions: Elleni
I tried it, but it does not work for me. A reboot is not necessary, right?
Code:
pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.4-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-6
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-5
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-18
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

Code:
Proxmox
Virtual Environment 6.2-15
Knoten 'srv2'
Logs
()
2020-11-05 09:32:13 starting migration of VM 106 to node 'srv1' (192.168.57.60)
2020-11-05 09:32:13 found local, replicated disk 'disks:vm-106-disk-0' (in current VM config)
2020-11-05 09:32:13 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2020-11-05 09:32:13 replicating disk images
2020-11-05 09:32:13 start replication job
2020-11-05 09:32:13 guest => VM 106, running => 25459
2020-11-05 09:32:13 volumes => disks:vm-106-disk-0
2020-11-05 09:32:14 freeze guest filesystem
2020-11-05 09:32:14 create snapshot '__replicate_106-0_1604565133__' on disks:vm-106-disk-0
2020-11-05 09:32:14 thaw guest filesystem
2020-11-05 09:32:14 using secure transmission, rate limit: none
2020-11-05 09:32:14 incremental sync 'disks:vm-106-disk-0' (__replicate_106-0_1604565005__ => __replicate_106-0_1604565133__)
2020-11-05 09:32:15 send from @__replicate_106-0_1604565005__ to rea_daten/vm-106-disk-0@__replicate_106-0_1604565133__ estimated size is 1.85M
2020-11-05 09:32:15 total estimated size is 1.85M
2020-11-05 09:32:15 TIME        SENT   SNAPSHOT rea_daten/vm-106-disk-0@__replicate_106-0_1604565133__
2020-11-05 09:32:15 rea_daten/vm-106-disk-0@__replicate_106-0_1604565005__    name    rea_daten/vm-106-disk-0@__replicate_106-0_1604565005__    -
2020-11-05 09:32:15 successfully imported 'disks:vm-106-disk-0'
2020-11-05 09:32:15 delete previous replication snapshot '__replicate_106-0_1604565005__' on disks:vm-106-disk-0
2020-11-05 09:32:15 (remote_finalize_local_job) delete stale replication snapshot '__replicate_106-0_1604565005__' on disks:vm-106-disk-0
2020-11-05 09:32:15 end replication job
2020-11-05 09:32:15 copying local disk images
2020-11-05 09:32:15 starting VM 106 on remote node 'srv1'
2020-11-05 09:32:16 start remote tunnel
2020-11-05 09:32:17 ssh tunnel ver 1
2020-11-05 09:32:17 starting storage migration
2020-11-05 09:32:17 scsi0: start migration to nbd:unix:/run/qemu-server/106_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 65536 bytes remaining: 720896 bytes total: 786432 bytes progression: 8.33 % busy: 1 ready: 0
drive-scsi0: transferred: 786432 bytes remaining: 0 bytes total: 786432 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2020-11-05 09:32:18 volume 'disks:vm-106-disk-0' is 'disks:vm-106-disk-0' on the target
2020-11-05 09:32:18 starting online/live migration on unix:/run/qemu-server/106.migrate
2020-11-05 09:32:18 set migration_caps
2020-11-05 09:32:18 migration speed limit: 8589934592 B/s
2020-11-05 09:32:18 migration downtime limit: 100 ms
2020-11-05 09:32:18 migration cachesize: 134217728 B
2020-11-05 09:32:18 set migration parameters
2020-11-05 09:32:18 start migrate command to unix:/run/qemu-server/106.migrate
channel 4: open failed: connect failed: open failed

2020-11-05 09:32:19 migration status error: failed
2020-11-05 09:32:19 ERROR: online migrate failure - aborting
2020-11-05 09:32:19 aborting phase 2 - cleanup resources
2020-11-05 09:32:19 migrate_cancel
drive-scsi0: Cancelling block job
channel 3: open failed: connect failed: open failed

drive-scsi0: Done.
2020-11-05 09:32:19 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2020-11-05 09:32:20 ERROR: migration finished with problems (duration 00:00:07)
TASK ERROR: migration problems
 
Last edited:
the VM needs to be powered down and started again (so that it's running the new/fixed code)
 
  • Like
Reactions: Elleni
Great, I can confirm it works. Sorry not having known that. I am still quite new with ProxMox. Finally how can I change back to non pvetest repo? Is it enough to just put an # before the pvetest in sources.list?

Thanks for your quick and valuable support!
 
Last edited: