VM Live Migration Failure

wbanta99 · Feb 20, 2022

I'm attempting to live migrate a Windows10 VM from one cluster node to another and I consistently receive the following error:

drive-ide0: Cancelling block job
drive-ide0: Done.
2022-02-18 22:35:21 ERROR: online migrate failure - block job (mirror) error: drive-ide0: 'mirror' has been cancelled
2022-02-18 22:35:21 aborting phase 2 - cleanup resources
2022-02-18 22:35:21 migrate_cancel
2022-02-18 22:35:23 ERROR: migration finished with problems (duration 00:10:26)
TASK ERROR: migration problems

VM config:

bootdisk: ide0
cores: 1
ide0: VM-Disk:vm-105-disk-0,size=80G
memory: 8192
name: Win10Test
net0: e1000=4A:86:A9:E9:EB:5F,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
rng0: source=/dev/urandom
scsihw: virtio-scsi-pci
smbios1: uuid=573ae1e1-659d-43a4-837d-06a38cec1b0e
sockets: 1
vmgenid: 703ad3d2-2c63-4f3e-abcb-50445d78ae04

pveversion

proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

Tmanok · Feb 20, 2022

Is your cluster using shared storage or ceph? Can you list some storage and physical setup details? (NIC link speed, iscsi or ceph performance, configuration).

Cheers,

Tmanok

wbanta99 · Feb 20, 2022

Tmanok said:
Is your cluster using shared storage or ceph? Can you list some storage and physical setup details? (NIC link speed, iscsi or ceph performance, configuration).

Cheers,

Tmanok

Thanks for you response, Tmanok! As requested:

storage.cfg:
path /var/lib/vz
content iso,backup,vztmpl

lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images

lvm: VM-Disk
vgname VM-Disk
content images,rootdir
nodes pve
shared 0

lvm: VM-Disk-2
vgname VM-Disk-2
content images,rootdir
nodes pve
shared 0

lvm: Guest-Volumes
vgname Guest-Volumes
content rootdir,images
nodes pveNUC
shared 1

dir: backup
path /backup
content backup,images
prune-backups keep-all=1
shared 0

iperf:

iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 4] local xxx.xxx.xxx.xxx port 5001 connected with zzz.zzz.zzz.zzz port 53156
[ ID] Interval Transfer Bandwidth
[ 4] 0.0000-10.0616 sec 544 MBytes 453 Mbits/sec
[ 5] local xxx.xxx.xxx.xxx port 5001 connected with yyy.yyy.yyy.yyy port 58046
[ ID] Interval Transfer Bandwidth
[ 5] 0.0000-10.0172 sec 1.09 GBytes 935 Mbits/sec
[ 4] local xxx.xxx.xxx.xxx port 5001 connected with uuu.uuu.uuu.uuu port 52120
[ ID] Interval Transfer Bandwidth
[ 4] 0.0000-10.0332 sec 1.10 GBytes 940 Mbits/sec

iperf -c xxx.xxx.xxx.xxx
------------------------------------------------------------
Client connecting to xxx.xxx.xxx.xxx, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local yyy.yyy.yyy.yyy port 58046 connected with xxx.xxx.xxx.xxx port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0000-10.0020 sec 1.09 GBytes 936 Mbits/sec

fio node1 (migrate VM from):

fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
seqwrite: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=0): [f(1)][100.0%][w=427MiB/s][w=427 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=792965: Sun Feb 20 12:23:34 2022
write: IOPS=428, BW=428MiB/s (449MB/s)(15.0GiB/35885msec); 0 zone resets
slat (usec): min=11, max=679, avg=19.82, stdev= 6.99
clat (usec): min=1996, max=23138, avg=2193.78, stdev=735.75
lat (usec): min=2014, max=23156, avg=2213.79, stdev=735.66
clat percentiles (usec):
| 1.00th=[ 2024], 5.00th=[ 2057], 10.00th=[ 2073], 20.00th=[ 2114],
| 30.00th=[ 2114], 40.00th=[ 2147], 50.00th=[ 2147], 60.00th=[ 2180],
| 70.00th=[ 2180], 80.00th=[ 2180], 90.00th=[ 2212], 95.00th=[ 2212],
| 99.00th=[ 2311], 99.50th=[ 2999], 99.90th=[15664], 99.95th=[15795],
| 99.99th=[21103]
bw ( KiB/s): min=423936, max=454656, per=100.00%, avg=438502.76, stdev=6936.90, samples=71
iops : min= 414, max= 444, avg=428.23, stdev= 6.77, samples=71
lat (msec) : 2=0.02%, 4=99.64%, 10=0.04%, 20=0.29%, 50=0.01%
cpu : usr=6.23%, sys=1.18%, ctx=15401, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s), io=15.0GiB (16.1GB), run=35885-35885msec

Disk stats (read/write):
dm-5: ios=0/15500, merge=0/0, ticks=0/33264, in_queue=33264, util=99.79%, aggrios=16/15462, aggrmerge=0/144, aggrticks=56/33404, aggrin_queue=33524, aggrutil=99.63%
sda: ios=16/15462, merge=0/144, ticks=56/33404, in_queue=33524, util=99.63%

fio node2 (migrate VM to):

fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
seqwrite: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=2112MiB/s][w=2112 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=1598776: Sun Feb 20 12:22:06 2022
write: IOPS=2095, BW=2095MiB/s (2197MB/s)(15.0GiB/7331msec); 0 zone resets
slat (usec): min=9, max=221, avg=13.05, stdev= 2.83
clat (usec): min=303, max=3393, avg=332.01, stdev=141.42
lat (usec): min=316, max=3406, avg=345.15, stdev=141.49
clat percentiles (usec):
| 1.00th=[ 306], 5.00th=[ 306], 10.00th=[ 306], 20.00th=[ 310],
| 30.00th=[ 310], 40.00th=[ 310], 50.00th=[ 314], 60.00th=[ 326],
| 70.00th=[ 334], 80.00th=[ 338], 90.00th=[ 355], 95.00th=[ 363],
| 99.00th=[ 412], 99.50th=[ 578], 99.90th=[ 3064], 99.95th=[ 3195],
| 99.99th=[ 3359]
bw ( MiB/s): min= 1994, max= 2136, per=100.00%, avg=2096.14, stdev=41.16, samples=14
iops : min= 1994, max= 2136, avg=2096.14, stdev=41.16, samples=14
lat (usec) : 500=99.39%, 750=0.22%, 1000=0.10%
lat (msec) : 2=0.01%, 4=0.27%
cpu : usr=28.47%, sys=2.17%, ctx=15455, majf=0, minf=13
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=2095MiB/s (2197MB/s), 2095MiB/s-2095MiB/s (2197MB/s-2197MB/s), io=15.0GiB (16.1GB), run=7331-7331msec

Disk stats (read/write):
dm-1: ios=0/15098, merge=0/0, ticks=0/4764, in_queue=4764, util=98.67%, aggrios=4/30754, aggrmerge=0/99, aggrticks=9/8058, aggrin_queue=8067, aggrutil=98.11%
nvme0n1: ios=4/30754, merge=0/99, ticks=9/8058, in_queue=8067, util=98.11%

Tmanok · Feb 21, 2022

Hi Wbanta99,

Thanks for sharing further details. Ok, to better understand, can you answer the following questions, I'm a little confused based on the config file alone.

What are the names of your nodes?
I did not see any external network attached storage (e.g. iSCSI), can you describe how you are attempting to migrate your VMs?

Your storage appears to be more than capable for this process, as does your networking. Perhaps I'm missing something more obvious about your shared storage, or you may not have any shared storage which will prevent live VM migration.

This may already be obvious to you, but you need a common storage location hosted by a non-cluster member (such as a NAS) for PVE to move a VM between any of the nodes without rebooting. This is because shared storage allows the hypervisor to hand over the processing and memory of a VM to another hypervisor without also transitioning the entire disk and therefore pausing I/O to prevent data loss. Look at the table in this section of the documentation and examine the "Shared" column to identify appropriate file system types.

Thanks WBanta99, if the final paragraph was already known (pardon me) just send me the node names and perhaps a screenshot of your "Datacentre>Storage" page with a brief description of what you are trying to do.

Tmanok

Neobin · Feb 21, 2022

Edit because of misinformation; see next post!

I only see LVM storage in your storage.cfg. ~~Afaik live migration doesn't work with LVM.~~

For live migration without shared storage, the source and the target storage of the nodes (for the VMs) ~~need to~~ be ZFS and also named exactly the same.

To prevent long copying of the disks between the nodes every time, you can setup replication for the specific VMs.

fabian · Feb 21, 2022

live migration with local disks works for most storage types (there are some restrictions for disks not currently used by the VM which have to be offline-migrated, and snapshots are also not supported for live-migration). please post the full task log and journal contents for the timespan of the migration for both sides, as well as pveversion -v from both source and target node.

Neobin · Feb 21, 2022

fabian said:
live migration with local disks works for most storage types

Oh okay, then sorry for my misinformation!

fabian · Feb 21, 2022

no worries, just wanted to clarify it - local disk support for live migration was experimental and limited for quite a while

wbanta99 · Feb 21, 2022

Tmanok said:
Hi Wbanta99,

Thanks for sharing further details. Ok, to better understand, can you answer the following questions, I'm a little confused based on the config file alone.

What are the names of your nodes?

I did not see any external network attached storage (e.g. iSCSI), can you describe how you are attempting to migrate your VMs?

Your storage appears to be more than capable for this process, as does your networking. Perhaps I'm missing something more obvious about your shared storage, or you may not have any shared storage which will prevent live VM migration.

This may already be obvious to you, but you need a common storage location hosted by a non-cluster member (such as a NAS) for PVE to move a VM between any of the nodes without rebooting. This is because shared storage allows the hypervisor to hand over the processing and memory of a VM to another hypervisor without also transitioning the entire disk and therefore pausing I/O to prevent data loss. Look at the table in this section of the documentation and examine the "Shared" column to identify appropriate file system types.

Thanks WBanta99, if the final paragraph was already known (pardon me) just send me the node names and perhaps a screenshot of your "Datacentre>Storage" page with a brief description of what you are trying to do.

Tmanok

Thanks again, Tmanok!

The confusing aspect of this is that I'd already live migrated several VM's (Ubuntu - between the same cluster members) so I thought my config is solid. At any rate here's the info you requested:

nodes: pve and pveNUC

tom · Feb 28, 2022

Can you migrate this Test VM offline?

Search

Search

VM Live Migration Failure

wbanta99

New Member

Tmanok

Renowned Member

wbanta99

New Member

Tmanok

Renowned Member

Neobin

Distinguished Member

fabian

Proxmox Staff Member

Neobin

Distinguished Member

fabian

Proxmox Staff Member

wbanta99

New Member

tom

Proxmox Staff Member

We value your privacy