VM Live Migration Failure

Feb 20, 2022
3
0
1
61
I'm attempting to live migrate a Windows10 VM from one cluster node to another and I consistently receive the following error:

drive-ide0: Cancelling block job
drive-ide0: Done.
2022-02-18 22:35:21 ERROR: online migrate failure - block job (mirror) error: drive-ide0: 'mirror' has been cancelled
2022-02-18 22:35:21 aborting phase 2 - cleanup resources
2022-02-18 22:35:21 migrate_cancel
2022-02-18 22:35:23 ERROR: migration finished with problems (duration 00:10:26)
TASK ERROR: migration problems

VM config:

bootdisk: ide0
cores: 1
ide0: VM-Disk:vm-105-disk-0,size=80G
memory: 8192
name: Win10Test
net0: e1000=4A:86:A9:E9:EB:5F,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
rng0: source=/dev/urandom
scsihw: virtio-scsi-pci
smbios1: uuid=573ae1e1-659d-43a4-837d-06a38cec1b0e
sockets: 1
vmgenid: 703ad3d2-2c63-4f3e-abcb-50445d78ae04

pveversion

proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 
Is your cluster using shared storage or ceph? Can you list some storage and physical setup details? (NIC link speed, iscsi or ceph performance, configuration).

Cheers,


Tmanok
Thanks for you response, Tmanok! As requested:

storage.cfg:
path /var/lib/vz
content iso,backup,vztmpl

lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images

lvm: VM-Disk
vgname VM-Disk
content images,rootdir
nodes pve
shared 0

lvm: VM-Disk-2
vgname VM-Disk-2
content images,rootdir
nodes pve
shared 0

lvm: Guest-Volumes
vgname Guest-Volumes
content rootdir,images
nodes pveNUC
shared 1

dir: backup
path /backup
content backup,images
prune-backups keep-all=1
shared 0

iperf:

iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 4] local xxx.xxx.xxx.xxx port 5001 connected with zzz.zzz.zzz.zzz port 53156
[ ID] Interval Transfer Bandwidth
[ 4] 0.0000-10.0616 sec 544 MBytes 453 Mbits/sec
[ 5] local xxx.xxx.xxx.xxx port 5001 connected with yyy.yyy.yyy.yyy port 58046
[ ID] Interval Transfer Bandwidth
[ 5] 0.0000-10.0172 sec 1.09 GBytes 935 Mbits/sec
[ 4] local xxx.xxx.xxx.xxx port 5001 connected with uuu.uuu.uuu.uuu port 52120
[ ID] Interval Transfer Bandwidth
[ 4] 0.0000-10.0332 sec 1.10 GBytes 940 Mbits/sec

iperf -c xxx.xxx.xxx.xxx
------------------------------------------------------------
Client connecting to xxx.xxx.xxx.xxx, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local yyy.yyy.yyy.yyy port 58046 connected with xxx.xxx.xxx.xxx port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0000-10.0020 sec 1.09 GBytes 936 Mbits/sec

fio node1 (migrate VM from):

fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
seqwrite: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=0): [f(1)][100.0%][w=427MiB/s][w=427 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=792965: Sun Feb 20 12:23:34 2022
write: IOPS=428, BW=428MiB/s (449MB/s)(15.0GiB/35885msec); 0 zone resets
slat (usec): min=11, max=679, avg=19.82, stdev= 6.99
clat (usec): min=1996, max=23138, avg=2193.78, stdev=735.75
lat (usec): min=2014, max=23156, avg=2213.79, stdev=735.66
clat percentiles (usec):
| 1.00th=[ 2024], 5.00th=[ 2057], 10.00th=[ 2073], 20.00th=[ 2114],
| 30.00th=[ 2114], 40.00th=[ 2147], 50.00th=[ 2147], 60.00th=[ 2180],
| 70.00th=[ 2180], 80.00th=[ 2180], 90.00th=[ 2212], 95.00th=[ 2212],
| 99.00th=[ 2311], 99.50th=[ 2999], 99.90th=[15664], 99.95th=[15795],
| 99.99th=[21103]
bw ( KiB/s): min=423936, max=454656, per=100.00%, avg=438502.76, stdev=6936.90, samples=71
iops : min= 414, max= 444, avg=428.23, stdev= 6.77, samples=71
lat (msec) : 2=0.02%, 4=99.64%, 10=0.04%, 20=0.29%, 50=0.01%
cpu : usr=6.23%, sys=1.18%, ctx=15401, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s), io=15.0GiB (16.1GB), run=35885-35885msec

Disk stats (read/write):
dm-5: ios=0/15500, merge=0/0, ticks=0/33264, in_queue=33264, util=99.79%, aggrios=16/15462, aggrmerge=0/144, aggrticks=56/33404, aggrin_queue=33524, aggrutil=99.63%
sda: ios=16/15462, merge=0/144, ticks=56/33404, in_queue=33524, util=99.63%

fio node2 (migrate VM to):

fio --name=seqwrite --filename=seqwrite.fio --refill_buffers --rw=write --direct=1 --loops=3 --ioengine=libaio --bs=1m --size=5G --runtime=60 --group_reporting
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
seqwrite: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=2112MiB/s][w=2112 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=1598776: Sun Feb 20 12:22:06 2022
write: IOPS=2095, BW=2095MiB/s (2197MB/s)(15.0GiB/7331msec); 0 zone resets
slat (usec): min=9, max=221, avg=13.05, stdev= 2.83
clat (usec): min=303, max=3393, avg=332.01, stdev=141.42
lat (usec): min=316, max=3406, avg=345.15, stdev=141.49
clat percentiles (usec):
| 1.00th=[ 306], 5.00th=[ 306], 10.00th=[ 306], 20.00th=[ 310],
| 30.00th=[ 310], 40.00th=[ 310], 50.00th=[ 314], 60.00th=[ 326],
| 70.00th=[ 334], 80.00th=[ 338], 90.00th=[ 355], 95.00th=[ 363],
| 99.00th=[ 412], 99.50th=[ 578], 99.90th=[ 3064], 99.95th=[ 3195],
| 99.99th=[ 3359]
bw ( MiB/s): min= 1994, max= 2136, per=100.00%, avg=2096.14, stdev=41.16, samples=14
iops : min= 1994, max= 2136, avg=2096.14, stdev=41.16, samples=14
lat (usec) : 500=99.39%, 750=0.22%, 1000=0.10%
lat (msec) : 2=0.01%, 4=0.27%
cpu : usr=28.47%, sys=2.17%, ctx=15455, majf=0, minf=13
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,15360,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=2095MiB/s (2197MB/s), 2095MiB/s-2095MiB/s (2197MB/s-2197MB/s), io=15.0GiB (16.1GB), run=7331-7331msec

Disk stats (read/write):
dm-1: ios=0/15098, merge=0/0, ticks=0/4764, in_queue=4764, util=98.67%, aggrios=4/30754, aggrmerge=0/99, aggrticks=9/8058, aggrin_queue=8067, aggrutil=98.11%
nvme0n1: ios=4/30754, merge=0/99, ticks=9/8058, in_queue=8067, util=98.11%
 
Hi Wbanta99,

Thanks for sharing further details. Ok, to better understand, can you answer the following questions, I'm a little confused based on the config file alone.
  1. What are the names of your nodes?
  2. I did not see any external network attached storage (e.g. iSCSI), can you describe how you are attempting to migrate your VMs?
Your storage appears to be more than capable for this process, as does your networking. Perhaps I'm missing something more obvious about your shared storage, or you may not have any shared storage which will prevent live VM migration.

This may already be obvious to you, but you need a common storage location hosted by a non-cluster member (such as a NAS) for PVE to move a VM between any of the nodes without rebooting. This is because shared storage allows the hypervisor to hand over the processing and memory of a VM to another hypervisor without also transitioning the entire disk and therefore pausing I/O to prevent data loss. Look at the table in this section of the documentation and examine the "Shared" column to identify appropriate file system types.

Thanks WBanta99, if the final paragraph was already known (pardon me) just send me the node names and perhaps a screenshot of your "Datacentre>Storage" page with a brief description of what you are trying to do.


Tmanok
 
Edit because of misinformation; see next post!

I only see LVM storage in your storage.cfg. Afaik live migration doesn't work with LVM.

For live migration without shared storage, the source and the target storage of the nodes (for the VMs) need to be ZFS and also named exactly the same.

To prevent long copying of the disks between the nodes every time, you can setup replication for the specific VMs.
 
Last edited:
live migration with local disks works for most storage types (there are some restrictions for disks not currently used by the VM which have to be offline-migrated, and snapshots are also not supported for live-migration). please post the full task log and journal contents for the timespan of the migration for both sides, as well as pveversion -v from both source and target node.
 
no worries, just wanted to clarify it - local disk support for live migration was experimental and limited for quite a while ;)
 
Hi Wbanta99,

Thanks for sharing further details. Ok, to better understand, can you answer the following questions, I'm a little confused based on the config file alone.
  1. What are the names of your nodes?
  2. I did not see any external network attached storage (e.g. iSCSI), can you describe how you are attempting to migrate your VMs?
Your storage appears to be more than capable for this process, as does your networking. Perhaps I'm missing something more obvious about your shared storage, or you may not have any shared storage which will prevent live VM migration.

This may already be obvious to you, but you need a common storage location hosted by a non-cluster member (such as a NAS) for PVE to move a VM between any of the nodes without rebooting. This is because shared storage allows the hypervisor to hand over the processing and memory of a VM to another hypervisor without also transitioning the entire disk and therefore pausing I/O to prevent data loss. Look at the table in this section of the documentation and examine the "Shared" column to identify appropriate file system types.

Thanks WBanta99, if the final paragraph was already known (pardon me) just send me the node names and perhaps a screenshot of your "Datacentre>Storage" page with a brief description of what you are trying to do.


Tmanok
Thanks again, Tmanok!

The confusing aspect of this is that I'd already live migrated several VM's (Ubuntu - between the same cluster members) so I thought my config is solid. At any rate here's the info you requested:

nodes: pve and pveNUC

1645457588826.png
 
Can you migrate this Test VM offline?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!