VM xxx qmp command 'block-job-cancel' failed - Block job 'drive-xxx' not found

peanut42

Member
Oct 14, 2020
23
3
23
Germany
I keep getting this error when I try to migrate a VM and I'm stuck. I've noticed that there is something fishy about the hard disk in question being "drive-virtio3". In /var/lib/vz/images/ I found the corresponding hard disk it is 87.8GB large. However according to VM.conf, the disk ist "204801M" large. For fun I edited the VM conf (/etc/pve/nodes/XXX/qemu-server/) to correct the size, but after a few minutes it changed back to the strange "204801M". From where does this conf get its size information from? Can somebody please guide me in the right direction?

Kind regards,
Peanut42
 
Hi,
I keep getting this error when I try to migrate a VM and I'm stuck. I've noticed that there is something fishy about the hard disk in question being "drive-virtio3". In /var/lib/vz/images/ I found the corresponding hard disk it is 87.8GB large.
Please post the output of
Code:
qemu-img info --output=json /var/lib/vz/images/<ID>/<disk>
stat /var/lib/vz/images/<ID>/<disk>
pveversion -v

However according to VM.conf, the disk ist "204801M" large. For fun I edited the VM conf (/etc/pve/nodes/XXX/qemu-server/) to correct the size, but after a few minutes it changed back to the strange "204801M". From where does this conf get its size information from? Can somebody please guide me in the right direction?
It's dependent on the storage, but for (most) file-based storages, the size is queried via qemu-img info.
Kind regards,
Peanut42
 
Thank you Fabian_E!

qemu-img info --output=json /var/lib/vz/images/<ID>/<disk>:

{
"virtual-size": 214749413376,
"filename": "/var/lib/vz/images/178/vm-178-disk-1.raw",
"format": "raw",
"actual-size": 94325362688,
"dirty-flag": false
}




stat /var/lib/vz/images/<ID>/<disk>:

File: /var/lib/vz/images/178/vm-178-disk-1.raw
Size: 214749413376 Blocks: 184229224 IO Block: 4096 regular file
Device: fd01h/64769d Inode: 3410234 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2022-02-01 10:40:20.518910772 +0100
Modify: 2021-08-09 11:34:39.038855786 +0200
Change: 2021-08-09 11:34:39.038855786 +0200
Birth: -




pveversion -v:

proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-11
pve-kernel-helper: 6.4-11
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1
 
This is a sparse file, so it doesn't actually occupy all the space it has available.
Thank you Fabian_E!

qemu-img info --output=json /var/lib/vz/images/<ID>/<disk>:

{
"virtual-size": 214749413376,
This is the size the VM sees (and 214749413376/1024/1024=204801 MiB) and the image can grow up to this size.
"filename": "/var/lib/vz/images/178/vm-178-disk-1.raw",
"format": "raw",
"actual-size": 94325362688,
This is the size on the file system, i.e. the 87.8G you observed.
"dirty-flag": false
}
So the size handling doesn't seem to be the problem here.

Please post the full migration log, the VM configuration (qm config <ID>) and the storage configuration (/etc/pve/storage.cfg).
 
qm config <ID>:

agent: 1
boot: cdn
bootdisk: virtio0
cores: 2
cpu: Westmere
description:
memory: 4096
name: 1.2.3.4-BDC
net0: virtio=00:0c:29:ef:0c:f6,bridge=vmbr0,tag=1
net1: virtio=00:0c:29:ef:0c:00,bridge=vmbr0,tag=110
numa: 0
onboot: 1
ostype: wxp
smbios1: uuid=7b8d996d-47a0-43dc-82ab-41cc4b815d39
sockets: 2
virtio0: local:178/vm-178-disk-0.raw,discard=on,format=raw,size=75G
virtio3: local:178/vm-178-disk-1.raw,discard=on,format=raw,size=204801M
vmgenid: 6c4bdcee-9f52-44e9-9ba5-c807c9d574d0



/etc/pve/storage.cfg:

dir: local
path /var/lib/vz
content images,vztmpl
prune-backups keep-last=3
shared 0

cifs: srv048_Proxmox_Backup
disable
path /mnt/pve/srv048_Proxmox_Backup
server 10.10.10.48
share srv048_Proxmox_Backup
content backup
prune-backups keep-last=2
username backup

cifs: srv048_ISO
disable
path /mnt/pve/srv048_ISO
server 10.10.10.48
share iso
content iso
username iso

lvm: local-sdd
vgname local-sdd
content rootdir,images
nodes ml155
shared 0

lvm: local-sdb
vgname local-sdb
content images,rootdir
nodes ml135,ml139
shared 0

cifs: backup252
path /mnt/pve/backup252
server 10.10.252.48
share srv048_Proxmox_Backup
content backup
prune-backups keep-last=2
username toot

cifs: iso252
path /mnt/pve/iso252
server 10.10.252.48
share iso
content iso
username toot

lvm: ml101
vgname ml101
content images,rootdir
nodes ml155,ml163,ml151
shared 1
 
Please also post the full migration log, as that most likely contains more information about the actual error. Also please indicate if you are migrating to a different target storage.
 
Last edited:
I apologize.


I am migrating from local storage to lvm: ml101 The UI won't let me copy the entire thread.


2022-02-03 13:54:43 starting migration of VM 178 to node 'ml155' (10.10.10.155)
2022-02-03 13:54:43 found local disk 'local:178/vm-178-disk-0.raw' (in current VM config)
2022-02-03 13:54:43 found local disk 'local:178/vm-178-disk-1.raw' (in current VM config)
2022-02-03 13:54:43 starting VM 178 on remote node 'ml155'
2022-02-03 13:54:47 volume 'local:178/vm-178-disk-0.raw' is 'ml101:vm-178-disk-0' on the target
2022-02-03 13:54:47 volume 'local:178/vm-178-disk-1.raw' is 'ml101:vm-178-disk-1' on the target
2022-02-03 13:54:47 start remote tunnel
2022-02-03 13:54:49 ssh tunnel ver 1
2022-02-03 13:54:49 starting storage migration
2022-02-03 13:54:49 virtio0: start migration to nbd:10.10.10.155:60001:exportname=drive-virtio0
drive mirror is starting for drive-virtio0 with bandwidth limit: 40000 KB/s
drive-virtio0: transferred 16.0 MiB of 75.0 GiB (0.02%) in 4m 13s
drive-virtio0: transferred 64.0 MiB of 75.0 GiB (0.08%) in 4m 14s
drive-virtio0: transferred 96.0 MiB of 75.0 GiB (0.12%) in 4m 15s
drive-virtio0: transferred 144.0 MiB of 75.0 GiB (0.19%) in 4m 16s
.
.
.
drive-virtio0: transferred 74.5 GiB of 75.0 GiB (99.31%) in 35m 51s
drive-virtio0: transferred 74.5 GiB of 75.0 GiB (99.35%) in 35m 52s
drive-virtio0: transferred 74.6 GiB of 75.0 GiB (99.41%) in 35m 53s
drive-virtio0: transferred 74.6 GiB of 75.0 GiB (99.45%) in 35m 54s
drive-virtio0: transferred 74.6 GiB of 75.0 GiB (99.50%) in 35m 55s
drive-virtio0: transferred 74.7 GiB of 75.0 GiB (99.56%) in 35m 56s
drive-virtio0: transferred 74.7 GiB of 75.0 GiB (99.62%) in 35m 57s
drive-virtio0: transferred 74.8 GiB of 75.0 GiB (99.66%) in 35m 58s
drive-virtio0: transferred 74.8 GiB of 75.0 GiB (99.72%) in 35m 59s
drive-virtio0: transferred 74.8 GiB of 75.0 GiB (99.76%) in 36m
drive-virtio0: transferred 74.9 GiB of 75.0 GiB (99.83%) in 36m 1s
drive-virtio0: transferred 74.9 GiB of 75.0 GiB (99.87%) in 36m 2s
drive-virtio0: transferred 75.0 GiB of 75.0 GiB (99.93%) in 36m 3s
drive-virtio0: transferred 75.0 GiB of 75.0 GiB (99.97%) in 36m 4s
drive-virtio0: transferred 75.0 GiB of 75.0 GiB (100.00%) in 36m 5s, ready
all 'mirror' jobs are ready
2022-02-03 14:30:54 virtio3: start migration to nbd:10.10.10.155:60001:exportname=drive-virtio3
drive mirror is starting for drive-virtio3 with bandwidth limit: 40000 KB/s
drive-virtio3: Cancelling block job
drive-virtio0: Cancelling block job
drive-virtio3: Done.
drive-virtio0: Done.
2022-02-03 14:30:54 ERROR: online migrate failure - block job (mirror) error: drive-virtio3: 'mirror' has been cancelled
2022-02-03 14:30:54 aborting phase 2 - cleanup resources
2022-02-03 14:30:54 migrate_cancel
2022-02-03 14:30:59 ERROR: migration finished with problems (duration 00:36:17)
TASK ERROR: migration problems
 
Ok, so I guess this is another instance of bug 3227. The issue is that the size of the newly allocated disk on the LVM storage is aligned to 4M, so the original disk size is a bit smaller, which QEMU's drive mirror doesn't like. A workaround should be qm resize 178 virtio3 +3M.
 
Sorry to necro this, but I see bug 3227 is still open, and I'm still having this issue as well. I've got three newly-set-up nodes in a cluster, all running 8.3.2. Migrating a live VM from node 1 to 3 works, and from 3 to 1 works, but from 1 to 2 doesn't work and 3 to 2 doesn't work. I tried the qm resize trick mentioned above but that didn't seem to help. I can shut down the VM and then it migrates without issue. I believe I've seen this happen with all my VMs so it's not just a single one with the issue.

Here's the live migration log, let me know if I can provide any more detail:


Code:
2025-01-08 09:31:37 starting migration of VM 100 to node 'pve3' (10.1.1.25)
2025-01-08 09:31:37 found local disk 'local-lvm:vm-100-disk-0' (attached)
2025-01-08 09:31:37 drive 'virtio0': size of disk 'local-lvm:vm-100-disk-0' updated from 10243M to 10244M
2025-01-08 09:31:37 starting VM 100 on remote node 'pve3'
2025-01-08 09:31:40 volume 'local-lvm:vm-100-disk-0' is 'local-lvm:vm-100-disk-0' on the target
2025-01-08 09:31:40 start remote tunnel
2025-01-08 09:31:41 ssh tunnel ver 1
2025-01-08 09:31:41 starting storage migration
2025-01-08 09:31:41 virtio0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-virtio0
drive mirror is starting for drive-virtio0 with bandwidth limit: 92160 KB/s
drive-virtio0: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
drive-virtio0: transferred 89.0 MiB of 10.0 GiB (0.87%) in 1s
drive-virtio0: transferred 178.0 MiB of 10.0 GiB (1.74%) in 2s
...
drive-virtio0: transferred 9.9 GiB of 10.0 GiB (98.48%) in 1m 53s
drive-virtio0: transferred 9.9 GiB of 10.0 GiB (99.34%) in 1m 54s
drive-virtio0: transferred 10.0 GiB of 10.0 GiB (100.00%) in 1m 55s, ready
all 'mirror' jobs are ready
2025-01-08 09:33:36 switching mirror jobs to actively synced mode
drive-virtio0: switching to actively synced mode
drive-virtio0: successfully switched to actively synced mode
2025-01-08 09:33:37 starting online/live migration on unix:/run/qemu-server/100.migrate
2025-01-08 09:33:37 set migration capabilities
2025-01-08 09:33:37 migration speed limit: 90.0 MiB/s
2025-01-08 09:33:37 migration downtime limit: 100 ms
2025-01-08 09:33:37 migration cachesize: 512.0 MiB
2025-01-08 09:33:37 set migration parameters
2025-01-08 09:33:37 start migrate command to unix:/run/qemu-server/100.migrate
2025-01-08 09:33:38 migration active, transferred 92.6 MiB of 4.0 GiB VM-state, 90.0 MiB/s
2025-01-08 09:33:38 xbzrle: send updates to 2973 pages in 461.2 KiB encoded memory, cache-miss 91.77%, overflow 3
...
2025-01-08 09:34:10 migration active, transferred 3.0 GiB of 4.0 GiB VM-state, 90.0 MiB/s
2025-01-08 09:34:10 xbzrle: send updates to 2973 pages in 461.2 KiB encoded memory, cache-miss 6.58%, overflow 3
2025-01-08 09:34:12 average migration speed: 117.5 MiB/s - downtime 33 ms
2025-01-08 09:34:12 migration status: completed
all 'mirror' jobs are ready
drive-virtio0: Completing block job...
drive-virtio0: Completed successfully.
drive-virtio0: Cancelling block job
drive-virtio0: Done.
2025-01-08 09:34:13 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-virtio0: Input/output error (io-status: ok)
2025-01-08 09:34:13 aborting phase 2 - cleanup resources
2025-01-08 09:34:13 migrate_cancel
2025-01-08 09:34:16 ERROR: migration finished with problems (duration 00:02:39)
TASK ERROR: migration problems
 
Hi,
Sorry to necro this, but I see bug 3227 is still open, and I'm still having this issue as well. I've got three newly-set-up nodes in a cluster, all running 8.3.2. Migrating a live VM from node 1 to 3 works, and from 3 to 1 works, but from 1 to 2 doesn't work and 3 to 2 doesn't work. I tried the qm resize trick mentioned above but that didn't seem to help. I can shut down the VM and then it migrates without issue. I believe I've seen this happen with all my VMs so it's not just a single one with the issue.

Here's the live migration log, let me know if I can provide any more detail:


Code:
2025-01-08 09:31:37 starting migration of VM 100 to node 'pve3' (10.1.1.25)
2025-01-08 09:31:37 found local disk 'local-lvm:vm-100-disk-0' (attached)
2025-01-08 09:31:37 drive 'virtio0': size of disk 'local-lvm:vm-100-disk-0' updated from 10243M to 10244M
2025-01-08 09:31:37 starting VM 100 on remote node 'pve3'
2025-01-08 09:31:40 volume 'local-lvm:vm-100-disk-0' is 'local-lvm:vm-100-disk-0' on the target
2025-01-08 09:31:40 start remote tunnel
2025-01-08 09:31:41 ssh tunnel ver 1
2025-01-08 09:31:41 starting storage migration
2025-01-08 09:31:41 virtio0: start migration to nbd:unix:/run/qemu-server/100_nbd.migrate:exportname=drive-virtio0
drive mirror is starting for drive-virtio0 with bandwidth limit: 92160 KB/s
drive-virtio0: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
drive-virtio0: transferred 89.0 MiB of 10.0 GiB (0.87%) in 1s
drive-virtio0: transferred 178.0 MiB of 10.0 GiB (1.74%) in 2s
...
drive-virtio0: transferred 9.9 GiB of 10.0 GiB (98.48%) in 1m 53s
drive-virtio0: transferred 9.9 GiB of 10.0 GiB (99.34%) in 1m 54s
drive-virtio0: transferred 10.0 GiB of 10.0 GiB (100.00%) in 1m 55s, ready
all 'mirror' jobs are ready
2025-01-08 09:33:36 switching mirror jobs to actively synced mode
drive-virtio0: switching to actively synced mode
drive-virtio0: successfully switched to actively synced mode
2025-01-08 09:33:37 starting online/live migration on unix:/run/qemu-server/100.migrate
2025-01-08 09:33:37 set migration capabilities
2025-01-08 09:33:37 migration speed limit: 90.0 MiB/s
2025-01-08 09:33:37 migration downtime limit: 100 ms
2025-01-08 09:33:37 migration cachesize: 512.0 MiB
2025-01-08 09:33:37 set migration parameters
2025-01-08 09:33:37 start migrate command to unix:/run/qemu-server/100.migrate
2025-01-08 09:33:38 migration active, transferred 92.6 MiB of 4.0 GiB VM-state, 90.0 MiB/s
2025-01-08 09:33:38 xbzrle: send updates to 2973 pages in 461.2 KiB encoded memory, cache-miss 91.77%, overflow 3
...
2025-01-08 09:34:10 migration active, transferred 3.0 GiB of 4.0 GiB VM-state, 90.0 MiB/s
2025-01-08 09:34:10 xbzrle: send updates to 2973 pages in 461.2 KiB encoded memory, cache-miss 6.58%, overflow 3
2025-01-08 09:34:12 average migration speed: 117.5 MiB/s - downtime 33 ms
2025-01-08 09:34:12 migration status: completed
all 'mirror' jobs are ready
drive-virtio0: Completing block job...
drive-virtio0: Completed successfully.
drive-virtio0: Cancelling block job
drive-virtio0: Done.
2025-01-08 09:34:13 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-virtio0: Input/output error (io-status: ok)
2025-01-08 09:34:13 aborting phase 2 - cleanup resources
2025-01-08 09:34:13 migrate_cancel
2025-01-08 09:34:16 ERROR: migration finished with problems (duration 00:02:39)
TASK ERROR: migration problems
That is a different issue. The issue from bug 3227 already happens at the start of the mirror job, not at the end. Please share the output of qm config 100 as well as, from both source and target node of the migration, pveversion -v and the relevant part of the system logs/journal.

Since it only doesn't work with node 2 as the target, please check if there are any relevant differences compared to the other nodes. I'd also check the disk health, e.g. using smartctl
 
Thanks for the quick response! This was the top result while googling, not many other people seem to have had the same migration failure. Good to know the OP is a different issue.

I'm doing some more reading and I'm wondering if I should set the CPU to something other than "host" as the CPU is different between PVE nodes. All Intel, but different years/models. I don't know why that would fail at the disk mirror step, but maybe that would help?

VM config:
Code:
root@pve1:~# qm config 100
agent: 1,fstrim_cloned_disks=1
boot: order=virtio0;net0
cores: 1
cpu: host
description:
memory: 4096
meta: creation-qemu=9.0.2,ctime=1735685246
name: pfsense
net0: virtio=BC:24:11:2C:DB:D5,bridge=vmbr0
numa: 0
onboot: 1
ostype: other
scsihw: virtio-scsi-single
smbios1: uuid=90b21c6e-c79f-49ec-aa55-1760e4c28815
sockets: 1
startup: order=1,up=120
virtio0: local-lvm:vm-100-disk-0,format=raw,iothread=1,size=10244M
vmgenid: 03105849-c7a3-49e1-8974-3d661ce479a9
root@pve1:~#

pveversion -v was identical on both machines:

Code:
root@pve1:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
intel-microcode: 3.20241112.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

I'm using NVMe SSDs on both PVE nodes, their SMART values in the UI look A-OK
 
I'm doing some more reading and I'm wondering if I should set the CPU to something other than "host" as the CPU is different between PVE nodes. All Intel, but different years/models. I don't know why that would fail at the disk mirror step, but maybe that would help?
Yes, you cannot use type host then, please see: https://pve.proxmox.com/pve-docs/chapter-qm.html#_cpu_type

If the problem persists please share the system logs/journal from both source and target of the migration.
 
  • Like
Reactions: empirical6355