TASK ERROR: storage migration failed: block job (mirror) error: drive-efidisk0: 'mirror' has been cancelled

AngryAdm · Dec 29, 2021

Dead PVE.... I wanted to move the EFI disk, hence I clicked MOVE.... I did not ask any cancel... why do you behave like a infantile AI and cancel my job? To anoy me? Success!!

WHY?
I don't know how many times i've had to shutdown a vm to move its EFI disk... as if its in used at all... This is not acceptable!
However, moving the 1.25tb C: drive...nooooo problem..

Can you please fix this? Just remove the feature that cancels the blockjob for no reason....so it can complete!

Move disk

Disk:
efidisk0
Target Storage:

Format:

Delete source:

Move disk

Task viewer: VM 201 - Move disk

OutputStatus

Stop
create full clone of drive efidisk0 (PVE02-STORAGE2:201/vm-201-disk-2.raw)
drive mirror is starting for drive-efidisk0
drive-efidisk0: Cancelling block job
drive-efidisk0: Done.
Removing image: 100% complete...done.
TASK ERROR: storage migration failed: block job (mirror) error: drive-efidisk0: 'mirror' has been cancelled

AngryAdm · Dec 29, 2021

.................

Move disk

Disk:
efidisk0
Target Storage:

Format:

Delete source:

Move disk

Task viewer: VM 107 - Move disk

OutputStatus

Stop
create full clone of drive efidisk0 (PVE02-STORAGE2:107/vm-107-disk-0.raw)
drive mirror is starting for drive-efidisk0
drive-efidisk0: Cancelling block job
drive-efidisk0: Done.
Removing image: 100% complete...done.
TASK ERROR: storage migration failed: block job (mirror) error: drive-efidisk0: 'mirror' has been cancelled

AngryAdm · Dec 29, 2021

The charade with migration continues...

Cancel? Why? Watchers? Huh? Shoot those watchers and get on!

This time the VM in question crashed as a bonus. Impressive. /s

drive-sata0: transferred 351.7 GiB of 1.2 TiB (28.36%) in 10m 19s
drive-sata0: transferred 352.3 GiB of 1.2 TiB (28.18%) in 10m 20s
drive-sata0: transferred 354.1 GiB of 1.2 TiB (28.35%) in 10m 21s
drive-sata0: transferred 354.6 GiB of 1.2 TiB (28.85%) in 10m 22s
drive-sata0: transferred 354.9 GiB of 1.2 TiB (29.27%) in 10m 23s
drive-sata0: Cancelling block job
2021-12-29T16:00:40.405+0100 7f4a277fe700 -1 librbd::image:

reRemoveRequest: 0x56550702fa10 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
rbd rm 'vm-107-disk-0' error: rbd: error: image still has watchers
TASK ERROR: storage migration failed: block job (mirror) error: VM 107 qmp command 'query-block-jobs' failed - got timeout

SINOS · Dec 29, 2021

AngryAdm said:
Watchers? Huh? Shoot those watchers and get on!

I don't think shooting anything is wanted in a production environment.
I agree, it's annoying that the EFI disks (and TPM state disks) cannot be moved at runtime right now, for whatever reason, though.

But in general, Proxmox' reliable and careful handling of resources is very appreciated instead of forcefully executing whatever the User requests (or whatever he *thinks* he requests, not knowing all the background consequences).

Would you mind sharing your VM configs and some details about your storage(s)? Source and Target.
I have no such problems on my Ceph Cluster, using PVE 7.1. Except for the EFI disks of course.

AngryAdm · Dec 29, 2021

The cluster consists of 6 online nodes.
p1-2 for VM's
p3-4 nodes for CEPH
Each CEPH node currently has 2 2TB Kingston SEDC500M enterprise SSD's. More to be added later.
It pushes out around 800mb/sec seq read and around 550 mb/sec seq write 32K
Network consists of two sets of stacked 10Gbe switches one redundant set for public and one set for cluster, eg. each PVE storage node has 4x10Gbe dedicated for CEPH.

The storage nodes are based on AM4 and Asus WS x570 PRO and have 1x quad 10Gbe and 1x SM 8port sas controller. in a 16 bay enclosure.
Sataports connect to sas/sata backplane via reverse 4xsata-> sff-8087 and the U.2 connector is connected to a single 4 bay backplane via sff 8643->8087. The last 8 bays are connected to the SM controller. They have 32 GB ram and are expected to pack 5 OSD's initially. If more OSD's, more ram will be needed.

#1 has 64TB raidz2 WD Gold rust,. and #2 has a 8TB SSD raid10000 setup of 5 3-way ZFS mirrors on consumergrade SSD's with a radianmemory systems RMS-300 SLOG device.
The vm disks to be moved is on the SSD setup

201.conf
agent: 1,fstrim_cloned_disks=1
balloon: 0
bios: ovmf
boot: order=scsi0;ide2
cores: 8
cpu: host
efidisk0: PVE02-STORAGE2:201/vm-201-disk-2.raw,size=128K
ide2: none,media=cdrom
machine: pc-q35-6.0
memory: 65536
name: XXXXXXXXX
net0: virtio=46:C3:6F:1C:A3:65,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: win10
scsi0: SSD01:vm-201-disk-0,cache=writeback,discard=on,size=2000G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=1b6d31fa-f735-43a4-8451-6c70ac2578e9
sockets: 1

###storage dir points to zfspool mounted /storage2/
dir: PVE02-STORAGE2
path /storage2/vm
content images
nodes pve02,pve01
prune-backups keep-all=1
shared 0
vmgenid:

SCSI0 moved itself from storage2 to SSD01 (ceph) quite fine when asked to do so.
But SCSI1 refuses and it should not be in use at all unless at boot time.

PS: I have never seen the highlighted bold line in a config file before.

PPS: nevermind, its the agent trim settings.

fabian · Dec 30, 2021

the reason why tpm state and efi disks can't be moved online is that both are 'writable' from the guest, so we can't touch them behind the guest's back, but are not accessible for the qemu process like regular block devices - so it's not possible to do a block mirror that intercepts/redirects writes. tpmstate is handled via a second process running next to the VM, and EFI disks are exposed like a flash chip - in both cases we just use our existing 'disk' layer on the PVE side to make management easier and flexible, they are not 'disks' as far as the VM is concerned.

it could (and should) be handled better/earlier and with a clear error message, but when some limitation like that is visible in PVE it's usually not because the devs are lazy/don't care, but because there is a good reason for that limitation.

MrPete · Jul 10, 2022

fabian said:
the reason why tpm state and efi disks can't be moved online is that both are 'writable' from the guest, so we can't touch them behind the guest's back, but are not accessible for the qemu process like regular block devices - so it's not possible to do a block mirror that intercepts/redirects writes. tpmstate is handled via a second process running next to the VM, and EFI disks are exposed like a flash chip - in both cases we just use our existing 'disk' layer on the PVE side to make management easier and flexible, they are not 'disks' as far as the VM is concerned.

it could (and should) be handled better/earlier and with a clear error message, but when some limitation like that is visible in PVE it's usually not because the devs are lazy/don't care, but because there is a good reason for that limitation.

@fabian something we do in other circumstances is preserve a copy of such crucial info on occasion, eg during reboot etc.
EFI config info doesn't change often... so while it is writable, it is rarely written, correct?

This is truly a killer issue.

fabian · Jul 11, 2022

yes, that is correct. it is writable albeit rarely written - but there is no way to ensure that no writes happen during the migration (which, depending on circumstances can take a while!), so it's not safe to move the disk (which could mean losing writes altogether, or transferring an inconsistent state).

MrPete · Jul 11, 2022

fabian said:
yes, that is correct. it is writable albeit rarely written - but there is no way to ensure that no writes happen during the migration...

Is there no way to detect that it has been written? (If nothing else, do a binary compare before and after the copy

)

Since EFI storage is small (typically a few hundred MB), and writes to EFI storage are incredibly rare, why not:
1) Have the migrate fail on EFI write... or better:
2) do the EFI move last (quick since it's tiny), and restart the EFI copy on write (once or twice, then fail)

As it is, I'm wanting to move my VM's away from UEFI simply because of this risk.

dswartz · Nov 24, 2022

I just got burned by this. I've just migrated away from esxi, and never had to do this. IMO, bringing down a VM just to move the trivially small EFI disk is less than optimal...

fireon · Nov 25, 2022

Same again on 7.3

dswartz · Nov 25, 2022

At the very least, a more user-friendly message? Please?

fireon · Nov 25, 2022

dswartz said:
At the very least, a more user-friendly message? Please?

Code:

create full clone of drive efidisk0 (SSD-secure:vm-118-disk-2)
drive mirror is starting for drive-efidisk0
drive-efidisk0: Cancelling block job
drive-efidisk0: Done.
TASK ERROR: storage migration failed: block job (mirror) error: drive-efidisk0: 'mirror' has been cancelled

dswartz · Nov 25, 2022

fireon said:

Code:

create full clone of drive efidisk0 (SSD-secure:vm-118-disk-2)
drive mirror is starting for drive-efidisk0
drive-efidisk0: Cancelling block job
drive-efidisk0: Done.
TASK ERROR: storage migration failed: block job (mirror) error: drive-efidisk0: 'mirror' has been cancelled

Well, I dunno about you, but seeing the above, the first words that pop into my head are NOT "gosh, that is intuitively obvious!" How hard is it to print "EFI disk cannot be moved while VM is running!"

sandor · Jan 12, 2023

I was able to move the efi disk with the move disk button from a local-lvm storage to a shared directory based storage into the qcow2 format.
But backward to move from the shared storage to the lvm into raw format i got cancelled too.
I using pve 6.3

EDIT:
The vm was in running state in the process.

HOPE IT · Apr 20, 2023

I'm a bit confused by this error message too. I just moved a Windows Server 2022 VM from one host to another (LVM storage to LVM storage) with no problems so it seems like EFI disks can move.

But now when I try to move the EFI Disk from a local LVM volume to a local ZFS volume I get the storage migration failed: block job (mirror) error: drive-efidisk0: 'mirror' has been cancelled error.

Is this the same problem or am I running into something different?

For what it's worth I'm on Proxmox 7.4.1/7.4.3:

proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)

fabian · Apr 20, 2023

the issue is that EFI disks *must* have a very specific (small) size, and some storages *have to round up* since they don't support such small volumes, and moving a disk live has to keep the exact size, so moving from one storage type to another may or may not work while the VM is running.

HOPE IT · Apr 21, 2023

Thanks @fabian, that explains it! LVM to LVM and ZFS to ZFS worked.

+1 vote for more descriptive error messages, even just a link to a wiki page containing 2 or 3 common reasons why the operation might fail.

MrPete · Apr 21, 2023

fabian said:
the issue is that EFI disks *must* have a very specific (small) size, and some storages *have to round up* since they don't support such small volumes, and moving a disk live has to keep the exact size, so moving from one storage type to another may or may not work while the VM is running.

Do you know of ANY situation where the EFI disk MUST be small? I've not ever seen that.

In such situations, what makes sense is to do the rounding up front as well as can be done.

We've long had the technology to know about such things in advance. Honestly I don't think anybody will complain about making EFI disks automagically a good size across the board. (If you don't like it, dont use UEFI boot!)
Even having a "Fix EFI Size" patch (requiring reboot) would be reasonable.

Assuming a non-removable drive, with rare exceptions, the following policy would fit most. These are defined partly by OS, partly by hardware:

ALL
* Round up, as if it were a 4k per sector drive
* Consider a 65527*4kB minimum, since it's FAT32 (65527) and 4kB is quite popular. That's really not that costly today

512B/sector drive
* 100 MiB for Windows and the vast majority of Linux/Unix (per-bootable OS and/or bootable copy)
* 200 MiB for MacOS

4k/sector drive
* 263 MiB (due to how FAT32 works)

Then, allow a specific setting. (Some Unixes want 550MiB, mostly to give room for multiple OS copies)

DETAILED TECH NOTE:
* For non-removable drives, the actual bare minimum is defined by FAT32: 65527 clusters, so @512B clusters = 32.7MiB @4kB = 262.1MiB
* AFAIK, there's no defined maximum, other than the limits of FAT32 (EFI System Partition is based on FAT32 for internal drives)

fabian · Apr 24, 2023

that's a misunderstanding. the efi disk is not the ESP

it's the equivalent of the flash chip your motherboard has to store UEFI configuration.

TASK ERROR: storage migration failed: block job (mirror) error: drive-efidisk0: 'mirror' has been cancelled

Member

Member

Member

Member

Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Active Member

Member

Proxmox Staff Member

Member

Active Member

Proxmox Staff Member

We value your privacy