ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-efidisk0: Input/output error (io-status: ok)

herzkerl · 2024-12-27T23:19:55+0100

I've been giving "remote migration" a try for the first time today, moving machines live from a single host running ZFS to a new cluster running CePH. It worked tremendously well—without issues and on the first try—for all VM's but one, which always fails with the following errors.

I tried quite a few things:
• Using a different remote host to migrate to
• Migrating to local-zfs instead of CePH
• Changing the machine version 7.1 to 8.2

Read quite a few threads regarding these issues, but to no avail. Looking forward to any suggestions you might have!

Here's the config from that VM:

Code:

agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=ide2;scsi0
cores: 8
cpu: x86-64-v3
efidisk0: local-zfs:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-i440fx-8.1
memory: 16384
name: W2019-DC
net0: virtio=7A:48:81:5E:B1:14,bridge=vmbr0
numa: 1
onboot: 1
ostype: win10
protection: 1
scsi0: local-zfs:vm-101-disk-1,discard=on,iothread=1,size=150G,ssd=1
scsi1: local-zfs:vm-101-disk-2,discard=on,iothread=1,size=300G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=80426df5-91a8-4be1-b1d1-99fd144cfda0
sockets: 1
vmgenid: c4d0dd5a-ac6a-4009-ae34-3cd2cf455626

I successfully migrated very similar machines (all running Windows Server 2019), though.

Code:

agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=ide2;scsi0
cores: 6
efidisk0: local-zfs:vm-102-disk-2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
lock: migrate
machine: pc-i440fx-8.1
memory: 65536
name: W2019-MX
net0: virtio=D2:C0:FE:A5:43:65,bridge=vmbr0
numa: 1
onboot: 1
ostype: win10
protection: 1
scsi0: local-zfs:vm-102-disk-0,discard=on,iothread=1,size=250G,ssd=1
scsi1: local-zfs:vm-102-disk-1,discard=on,iothread=1,size=150G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=83c92ef2-f8e2-4cc9-9c43-024c5380f0a7
sockets: 2
vmgenid: 6d0ec721-dbba-4a37-99b1-6fcafa9152e3

Code:

agent: 1,fstrim_cloned_disks=1
balloon: 32768
bios: ovmf
boot: order=ide2;scsi0
cores: 8
efidisk0: local-zfs:vm-103-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
lock: migrate
machine: pc-i440fx-8.1
memory: 131072
name: W2019-TS
net0: virtio=BA:2D:CA:68:77:CC,bridge=vmbr0
numa: 1
onboot: 1
ostype: win10
protection: 1
scsi0: local-zfs:vm-103-disk-1,discard=on,iothread=1,size=200G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=0eb103f1-1096-4c54-b112-d1779b3116d3
sockets: 2
vmgenid: 792b8cc8-3fd4-49bd-89ad-c9c2cdb554b9

And here's the error log:

Code:

2024-12-27 18:52:44 remote: started tunnel worker 'UPID:pve-r6415-2:000301A6:002701C8:676EE96C:qmtunnel:101:root@pam!migration:'
tunnel: -> sending command "version" to remote
tunnel: <- got reply
2024-12-27 18:52:44 local WS tunnel version: 2
2024-12-27 18:52:44 remote WS tunnel version: 2
2024-12-27 18:52:44 minimum required WS tunnel version: 2
websocket tunnel started
2024-12-27 18:52:44 starting migration of VM 101 to node 'pve-r6415-2' (192.168.100.12)
tunnel: -> sending command "bwlimit" to remote
tunnel: <- got reply
tunnel: -> sending command "bwlimit" to remote
tunnel: <- got reply
tunnel: -> sending command "bwlimit" to remote
tunnel: <- got reply
2024-12-27 18:52:44 found local disk 'local-zfs:vm-101-disk-0' (attached)
2024-12-27 18:52:44 found local disk 'local-zfs:vm-101-disk-1' (attached)
2024-12-27 18:52:44 found local disk 'local-zfs:vm-101-disk-2' (attached)
2024-12-27 18:52:44 mapped: net0 from vmbr0 to vmbr0
2024-12-27 18:52:44 Allocating volume for drive 'scsi0' on remote storage 'data'..
tunnel: -> sending command "disk" to remote
tunnel: <- got reply
2024-12-27 18:52:44 volume 'local-zfs:vm-101-disk-1' is 'data:vm-101-disk-0' on the target
2024-12-27 18:52:44 Allocating volume for drive 'scsi1' on remote storage 'data'..
tunnel: -> sending command "disk" to remote
tunnel: <- got reply
2024-12-27 18:52:44 volume 'local-zfs:vm-101-disk-2' is 'data:vm-101-disk-1' on the target
2024-12-27 18:52:44 Allocating volume for drive 'efidisk0' on remote storage 'data'..
tunnel: -> sending command "disk" to remote
tunnel: <- got reply
2024-12-27 18:52:45 volume 'local-zfs:vm-101-disk-0' is 'data:vm-101-disk-2' on the target
tunnel: -> sending command "config" to remote
tunnel: <- got reply
tunnel: -> sending command "start" to remote
tunnel: <- got reply
2024-12-27 18:52:46 Setting up tunnel for '/run/qemu-server/101.migrate'
2024-12-27 18:52:46 Setting up tunnel for '/run/qemu-server/101_nbd.migrate'
2024-12-27 18:52:46 starting storage migration
2024-12-27 18:52:46 scsi1: start migration to nbd:unix:/run/qemu-server/101_nbd.migrate:exportname=drive-scsi1
drive mirror is starting for drive-scsi1
tunnel: accepted new connection on '/run/qemu-server/101_nbd.migrate'
tunnel: requesting WS ticket via tunnel
tunnel: established new WS for forwarding '/run/qemu-server/101_nbd.migrate'
drive-scsi1: transferred 87.0 MiB of 300.0 GiB (0.03%) in 1s
[...]
drive-scsi1: transferred 300.1 GiB of 300.1 GiB (100.00%) in 50m 16s, ready
all 'mirror' jobs are ready
2024-12-27 19:43:02 efidisk0: start migration to nbd:unix:/run/qemu-server/101_nbd.migrate:exportname=drive-efidisk0
drive mirror is starting for drive-efidisk0
tunnel: accepted new connection on '/run/qemu-server/101_nbd.migrate'
tunnel: requesting WS ticket via tunnel
tunnel: established new WS for forwarding '/run/qemu-server/101_nbd.migrate'
drive-efidisk0: transferred 0.0 B of 528.0 KiB (0.00%) in 0s
drive-efidisk0: transferred 528.0 KiB of 528.0 KiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-12-27 19:43:03 scsi0: start migration to nbd:unix:/run/qemu-server/101_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
tunnel: accepted new connection on '/run/qemu-server/101_nbd.migrate'
tunnel: requesting WS ticket via tunnel
tunnel: established new WS for forwarding '/run/qemu-server/101_nbd.migrate'
drive-scsi0: transferred 79.0 MiB of 150.0 GiB (0.05%) in 1s
[...]
drive-scsi0: transferred 150.3 GiB of 150.3 GiB (100.00%) in 25m 1s, ready
all 'mirror' jobs are ready
2024-12-27 20:08:04 switching mirror jobs to actively synced mode
drive-efidisk0: switching to actively synced mode
drive-scsi0: switching to actively synced mode
drive-scsi1: switching to actively synced mode
drive-efidisk0: successfully switched to actively synced mode
drive-scsi0: successfully switched to actively synced mode
drive-scsi1: successfully switched to actively synced mode
2024-12-27 20:08:05 starting online/live migration on unix:/run/qemu-server/101.migrate
2024-12-27 20:08:05 set migration capabilities
tunnel: -> sending command "bwlimit" to remote
tunnel: <- got reply
2024-12-27 20:08:05 migration downtime limit: 100 ms
2024-12-27 20:08:05 migration cachesize: 2.0 GiB
2024-12-27 20:08:05 set migration parameters
2024-12-27 20:08:05 start migrate command to unix:/run/qemu-server/101.migrate
tunnel: accepted new connection on '/run/qemu-server/101.migrate'
tunnel: requesting WS ticket via tunnel
tunnel: established new WS for forwarding '/run/qemu-server/101.migrate'
2024-12-27 20:08:06 migration active, transferred 79.0 MiB of 16.0 GiB VM-state, 122.9 MiB/s
2024-12-27 20:08:06 xbzrle: send updates to 373916 pages in 190.0 MiB encoded memory, cache-miss 17.56%, overflow 31529
[...]
2024-12-27 20:10:55 auto-increased downtime to continue migration: 800 ms
2024-12-27 20:10:56 migration active, transferred 16.6 GiB of 16.0 GiB VM-state, 86.9 MiB/s, VM dirties lots of memory: 128.5 MiB/s
2024-12-27 20:10:56 xbzrle: send updates to 551775 pages in 211.4 MiB encoded memory, cache-miss 33.71%, overflow 32568
tunnel: done handling forwarded connection from '/run/qemu-server/101.migrate'
2024-12-27 20:10:56 average migration speed: 95.9 MiB/s - downtime 303 ms
2024-12-27 20:10:56 migration status: completed
all 'mirror' jobs are ready
drive-efidisk0: Completing block job...
drive-efidisk0: Completed successfully.
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi1: Completing block job...
tunnel: done handling forwarded connection from '/run/qemu-server/101_nbd.migrate'
tunnel: done handling forwarded connection from '/run/qemu-server/101_nbd.migrate'
tunnel: done handling forwarded connection from '/run/qemu-server/101_nbd.migrate'
drive-scsi1: Completed successfully.
drive-efidisk0: Cancelling block job
drive-scsi1: Cancelling block job
drive-scsi0: Cancelling block job
drive-efidisk0: Done.
WARN: drive-scsi1: Input/output error (io-status: ok)
drive-scsi1: Done.
drive-scsi0: Done.
2024-12-27 20:10:59 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-efidisk0: Input/output error (io-status: ok)
2024-12-27 20:10:59 aborting phase 2 - cleanup resources
2024-12-27 20:10:59 migrate_cancel
tunnel: -> sending command "stop" to remote
tunnel: <- got reply
tunnel: -> sending command "quit" to remote
tunnel: <- got reply
tunnel: thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', library/std/src/io/stdio.rs:1009:9
tunnel: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
CMD websocket tunnel died: command 'proxmox-websocket-tunnel' failed: exit code 101

2024-12-27 20:11:45 ERROR: no reply to command '{"cleanup":1,"cmd":"quit"}': reading from tunnel failed: got timeout
print() on closed filehandle GEN24 at /usr/share/perl5/PVE/Tunnel.pm line 99.
readline() on closed filehandle GEN21 at /usr/share/perl5/PVE/Tunnel.pm line 71.
Use of uninitialized value $res in concatenation (.) or string at /usr/share/perl5/PVE/Tunnel.pm line 117.
2024-12-27 20:12:15 tunnel still running - terminating now with SIGTERM
2024-12-27 20:12:25 tunnel still running - terminating now with SIGKILL
2024-12-27 20:12:26 ERROR: tunnel child process (PID 3022180) couldn't be collected
2024-12-27 20:12:26 ERROR: failed to decode tunnel reply '' (command '{"cleanup":0,"cmd":"quit"}') - malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tunnel.pm line 116.
2024-12-27 20:12:26 ERROR: migration finished with problems (duration 01:19:42)

TASK ERROR: migration problems

Kingneutron · 2024-12-28T02:21:54+0100

There may be a read error inside the win10 VM's virtual disk.

What I would recommend is using something like Veeam Free Agent to do a bare-metal backup in-vm, and restore that. Note that some files/dirs may be unrecoverable. You might want to do a chkdsk/f and sfc/scannow in-vm and see how a defrag runs

herzkerl · 2024-12-28T13:19:53+0100

Both chkdsk and sfc /scannow haven't found any issues.

herzkerl · 2024-12-28T15:19:33+0100

It might have been a network issue after all: We set up a bond (lacp, hash policy layer3+4)—after changing to a single nic config on the old host, the migration worked just fine.

EDIT: It could also be due to the different sizes of the EFI image. When trying to move from local-zfs to CePH I'm still seeing an error—albeit a different one:

Code:

create full clone of drive efidisk0 (local-zfs:vm-101-disk-2)
drive mirror is starting for drive-efidisk0
drive-efidisk0: Cancelling block job
drive-efidisk0: Done.
Removing image: 100% complete...done.
TASK ERROR: storage migration failed: block job (mirror) error: drive-efidisk0: Source and target image have different sizes (io-status: ok)

I'll try to move the disk after turning off the machine, as stated here: https://forum.proxmox.com/threads/t...-mirror-has-been-cancelled.102202/post-550688

Search

Search

ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-efidisk0: Input/output error (io-status: ok)

herzkerl

Member

Kingneutron

Active Member

herzkerl

Member

herzkerl

Member