job failed with err -5 - Input/output error (BTRFS) transfer/backup/clone

harmonyp

Member
Nov 26, 2020
196
4
23
46
After testing btrfs for a few days it seems every day more virtual machines are having this same issue when trying to create a backup.

Transfer (BTRFS > ZFS)

Code:
2021-09-25 22:16:24 starting migration of VM 520 to node 'NODE1' (192.168.0.120)
2021-09-25 22:16:24 found local disk 'btrfs:520/vm-520-disk-0.raw' (in current VM config)
2021-09-25 22:16:24 starting VM 520 on remote node 'NODE1'
2021-09-25 22:16:28 volume 'btrfs:520/vm-520-disk-0.raw' is 'zfs:vm-520-disk-0' on the target
2021-09-25 22:16:28 start remote tunnel
2021-09-25 22:16:29 ssh tunnel ver 1
2021-09-25 22:16:29 starting storage migration
2021-09-25 22:16:29 scsi0: start migration to nbd:unix:/run/qemu-server/520_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 50.0 GiB (0.00%) in 23s
drive-scsi0: transferred 262.0 MiB of 50.0 GiB (0.51%) in 24s
drive-scsi0: transferred 407.0 MiB of 50.0 GiB (0.79%) in 25s
drive-scsi0: transferred 545.0 MiB of 50.0 GiB (1.06%) in 26s
drive-scsi0: transferred 631.5 MiB of 50.0 GiB (1.23%) in 27s
drive-scsi0: transferred 718.6 MiB of 50.0 GiB (1.40%) in 28s
drive-scsi0: transferred 832.0 MiB of 50.0 GiB (1.62%) in 29s
drive-scsi0: transferred 958.1 MiB of 50.0 GiB (1.87%) in 30s
drive-scsi0: transferred 1.0 GiB of 50.0 GiB (2.06%) in 31s
drive-scsi0: transferred 1.2 GiB of 50.0 GiB (2.37%) in 32s
drive-scsi0: transferred 1.3 GiB of 50.0 GiB (2.61%) in 33s
drive-scsi0: transferred 1.4 GiB of 50.0 GiB (2.89%) in 34s
drive-scsi0: transferred 1.5 GiB of 50.0 GiB (3.06%) in 35s
drive-scsi0: transferred 1.6 GiB of 50.0 GiB (3.29%) in 36s
drive-scsi0: transferred 1.7 GiB of 50.0 GiB (3.45%) in 37s
drive-scsi0: transferred 1.8 GiB of 50.0 GiB (3.64%) in 38s
drive-scsi0: transferred 1.9 GiB of 50.0 GiB (3.89%) in 39s
drive-scsi0: transferred 2.1 GiB of 50.0 GiB (4.15%) in 40s
drive-scsi0: transferred 2.3 GiB of 50.0 GiB (4.51%) in 41s
drive-scsi0: transferred 2.5 GiB of 50.0 GiB (4.92%) in 42s
drive-scsi0: transferred 2.7 GiB of 50.0 GiB (5.32%) in 43s
drive-scsi0: transferred 2.8 GiB of 50.0 GiB (5.70%) in 44s
drive-scsi0: transferred 3.0 GiB of 50.0 GiB (6.10%) in 45s
drive-scsi0: transferred 3.2 GiB of 50.0 GiB (6.46%) in 46s
drive-scsi0: transferred 3.4 GiB of 50.0 GiB (6.81%) in 47s
drive-scsi0: transferred 3.6 GiB of 50.0 GiB (7.12%) in 48s
drive-scsi0: transferred 3.8 GiB of 50.0 GiB (7.54%) in 49s
drive-scsi0: transferred 3.9 GiB of 50.0 GiB (7.87%) in 50s
drive-scsi0: transferred 4.1 GiB of 50.0 GiB (8.26%) in 51s
drive-scsi0: transferred 4.3 GiB of 50.0 GiB (8.67%) in 52s
drive-scsi0: transferred 4.5 GiB of 50.0 GiB (9.04%) in 53s
drive-scsi0: transferred 4.7 GiB of 50.0 GiB (9.37%) in 54s
drive-scsi0: transferred 4.9 GiB of 50.0 GiB (9.74%) in 55s
drive-scsi0: transferred 5.0 GiB of 50.0 GiB (10.05%) in 56s
drive-scsi0: transferred 5.2 GiB of 50.0 GiB (10.46%) in 57s
drive-scsi0: transferred 5.5 GiB of 50.0 GiB (10.93%) in 58s
drive-scsi0: transferred 5.7 GiB of 50.0 GiB (11.36%) in 59s
drive-scsi0: transferred 5.9 GiB of 50.0 GiB (11.74%) in 1m
drive-scsi0: transferred 6.1 GiB of 50.0 GiB (12.16%) in 1m 1s
drive-scsi0: transferred 6.3 GiB of 50.0 GiB (12.60%) in 1m 2s
drive-scsi0: transferred 6.5 GiB of 50.0 GiB (13.06%) in 1m 3s
drive-scsi0: transferred 6.8 GiB of 50.0 GiB (13.55%) in 1m 4s
drive-scsi0: transferred 7.0 GiB of 50.0 GiB (13.95%) in 1m 5s
drive-scsi0: transferred 7.1 GiB of 50.0 GiB (14.30%) in 1m 6s
drive-scsi0: transferred 7.4 GiB of 50.0 GiB (14.73%) in 1m 7s
drive-scsi0: transferred 7.6 GiB of 50.0 GiB (15.18%) in 1m 8s
drive-scsi0: transferred 7.8 GiB of 50.0 GiB (15.58%) in 1m 9s
drive-scsi0: transferred 8.1 GiB of 50.0 GiB (16.11%) in 1m 10s
drive-scsi0: transferred 8.3 GiB of 50.0 GiB (16.57%) in 1m 11s
drive-scsi0: transferred 8.5 GiB of 50.0 GiB (17.08%) in 1m 12s
drive-scsi0: transferred 8.8 GiB of 50.0 GiB (17.61%) in 1m 13s
drive-scsi0: transferred 9.1 GiB of 50.0 GiB (18.12%) in 1m 14s
drive-scsi0: transferred 9.3 GiB of 50.0 GiB (18.65%) in 1m 15s
drive-scsi0: transferred 9.5 GiB of 50.0 GiB (19.00%) in 1m 16s
drive-scsi0: transferred 9.6 GiB of 50.0 GiB (19.26%) in 1m 17s
drive-scsi0: transferred 9.8 GiB of 50.0 GiB (19.66%) in 1m 18s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2021-09-25 22:17:49 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
2021-09-25 22:17:49 aborting phase 2 - cleanup resources
2021-09-25 22:17:49 migrate_cancel
2021-09-25 22:17:53 ERROR: migration finished with problems (duration 00:01:29)
TASK ERROR: migration problems



Cloning

Code:
create full clone of drive scsi0 (btrfs:520/vm-520-disk-0.raw)
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 1.1 GiB of 50.0 GiB (2.11%) in 5s
drive-scsi0: transferred 1.8 GiB of 50.0 GiB (3.55%) in 9s
drive-scsi0: transferred 2.0 GiB of 50.0 GiB (4.06%) in 10s
drive-scsi0: transferred 2.3 GiB of 50.0 GiB (4.65%) in 11s
drive-scsi0: transferred 2.6 GiB of 50.0 GiB (5.18%) in 12s
drive-scsi0: transferred 2.9 GiB of 50.0 GiB (5.72%) in 13s
drive-scsi0: transferred 3.2 GiB of 50.0 GiB (6.41%) in 14s
drive-scsi0: transferred 3.5 GiB of 50.0 GiB (7.09%) in 16s
drive-scsi0: transferred 3.9 GiB of 50.0 GiB (7.73%) in 17s
drive-scsi0: transferred 4.2 GiB of 50.0 GiB (8.47%) in 18s
drive-scsi0: transferred 4.5 GiB of 50.0 GiB (9.04%) in 19s
drive-scsi0: transferred 4.8 GiB of 50.0 GiB (9.52%) in 20s
drive-scsi0: transferred 5.0 GiB of 50.0 GiB (10.08%) in 21s
drive-scsi0: transferred 5.4 GiB of 50.0 GiB (10.81%) in 22s
drive-scsi0: transferred 5.8 GiB of 50.0 GiB (11.57%) in 23s
drive-scsi0: transferred 6.0 GiB of 50.0 GiB (12.06%) in 24s
drive-scsi0: transferred 6.3 GiB of 50.0 GiB (12.56%) in 25s
drive-scsi0: transferred 6.6 GiB of 50.0 GiB (13.19%) in 26s
drive-scsi0: transferred 7.0 GiB of 50.0 GiB (14.01%) in 27s
drive-scsi0: transferred 7.4 GiB of 50.0 GiB (14.82%) in 28s
drive-scsi0: transferred 7.8 GiB of 50.0 GiB (15.69%) in 29s
drive-scsi0: transferred 8.3 GiB of 50.0 GiB (16.51%) in 30s
drive-scsi0: transferred 8.7 GiB of 50.0 GiB (17.32%) in 31s
drive-scsi0: transferred 9.1 GiB of 50.0 GiB (18.13%) in 32s
drive-scsi0: transferred 9.5 GiB of 50.0 GiB (18.90%) in 33s
drive-scsi0: transferred 9.7 GiB of 50.0 GiB (19.48%) in 34s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
TASK ERROR: clone failed: block job (mirror) error: drive-scsi0: 'mirror' has been cancelled


Backup:

Code:
INFO: starting new backup job: vzdump 520 --mode snapshot --storage PBS --node NODE2 --remove 0
INFO: Starting Backup of VM 520 (qemu)
INFO: Backup started at 2021-09-25 22:20:20
INFO: status = running
INFO: VM Name: hg-win753
INFO: include disk 'scsi0' 'btrfs:520/vm-520-disk-0.raw' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/520/2021-09-25T21:20:20Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '5a224b36-9f5a-40b6-88c0-e65f03e0099c'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   2% (1.2 GiB of 50.0 GiB) in 3s, read: 408.0 MiB/s, write: 161.3 MiB/s
INFO:   3% (1.7 GiB of 50.0 GiB) in 6s, read: 160.0 MiB/s, write: 88.0 MiB/s
INFO:   4% (2.5 GiB of 50.0 GiB) in 9s, read: 269.3 MiB/s, write: 108.0 MiB/s
INFO:   7% (3.9 GiB of 50.0 GiB) in 12s, read: 496.0 MiB/s, write: 64.0 MiB/s
INFO:  10% (5.0 GiB of 50.0 GiB) in 15s, read: 386.7 MiB/s, write: 57.3 MiB/s
INFO:  11% (5.9 GiB of 50.0 GiB) in 18s, read: 286.7 MiB/s, write: 54.7 MiB/s
INFO:  12% (6.3 GiB of 50.0 GiB) in 21s, read: 142.7 MiB/s, write: 56.0 MiB/s
INFO:  15% (7.6 GiB of 50.0 GiB) in 24s, read: 450.7 MiB/s, write: 28.0 MiB/s
INFO:  17% (8.7 GiB of 50.0 GiB) in 27s, read: 366.7 MiB/s, write: 81.3 MiB/s
INFO:  18% (9.5 GiB of 50.0 GiB) in 30s, read: 269.3 MiB/s, write: 36.0 MiB/s
INFO:  19% (9.7 GiB of 50.0 GiB) in 33s, read: 86.7 MiB/s, write: 72.0 MiB/s
INFO:  19% (9.9 GiB of 50.0 GiB) in 36s, read: 65.3 MiB/s, write: 65.3 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 520 failed - job failed with err -5 - Input/output error
INFO: Failed at 2021-09-25 22:21:00
INFO: Backup job finished with errors
TASK ERROR: job errors
 
To add syslog contains

Code:
Sep 26 20:15:02 HOSTNAME vzdump[3822900]: INFO: Starting Backup of VM 152 (qemu)
Sep 26 20:15:32 HOSTNAME kernel: [272661.568777] BTRFS warning (device nvme0n1): csum failed root 445 ino 257 off 10338160640 csum 0xd2eda394 expected csum 0x8ba91653 mirror 1
Sep 26 20:15:32 HOSTNAME kernel: [272661.595833] BTRFS error (device nvme0n1): bdev /dev/nvme2n1 errs: wr 0, rd 0, flush 0, corrupt 3526, gen 0
Sep 26 20:15:32 HOSTNAME vzdump[3822900]: ERROR: Backup of VM 152 failed - job failed with err -5 - Input/output error

pveversion -v
Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-4-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.4: 6.4-6
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.4.140-1-pve: 5.4.140-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-3
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
pve-zsync: 2.2
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

btrfs was added through

Code:
apt-get install btrfs-progs
(this was not installed on fresh 7 upgrade???)
Code:
mkfs.btrfs -m raid1 -d single -L My-Storage /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1

nano /etc/fstab
Code:
UUID=5c51b01a-8053-4fff-8549-669d6594ef5a  /my-storage btrfs defaults 0 0

Code:
mkdir /my-storage
mount /dev/nvme0n1 /my-storage
pvesm add btrfs btrfs --path /my-storage
 
Last edited:
To add syslog contains

Code:
Sep 26 20:15:02 HOSTNAME vzdump[3822900]: INFO: Starting Backup of VM 152 (qemu)
Sep 26 20:15:32 HOSTNAME kernel: [272661.568777] BTRFS warning (device nvme0n1): csum failed root 445 ino 257 off 10338160640 csum 0xd2eda394 expected csum 0x8ba91653 mirror 1
Sep 26 20:15:32 HOSTNAME kernel: [272661.595833] BTRFS error (device nvme0n1): bdev /dev/nvme2n1 errs: wr 0, rd 0, flush 0, corrupt 3526, gen 0
Sep 26 20:15:32 HOSTNAME vzdump[3822900]: ERROR: Backup of VM 152 failed - job failed with err -5 - Input/output error

that sounds like some hardware error? I'd check RAM and the NVME device..
 
Ill run some tests but I don't think that's the case have been reading this is very common with BTRFS

Also from trying to copy all the virtual machines running

Code:
find /my-storage -type f -exec cp {} /dev/null \;

Code:
cp: error reading '/my-storage/images/152/vm-152-disk-0/disk.raw': Input/output error

I can see this only seems to be happening to Windows virtual machines. It's happened to every Windows machine I have copied from ZFS > BTRFS new Windows VMs created on this node don't have any errors so far.

1a2aad1eb2af5661ba72bb052b3b51bf.png


btrfs scrub status /my-storage
Code:
UUID:             8db0407b-1950-4a74-acc6-63bf1cc2c137
Scrub started:    Mon Sep 27 11:04:46 2021
Status:           finished
Duration:         0:01:13
Total to scrub:   394.79GiB
Rate:             5.40GiB/s
Error summary:    csum=42
  Corrected:      0
  Uncorrectable:  42
  Unverified:     0
 
Last edited:
you could try resilvering on the BTRFS level..
 
it's a technology preview, as it say on top of the page ;) your setup has no redundancy for the data, so BTRFS can't recover the checksum errors.
 
  • Like
Reactions: harmonyp
mkfs.btrfs -m raid1 -d single -L My-Storage /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1

The -d single part here looks like a bad idea. This is basically the equivalent of a raid0 for data, you're only using raid1 only for metadata, so data simply cannot recovered (unless your VM internally has multiple copies and can do that by itself, for example if you're running databases inside, chances are those are robust enough to recover from within).
Personally, at this point I'd opt for making backups from within the affected VMs where cp of the disks causes I/O errors and recreate them on another storage, and, well, wipe and recreate the btrfs with redundancy for data ;-)
 
The -d single part here looks like a bad idea. This is basically the equivalent of a raid0 for data, you're only using raid1 only for metadata, so data simply cannot recovered (unless your VM internally has multiple copies and can do that by itself, for example if you're running databases inside, chances are those are robust enough to recover from within).
Personally, at this point I'd opt for making backups from within the affected VMs where cp of the disks causes I/O errors and recreate them on another storage, and, well, wipe and recreate the btrfs with redundancy for data ;-)
Understand the risks I need the extra disk space and possible performance. I will try running a repair first or try telling btrfs to ignore checksum errors.

From looking around writeback or writethrough cache on COW filesystems is required. Direct i/o isn't supported on btrfs with files that are have checksums. btrfs should not be letting me open these files.

https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs

For anyone else that has this issue I did the following after shutting down all virtual machines to restore the data successfully.

Unmount current btrfs devices
Code:
umount /my-storage

Create a copy/restore of all current virtual machines into the /var/lib/vz/restore directory
Code:
btrfs restore /dev/nvme0n1 /var/lib/vz/restore

Remove the current disk and replace
Code:
cd /my-storage/images/VMID/vm-VMID-disk-0
rm disk.raw
cp /var/lib/vz/restore/images/VMID/vm-VMID-disk-0/disk.raw disk.raw
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!