Backup of one VM suddenly failing: job failed with err -5 - Input/output error

adamreece.webbox

New Member
Feb 24, 2023
5
0
1
Cardiff
www.webbox.digital
Hello,

One particular VM on one of our nodes has suddenly begun failing to backup as of yesterday. The VM runs and functions perfectly fine, and works after a full shut down and restart. The error in question is the all famous "job failed with err -5 - Input/output error".

There is not a shortage of space at the destination storage. Backing up to another destination also fails. The destination "/backup-shared/" is an SSHFS mount to backup storage on another server, a remount has been attempted already.
  • Volume "scsi0" is the VM's primary disk stored within a ZFS pool.
  • Volume "scsi1" is the VM's swap disk stored on a local NVMe drive.
  • When removing "scsi1" and "tpmstate0" from the VM the backup process still fails, so the issue is going to be with "scsi0".
Performing a ZFS scrub has no errors, and none of the physical volumes (four SSDs) within the ZFS pool have any SMART warnings. Is `qemu-img check` able to check an image at a ZVOL block device?

Thanks,

Adam

---- Command outputs ----

Backup job output while the VM is fully off:
Code:
INFO: starting new backup job: vzdump 101 --remove 0 --mode snapshot --storage backup-shared --notification-mode auto --notes-template '{{guestname}}' --node dev-server1 --compress zstd
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2024-09-05 11:58:51
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: virtdev-example
INFO: include disk 'scsi0' 'dev-server1:vm-101-disk-0' 128G
INFO: include disk 'scsi1' 'local-lvm:vm-101-disk-0' 8G
INFO: include disk 'tpmstate0' 'dev-server1:vm-101-disk-1' 4M
INFO: creating vzdump archive '/backup-shared/dump/vzdump-qemu-101-2024_09_05-11_58_51.vma.zst'
INFO: starting kvm to execute backup task
swtpm_setup: Not overwriting existing state file.
INFO: attaching TPM drive to QEMU for backup
INFO: started backup task '1f8849ef-73f1-49f2-8f5f-154fe5dadfcc'
INFO:   6% (8.3 GiB of 136.0 GiB) in 3s, read: 2.8 GiB/s, write: 47.5 MiB/s
INFO:   6% (8.7 GiB of 136.0 GiB) in 6s, read: 145.0 MiB/s, write: 114.4 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: stopping kvm after backup task
ERROR: Backup of VM 101 failed - job failed with err -5 - Input/output error
INFO: Failed at 2024-09-05 11:58:59
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
TASK ERROR: job errors

`qm config 101` output:
Code:
agent: 1,fstrim_cloned_disks=1
boot: order=scsi0;sata1;net0
cores: 4
memory: 4096
meta: creation-qemu=6.2.0,ctime=1660567839
name: virtdev-example
net0: virtio=08:00:27:51:79:ef,bridge=vmbr0,firewall=1,queues=4,tag=10
numa: 0
onboot: 1
ostype: l26
sata1: none,media=cdrom
scsi0: dev-server1:vm-101-disk-0,discard=on,iothread=1,size=128G,ssd=1
scsi1: local-lvm:vm-101-disk-0,discard=on,iothread=1,size=8G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=50bd1888-004d-4503-8799-971596e12eea
sockets: 1
tpmstate0: dev-server1:vm-101-disk-1,size=4M,version=v2.0
vmgenid: 4527ccc2-2da4-42b5-bbbc-50b1e4e1ad8a

`pveversion -v` output:
Code:
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-11
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
pve-kernel-5.15.143-1-pve: 5.15.143-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

`zpool list -v && zpool status && zfs list` output:
Code:
NAME                                  SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
dev-server1                          1.81T   698G  1.13T        -         -    14%    37%  1.00x    ONLINE  -
  raidz2-0                           1.81T   698G  1.13T        -         -    14%  37.6%      -    ONLINE
    ata-CT500MX500SSD1_2210E615D037   466G      -      -        -         -      -      -      -    ONLINE
    ata-CT500MX500SSD1_2210E6166B1B   466G      -      -        -         -      -      -      -    ONLINE
    ata-CT500MX500SSD1_2210E6166B0D   466G      -      -        -         -      -      -      -    ONLINE
    ata-CT500MX500SSD1_2210E616684E   466G      -      -        -         -      -      -      -    ONLINE
  pool: dev-server1
 state: ONLINE
  scan: scrub repaired 0B in 00:10:16 with 0 errors on Wed Sep  4 20:10:17 2024
config:

        NAME                                 STATE     READ WRITE CKSUM
        dev-server1                          ONLINE       0     0     0
          raidz2-0                           ONLINE       0     0     0
            ata-CT500MX500SSD1_2210E615D037  ONLINE       0     0     0
            ata-CT500MX500SSD1_2210E6166B1B  ONLINE       0     0     0
            ata-CT500MX500SSD1_2210E6166B0D  ONLINE       0     0     0
            ata-CT500MX500SSD1_2210E616684E  ONLINE       0     0     0

errors: No known data errors
NAME                        USED  AVAIL  REFER  MOUNTPOINT
dev-server1                 567G   304G   256K  /dev-server1
dev-server1/vm-100-disk-0   142G   430G  16.4G  -
dev-server1/vm-100-disk-1  6.36M   304G   134K  -
dev-server1/vm-101-disk-0   142G   321G   125G  -
dev-server1/vm-101-disk-1  6.36M   304G   128K  -
dev-server1/vm-102-disk-0   142G   366G  80.0G  -
dev-server1/vm-102-disk-1  6.36M   304G   134K  -
dev-server1/vm-104-disk-0   142G   330G   116G  -
dev-server1/vm-104-disk-1  6.36M   304G   157K  -
 
Hi,

Thank you for the outputs!

To troubleshooting, I would:
- Test a backup on a different storage
- Does the issue only on the 101 VM?
- Could you check if the SSHFS mount is stable and accessible? You can test by manually copying a large file to the `/backup-shared/` and monitoring the (network usage)
- I would also check the network connection during the backup of 101.
 
Thanks for your response.
  • Already tested to a different destination. The -5 error occurs at approximately the same percentage in.
  • Yes, just VM 101. 3 other VMs on the same host backup fine.
  • The SSHFS seems stable. Other VMs also use it for backup, and touching a new file over it works.
  • Oh interesting:
On that last point, I noticed `dmesg` is regularly written with this:
Code:
[8052379.394801] vmbr0: port 3(fwpr101p0) entered blocking state
[8052379.394802] vmbr0: port 3(fwpr101p0) entered listening state
[8052379.407454] fwbr101i0: port 1(fwln101i0) entered blocking state
[8052379.407457] fwbr101i0: port 1(fwln101i0) entered disabled state
[8052379.407471] fwln101i0: entered allmulticast mode
[8052379.407498] fwln101i0: entered promiscuous mode
[8052379.407520] fwbr101i0: port 1(fwln101i0) entered blocking state
[8052379.407521] fwbr101i0: port 1(fwln101i0) entered listening state
[8052379.413076] fwbr101i0: port 2(tap101i0) entered blocking state
[8052379.413079] fwbr101i0: port 2(tap101i0) entered disabled state
[8052379.413090] tap101i0: entered allmulticast mode
[8052379.413130] fwbr101i0: port 2(tap101i0) entered blocking state
[8052379.413132] fwbr101i0: port 2(tap101i0) entered listening state
[8052394.891418] fwbr101i0: port 2(tap101i0) entered learning state
[8052394.891433] fwbr101i0: port 1(fwln101i0) entered learning state
[8052394.891438] vmbr0: port 3(fwpr101p0) entered learning state
[8052410.251592] vmbr0: port 3(fwpr101p0) entered forwarding state
[8052410.251598] vmbr0: topology change detected, sending tcn bpdu
[8052410.251649] fwbr101i0: port 1(fwln101i0) entered forwarding state
[8052410.251651] fwbr101i0: topology change detected, sending tcn bpdu
[8052410.251674] fwbr101i0: port 2(tap101i0) entered forwarding state
[8052410.251675] fwbr101i0: topology change detected, sending tcn bpdu
[8052410.251685] vmbr0: port 3(fwpr101p0) received tcn bpdu
[8052410.251686] vmbr0: topology change detected, sending tcn bpdu
Using the VMs and host doesn't feel like there is any network connectivity problems. The VMs feel very smooth over VSCode for example.
(I'm not seeing these messages on the other host in the cluster.)

`journalctl -b` shows nothing of interest currently.
 
Hi,

Thank you for the output!

The above entries look normal to me.

The error `err -5 - Input/output error` usually point to the problem of the disk, and since you tested to backup on a different storage and the other VMs backup works well, I would test the `backup mode` on Snapshot or Suspend if possible.
 
Thanks for your suggestions.

The error `err -5 - Input/output error` usually point to the problem of the disk
I too thought that this would be related to the source storage of the VM's primary disk, though wouldn't ZFS be responsible for ensuring that this is corruption free? Particularly as the ZFS pool is RAID-Z2 based.

I would test the `backup mode` on Snapshot or Suspend if possible.
The backup mode was already on snapshot. Using suspend and stop modes produces the same error at the same position.
 
After the weekend I've gone through my sysadmin mailbox to find that Friday, Saturday, and Sunday backups ran completely successfully. I can't be sure what I did on Friday (6th Sep) to resolve this as all I did was run a few ZFS scrubs and remount the SSHFS destination that the other VMs were already backing up to fine.

Something is mugging me off here. I'll come back if I find anything useful.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!