IO Error in Guest VM after error during backup to PBS

reto

Member
Feb 12, 2022
23
6
8
102
Situation:
This weekend I installed TrueNas for some tests in proxmox vm and attached a 1 TB disk to the VM. Unfortunately I forgot to disable backups on this disk and my nightly backup2pbs job filled up the storage a few hours later.

At the same time, as the backup crashed, the Guest VM reported an IO error and the ZFS (the one in the guest VM) went into DEGRADED mode.

logs host:

Code:
May 11 00:51:54 saturn pvescheduler[1850224]: INFO: Starting Backup of VM 109 (qemu)
  [ some more pvestatd ignored ]
May 11 00:56:57 saturn pvestatd[1421]: status update time (11.885 seconds)
May 11 00:57:25 saturn pvescheduler[1850224]: ERROR: Backup of VM 109 failed - backup write data failed: command error: write_data upload error: pipelined request failed: >
May 11 00:57:25 saturn pvescheduler[1850224]: INFO: Starting Backup of VM 110 (qemu)
May 11 00:57:25 saturn pvestatd[1421]: status update time (6.756 seconds)
May 11 00:57:33 saturn pvescheduler[1850224]: ERROR: Backup of VM 110 failed - backup write data failed: command error: write_data upload error: pipelined request failed: >

logs guest vm (qemu vm 109)

Code:
May 11 00:51:54 truenas qemu-ga[2002]: info: guest-ping called
May 11 00:51:54 truenas qemu-ga[2002]: info: guest-fsfreeze called
May 11 00:57:25 truenas kernel: I/O error, dev vda, sector 55762872 op 0x1:(WRITE) flags 0x100 phys_seg 13 prio class 2
May 11 00:57:25 truenas kernel: zio pool=boot-pool vdev=/dev/vda3 error=5 type=2 offset=28010573824 size=53248 flags=1074267264
May 11 00:57:25 truenas zed[203802]: eid=22 class=io pool='boot-pool' vdev=vda3 size=53248 offset=28010573824 priority=3 err=5 flags=0x40080480 delay=545ms
May 11 00:57:25 truenas zed[203808]: eid=24 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010618880 priority=3 err=5 flags=0x384080 bookmark=0:0:0:50
May 11 00:57:25 truenas zed[203810]: eid=23 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010622976 priority=3 err=5 flags=0x384080 bookmark=0:0:0:4
May 11 00:57:25 truenas zed[203814]: eid=26 class=io pool='boot-pool' vdev=vda3 size=12288 offset=28010602496 priority=3 err=5 flags=0x384080 bookmark=0:139:0:0
May 11 00:57:25 truenas zed[203815]: eid=25 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010614784 priority=3 err=5 flags=0x384080 bookmark=0:1626:0:0
May 11 00:57:25 truenas zed[203822]: eid=28 class=io pool='boot-pool' vdev=vda3 size=1536 offset=28010594304 priority=3 err=5 flags=0x384080 bookmark=0:52:0:0
May 11 00:57:25 truenas zed[203823]: eid=27 class=io pool='boot-pool' vdev=vda3 size=2048 offset=28010598400 priority=3 err=5 flags=0x384080 bookmark=0:69:0:0
May 11 00:57:25 truenas zed[203827]: eid=30 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010586112 priority=3 err=5 flags=0x380080 bookmark=0:0:-1:0
May 11 00:57:25 truenas zed[203826]: eid=29 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010590208 priority=3 err=5 flags=0x384080 bookmark=0:72:0:0
May 11 00:57:25 truenas zed[203832]: eid=31 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010582016 priority=3 err=5 flags=0x380080 bookmark=0:0:1:0
May 11 00:57:25 truenas zed[203835]: eid=32 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010577920 priority=3 err=5 flags=0x380080 bookmark=0:0:0:1
May 11 00:57:25 truenas zed[203833]: eid=33 class=io pool='boot-pool' vdev=vda3 size=4096 offset=28010573824 priority=3 err=5 flags=0x380080 bookmark=0:0:0:2
May 11 00:57:40 truenas zed[204236]: eid=34 class=statechange pool='boot-pool' vdev=vda3 vdev_state=FAULTED
May 11 00:57:40 truenas zed[204246]: eid=35 class=statechange pool='boot-pool' vdev=vda3 vdev_state=DEGRADED
May 11 00:57:41 truenas find_alias_for_smtplib.py[204279]: sending mail to
                                                           To: root
                                                           Subject: ZFS device fault for pool boot-pool on truenas

The guest VM is for tests only, so this is no big problem.

But: This is the second time that I notice IO errors when a backup failed. The last time I had a network outage due to a firewall misconfiguration (all packages got dropped) and the guest vm didn't boot anymore afterwards. Fortunately this was during a migration window and I had a valid snapshot shortly before the problem.

I don't know how hard these kind of problems are to reproduce, maybe my case (or cases) were just a fluke. But I wanted to let you know that there might be some issues below the surface.

In both cases the vm got backuped to a PBS, I think both time I used mode=snapshot. Both times the guest was a linux vm. I think the other time the guest filesystem was an ext4.

Versions:
pve-manager/8.1.10/4b06efb5db453f29 (running kernel: 6.5.13-5-pve)
Proxmox Backup Server 3.2-2
 
During a snapshot backup, new writes (inside the VM) need to be delayed or stored elsewhere. If something goes wrong with the backup and those writes are not performed, then the virtual disk does not have the contents that the system inside the VM expects. Other people have reported similar issues, where the writes were not applied after a failed backup.

Proxmox recently added support for fleecing, where the new writes always to the the virtual disk but the previous data is moved to some other place (so the original can be backed up): https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_vm_backup_fleecing
Be aware that there are reports about storage getting full when backups fail, as the fleeced data is not removed immediately.
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!