some backups fail after upgrde

RobFantini

Famous Member
May 24, 2012
2,042
107
133
Boston,Mass
we ugraded on Fri 5/6/22. after that some backup issues started to occur.

we have 5 nodes, each have a mix of kvm and lxc.

Friday on one node all 7 lxc backups failed. the backup was to local storage, [ Also on Friday all PBS backups worked. ]

I tried to backup one of the lxcs manually and the fail repeated:
Code:
INFO: starting new backup job: vzdump 126 --remove 0 --mode snapshot --notes-template '{{guestname}}' --compress zstd --storage z-local-nvme --node pve2
INFO: Starting Backup of VM 126 (lxc)
INFO: Backup started at 2022-05-07 15:32:16
INFO: status = running
INFO: CT Name: ona
INFO: including mount point rootfs ('/') in backup
INFO: found old vzdump snapshot (force removal)
2022-05-07T15:32:16.339-0400 7fe0c67fc700 -1 librbd::object_map::InvalidateRequest: 0x7fe0c000f520 should_complete: r=0
Removing snap: 100% complete...done.
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd7
mount: /mnt/vzsnap0: special device /dev/rbd-pve/220b9a53-4556-48e3-a73c-28deff665e45/nvme-4tb/vm-126-disk-0@vzdump does not exist.
umount: /mnt/vzsnap0/: not mounted.
command 'umount -l -d /mnt/vzsnap0/' failed: exit code 32
ERROR: Backup of VM 126 failed - command 'mount -o ro,noload /dev/rbd-pve/220b9a53-4556-48e3-a73c-28deff665e45/nvme-4tb/vm-126-disk-0@vzdump /mnt/vzsnap0//' failed: exit code 32
INFO: Failed at 2022-05-07 15:32:17
INFO: Backup job finished with errors
TASK ERROR: job errors

On Saturday 5/7 4 nodes had no issues with pbs backups. the same node that had an issue on Friday had issues with two lxc. 5 lxc backups worked.

Code:
# pveversion -v                      
proxmox-ve: 7.2-1 (running kernel: 5.15.30-2-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-helper: 7.2-2
pve-kernel-5.15: 7.2-1
pve-kernel-5.13: 7.1-9
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 15.2.16-pve1
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-1
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
pve-zsync: 2.2.2
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

If more info is needed let me know.
 
Last edited:
Interesting you should raise this. I upgraded yesterday and was just coming to the forum to ask about backup issues.

I have a 3.6TB guest which was backing up fine (runs on Ceph), upgraded our nodes yesterday and now I'm having to cancel the backup because it's causing the guest to become unrepsonsive during the process. I can see it looks like it's having to rebuild from scratch, but thought this wouldn't be an issue.


Code:
INFO: starting new backup job: vzdump 111 --storage office --remove 0 --mode snapshot --node pve02 --notes-template '{{guestname}}'
INFO: Starting Backup of VM 111 (qemu)
INFO: Backup started at 2022-05-08 06:55:18
INFO: status = running
INFO: VM Name: shared2
INFO: include disk 'scsi0' 'ceph_data:vm-111-disk-3' 100G
INFO: include disk 'scsi1' 'ceph_data:vm-111-disk-0' 100G
INFO: include disk 'scsi2' 'ceph_data:vm-111-disk-1' 3500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/111/2022-05-08T05:55:18Z'
INFO: started backup task 'bd1779ec-4357-4929-9bd7-5180d059a2b0'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO: scsi1: dirty-bitmap status: created new
INFO: scsi2: dirty-bitmap status: created new
INFO:   0% (244.0 MiB of 3.6 TiB) in 3s, read: 81.3 MiB/s, write: 12.0 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 111 failed - interrupted by signal
INFO: Failed at 2022-05-08 06:59:21
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal

Here is the IOWait of our guest during backup:

1651990686310.png

And the strain isn't particularly excessive on Ceph either:

1651990755201.png

I thought the whole idea of the Proxmox backup is that it doesn't have to talk to the guest and thus doesn't intefere?

Anyway, I wonder if this is related to a wider backup issue?

Chris.
 
Last edited:
pve2 - the node that had backup issues some hours before the backup fail - also had some trouble when it was rebooted for the upgraded kernel to be used .

after pve2 restarted the 4 lxc's that use HA - and were trying to migrate back to pve2 had this error
Code:
task started by HA resource agent
run_buffer: 321 Script exited with status 32
lxc_init: 847 Failed to run lxc.hook.pre-start for container "607"
__lxc_start: 2008 Failed to initialize container "607"
TASK ERROR: startup for container '607' failed

We'll try another reboot of that node to see if that helps tonight's pbs backups.
 
Last edited: