Proxmox backup hangs pruning older backups

kyriazis

Active Member
Oct 28, 2019
96
4
28
Austin, TX
Hello,

I have a problem when backing up to a ceph cluster of spinning disks.

I have a cluster of 27 server-class nodes with 60 OSDs on a 10gig network. If I backup ~10 VM/CTs it works fine. Upping that number to ~20 the backup grinds to a halt (write bandwidth in the KB/s range) but eventually finishes.

Doing a full backup of ~70 CT/VMs hangs.. All backup tasks are stuck pruning older backups and the bandwidth to the disks is (again) in the KB/s range.

A (temporary) solution is to restart the MDS server. That gets some of the pruning unstuck, but it gets stuck again soon thereafter.

Here's the log of one of the backup jobs:

Code:
INFO: starting new backup job: vzdump 163 100 102 103 104 105 106 107 108 109 110 112 113 117 119 120 122 123 124 125 126 127 128 129 131 133 134 135 136 137 138 139 140 141 143 144 147 148 149 152 153 154 155 156 157 159 160 161 162 164 165 132 167 158 166 169 168 172 173 193 171 178 180 182 130 174 175 183 185 186 188 189 190 191 200 201 202 181 195 196 199 197 192 194 101 111 115 142 145 --all 0 --compress zstd --prune-backups 'keep-daily=3,keep-monthly=3,keep-weekly=3' --quiet 1 --storage ceph --mailnotification failure --mailto george.kyriazis@intel.com --node vis-clx-00 --mode snapshot
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp3772435_106 for temporary files
INFO: Starting Backup of VM 106 (lxc)
INFO: Backup started at 2022-12-01 17:32:33
INFO: status = running
INFO: CT Name: vis-ct-clx-00
INFO: including mount point rootfs ('/') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
  Logical volume "snap_vm-106-disk-0_vzdump" created.
INFO: creating vzdump archive '/mnt/pve/ceph/dump/vzdump-lxc-106-2022_12_01-17_32_33.tar.zst'
INFO: Total bytes written: 168008232960 (157GiB, 63MiB/s)
INFO: archive file size: 68.53GB
INFO: prune older backups with retention: keep-daily=3, keep-monthly=3, keep-weekly=3

That used to work earlier this year, but started happening a few months ago after some upgrade. Current ceph version is 17.2.5.

Thanks!!
 
Hi,
are you backing up guests from multiple nodes at the same time? If yes, maybe you should split the job into multiple, time-shifted ones. It could be that the CephFS can't handle all the load. There is a feature request open to automatically limit the number of parallel backups: https://bugzilla.proxmox.com/show_bug.cgi?id=3086

If not, please post the output of pveversion -v. Do you see any errors or warnings in the Ceph (MDS or OSD) logs?
 
Yes, I am backing up multiple nodes at the same time. This used to work, however. Things used to grind to a halt, but eventually succeeded. Now, they just hung.

Out of pveversion -v below

Code:
# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.64-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-5.15: 7.2-13
pve-kernel-helper: 7.2-13
pve-kernel-5.11: 7.0-10
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 17.2.5-pve1
ceph-fuse: 17.2.5-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-4
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-3
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-6
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

Thank you!
 
  • Like
Reactions: Neobin and ITT
Ok, understood.. However, in order to make the GUI more user-friendly I would recommend the following:

Currently, in the edit backup job popup, there is a tick mark on the left column that allows you to select all container types in one click. That gives the impression that it's ok to do that.

If this is not the preferred way, then there should be an easy way for Proxmox to "do the right" thing. Either by throttling backup jobs so no more then N jobs are running at the same time, or some method to automatically split/partition backups. It's much easier to have one place where you can check which backups you want to have run, as opposed to manually splitting in several backups jobs, and mentally keeping track which backup job has which checkboxes ticked, no no mention doing trial runs to figure out what's the best time to run said backup jobs, so they don't overlap.

Additionally, it would be nice to have a search box for looking up which VM/LCXs should be displayed at the window below. That way a user can partition backups by (say) name, node, ids, etc.

Thank you

George
 
Ok, understood.. However, in order to make the GUI more user-friendly I would recommend the following:

Currently, in the edit backup job popup, there is a tick mark on the left column that allows you to select all container types in one click. That gives the impression that it's ok to do that.
It can be okay, if you don't have too many nodes.

If this is not the preferred way, then there should be an easy way for Proxmox to "do the right" thing. Either by throttling backup jobs so no more then N jobs are running at the same time, or some method to automatically split/partition backups. It's much easier to have one place where you can check which backups you want to have run, as opposed to manually splitting in several backups jobs, and mentally keeping track which backup job has which checkboxes ticked, no no mention doing trial runs to figure out what's the best time to run said backup jobs, so they don't overlap.
That's what the mentioned feature request is about: https://bugzilla.proxmox.com/show_bug.cgi?id=3086

Additionally, it would be nice to have a search box for looking up which VM/LCXs should be displayed at the window below. That way a user can partition backups by (say) name, node, ids, etc.
The list is already filtered if you configure the job only for a single node rather than all nodes. Feel free to open a feature request on our bugtracker for the others.
 
It can be okay, if you don't have too many nodes.
But if you do have many nodes ...

That's what the mentioned feature request is about: https://bugzilla.proxmox.com/show_bug.cgi?id=3086
That feature request is when communicating to a backup server. Not sure if things are different when using the included backup utility to backup to local ceph storage.

The list is already filtered if you configure the job only for a single node rather than all nodes. Feel free to open a feature request on our bugtracker for the others.
Oh yeah, about that. it's nice that there is an option to select a single node, but it would be actually useful if you could select multiple nodes per backup job. Thanks for the suggestion of filing a bug report. Will probably do that.

Cheers!

George
 
That feature request is when communicating to a backup server. Not sure if things are different when using the included backup utility to backup to local ceph storage.
It started out as a PBS-specific feature request, but much of the later discussion is about PVE and other backup storages too ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!