Backup job locks up PVE 7.0

jermudgeon

Renowned Member
Apr 7, 2016
33
2
73
46
6-node cluster, all running latest PVE and updated. Underlying VM/LXC storage is ceph. Backups -> cephfs.

In the backup job, syncfs fails, and then the following things happen.

•The node and container icons in the GUI have a grey question mark, but no functions of the UI itself appear to fail
•The backup job cannot complete or be aborted
•The VMs cannot be migrated
•The node cannot be safely restarted and must be hard reset
•pct functions fail (cannot shutdown, migrate, enter)

This is new behavior in Proxmox 7.0. It's happening repeatedly on multiple nodes. There does not appear always to be a clear pattern of specific containers, but I will investigate this more thoroughly.

Code:
INFO: starting new backup job: vzdump --quiet 1 --pool backup --compress zstd --mailnotification failure --mailto jeremy@idealoft.net --mode snapshot --storage cephfs
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp1254910_102 for temporary files
INFO: Starting Backup of VM 102 (lxc)
INFO: Backup started at 2021-07-30 00:00:02
INFO: status = running
INFO: CT Name: xxxx
INFO: including mount point rootfs ('/') in backup
INFO: excluding bind mount point mp0 ('/mnt/xxxx') from backup (not a volume)
INFO: found old vzdump snapshot (force removal)
rbd error: error setting snapshot context: (2) No such file or directory
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
Creating snap: 10% complete...
Creating snap: 100% complete...done.
/dev/rbd4
INFO: creating vzdump archive '/mnt/pve/cephfs/dump/vzdump-lxc-102-2021_07_30-00_00_02.tar.zst'
INFO: Total bytes written: 2504898560 (2.4GiB, 6.4MiB/s)
INFO: archive file size: 686MB
INFO: prune older backups with retention: keep-daily=7, keep-last=22, keep-monthly=1, keep-weekly=4, keep-yearly=1
INFO: pruned 0 backup(s)
INFO: cleanup temporary 'vzdump' snapshot
Removing snap: 100% complete...done.
INFO: Finished Backup of VM 102 (00:06:24)
INFO: Backup finished at 2021-07-30 00:06:26
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp1254910_109 for temporary files
INFO: Starting Backup of VM 109 (lxc)
INFO: Backup started at 2021-07-30 00:06:26
INFO: status = running
INFO: CT Name: xxxxx
INFO: including mount point rootfs ('/') in backup
INFO: excluding bind mount point mp0 ('/mnt/xxxxx') from backup (not a volume)
INFO: excluding bind mount point mp2 ('/mnt/xxxxx') from backup (not a volume)
INFO: found old vzdump snapshot (force removal)
rbd error: error setting snapshot context: (2) No such file or directory
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
Creating snap: 10% complete...
Creating snap: 100% complete...done.
/dev/rbd4
INFO: creating vzdump archive '/mnt/pve/cephfs/dump/vzdump-lxc-109-2021_07_30-00_06_26.tar.zst'
INFO: Total bytes written: 3282534400 (3.1GiB, 2.5MiB/s)
INFO: archive file size: 1.16GB
INFO: prune older backups with retention: keep-daily=7, keep-last=22, keep-monthly=1, keep-weekly=4, keep-yearly=1
INFO: pruned 0 backup(s)
INFO: cleanup temporary 'vzdump' snapshot
Removing snap: 100% complete...done.
INFO: Finished Backup of VM 109 (00:22:55)
INFO: Backup finished at 2021-07-30 00:29:21
INFO: filesystem type on dumpdir is 'ceph' -using /var/tmp/vzdumptmp1254910_113 for temporary files
INFO: Starting Backup of VM 113 (lxc)
INFO: Backup started at 2021-07-30 00:29:21
INFO: status = running
INFO: CT Name: xxxxx
INFO: including mount point rootfs ('/') in backup
INFO: excluding bind mount point mp0 ('/mnt/xxxxx') from backup (not a volume)
INFO: excluding bind mount point mp1 ('/mnt/xxxxx') from backup (not a volume)
INFO: including mount point mp2 ('/db') in backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: suspend vm to make snapshot
INFO: create storage snapshot 'vzdump'
syncfs '/' failed - Input/output error
 
Last edited:
Interestingly, for further notice there appear to be rbd errors at the time of the failure:
Code:
Jul 30 00:29:04 quarb kernel: [144026.433830] rbd: rbd2: write at objno 1056 2686976~40960 result -108
Jul 30 00:29:04 quarb kernel: [144026.433841] rbd: rbd2: write result -108
Jul 30 00:29:04 quarb kernel: [144026.436982] rbd: rbd2: write at objno 1280 36864~4096 result -108
Jul 30 00:29:04 quarb kernel: [144026.437902] rbd: rbd2: write result -108
Jul 30 00:29:04 quarb kernel: [144026.439598] EXT4-fs warning (device rbd2): ext4_end_bio:342: I/O error 10 writing to inode 393974 starting block 1310730)
<repeats a few times a minute for about 1 hour, then stops>

Ceph health is continually OK during this time. I'm running the latest packages of 16.2.5 pacific.
 
@dietmar There's a mix; currently they are on separate VLANs, and on some hosts share physical ports (I'm using openvswitch). No other cluster operations are failing. The problem appears to be intermittent but also possibly narrowed to only one or two LXCs.
 
We are also seeing this issue.

Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-3-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-5.11: 7.0-7
pve-kernel-helper: 7.0-7
pve-kernel-5.11.22-4-pve: 5.11.22-8
pve-kernel-5.11.22-3-pve: 5.11.22-7
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 16.2.5-pve1
ceph-fuse: 16.2.5-pve1
corosync: 3.1.2-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.21-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-6
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-9
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-2
pve-firmware: 3.3-1
pve-ha-manager: 3.3-1
pve-i18n: 2.4-1
pve-qemu-kvm: 6.0.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-13
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

Looking into it further it looks like snapshot deletions are causing slow downs.

Screen Shot 2021-09-06 at 2.06.36 PM.png


Screen Shot 2021-09-06 at 2.07.53 PM.png

Ceph osd drives are NVMe and they show this error:

Screen Shot 2021-09-06 at 2.10.08 PM.png

We have upgraded to the lastest kernel and latest versions of Ceph pacific and it seems better but there are still issues with snapshots/backups locking VM's for extended periods of time. One other thing that was mentioned was that it could have something to do with cgroups v2 and how they handle the thaw process. Currently testing out compacting all of our rocks db's on all NVMe disks to see if that helps.

The other things we have done to see if it helps is turn off swap all together. Next thing we will test is tuning down the snap trim settings from their defaults.

ceph tell 'osd.*' injectargs --osd-snap-trim-sleep=3
ceph tell 'osd.*' injectargs --osd-max-trimming-pgs=1
ceph tell 'osd.*' injectargs --osd-pg-max-concurrent-snap-trims=1
 
We ran the commands to tune down snap trimming:

ceph tell 'osd.*' injectargs --osd-snap-trim-sleep=3
ceph tell 'osd.*' injectargs --osd-max-trimming-pgs=1
ceph tell 'osd.*' injectargs --osd-pg-max-concurrent-snap-trims=1

That appears to fixed our issues with slow logs and random freezing when snapshots go through. Now the only issue is that we are noticing snaptrims are just taking a long time to actually complete. We might have to adjust the values a bit further.

Anyone else have ideas on what the optimial values are for these on nvme storage?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!