VM hang/frozen during backup

vsquare

Member
Mar 25, 2022
1
0
6
Hey Guys!

I've been running into an issue where I have a PBS instance configured to take backups from my PVE cluster, however random VMs will hang/freeze with their CPU spinning to 100% on 1 core until i hit reset on the VM itself.

I saw a few threads where people had similar issues with the qemu-guest-agent and doing this on fsfreeze/thaw however my issues seem to be happening on both VMs with this installed and ones where its not installed.

Has anyone seen this happen before? Is there anything I can try and do to either induce the failure to aid with debugging or any tips for preventing it?

Details of my configuration below.

Cluster: 5 nodes
CPU: AMD EPYC 7282
Storage: Ceph (proxmox managed, librbd)
Network: openvswitch

Code:
INFO: Starting Backup of VM 135 (qemu)
INFO: Backup started at 2022-03-25 03:45:16
INFO: status = running
INFO: include disk 'scsi0' 'SSD:vm-135-disk-0'
INFO: backup mode: snapshot
INFO: bandwidth limit: 75000 KB/s
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/135/2022-03-24T19:45:16Z'
INFO: skipping guest-agent 'fs-freeze', agent configured but not running?
INFO: enabling encryption
INFO: started backup task '7d80f5a8-6833-4883-a800-671a9b3b9d1b'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (1.8 GiB of 150.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 1.8 GiB dirty of 150.0 GiB total
INFO:  11% (220.0 MiB of 1.8 GiB) in 3s, read: 73.3 MiB/s, write: 73.3 MiB/s
<snip>
INFO: 100% (1.8 GiB of 1.8 GiB) in 3m 13s, read: 1.3 MiB/s, write: 1.3 MiB/s
INFO: backup was done incrementally, reused 148.18 GiB (98%)
INFO: transferred 1.84 GiB in 238 seconds (7.9 MiB/s)
INFO: Finished Backup of VM 135 (00:04:08)
INFO: Backup finished at 2022-03-25 03:49:24

Code:
root@vmh1:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
 
I want to keep an eye on this. There is a definite issue when there is a big hit on IO.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!