Hey Guys!
I've been running into an issue where I have a PBS instance configured to take backups from my PVE cluster, however random VMs will hang/freeze with their CPU spinning to 100% on 1 core until i hit reset on the VM itself.
I saw a few threads where people had similar issues with the qemu-guest-agent and doing this on fsfreeze/thaw however my issues seem to be happening on both VMs with this installed and ones where its not installed.
Has anyone seen this happen before? Is there anything I can try and do to either induce the failure to aid with debugging or any tips for preventing it?
Details of my configuration below.
Cluster: 5 nodes
CPU: AMD EPYC 7282
Storage: Ceph (proxmox managed, librbd)
Network: openvswitch
I've been running into an issue where I have a PBS instance configured to take backups from my PVE cluster, however random VMs will hang/freeze with their CPU spinning to 100% on 1 core until i hit reset on the VM itself.
I saw a few threads where people had similar issues with the qemu-guest-agent and doing this on fsfreeze/thaw however my issues seem to be happening on both VMs with this installed and ones where its not installed.
Has anyone seen this happen before? Is there anything I can try and do to either induce the failure to aid with debugging or any tips for preventing it?
Details of my configuration below.
Cluster: 5 nodes
CPU: AMD EPYC 7282
Storage: Ceph (proxmox managed, librbd)
Network: openvswitch
Code:
INFO: Starting Backup of VM 135 (qemu)
INFO: Backup started at 2022-03-25 03:45:16
INFO: status = running
INFO: include disk 'scsi0' 'SSD:vm-135-disk-0'
INFO: backup mode: snapshot
INFO: bandwidth limit: 75000 KB/s
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/135/2022-03-24T19:45:16Z'
INFO: skipping guest-agent 'fs-freeze', agent configured but not running?
INFO: enabling encryption
INFO: started backup task '7d80f5a8-6833-4883-a800-671a9b3b9d1b'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (1.8 GiB of 150.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 1.8 GiB dirty of 150.0 GiB total
INFO: 11% (220.0 MiB of 1.8 GiB) in 3s, read: 73.3 MiB/s, write: 73.3 MiB/s
<snip>
INFO: 100% (1.8 GiB of 1.8 GiB) in 3m 13s, read: 1.3 MiB/s, write: 1.3 MiB/s
INFO: backup was done incrementally, reused 148.18 GiB (98%)
INFO: transferred 1.84 GiB in 238 seconds (7.9 MiB/s)
INFO: Finished Backup of VM 135 (00:04:08)
INFO: Backup finished at 2022-03-25 03:49:24
Code:
root@vmh1:~# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
openvswitch-switch: 2.15.0+ds1-2
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3