PBS Datacenter Crashes

iprigger

Renowned Member
Sep 5, 2009
190
41
93
earth!
Hi All,

After some successful tests with PBS (it actually ran perfectly flawless from day one) I have encoutered massive problems since the last couple of days (two weeks) up to a point where my complete datacenter more or less went down...

In short: If I do backups to PBS I am prone to rebooting almost every node in the proxmox VE Infrastructure...

* pvestatd, pvedaemon and pveproxy hang on all nodes.
* I have random VMs hanging and just consuming 100% CPU on one core (worst: they are still pingable...)

In short: at current it's unusable.... unfortunately.

I have PBS running as a VM (not backing this one up using PBS, though). The data storage is on a NFS share.... network connection is 10GBE all over...

PVE:
proxmox-ve: 6.2-1 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-4.15: 5.4-8
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.13-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-1
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.1-13
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.0.0-13
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1


PBS is on release:
ii proxmox-backup 1.0-4 all Proxmox Backup Server metapackage
ii proxmox-backup-client 0.8.17-1 amd64 Proxmox Backup Client tools
ii proxmox-backup-docs 0.8.17-1 all Proxmox Backup Documentation
ii proxmox-backup-server 0.8.17-1 amd64 Proxmox Backup Server daemon with tools and GUI
ii proxmox-mini-journalreader 1.1-1 amd64 Minimal systemd Journal Reader
ii proxmox-widget-toolkit 2.2-12 all ExtJS Helper Classes for Proxmox

Any idea what's the cause here?

Also... most notable are linux systems having qemu-guest-agent installed.... systems without guest agent seem to suffer less (and windows systems haven't been affected at all so far...)

Still: it's an uncool situation because even with test-runs I have massive problems...

Any idea?

Thanks!

Tobias
 
Do you run backup on multiple hosts in parallel? That could lead to saturation of resources such as network, shared VM storage, or target backup storage if it uses the same shared storage as the other VM:s.

Do you have more than 1 physical corosync link?

If nodes are rebooting it may be because of fencing due to missing heartbeats and cluster synchronization, which in turn may be because of saturated physical links for corosync.

I hope you can figure it out.
 
Hi!

Do you run backup on multiple hosts in parallel? That could lead to saturation of resources such as network, shared VM storage, or target backup storage if it uses the same shared storage as the other VM:s.

Do you have more than 1 physical corosync link?

If nodes are rebooting it may be because of fencing due to missing heartbeats and cluster synchronization, which in turn may be because of saturated physical links for corosync.

I hope you can figure it out.

Yes, I run backups on all systems at the same time, good point... will try to just start one host today... nevertheless... it was working before like a charm...

I have two physical corosync links... so that should be OK.

Tobias
 
Hi,

Do you run backup on multiple hosts in parallel? That could lead to saturation of resources such as network, shared VM storage, or target backup storage if it uses the same shared storage as the other VM:s.

Do you have more than 1 physical corosync link?

If nodes are rebooting it may be because of fencing due to missing heartbeats and cluster synchronization, which in turn may be because of saturated physical links for corosync.

I hope you can figure it out.

I just tried with singe server backups - seems to work a bit better. Nevertheless I often get (this one is a windows machine):

INFO: 99% (65.2 GiB of 65.8 GiB) in 30m 51s, read: 44.0 MiB/s, write: 27.8 MiB/s
ERROR: VM 762 qmp command 'query-backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 762 qmp command 'backup-cancel' failed - unable to connect to VM 762 qmp socket - timeout after 5979 retries
ERROR: Backup of VM 762 failed - VM 762 qmp command 'query-backup' failed - got timeout
INFO: Failed at 2020-09-25 06:25:13

Tobias
 
Yes this problem is here from lets say two versions ago. Before that,i didn't have this problem.

Can you share details about storage used (source and target) and if this happens on all VMs or just some?
 
Well storage is OMV ,with proxmox PBS installed, 4x4tb in raidz1 , and it happens when i run scheduled backup, which runs from multiple PVE server(2-4) simultaneously. Network is 1Gbps. I believe it all worked up until one or two versions.
 
Hm, i guess now with .21 it is better,but is still got one more backup failed(against 3 in two versions before).
 
4x4tb in raidz1

Which drives exactly? How much speed do you get roughly in backup (low, max, avgerage)?

We found a few hard to reproducible cases where a relatively slow backup target could make the VM hang for a few seconds to few minutes towards the end, currently investigating that. Often, the hang is shorter than the QMP timeout, so one may not even notice it, and in that case the backup also finishes and the VM continues to run OK afterwards.
 
Ironwolf 4tb, so 5400rpm. I don't know about speeds,but the culpript could be some PBS-set timeout for the datastore.
 
A new pve-qemu-kvm package with version 5.1.0-3 just got available on the pvetest repository.

It includes a fix for the hanging VMs and QMP timeouts at the end of backup jobs.

If you can, please update and either do a fresh start of the VM (reboot over API works too) or live migrate it to an already updated node, to make it use the new KVM/QEMU executable. The server should then also be updated again, to actually test the fix.
 
  • Like
Reactions: iprigger and smasty
See: https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_test_repo for details on the pvetest repository.

If you want to only upgrade the pve-qemu package you can do something like:
Bash:
# add pvetest repo
echo 'deb http://download.proxmox.com/debian/pve buster pvetest' > /etc/apt/sources.list.d/pvetest.list
apt update
# updgrade only QEMU related packages
apt install pve-qemu-kvm qemu-server
# remove pvetest again to avoid other packages updating
rm /etc/apt/sources.list.d/pvetest.list
apt update
 
A new pve-qemu-kvm package with version 5.1.0-3 just got available on the pvetest repository.

It includes a fix for the hanging VMs and QMP timeouts at the end of backup jobs.

If you can, please update and either do a fresh start of the VM (reboot over API works too) or live migrate it to an already updated node, to make it use the new KVM/QEMU executable. The server should then also be updated again, to actually test the fix.
Great News, thanks. I was starting to tear apart my complete storage infrastructure...

One Thing I have noted as well is that sometimes after successful backups VMs that used to be off are booted up...

Tobias
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!