VMs crashing/freezing on a Cluster during PBS backup operation

VTeronen

Member
Aug 5, 2019
4
0
6
36
Hello,

We have started to have a problem with our cluster environment. For 3 days we have experienced freezing VMs during the PBS backup process.
What happens is that a number VMs suddenly go into a state where they timeout if we try to interract with them using the console & or any functionality (reboot, shutdown etc). The VM also stops answering ping and is basically "frozen".

Syslog on the host keeps displaying the following message for every VM that is in this error state

Bash:
Mar 17 00:43:16 1143-vhost-c3h1 pvestatd[2051]: VM 178 qmp command failed - VM 178 qmp command 'query-proxmox-support' failed - unable to connect to VM 178 qmp socket - timeout after 31 retries

The only way to resolve this is to Stop the VM and start it again. After this the VM boots up just fine.

This problem is clearly triggered by the PBS backup run but it also escalated last night when I canceled the running PBS backup process. After this the VMs kept entering the frozen state again even though there was no backup run active. VMs stopped doing this after I moved them around in our cluster and made sure that no host had over 15 VMs running on it. Last night the problem triggered only on one hosts that had over 25 VMs running on it.

Background info on the cluster:

Bash:
pveversion
pve-manager/6.3-6/2184247e (running kernel: 5.4.103-1-pve)

Package: pve-qemu-kvm
Version: 5.2.0-3

It's 6 Hosts
1143-vhost-c3h1 - 2x CPU E5-2660 v4 + 512GB RAM
1143-vhost-c3h2 - 2x CPU E5-2660 v4 + 512GB RAM
1143-vhost-c3h3 - 2x CPU E5-2660 v4 + 512GB RAM

R630-1 - 2x E5-2620 v4 + 128GB RAM
R630-2 - 2x E5-2620 v4 + 128GB RAM
R630-3 - 2x E5-2620 v4 + 128GB RAM

There is atleast 50% of free resources on every Host

Storage is CEPH on the R630 Hardware + Dell Compellent Iscsi LVM Mount on the 1143-hosts

Proxmox-Backup Server version is currently 1.0-9

The VMs that have been frozen are mixed bunch.
There are few Windows VMs but most of them are Linux based. Some of them have multiple disks and some are single disk ones.
Some have Qemu-Guest-Agent installed but some of them don't. There is really no connecting feature that the Frozen Vms have in common other than being on the same hosts with a lot of VMs.

I did some digging on the Forums & Google but wasn't able to find solution.
However following threads seem to indicate same type of problem:

https://forum.proxmox.com/threads/qmp-command-backup-failed-got-timeout.77749/
https://forum.proxmox.com/threads/error-when-finishing-task-of-backup.76379/
https://forum.proxmox.com/threads/error-vm-103-qmp-command-query-backup-failed-got-timeout.76554/

* What I have tried so far is to make sure that PBS doesn't verify the backup on creation.
* Updated to the latest version
* Used another host (c3h2 -> c3h1) to make sure this isn't a hardware failure

Any ideas what I should start looking at next?
 
Hi,
Thank you for a swift response. I can certainly contribute the backtrace if/when we are seeing a freeze.

Looking at the other threads, it would seem that it's not always tied to the backup process but can happen "randomly" ?
 
the culprit seems to be somewhere in the monitor code (the interface used to "talk" to the VM and issue commands). this interface is heavily used for backups, but also for migration, device hotplug, power management, ... until we know more we can only speculate what triggers it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!