Hello,
We have started to have a problem with our cluster environment. For 3 days we have experienced freezing VMs during the PBS backup process.
What happens is that a number VMs suddenly go into a state where they timeout if we try to interract with them using the console & or any functionality (reboot, shutdown etc). The VM also stops answering ping and is basically "frozen".
Syslog on the host keeps displaying the following message for every VM that is in this error state
The only way to resolve this is to Stop the VM and start it again. After this the VM boots up just fine.
This problem is clearly triggered by the PBS backup run but it also escalated last night when I canceled the running PBS backup process. After this the VMs kept entering the frozen state again even though there was no backup run active. VMs stopped doing this after I moved them around in our cluster and made sure that no host had over 15 VMs running on it. Last night the problem triggered only on one hosts that had over 25 VMs running on it.
Background info on the cluster:
It's 6 Hosts
1143-vhost-c3h1 - 2x CPU E5-2660 v4 + 512GB RAM
1143-vhost-c3h2 - 2x CPU E5-2660 v4 + 512GB RAM
1143-vhost-c3h3 - 2x CPU E5-2660 v4 + 512GB RAM
R630-1 - 2x E5-2620 v4 + 128GB RAM
R630-2 - 2x E5-2620 v4 + 128GB RAM
R630-3 - 2x E5-2620 v4 + 128GB RAM
There is atleast 50% of free resources on every Host
Storage is CEPH on the R630 Hardware + Dell Compellent Iscsi LVM Mount on the 1143-hosts
Proxmox-Backup Server version is currently 1.0-9
The VMs that have been frozen are mixed bunch.
There are few Windows VMs but most of them are Linux based. Some of them have multiple disks and some are single disk ones.
Some have Qemu-Guest-Agent installed but some of them don't. There is really no connecting feature that the Frozen Vms have in common other than being on the same hosts with a lot of VMs.
I did some digging on the Forums & Google but wasn't able to find solution.
However following threads seem to indicate same type of problem:
https://forum.proxmox.com/threads/qmp-command-backup-failed-got-timeout.77749/
https://forum.proxmox.com/threads/error-when-finishing-task-of-backup.76379/
https://forum.proxmox.com/threads/error-vm-103-qmp-command-query-backup-failed-got-timeout.76554/
* What I have tried so far is to make sure that PBS doesn't verify the backup on creation.
* Updated to the latest version
* Used another host (c3h2 -> c3h1) to make sure this isn't a hardware failure
Any ideas what I should start looking at next?
We have started to have a problem with our cluster environment. For 3 days we have experienced freezing VMs during the PBS backup process.
What happens is that a number VMs suddenly go into a state where they timeout if we try to interract with them using the console & or any functionality (reboot, shutdown etc). The VM also stops answering ping and is basically "frozen".
Syslog on the host keeps displaying the following message for every VM that is in this error state
Bash:
Mar 17 00:43:16 1143-vhost-c3h1 pvestatd[2051]: VM 178 qmp command failed - VM 178 qmp command 'query-proxmox-support' failed - unable to connect to VM 178 qmp socket - timeout after 31 retries
The only way to resolve this is to Stop the VM and start it again. After this the VM boots up just fine.
This problem is clearly triggered by the PBS backup run but it also escalated last night when I canceled the running PBS backup process. After this the VMs kept entering the frozen state again even though there was no backup run active. VMs stopped doing this after I moved them around in our cluster and made sure that no host had over 15 VMs running on it. Last night the problem triggered only on one hosts that had over 25 VMs running on it.
Background info on the cluster:
Bash:
pveversion
pve-manager/6.3-6/2184247e (running kernel: 5.4.103-1-pve)
Package: pve-qemu-kvm
Version: 5.2.0-3
It's 6 Hosts
1143-vhost-c3h1 - 2x CPU E5-2660 v4 + 512GB RAM
1143-vhost-c3h2 - 2x CPU E5-2660 v4 + 512GB RAM
1143-vhost-c3h3 - 2x CPU E5-2660 v4 + 512GB RAM
R630-1 - 2x E5-2620 v4 + 128GB RAM
R630-2 - 2x E5-2620 v4 + 128GB RAM
R630-3 - 2x E5-2620 v4 + 128GB RAM
There is atleast 50% of free resources on every Host
Storage is CEPH on the R630 Hardware + Dell Compellent Iscsi LVM Mount on the 1143-hosts
Proxmox-Backup Server version is currently 1.0-9
The VMs that have been frozen are mixed bunch.
There are few Windows VMs but most of them are Linux based. Some of them have multiple disks and some are single disk ones.
Some have Qemu-Guest-Agent installed but some of them don't. There is really no connecting feature that the Frozen Vms have in common other than being on the same hosts with a lot of VMs.
I did some digging on the Forums & Google but wasn't able to find solution.
However following threads seem to indicate same type of problem:
https://forum.proxmox.com/threads/qmp-command-backup-failed-got-timeout.77749/
https://forum.proxmox.com/threads/error-when-finishing-task-of-backup.76379/
https://forum.proxmox.com/threads/error-vm-103-qmp-command-query-backup-failed-got-timeout.76554/
* What I have tried so far is to make sure that PBS doesn't verify the backup on creation.
* Updated to the latest version
* Used another host (c3h2 -> c3h1) to make sure this isn't a hardware failure
Any ideas what I should start looking at next?