ps. Not sure if this should be in PVE section or PBS section.
Any ideas/suggestion on what next I could pursue for debugging this issue?
Summary
With v7 I'm starting to get periodic or sometimes constant issues with PVE/PBS backups
Backups task trow errors about backup timeouts on certain VMs.
On earlier v7.x PVE versions it was: start failed:
Now with Latest ones:
For some VMs the problem is constant/almost constant. For some it occurs randomly. VMs them selves are backed up
once a week. So it ends up in scenario where I either have backup every X week or in worst case now no backups for few weeks
Problem does not seem to be related to node load or network load. As I can reproduce
for some VMs on spot with very light load on systems and practically no load on storage related
networks.
Environment
Some experimentation to look for solution
So I tried debugging on few VMs on which I had issues both in production and staging evn. These specific ones are clones so to speak. I had them
restored from backup on staging cluster when I was creating dedicated staging environment for us to work in.
Here just some typical messages that I see in report emails about these VMs
Regards,
Krisjanis
Any ideas/suggestion on what next I could pursue for debugging this issue?
Summary
With v7 I'm starting to get periodic or sometimes constant issues with PVE/PBS backups
Backups task trow errors about backup timeouts on certain VMs.
On earlier v7.x PVE versions it was: start failed:
org.freedesktop.systemd1.UnitExists: Unit <VMid>.scope already exists.
Now with Latest ones:
VM <VMid> qmp command 'query-pbs-bitmap-info' failed - got timeout
For some VMs the problem is constant/almost constant. For some it occurs randomly. VMs them selves are backed up
once a week. So it ends up in scenario where I either have backup every X week or in worst case now no backups for few weeks
Problem does not seem to be related to node load or network load. As I can reproduce
for some VMs on spot with very light load on systems and practically no load on storage related
networks.
Environment
Markdown (GitHub flavored):
Production:
* PVE cluster:
* Was: 7.1-6ish
* Is: 7.1-12 (Not yet updated to latest v7.2 release):
* CEPH cluster 0:
* PVE: 5.4-15
* CEPH: 12.2.13
* CEHP cluster 1:
* PVE: 7.1-12 (Not yet updated to latest v7.2 release):
* CEPH: 16.2.7
* PBS: 2.1-6
Staging:
* PVE cluser:
* Was: 7.1-6 -> 7.1-12
* Is: 7.2-3
* Ceph cluster:
* PVE: 5.4-15
* CEPH: 12.2.13
* PBS: 2.1-6
Network:
* Data: separate 10G LACPd bonding under a bridge
* Storage: separate 10G LACPd bonding
* Backups: separate 10G interface under a bridge
Some experimentation to look for solution
So I tried debugging on few VMs on which I had issues both in production and staging evn. These specific ones are clones so to speak. I had them
restored from backup on staging cluster when I was creating dedicated staging environment for us to work in.
Markdown (GitHub flavored):
## Ubuntu VM
* Was: 18.04
* Backup timeouts with "qmp command 'query-pbs-bitmap-info' failed - got timeout"
* Upgrade to: 22.04
* Issue seems fixed?
## Debian VM
1. Tried messing with guest OS.
* Was `10`
* Upgrade to `11`
* Didn't help, same error.
* Upgrade to `12`
* Didn't help, same error.
2. Found a suggestion in [post](https://forum.proxmox.com/threads/pve7-pbs2-backup-timeout-qmp-command-cont-failed-got-timeout.95212/post-426261) to try
* change line `134` of `/usr/share/perl5/PVE/QMPClient.pm`
```pm
} else {
$timeout = 3; # default
to
} else {
$timeout = 8; # default
```
* Restart the pve daemons
```bash
for service in pvedaemon.service pveproxy.service pvestatd.service ;do
echo "systemctl restart $service"
systemctl restart $service
done
```
* Results
* With 3s timeout (default)
* Backup timeouts with "qmp command 'query-pbs-bitmap-info' failed - got timeout"
* With 8s timeout
* Backup timeouts with "qmp command 'query-pbs-bitmap-info' failed - got timeout"
* With 30s timeout:
* VM does not startup correctly, sits in some weird half started state whole backup time. No console output, no qemu agent.
* VM cpu usage sits around 10% for whole backup creation time
* Backup read performance is also severly degraded about 1/5 maybe 2/5 of typical read speed.
* After backup is done VM starts
* Tested backup and it looks fine (restores without problems and no error in fs checks)
Here just some typical messages that I see in report emails about these VMs
Markdown (GitHub flavored):
| VMid | Name | Status | Time | Message |
| ---- | ------------ | ------ | -------- | --------------------------------------------------------------------------------- |
| 107 | (niceVMname) | FAILED | 00:00:10 | VM 107 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 116 | (niceVMname) | FAILED | 00:00:29 | VM 116 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 116 | (niceVMname) | FAILED | 00:00:31 | VM 116 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 116 | (niceVMname) | FAILED | 00:00:33 | VM 116 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 121 | (niceVMname) | FAILED | 00:00:26 | VM 121 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 121 | (niceVMname) | FAILED | 00:00:26 | VM 121 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 121 | (niceVMname) | FAILED | 00:00:28 | VM 121 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 125 | (niceVMname) | FAILED | 00:00:37 | VM 125 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 140 | (niceVMname) | FAILED | 00:00:08 | VM 140 qmp command 'human-monitor-command' failed - got timeout |
| 4100 | (niceVMname) | FAILED | 00:01:42 | VM 4100 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 4132 | (niceVMname) | FAILED | 00:00:11 | VM 4132 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 4149 | (niceVMname) | FAILED | 00:00:18 | VM 4149 qmp command 'query-pbs-bitmap-info' failed - got timeout |
| 4149 | (niceVMname) | FAILED | 00:00:18 | VM 4149 qmp command 'query-pbs-bitmap-info' failed - got timeout |
Regards,
Krisjanis
Last edited: