[SOLVED] PVE v7 / PBS v2.1 - backup qmp timeouts

elimus_ · May 10, 2022

ps. Not sure if this should be in PVE section or PBS section.

Any ideas/suggestion on what next I could pursue for debugging this issue?

Summary

With v7 I'm starting to get periodic or sometimes constant issues with PVE/PBS backups
Backups task trow errors about backup timeouts on certain VMs.

On earlier v7.x PVE versions it was: start failed: org.freedesktop.systemd1.UnitExists: Unit <VMid>.scope already exists.
Now with Latest ones: VM <VMid> qmp command 'query-pbs-bitmap-info' failed - got timeout

For some VMs the problem is constant/almost constant. For some it occurs randomly. VMs them selves are backed up
once a week. So it ends up in scenario where I either have backup every X week or in worst case now no backups for few weeks

Problem does not seem to be related to node load or network load. As I can reproduce
for some VMs on spot with very light load on systems and practically no load on storage related
networks.

Environment

Markdown (GitHub flavored):

Production:
  * PVE cluster:
    * Was: 7.1-6ish
    * Is: 7.1-12 (Not yet updated to latest v7.2 release):
  * CEPH cluster 0:
    * PVE: 5.4-15
    * CEPH: 12.2.13
  * CEHP cluster 1:
    * PVE: 7.1-12 (Not yet updated to latest v7.2 release):
    * CEPH: 16.2.7
  * PBS: 2.1-6

Staging:
  * PVE cluser:
    * Was: 7.1-6 -> 7.1-12
    * Is: 7.2-3
  * Ceph cluster:
    * PVE: 5.4-15
    * CEPH: 12.2.13
  * PBS: 2.1-6

Network:
  * Data: separate 10G LACPd bonding under a bridge
  * Storage: separate 10G LACPd bonding
  * Backups: separate 10G interface under a bridge

Some experimentation to look for solution

So I tried debugging on few VMs on which I had issues both in production and staging evn. These specific ones are clones so to speak. I had them
restored from backup on staging cluster when I was creating dedicated staging environment for us to work in.

Markdown (GitHub flavored):

## Ubuntu VM

* Was: 18.04
  * Backup timeouts with "qmp command 'query-pbs-bitmap-info' failed - got timeout"
* Upgrade to: 22.04
  * Issue seems fixed?

## Debian VM

1. Tried messing with guest OS.
  * Was `10`
  * Upgrade to `11`
    * Didn't help, same error.
  * Upgrade to `12`
    * Didn't help, same error.
2. Found a suggestion in [post](https://forum.proxmox.com/threads/pve7-pbs2-backup-timeout-qmp-command-cont-failed-got-timeout.95212/post-426261) to try
   * change line `134` of `/usr/share/perl5/PVE/QMPClient.pm`

      ```pm
            } else {
                  $timeout = 3; # default
      to
            } else {
                  $timeout = 8; # default
      ```

    * Restart the pve daemons

      ```bash
      for service in pvedaemon.service pveproxy.service pvestatd.service ;do
           echo "systemctl restart $service"
           systemctl restart $service
      done
      ```

* Results
  * With 3s timeout (default)
    * Backup timeouts with "qmp command 'query-pbs-bitmap-info' failed - got timeout"
  * With 8s timeout
    * Backup timeouts with "qmp command 'query-pbs-bitmap-info' failed - got timeout"
  * With 30s timeout:
    * VM does not startup correctly, sits in some weird half started state whole backup time. No console output, no qemu agent.
    * VM cpu usage sits around 10% for whole backup creation time
    * Backup read performance is also severly degraded about 1/5 maybe 2/5 of typical read speed.
    * After backup is done VM starts
    * Tested backup and it looks fine (restores without problems and no error in fs checks)

Here just some typical messages that I see in report emails about these VMs

Markdown (GitHub flavored):

| VMid |     Name     | Status |   Time   |                                      Message                                      |
| ---- | ------------ | ------ | -------- | --------------------------------------------------------------------------------- |
| 107  | (niceVMname) | FAILED | 00:00:10 | VM 107 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 116  | (niceVMname) | FAILED | 00:00:29 | VM 116 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 116  | (niceVMname) | FAILED | 00:00:31 | VM 116 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 116  | (niceVMname) | FAILED | 00:00:33 | VM 116 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 121  | (niceVMname) | FAILED | 00:00:26 | VM 121 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 121  | (niceVMname) | FAILED | 00:00:26 | VM 121 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 121  | (niceVMname) | FAILED | 00:00:28 | VM 121 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 125  | (niceVMname) | FAILED | 00:00:37 | VM 125 qmp command 'query-pbs-bitmap-info' failed - got timeout                   |
| 140  | (niceVMname) | FAILED | 00:00:08 | VM 140 qmp command 'human-monitor-command' failed - got timeout                   |
| 4100 | (niceVMname) | FAILED | 00:01:42 | VM 4100 qmp command 'query-pbs-bitmap-info' failed - got timeout                  |
| 4132 | (niceVMname) | FAILED | 00:00:11 | VM 4132 qmp command 'query-pbs-bitmap-info' failed - got timeout                  |
| 4149 | (niceVMname) | FAILED | 00:00:18 | VM 4149 qmp command 'query-pbs-bitmap-info' failed - got timeout                  |
| 4149 | (niceVMname) | FAILED | 00:00:18 | VM 4149 qmp command 'query-pbs-bitmap-info' failed - got timeout                  |

Regards,
Krisjanis

t.lamprecht · May 11, 2022

Hi,

How many nodes, and what are the (rough) specs from those systems?

How does the basic resource pressure stall information looks like during such events?
head /proc/pressure/*

Can you post an affected VM config? Also, if not tried already, can you switch the disks of such an affected VM to using IO Threads and the SCSI controller to virtio-scsi-single - this could allow the QEMU main thread to have more time for working on QMP as IO processing won't happen anymore in the same thread.

elimus_ · May 12, 2022

Hello Thomas,

Nodes:

Code:

Staging - 3 nodes of both PVE and PVE CEPH
Production - 12 PVE nodes, CEPH0 5 nodes, CEPH1 3 nodes

Hardware:

Code:

PVE / CEPH nodes: both Staging/Production area
  CPUs: 1-2 CPUs of AMD EYPCs 7xx1/7xx2 series
  RAM: 128G-256G

  CEPH disk/osd:
    Staging: 18 OSDs. 6 HDD per node + Optage for bluestore
    Production:
      cluster0: 60 OSDs, 6 sata SSD / 6 sata HDD per node
      cluster1: 24 OSDs, 8 nvme SSD per node

How does the basic resource pressure stall information looks like during such events?
head /proc/pressure/*

With 3s timeout (default) - No load here as timeout happens within seconds.:

Code:

    ==> /proc/pressure/cpu <==
    some avg10=0.00 avg60=0.00 avg300=0.00 total=2176927069
    full avg10=0.00 avg60=0.00 avg300=0.00 total=2165692196
    ==> /proc/pressure/io <==
    some avg10=0.00 avg60=0.00 avg300=0.00 total=141624665
    full avg10=0.00 avg60=0.00 avg300=0.00 total=140097875
    ==> /proc/pressure/memory <==
    some avg10=0.00 avg60=0.00 avg300=0.00 total=0
    full avg10=0.00 avg60=0.00 avg300=0.00 total=0

With 30s timeout:

Code:

    ==> /proc/pressure/cpu <==
    ==> /proc/pressure/cpu <==
    some avg10=0.00 avg60=0.01 avg300=0.00 total=2179140327
    full avg10=0.00 avg60=0.01 avg300=0.00 total=2167888693
    ==> /proc/pressure/io <==
    some avg10=0.00 avg60=0.00 avg300=0.00 total=141633349
    full avg10=0.00 avg60=0.00 avg300=0.00 total=140106123
    ==> /proc/pressure/memory <==
    some avg10=0.00 avg60=0.00 avg300=0.00 total=0
    full avg10=0.00 avg60=0.00 avg300=0.00 total=0
     full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Can you post an affected VM config?

From VM that that constantly fails backup on both clusters (before virtio-scsi-single/iothread test)

Code:

  agent: 1
  balloon: 4000
  boot: cd
  bootdisk: scsi0
  cores: 4
  cpu: host
  memory: 5000
  name: (niceVMname)
  net0: virtio=9A:30:36:D9:D3:29,bridge=vmbr0,tag=3451
  numa: 0
  onboot: 1
  ostype: l26
  scsi0: ceph-hdds:vm-121-disk-0,size=100G
  scsi1: ceph-hdds:vm-121-disk-1,size=2000G
  scsihw: virtio-scsi-pci
  smbios1: uuid=2983fed4-d003-4385-bb99-60715e2e8057
  sockets: 1

Also, if not tried already, can you switch the disks of such an affected VM to using IO Threads
and the SCSI controller to virtio-scsi-single - this could allow the QEMU main thread to have more
time for working on QMP as IO processing won't happen anymore in the same thread.

Hadn't tried to switch to this configuration yet. After change on that problematic VM, the backup
started didn't throw qmp error. Backup starts, but with the same weird behavior as when increasing
timeout in /usr/share/perl5/PVE/QMPClient.pm. One this is different thou. VM CPU load now it closer
25% during this backup test run.

Code:

  ==> /proc/pressure/cpu <==
  some avg10=0.00 avg60=0.00 avg300=0.00 total=2189059738
  full avg10=0.00 avg60=0.00 avg300=0.00 total=2177728730
  ==> /proc/pressure/io <==
  some avg10=0.00 avg60=0.00 avg300=0.00 total=141688697
  full avg10=0.00 avg60=0.00 avg300=0.00 total=140157559
  ==> /proc/pressure/memory <==
  some avg10=0.00 avg60=0.00 avg300=0.00 total=0
  full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Regards,
Krisjanis

elimus_ · May 13, 2022

Noticed: https://forum.proxmox.com/threads/p...-if-backing-up-large-disks.109272/post-470734

To add. both of my areas also do NOT have krbd enable for CEPH storage.

~~Haven't experimented. Can I enable this on the fly? Or I need to restart PVE node and let cluster manager remount storage with new parameters?~~
Ah. Disks in VM conf files would also need to be reconfigured.

elimus_ · Jun 1, 2022

It looks that

Code:

pve-qemu-kvm=6.2.0-8

seems to solve the problem.

Best follow previously mentioned topic.

[SOLVED] PVE v7 / PBS v2.1 - backup qmp timeouts

elimus_

Active Member

t.lamprecht

Proxmox Staff Member

elimus_

Active Member

elimus_

Active Member

elimus_

Active Member

We value your privacy