Backups getting hung with high I/O wait

gouthamravee

Well-Known Member
May 16, 2019
31
7
48
I have PBS setup as a VM with the specs below
Guest:
CPU - 2 cores
RAM - 6GB
Bios - OVMF
Machine - q35
Controller - VirtIO SCSI Single
OS Drive - 32GB QCOW SSD Discard on
Backup Drive - WD 4TB RE 7200RPM HDD, passthroughed to VM using SCSI and iothread = 1
PBS version - 2.3-3 Linux 5.15.85-1-pve #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z

The host is
CPU - 8 Core E3-1275 V2
RAM - 32GB
Proxmox has its own dedicated SSD while the VMs use another SSD.

I have the PBS storage mounted on all my hosts.
I've done a iperf3 test from the PBS host to every host and VM, speeds are saturating the 1Gbs link between all of them as expected.
I've done a naive drive test on the backup drive on PBS, the script wrote a thousand or so 1MB files to the drive without issues.

This setup was running perfectly fine for about a year, but now every backup freezes up at some point and the PBS guest shows 100% cpu usage and 100% I/O wait.
I've seen threads saying a single HDD isn't fast enough for PBS, but I've always had a single HDD like this, at one point it was even a USB external drive. Did something change recently that PBS needs faster storage? I don't mind swapping over to a SSD, but that's going to suck space wise.

All the guest VMs virutal drives are of the qcow2 type.

The host with PBS is also running BlueIris with its own three HDDs passedthrough and that VM is working fine.

I did notice the host was low on RAM, using up 30 out of 32GB, so I turned off some of the VMs and that seemed to solve the issue temporarily. Maybe with small guests it can backup, but soon after the problems showed up again.

Unfortunately I can't find anything else that would help me pin down the issue. During these issues the VM console shows messages like
INFO: task jbd2/sdb1-8:483 blocked for more than 120 seconds.

sdb1 is the backup drive partition.

output of
Bash:
df -h

Bash:
Filesystem            Size  Used Avail Use% Mounted on
udev                  2.9G     0  2.9G   0% /dev
tmpfs                 590M  856K  590M   1% /run
/dev/mapper/pbs-root   29G  3.3G   25G  12% /
tmpfs                 2.9G  164K  2.9G   1% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
/dev/sda2             511M  336K  511M   1% /boot/efi
/dev/sdb1             3.6T  338G  3.1T  10% /mnt/datastore/backup
tmpfs                 590M     0  590M   0% /run/user/0
 
Please post also:

PVE:
Code:
qm config <vmid>  # from PBS and one large guest
cat /etc/pve/storage.cfg
journalctl -u pvedaemon.service  # something interesting?

PBS:
Code:
cat /etc/proxmox-backup/datastore.cfg
mount | grep -e mapper -e mnt

Since you use an unrecommended setup (PBS as VM, HDD as backup drive) it cannot perform best. I'm not sure about your only 6GB assigned RAM, AFAIK it could be suboptimal when using ZFS on your backup drive.

Checking S.M.A.R.T values can't be wrong either.

When did you notice the issue, was it immediately after some changes/updates?
 
config from PBS
Bash:
agent: 1
bios: ovmf
boot: order=scsi1
cores: 2
efidisk0: extra-ssd:127/vm-127-disk-0.qcow2,efitype=4m,size=528K
ide2: none,media=cdrom
machine: q35
memory: 6144
meta: creation-qemu=7.1.0,ctime=1676859848
name: pbs
net0: virtio=7e:da:3f:84:20:b7,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi1: extra-ssd:127/vm-127-disk-1.qcow2,discard=on,iothread=1,size=32G,ssd=1
scsi2: /dev/disk/by-id/ata-WDC_WD4000FYYZ-05UL1B0_WD-WCC131790216,backup=0,iothread=1,size=3907018584K
scsihw: virtio-scsi-single
smbios1: uuid=4e152022-64e7-4c0a-a850-c5de8e0de0a8
sockets: 1
startup: order=99
vmgenid: 5be3c598-485b-433d-9974-bf90824bda74

config from one of the larger VMS, its on another proxmox host
Bash:
agent: 1
bios: ovmf
boot: order=scsi0
cores: 20
cpu: host
efidisk0: local-lvm:vm-108-disk-1,size=4M
hostpci0: 0000:02:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 12288
name: plex
net0: virtio=88:D7:F6:D5:4D:CD,bridge=vmbr1,queues=20
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-108-disk-0,cache=writethrough,discard=on,iothread=1,size=128G,ssd=1
scsi1: extra-ssd:108/vm-108-disk-0.qcow2,backup=0,discard=on,iothread=1,size=120G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7f7796e1-22a9-49e0-974b-a6d5626594d6
sockets: 1
startup: order=1
usb0: host=2040:826d
vmgenid: fc421fe4-164c-4913-8ad1-ae78a685d18a

storage.cfg from PVE hosting PBS
Bash:
dir: local
        path /var/lib/vz
        content vztmpl,iso
        shared 0

lvmthin: local-lvm
        thinpool data
        vgname pve
        content rootdir,images

dir: extra-ssd
        path /mnt/pve/extra-ssd
        content images,rootdir
        prune-backups keep-all=1
        shared 0

pbs: pbs
        datastore backup
        server pbs.gravee.com
        content backup
        fingerprint 01:27:43:64:db:2b:f8:66:34:db:d7:91:31:83:f3:60:57:3f:4f:95:8f:83:e3:29:48:9b:2f:b7:ab:33:56:ee
        prune-backups keep-all=1
        username backup-user@pbs

Jounrnalctl on PVE, started a backup job on a small VM an Jun 16 10:16
Bash:
Jun 14 04:12:12 pmx4 pvedaemon[1875]: worker 1152342 started
Jun 14 04:12:12 pmx4 pvedaemon[1875]: worker 1152343 started
Jun 14 04:12:17 pmx4 pvedaemon[2874918]: worker exit
Jun 14 04:12:17 pmx4 pvedaemon[2874305]: worker exit
Jun 14 04:12:17 pmx4 pvedaemon[2883377]: worker exit
Jun 14 04:12:17 pmx4 pvedaemon[1875]: worker 2874305 finished
Jun 14 04:12:17 pmx4 pvedaemon[1875]: worker 2874918 finished
Jun 14 04:12:17 pmx4 pvedaemon[1875]: worker 2883377 finished
Jun 14 10:32:35 pmx4 pvedaemon[1152341]: VM 127 qmp command failed - VM 127 qmp command 'guest-ping' failed - got timeout
Jun 14 10:32:37 pmx4 pvedaemon[1152342]: VM 127 qmp command failed - VM 127 qmp command 'guest-ping' failed - got timeout
Jun 14 10:35:52 pmx4 pvedaemon[1152342]: VM 127 qmp command failed - VM 127 qmp command 'guest-ping' failed - got timeout
Jun 14 10:35:56 pmx4 pvedaemon[1152341]: pbs: error fetching datastores - 500 Can't connect to pbs.gravee.com:8007 (Connection timed out)
Jun 14 10:36:00 pmx4 pvedaemon[1152342]: <root@pam> starting task UPID:pmx4:0013CE19:05F38251:6489D050:qmstop:127:root@pam:
Jun 14 10:36:00 pmx4 pvedaemon[1297945]: stop VM 127: UPID:pmx4:0013CE19:05F38251:6489D050:qmstop:127:root@pam:
Jun 14 10:36:09 pmx4 pvedaemon[1152341]: VM 127 qmp command failed - VM 127 qmp command 'query-proxmox-support' failed - unable to connect to VM 127 qmp soc>
Jun 14 10:36:18 pmx4 pvedaemon[1152341]: VM 127 qmp command failed - VM 127 qmp command 'query-proxmox-support' failed - unable to connect to VM 127 qmp soc>
Jun 14 10:36:18 pmx4 pvedaemon[1152343]: VM 127 qmp command failed - VM 127 qmp command 'query-proxmox-support' failed - unable to connect to VM 127 qmp soc>
Jun 14 10:36:30 pmx4 pvedaemon[1297945]: VM still running - terminating now with SIGTERM
Jun 14 10:36:40 pmx4 pvedaemon[1297945]: VM still running - terminating now with SIGKILL
Jun 14 10:36:40 pmx4 pvedaemon[1152343]: VM 127 qmp command failed - VM 127 not running
Jun 14 10:36:41 pmx4 pvedaemon[1152342]: <root@pam> end task UPID:pmx4:0013CE19:05F38251:6489D050:qmstop:127:root@pam: OK
Jun 14 10:37:45 pmx4 pvedaemon[1152342]: <root@pam> starting task UPID:pmx4:0013D0C8:05F3AB45:6489D0B9:qmstart:127:root@pam:
Jun 14 10:37:45 pmx4 pvedaemon[1298632]: start VM 127: UPID:pmx4:0013D0C8:05F3AB45:6489D0B9:qmstart:127:root@pam:
Jun 14 10:37:46 pmx4 pvedaemon[1152342]: <root@pam> end task UPID:pmx4:0013D0C8:05F3AB45:6489D0B9:qmstart:127:root@pam: OK
Jun 15 13:55:39 pmx4 pvedaemon[1152343]: VM 127 qmp command failed - VM 127 qmp command 'guest-ping' failed - got timeout
Jun 15 13:55:42 pmx4 pvedaemon[1152342]: VM 127 qmp command failed - VM 127 qmp command 'guest-ping' failed - got timeout
Jun 15 13:56:01 pmx4 pvedaemon[1152343]: VM 127 qmp command failed - VM 127 qmp command 'guest-ping' failed - got timeout
Jun 16 10:16:51 pmx4 pvedaemon[1152343]: <root@pam> starting task UPID:pmx4:0028FE35:06F96D5D:648C6ED3:vzdump:118:root@pam:
Jun 16 10:16:51 pmx4 pvedaemon[2686517]: INFO: starting new backup job: vzdump 118 --mailto gouthamravee@gmail.com --mode snapshot --node pmx4 --prune-backu>
Jun 16 10:16:51 pmx4 pvedaemon[2686517]: INFO: Starting Backup of VM 118 (qemu)

Backup job info, it basically stopped around 11%
Code:
()
INFO: starting new backup job: vzdump 118 --mailto gouthamravee@gmail.com --mode snapshot --node pmx4 --prune-backups 'keep-last=2' --mailnotification failure --all 0 --storage pbs --notes-template '{{guestname}}'
INFO: Starting Backup of VM 118 (qemu)
INFO: Backup started at 2023-06-16 10:16:51
INFO: status = running
INFO: VM Name: archive
INFO: include disk 'scsi0' 'local-lvm:vm-118-disk-1' 32G
INFO: include disk 'efidisk0' 'local-lvm:vm-118-disk-0' 528K
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/118/2023-06-16T14:16:51Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'fc61a1e9-40c7-407a-af8d-6db42dfe3635'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: created new
INFO: scsi0: dirty-bitmap status: created new
INFO:   2% (792.0 MiB of 32.0 GiB) in 3s, read: 264.0 MiB/s, write: 76.0 MiB/s
INFO:   6% (2.0 GiB of 32.0 GiB) in 6s, read: 420.0 MiB/s, write: 52.0 MiB/s
INFO:   9% (2.9 GiB of 32.0 GiB) in 9s, read: 316.0 MiB/s, write: 16.0 MiB/s
INFO:  11% (3.5 GiB of 32.0 GiB) in 12s, read: 210.7 MiB/s, write: 17.3 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 118 failed - interrupted by signal
INFO: Failed at 2023-06-16 10:19:32
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal

storage.cfg from PBS
Code:
datastore: backup
        path /mnt/datastore/backup

mount from pbs
Bash:
/dev/mapper/pbs-root on / type ext4 (rw,relatime,errors=remount-ro)
/dev/sdb1 on /mnt/datastore/backup type ext4 (rw,relatime)

I did not realize running pbs as a VM was not recommended, I thought that was one of the use cases.

SMART values on all the drives that are in use by PBS and its host PVE are good.

I noticed the issue relatively recently, I've been using PBS in a similar config for at least 2 years. This specific setup is about a year old.
I feel like the last PBS update was when the issue started, but I can't say for sure unfortunately because the issue has been happening for almost 2 months now.
 
I did not realize running pbs as a VM was not recommended, I thought that was one of the use cases.

https://pbs.proxmox.com/docs/proxmox-backup.pdf#1c
Caution: Installing the backup server directly on the hypervisor is not recommended. It is safer
to use a separate physical server to store backups. Should the hypervisor server fail, you can
still access the backups.

Although the performance-wise best solution (in the long run) would be to change your setup to installing PBS directly on a host with the provided installer-image, you can still try to improve your setup by adapting it to:

https://pbs.proxmox.com/docs/proxmox-backup.pdf#10
2.1.2 Recommended Server System Requirements

E.g. use 4 cores instead of 2 and maybe 7GB+ of RAM (4GB for PBS + 3GB due to your ~3TB storage)

Furthermore, you can try to set the 'bwlimit' (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#vzdump_configuration) in the WebUI (Datacenter > Options > Bandwith Limits > Backup Restore) to e.g. 20 MiB/s. Because without/after cache that's about the write performance according to:
INFO: 11% (3.5 GiB of 32.0 GiB) in 12s, read: 210.7 MiB/s, write: 17.3 MiB/s

Another way to improve your issue is setting 'performance: max-workers=1' in /etc/vzdump.conf and increasing it gradually if it helped.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!