Proxmox 8 - Backup to PBS - High IO Delay - Server gets stuck

AechoOne26

New Member
Aug 16, 2023
7
1
3
Germany
Good Day,

we have a new Proxmox Node with High Storage KVM Servers. Connected to the PBS Server with 2x10G active-backup Bond

Setup is :
AMD EPYC 7443P 24-Core Processor
512 GB RAM
RaidZ2 with 8x 20TB HDD and 2x 3,84 TB SSD Raid1 as a META Devices.

The VM's have 30 TB in some cases.

If we start a Backup Job the Server get IO Delay higher dann 50% and the running VM's are no longer accesible (for example the Webserver)

We've tried to set ionice to 8 and a bandwith limit in /etc/vzdump.conf but this doesn't help. For now the Backups runs with 50mbit, but this is not the right Way to do it. The Backups need 4 Weeks till finishing...we need Daily Backups.

Any solutions here?

Best regards
 
Last edited:
Hi,
please provide the output of
Code:
# PBS host
proxmox-backup-manager version --verbose

# PVE host
qm config <VMID>
cat /etc/pve/storage.cfg
as well as the complete backup task log.

Is this setup you described above the PVE host or the PBS host? Has the VM been switched of since the last completed backup job (or in other words, is the dirty bitmap used)?
 
we have a new Proxmox Node with High Storage KVM Servers. Connected to the PBS Server with 2x10G active-backup Bond
If I understand your post correct, you described the hardware setup of your Compute node, However all we know about the PBS server is that it has, effectively, single 10G connection to Compute.

What you have to keep in mind is that PBS server compute power and network connectivity are in a critical path of your production traffic during backup:
https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt

specifically changes/writes to existing data can be affected by slow backup infrastructure:

Code:
1.) read old data before it gets overwritten
2.) write that data into the backup archive
3.) write new data (VM write)


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi,
please provide the output of
Code:
# PBS host
proxmox-backup-manager version --verbose

# PVE host
qm config <VMID>
cat /etc/pve/storage.cfg
as well as the complete backup task log.

Is this setup you described above the PVE host or the PBS host? Has the VM been switched of since the last completed backup job (or in other words, is the dirty bitmap used)?
Good Day,

the described Setup is the PVE Node. For complete information, i give you the PBS Setup:

2x Intel(R) Xeon(R) CPU E5-2620 v4
256G RAM
ZFS Pool with 1x Raidz2 with 8x 18TB HDD and 1x Raidz2 with 8x 20TB HDD // Raid10 SSD Special Device 2x 1TB + 2x2TB
2x 10G LACP Bond


---------------------------------------------------------------------------------------
Here i can give you a Log from our last try - the problem is with all vms, we have to stop the Backup job.

Code:
INFO: starting new backup job: vzdump 110 --remove 0 --mode snapshot --notes-template '{{guestname}}' --node root2476 --storage FRA1.PBS1
INFO: Starting Backup of VM 110 (qemu)
INFO: Backup started at 2023-08-17 17:26:20
INFO: status = running
INFO: VM Name: stream01.mediathek-hessen.de
INFO: include disk 'scsi0' 'zfs-hdd:vm-110-disk-0' 30T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/110/2023-08-17T15:26:20Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '768544db-d71b-4f0f-997b-c6e7954a6c4c'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   0% (1.4 GiB of 30.0 TiB) in 3s, read: 466.7 MiB/s, write: 250.7 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 110 failed - interrupted by signal
INFO: Failed at 2023-08-17 17:28:23
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal
PBS Host:
Code:
proxmox-backup             2.4-1        running kernel: 5.15.39-2-pve
proxmox-backup-server      2.4.3-1      running version: 2.2.5
pve-kernel-5.15            7.4-4
pve-kernel-helper          7.3-8
pve-kernel-5.13            7.1-9
pve-kernel-5.15.39-2-pve   5.15.39-2
pve-kernel-5.13.19-6-pve   5.13.19-15
pve-kernel-5.13.19-2-pve   5.13.19-4
pve-kernel-5.13.19-1-pve   5.13.19-3
ifupdown2                  3.1.0-1+pmx4
libjs-extjs                7.0.0-1
proxmox-backup-docs        2.4.3-1
proxmox-backup-client      2.4.3-1
proxmox-mini-journalreader 1.2-1
proxmox-widget-toolkit     3.7.3
pve-xtermjs                4.16.0-2
smartmontools              7.2-pve3
zfsutils-linux             2.1.11-pve1


qm config 110

Code:
agent: 1
boot: order=scsi0;ide2;net0
cores: 24
cpu: kvm64,flags=+aes
hotplug: disk,network,usb
ide2: none,media=cdrom
memory: 229376
name: censored
net0: virtio=EE:D4:32:42:32:4E,bridge=vmbr0
numa: 1
onboot: 1
ostype: l26
scsi0: zfs-hdd:vm-110-disk-0,format=raw,size=30T
scsihw: virtio-scsi-pci
smbios1: uuid=e10dabae-068a-40cc-81a4-19307c9904ce
sockets: 1
vmgenid: 3ece1bfc-1556-4a56-ba05-7cab09b841c5

storage.cfg PVE

Code:
dir: local
        path /var/lib/vz
        content images,vztmpl,rootdir,iso
        prune-backups keep-all=1
        shared 0

pbs: FRA1.PBS1
        datastore censored(customerid)
        server 10.0.0.41
        content backup
        fingerprint censored
        nodes root2476,root2475
        prune-backups keep-all=1
        username censored(customerid)

zfspool: zfs-ssd
        pool zfs-ssd
        content images,rootdir
        mountpoint /zfs-ssd
        nodes root2475
        sparse 1

zfspool: zfs-hdd
        pool zfs-hdd
        content images,rootdir
        mountpoint /zfs-hdd
        nodes root2476
        sparse 1
 
Last edited:
So the issue seems to be that you ZFS storage on the PVE host is overwhelmed by the read requests from the backup client. As the VM has been shut down, the dirty bitmap has been invalidated and the whole disks are backed up again.

On the PVE side you unfortunately cannot adjust the bandwidth and ionice values, you will have to setup a bandwidth limit for the PVE host on the PBS side [0]. Maybe half of the read speed seen in the backup job you posted might be a good starting point (~200 MiB/s).

Once the backup has completed, the dirty-bitmap will be taken into account for the next backup job, which then will be much faster. Note however that this only has effect until you stop the VM.

[0] https://pbs.proxmox.com/docs/network-management.html#traffic-control

Edit: Incorrect Information
 
Last edited:
Hey guys,

we've activated Discard on the VM's now Full Backup works like a charm with limited Bandiwth & ionice 8. But incremental Backup gives the old behavior. IO Delay goes over 50% and the VMs get unreachable....
 
Hi together

we have the same experience after upgrading to pve 7 to 8 today, the backup canceled (my fault) and created a invalid bitmap.
Never saw this before, usually we don't use harddrives on raidz2.

The system is not that different:
AMD EPYC 7313P
128GB RAM (50% for the ARCcache)
2x 8GB SAS SSD Samsung - Mirror
4x 12GB SATA WD Gold - Z2
10G Connection PVE <> PBS (internal)

1 VM with virtual diskw summing up to 4TB on SSD and 6TB on HDD Z2

IO Delay 40% when doing dirty bitmap - time went up from 6minutes to ~12hours
Meanwhile the VM keeps snappy an responsive.

I think the problem in your configuration, are the high IOPS on the z2 on your system. High IOPS on PVE installation partition is bad.

I would try to use SSDs for PVE and separate the virtual OS from the data on the SSD. If possible, you should use SAS SSDs.

So the system should not crash if the vm demands high io while backuping.
 
Last edited:
Hi together

we have the same experience after upgrading to pve 7 to 8 today, the backup canceled (my fault) and created a invalid bitmap.
Never saw this before, usually we don't use harddrives on raidz2.

The system is not that different:
AMD EPYC 7313P
128GB RAM (50% for the ARCcache)
2x 8GB SAS SSD Samsung - Mirror
4x 12GB SATA WD Gold - Z2
10G Connection PVE <> PBS (internal)

1 VM with virtual diskw summing up to 4TB on SSD and 6TB on HDD Z2

IO Delay 40% when doing dirty bitmap - time went up from 6minutes to ~12hours
Meanwhile the VM keeps snappy an responsive.

I think the problem in your configuration, are the high IOPS on the z2 on your system. High IOPS on PVE installation partition is bad.

I would try to use SSDs for PVE and separate the virtual OS from the data on the SSD. If possible, you should use SAS SSDs.

So the system should not crash if the vm demands high io while backuping.
Thanks for your answer. Unfortunately the System IS on a seperate RAID , i just ignored to write this Information. PVE is responsible during backup. The Problem is, the VMs are Not responsible and this is no opinion. It‘s critical Infrastructure ….
 
Last edited:
>the backup canceled (my fault) and created a invalid bitmap.

could you describe this better? what exactly happened and where do do get that information from ? what's being logged?
 
Thanks for your answer. Unfortunately the System IS on a seperate RAID , i just ignored to write this Information. PVE is responsible during backup. The Problem is, the VMs are Not responsible and this is no opinion. It‘s critical Infrastructure ….
So the Proxmox Webpage is responsive while backuping?
Are you having an IODelay when not backuping?
 
>the backup canceled (my fault) and created a invalid bitmap.

could you describe this better? what exactly happened and where do do get that information from ? what's being logged?
I made the update pve7to8 while the backup task was planned and started in the background(I forgot about this)

Then I got this:
Code:
could not activate storage 'pbs': pbs: error fetching datastores - Bareword "URI::HAS_RESERVED_SQUARE_BRACKETS" not allowed while "strict subs" in use at /usr/share/perl5/URI/_generic.pm line 13.
Compilation failed in require at /usr/share/perl5/URI.pm line 103.

After the reboot I got "dirty-bitmap status: existing bitmap was invalid and has been cleared"
 
  • Like
Reactions: RolandK

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!