strange issues with snapshot backups

K-P4ul

Member
Jan 28, 2021
15
2
23
Hi,
I have a strange issue with the Proxmox Backup function. I have 4 Clusters (2+1) running with DRBD as storage backend.
On a other location (in a data center) runs a Proxmox Backup server. Both locations are connected via 1GBit/s.
Now I want to backup all of my VMs to the PBS over the 1GBit internet link. I know the initial run will last several hours, maybe days. The backup mode on the PVEs is set to snapshot and i have one job per Cluster (seems to run 2 jobs in parallel).
When the Backup runs, after a few hours i got I/O errors on the VMs where the backup is running. It seems the blockdevice stalls for a longer time (over 600 seconds). Then the vm kernel remounts the filesystems ro.

This leads me to my first question:

Is a longer runtime a Problem when using Snapshot mode?

For better understanding whats going on, I want to see where the snapshots are created when backup is runnning.
When i perform a snapshot in proxmox ui i can see that drbd is creating a snapshot and after succsefull creation i can see the snapshot in the linstor controller. But when proxmox backup is running i can not see anything about snapshots in drbd or linstor.


This brings me to my second question:

Does proxmox backup in snapshot mode really create snapshots? Or does it skip the task and just copy the mounted drbd volume? Maybe there is some hidden proxmox magic going on I don't know about?
 
A "snapshot backup" creates a dirty bitmap at QEMU level (not storage) of the blocks that are being changed during the backup.
If the block is changed during the backup but prior to this block actually placed in backup media two things can happen:
a) the change (write) is held until the block is successfully pushed to PBS
b) if fleecing-storage is enabled the block is copied to fleecing storage and the write is acknowledged.

If your backup flow gets bogged down for any reason the (a) can caused IO timeouts and errors. This is why (b) was created.

There is no integration/interaction with DRBD or any other storage type.



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Thank you for the explanation. I think it would be better to host an PBS on site and use the remote sync feature for offsite Backup.
 
  • Like
Reactions: Kingneutron
you can also use a fleecing device in schedule backups advanced option, to define a temp buffer disk. (could be local storage)
Yes, they could. However, if the link to this offsite PBS is not great, they could run into various other edge conditions. For example: active changes with slow link may mean a requirement for fleecing storage to be close to 100% of the primary one. I have not looked in the code - if they run out of fleecing storage, does the backup process switch to non-fleecing mode, fail, start producing IO errors?

The "backup local and synchronize to remote PBS" is the best approach. It also simultaneously checks off the 3-2-1 rule.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: leesteken
Yes, they could. However, if the link to this offsite PBS is not great, they could run into various other edge conditions. For example: active changes with slow link may mean a requirement for fleecing storage to be close to 100% of the primary one. I have not looked in the code - if they run out of fleecing storage, does the backup process switch to non-fleecing mode, fail, start producing IO errors?


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
if your fleecing storage is full, sure, you'll got io-error.

But it's helping too to avoid slowdown during backup if your pbs is slow. for example, production is on nvme, and pbs is on hdd. You'll have slowdown when backup is running if write occur on a block not yet saved. With fleecing, if you put in on same nvme, it'll cache write if pbs is not fast enough. (or if pbs crash,reboot during the backup)
 
But it's helping too to avoid slowdown during backup if your pbs is slow. for example, production is on nvme, and pbs is on hdd. You'll have slowdown when backup is running if write occur on a block not yet saved. With fleecing, if you put in on same nvme, it'll cache write if pbs is not fast enough. (or if pbs crash,reboot during the backup)
Yes, we are on the same page about it. The OP seemed to indicate that PBS in their particular case is remote. Fleecing storage is still a benefit, even if PBS is brought local and then synced to remote site.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
if your fleecing storage is full, sure, you'll got io-error.
Yes, but only for the fleecing image and the backup should then be cleanly aborted without the guest noticing anything. Of course, if the guest is using the very same storage and the whole storage is full and there is bad timing, it can also get an IO error ;)