VMs freezing and unreachable when backup server is slow

christian.g · May 17, 2022

We are having a hard time too. The combination of dirty maps in conjunction with a mid-fast PBS is giving us nightmares. We have > 40 VMs and also have a few VMs which are quite big database servers (>3TB). Sometimes they need updates and a reboot is required, which in turn invalidates the dirty maps and a full backup is the result. This again delays the whole backup of all other VMs on the node and makes those database VMs unusable for hours. And hard resetting such big database server and log recovery make things even worse.

Is there any design progress?
What about incorporating Ceph Snapshots if Ceph is in use instead of using qemu dirty maps?
I know you try to make a solution which works in every case an hence use qemu but these blocker/delays/freezes/hard resets are a big problem.

michel.seicon · Jul 12, 2022

Me too

tuxick · Jul 15, 2022

dcsapak said:
this is how the backup works, it intercepts write calls from the vm and backups the relevant block (detailed info here: https://git.proxmox.com/?p=pve-qemu...16aeb06259c2bcd196e4949c;hb=refs/heads/master)

we already cache some blocks in memory, but if the backup storage is too slow, symptoms such as this can happen

Any way to increase the cache size?

RolandK · Jul 22, 2022

i wonder why that cache is memory only and why it doesn't get send to disk also/instead when the cache is getting full. if network or backup server has slowness issue, it's unacceptable that VMs get IO error because of this

phs · Jul 22, 2022

thats is a critical issue, it just can not be that backup is crashing vm, is this being worked on? is there usable workaround?

Stefano Giunchi · Jul 28, 2022

dcsapak said:
this is how the backup works, it intercepts write calls from the vm and backups the relevant block (detailed info here: https://git.proxmox.com/?p=pve-qemu...16aeb06259c2bcd196e4949c;hb=refs/heads/master)

we already cache some blocks in memory, but if the backup storage is too slow, symptoms such as this can happen

From the file I read

* slow backup storage can slow down VM during backup
It is important to note that we only do sequential writes to the backup storage. Furthermore one can compress the backup stream. IMHO, it is better to slow down the VM a bit. All other solutions creates large amounts of temporary data during backup.

In fact, depending on the backup speed, VMs are not slowed down a bit: they are slowed down a lot, freezed or even crashed.

On Windows machines, I receive the ESENT/508 error: svchost (1008) SoftwareUsageMetrics-Svc: A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 3043328 (0x00000000002e7000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (15 seconds) to be serviced by the OS. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

I'm going to give my +1 to https://bugzilla.proxmox.com/show_bug.cgi?id=3631

RolandK · Jul 28, 2022

i think it would be best to do a showcase by using some network bandwidth throttling tool and some io throttling tool on the PBS, just to demonstrate , how badly things can behave...

the easiest way to do should be setting up a virtual PBS and limiting virtual nic and disk in the hardware settings dialogue.

RolandK · Jul 31, 2022

there is backup fleecing, which can redirect temporary data to a local file instead of buffering in ram. that sounds like a good solution https://bugzilla.proxmox.com/show_bug.cgi?id=4136

Darkk · Aug 2, 2022

mgiammarco said:
I have a customer with a database on ceph/nvme disks. They write on the database 24 hours a day. The backup server has 5400rpm disks. When the backup starts the database VM slows down. Please note that is not a network problem because proxmox->pbs connections is 10gbit.
I repeat: to do an optimization of backup process you think is "acceptable" to bind VM disks speed to backup unit disks speed. This is a classical example of "perfect is enemy of good".
Do you know we have so many problems due to this choice?
Where can I fill a complain or request to change this behaviour? At least for ceph filesystem.
Thanks,
Mario

I know what you're going through trying to deal with VMs thats running on CEPH. I've experienced the same slowdowns or sometimes freezing during backups. This is before PBS came around. I ended up trashing the servers in favor of a different solution for now. I will get back to it when the subscription runs out.

I run two node ProxMox for my home lab. Based on my trial and error I realize trying to back up super large VMs that hold 1-2TB of data is pretty much fruitless. It seems easier to create small VMs just run the apps that can be backed up quickly while the data actually reside on TrueNAS using it's own backup system. I use NFS shares. This is just an example.

You would shutdown SQL on the VM and then back up the VM in a powered off state. This way you have a full working backup image of it. Then use SQL backup to run your daily backups. For recovery you just restore the database server VM and then restore from SQL backup.

I can tell you using PBS to backup small SQL database servers does not pose a problem but always use the native SQL backup just in case. Very busy SQL servers during VM backups are asking for problems with corruption.

Ultranium · Aug 2, 2022

Yeah, this is a major problem.

I have a busy VM with a ~3TB disk in it. The backup takes almost 6 hours, and during this time all sorts of disk IO-related errors happen inside the VM, and some programs just crash because of it.
I had to disable the backup for this VM just to make it usable.

Can't Proxmox use ZFS snapshots for the live backup? I never have troubles using ZFS snapshots on a running VM, not even a slight slowdown.

KB19 · Aug 3, 2022

Ultranium said:
Can't Proxmox use ZFS snapshots for the live backup?

Feature request rejected: https://bugzilla.proxmox.com/show_bug.cgi?id=3304

christian.g · Aug 3, 2022

Till now a few suggestion have been provided by the community like

- manually increase the memory buffer size
- add a fast and large enough buffering device like a PCI NVMe or a ZFS mirror of them
- use storage snapshots if available (ZFS/CEPH)
- use backup fleecing

Is Proxmox working on any of them? Any Feedback from the Proxmox Team?

Thanks

christian.g · Sep 14, 2022

Any update on this? We have this problem on a regular basis and it's becoming a real show stopper.

fiona · Sep 15, 2022

Hi,
not sure if it helps, but there is a QEMU build using fewer workers during backup. Might be worth a try. Be sure to stop/start the VM after installing that version or migrate your VM to a node with that version installed.

christian.g · Sep 15, 2022

Thanks but I suspect this will slow down backups and sounds more like a workaround.

Why not implement a SSD/NVMe Write-Cache in PBS?
Add a zpool of mirrored fast devices with enough capacity, write there then flush to spinning disks in background.
That way one can have fast and save backups with large capacity.

Or use backup fleecing.

fiona · Sep 15, 2022

christian.g said:
Thanks but I suspect this will slow down backups and sounds more like a workaround.

Would still be interesting to see if it improves the situation and how much it hurts performance.

christian.g said:
Why not implement a SSD/NVMe Write-Cache in PBS?
Add a zpool of mirrored fast devices with enough capacity, write there then flush to spinning disks in background.
That way one can have fast and save backups with large capacity.

You should be able to set up your ZFS storage like this already? It doesn't have to be implemented in PBS itself, or what am I missing?

christian.g said:
Or use backup fleecing.

AFAIU, we are not generally opposed to that approach, but it would take time to evaluate and has different trade-offs than the current approach.

servada · Sep 15, 2022

We've also encountered issues with high iowait/VM's freezing up regularly during bigger backup jobs with PBS and Ceph (https://forum.proxmox.com/threads/h...rading-to-proxmox-7.113790/page-2#post-498071). The Ceph storage itself is mostly idle when this happens and not loaded at all. The backup server can however be quite busy on times, especially during pruning. Of course this can be planned better, but any backup solution (IMO) should never cause workloads to freeze due to loads on the backup server side. In such case it is not a good solution.

Really hope Proxmox team can implement an alternative method involving snapshots or the mentioned fleece method to prevent these stalls from happening. Any other solution (e.g. fast write cache on PBS side) would still cause workloads to stall if anything goes wrong on the backup server. Backups should always be best-effort (at least in my opinion) compared to the actual workload. Maybe it would be good to add 'profiles' so users could for example choose between backup integrity (= slow down VM, make sure backup server is fast enough) or workload continuity (= never slow down workload, even if it compromises backup coherency)?

christian.g · Sep 15, 2022

fiona said:
You should be able to set up your ZFS storage like this already? It doesn't have to be implemented in PBS itself, or what am I missing?

Not sure what you mean. ZFS doesn't have a Write Cache. SLOG is not a Write Cache.

fiona · Sep 16, 2022

christian.g said:
Not sure what you mean. ZFS doesn't have a Write Cache. SLOG is not a Write Cache.

I was thinking about a ZFS special device, but the size of small files written to it is limited to at most 1MiB. So for VM backups, it'd actually only be useful for metadata and likely not help that much.

EDIT: Also, IIRC, blocks would not be moved to the slower device automatically, so for file-based backups, it'd also only be useful for metadata :/

christian.g · Sep 16, 2022

fiona said:
I was thinking about a ZFS special device, but the size of small files written to it is limited to at most 1MiB. So for VM backups, it'd actually only be useful for metadata and likely not help that much.

EDIT: Also, IIRC, blocks would not be moved to the slower device automatically, so for file-based backups, it'd also only be useful for metadata :/

A Special Device is not a write cache but a "normal" vdev. Meaning, the data stays there and doesn't get flushed to the spinning disks.
We need more something like bcache in front of the zfs pool but this introduces a complex setup and i personally think it would make way more sense to implement this caching logic in PBS or PVE itself.

If Proxmox wants to stay storage agnostic for backups and hence deson't want to rely on storage snapshots (which i think is the best solution anyway) and backup fleecing needs a lot of time to implement, i would vote for a SSD/NVMe write cache/buffer either in PBS or PVE.

Having it in PBS would mean less expense as the fast devices are needed only there instead of every node and it would simplify the implementation a lot.
This presumes a stable and fast network from the nodes to PBS.

VMs freezing and unreachable when backup server is slow

Well-Known Member

Active Member

Active Member

Famous Member

Renowned Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

New Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

We value your privacy