VMs freezing and unreachable when backup server is slow

i also think it would make sense to have some good caching / optimization at the pve/pbs level instead leaving it up to the filesystem or throwing with hardware (ssd) to the problem.

i have some good results with zfs + secondarycache=metadata and adding cheap ssd for l2arc.
unfortunately, zfs is quite buggy regarding metadata caching (eviction happens too early, see https://github.com/openzfs/zfs/issues/12028 and https://github.com/openzfs/zfs/issues/10508 for example)
 
For testing I added a special device (mirror of 2 256GB NVMe) to the pool on PBS site and did set the small files limit to 32k on the dataset.
It helped a bit. I did get an overall backup speed of 100MiB/s.
I will leave it added for now until a better solution is available.

Warning: A special device is treated like a normal VDEV! If the special device fails, the pool is toast.
 
I also encountered a substantial downtime during the scheduled backup with Proxmox Backup Server. o_O

May be we should write a script to duplicate the live VM to a new VM ID first, then launch the backup with the cloned VM only.
By doing this, the live VM will no longer suffer from downtime even the connection to Proxmox backup server is slow.

Anyway, I hope Proxmox would consider add this solution out of the box.
- Let us have an option to create a duplicate VM backup before the sync with Proxmox Backup Server, then Proxmox Backup Server sync with the cloned VM only. After the sync is completed, remove the cloned VM automatically.



duringBackup.png

iowait.png
 
Last edited:
This seems like a show-stopper to me, not an annoyance. A freeze or more than a few seconds is rarely tolerable, and an effective total freeze until hard reboot is a terrifying thing to face.

Network errors are a fact of life. In addition, if you use a remote third party PBS service (e.g. Tuxis) you have no control over the load, bandwidth availability and so forth on the PBS server itself.

From what I read in the two bug reports, it seems like this is all down to some kind of limitation in qemu.

Even so, can't there be a user-configurable disk buffer (on the node) as well as the memory buffer? That would help smooth the bumps if there is very short term problems - more so than just buffering in memory.

Another issue that seems to be asked repeatedly in the bug reports is "how slow is too slow?" in terms of write speeds to trigger a backup abort, to prevent the VM freezing (or worse). Again, can't that be user-configurable? If I know that I have a really slow DSL connection, I can set write speeds to to a low number. If I have a 1Gig ethernet connection, I can set it to a high write speed. That way, if things get slower than they should during normal operations, the backup would abort, leaving the VM to run correctly instead of freezing.

If none of the above is possible, what about a short timeout (user configurable again). For our VMs, a freeze of anything more than about 10 seconds would be intolerable. I'd happily see a backup aborted rather than a VM freeze for longer than that.
In addition consider a customer that buys three servers to build a Proxmox HA cluster. How can you tell to him that, after building a redundant solution, that a slow backup or a broken network can halt all VMs of his cluster?
 
In general, it helps to reduce IO pressure for the VM by using IO-Threads on the disks (and virtio-scsi-single as IO controller), as that gives the QEMU main thread more time to process some VM events/work.
 
Hello,

first of all: The pbs is a great product. But I must agree that the current issue with stalling all write operations till they are backed up by the backup server is a big concern. We already faced issues with downtimes.

We have switched to a pbs with nvme disk as the backup target. But it is a general issue that the speed of the backup server is practically limiting the write speed of any vm. What if our 10 GBit/s ethernet connection will not be sufficient? Not speaking of the performance of the pbs with only a single backup running (it is way below the 10 GBit/s ...).

"Everybody" (at least some big competitors ...) are working on snapshots decoupling the write speed within the vm from the speed the backup is taken. I would really appreciate proxmox taking the same approach.

Is there anything planned by the proxmox team?
 
Hi,
Hello,

first of all: The pbs is a great product. But I must agree that the current issue with stalling all write operations till they are backed up by the backup server is a big concern. We already faced issues with downtimes.

We have switched to a pbs with nvme disk as the backup target. But it is a general issue that the speed of the backup server is practically limiting the write speed of any vm. What if our 10 GBit/s ethernet connection will not be sufficient? Not speaking of the performance of the pbs with only a single backup running (it is way below the 10 GBit/s ...).

"Everybody" (at least some big competitors ...) are working on snapshots decoupling the write speed within the vm from the speed the backup is taken. I would really appreciate proxmox taking the same approach.

Is there anything planned by the proxmox team?
there is nothing concrete at the moment, but ideas do exist: https://bugzilla.proxmox.com/show_bug.cgi?id=4136
 
I'm being it by that problem very hard, imho real solution is caching backup data, on RAM or local disk, if you run out of designated space because of too much write either retry backup later or mark as failed
We have for example zabbix instances that become so frozen that they spill out thousands of alarms
Backup is slow, yes and a lot too, but installing nvme ssd would be just an expensive workaround , not a real solution, backup is slow when it cannot keep up with schedule period.
 
Running a remote PBS for several clusters i'm getting the impression these problems became quite rare after improving disk IO/zfs performance on the PBS. Network speed seems less relevant.
 
Iam facing the same but it can be improved by limiting the BW (at least in my case) which makes the backups even faster then without limit … don’t ask me why :/
 
Iam facing the same but it can be improved by limiting the BW (at least in my case) which makes the backups even faster then without limit … don’t ask me why :/
It's probably because of the I/O that is overwhelming the disk sub-system on PBS so by limiting the bandwidth it matches to what the disks can handle.
 
  • Like
Reactions: tuxick
It's probably because of the I/O that is overwhelming the disk sub-system on PBS so by limiting the bandwidth it matches to what the disks can handle.
Iam facing the same but it can be improved by limiting the BW (at least in my case) which makes the backups even faster then without limit … don’t ask me why :/
I'm assuming you mean vzdump --bwlimit ?
 
  • Like
Reactions: dMopp

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!