Guests Hang During Backup

dosmage

Active Member
Nov 30, 2016
27
0
41
41
Hello, I've noticed that during a backup of a guest the guest becomes intermittently hung, dropping packets, crazy network lag of over 10 seconds and/or even occasionally getting host unreachable errors for the duration of the backup job. The source storage is Ceph and the destination storage is ZFS over NFS. I've used snapshot as the backup method. I'm a little confused why the guest would become intermittently unresponsive if vzdump is;
* Using the fstrim
* Creating the snapshot
* Running fsthaw
* Then backing up the snapshot image

The problem starts occurring after the fsthaw, when it's transferring the image. I've performed some testing to try to figure out exactly when this happens. No other VMs are impacted so it isn't saturation of the network I'm using to ping and it doesn't seem like an underlying storage issue.

Thinking it may be some kind of saturation I set the bandwidth of vzdump to a quarter of the maximum speed I saw from previous backups, e.g. 400mbps to 100mbps. All this seemed to accomplish was to lengthen the time windows of intermittent but did nothing to mitigate the hanging.

I've tried changing destination storage from ZFS over NFS to Cephfs, which improves the situation. From this I believe there is some kind of deadlock occurring during an IO process. Again, if the copy is the snapshot I don't understand why the VM would occasionally become entirely unresponsive, including its console.

I've tried taking a snapshot without including RAM which manifests a single ping increase, which is hardly noticeable however; when I include the RAM the problem becomes apparent. From this test I believe that the backups may be including the RAM and I cannot find an option in which I can turn this off.

Given these tests my conclusion, which may or may not be accurate, is that perhaps the backup is performing a RAM dump and if the destination is slow then the VM is being frozen in the hypervisor through some form of deadlocking. I'm trying to find out if my assumption is accurate on the RAM. If that is correct is there any option I can set to not include ram during a backup as I'm just looking to backup the VM and its configuration but not its state and store that on a different system?

A workaround I'm considering is that the cephfs volume doesn't seem to have nearly the same amount of intermittent hanging, packet loss seems to be none however; latency is increased from around .4ms to 4000ms during sparks. I considered a cron might be in order to copy cephfs/dump to the nfs mount point daily.

Any advice would be most appreciated. Thank you!
 
I'm a little confused why the guest would become intermittently unresponsive if vzdump is;
* Using the fstrim
* Creating the snapshot
* Running fsthaw
* Then backing up the snapshot image

The problem starts occurring after the fsthaw, when it's transferring the image. I've performed some testing to try to figure out exactly when this happens. No other VMs are impacted so it isn't saturation of the network I'm using to ping and it doesn't seem like an underlying storage issue.

I think you mean "fs-freeze" not fstrim, just for clarification :)

The thing is, when backing up a live VM one tries to first get the VM in a consistent state, fs-freeze/thaw help here a lot as it makes the VM aware of the need to flush important data off to disk to be consistent. But after that the VM starts to write again, and one does not wants to mark the whole previously made "snapshot state" read-only for writing it off to the backup - as that could mean that one risks double the space usage during that if the VM's disks would be rewritten completely.
Often either that extra space may not be available or but a bit of a strain on the system. ( Note this all is from top of my head, so do not ping me on details please :) ) Thus normally we write off blocks to the backup which the guest wants to write again with higher priority, so we can stop tracking those and just let the guest write to those block directly instead of another place, that can induce some latency sometimes, but depends a lot of the guest write pattern.

I've tried taking a snapshot without including RAM which manifests a single ping increase, which is hardly noticeable however; when I include the RAM the problem becomes apparent. From this test I believe that the backups may be including the RAM and I cannot find an option in which I can turn this off.

No, backups do not include the whole RAM state. But Snapshot with RAM need also to some memory tracking to get a consistent memory state, so there can be similar underlying mechanics at play, and thus you get those seemingly correlated results.

I considered a cron might be in order to copy cephfs/dump to the nfs mount point daily.

That dump won't be consistent, not yet synced writes from the page cache and the like may miss from the dump, that may throw off some applications which aren't cleanly programmed to ensure their important data is synced fully before continuing.

If you do anything like this test your restore process good, as it could just be that you trade consistent for fast.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!