Hello, I've noticed that during a backup of a guest the guest becomes intermittently hung, dropping packets, crazy network lag of over 10 seconds and/or even occasionally getting host unreachable errors for the duration of the backup job. The source storage is Ceph and the destination storage is ZFS over NFS. I've used snapshot as the backup method. I'm a little confused why the guest would become intermittently unresponsive if vzdump is;
* Using the fstrim
* Creating the snapshot
* Running fsthaw
* Then backing up the snapshot image
The problem starts occurring after the fsthaw, when it's transferring the image. I've performed some testing to try to figure out exactly when this happens. No other VMs are impacted so it isn't saturation of the network I'm using to ping and it doesn't seem like an underlying storage issue.
Thinking it may be some kind of saturation I set the bandwidth of vzdump to a quarter of the maximum speed I saw from previous backups, e.g. 400mbps to 100mbps. All this seemed to accomplish was to lengthen the time windows of intermittent but did nothing to mitigate the hanging.
I've tried changing destination storage from ZFS over NFS to Cephfs, which improves the situation. From this I believe there is some kind of deadlock occurring during an IO process. Again, if the copy is the snapshot I don't understand why the VM would occasionally become entirely unresponsive, including its console.
I've tried taking a snapshot without including RAM which manifests a single ping increase, which is hardly noticeable however; when I include the RAM the problem becomes apparent. From this test I believe that the backups may be including the RAM and I cannot find an option in which I can turn this off.
Given these tests my conclusion, which may or may not be accurate, is that perhaps the backup is performing a RAM dump and if the destination is slow then the VM is being frozen in the hypervisor through some form of deadlocking. I'm trying to find out if my assumption is accurate on the RAM. If that is correct is there any option I can set to not include ram during a backup as I'm just looking to backup the VM and its configuration but not its state and store that on a different system?
A workaround I'm considering is that the cephfs volume doesn't seem to have nearly the same amount of intermittent hanging, packet loss seems to be none however; latency is increased from around .4ms to 4000ms during sparks. I considered a cron might be in order to copy cephfs/dump to the nfs mount point daily.
Any advice would be most appreciated. Thank you!
* Using the fstrim
* Creating the snapshot
* Running fsthaw
* Then backing up the snapshot image
The problem starts occurring after the fsthaw, when it's transferring the image. I've performed some testing to try to figure out exactly when this happens. No other VMs are impacted so it isn't saturation of the network I'm using to ping and it doesn't seem like an underlying storage issue.
Thinking it may be some kind of saturation I set the bandwidth of vzdump to a quarter of the maximum speed I saw from previous backups, e.g. 400mbps to 100mbps. All this seemed to accomplish was to lengthen the time windows of intermittent but did nothing to mitigate the hanging.
I've tried changing destination storage from ZFS over NFS to Cephfs, which improves the situation. From this I believe there is some kind of deadlock occurring during an IO process. Again, if the copy is the snapshot I don't understand why the VM would occasionally become entirely unresponsive, including its console.
I've tried taking a snapshot without including RAM which manifests a single ping increase, which is hardly noticeable however; when I include the RAM the problem becomes apparent. From this test I believe that the backups may be including the RAM and I cannot find an option in which I can turn this off.
Given these tests my conclusion, which may or may not be accurate, is that perhaps the backup is performing a RAM dump and if the destination is slow then the VM is being frozen in the hypervisor through some form of deadlocking. I'm trying to find out if my assumption is accurate on the RAM. If that is correct is there any option I can set to not include ram during a backup as I'm just looking to backup the VM and its configuration but not its state and store that on a different system?
A workaround I'm considering is that the cephfs volume doesn't seem to have nearly the same amount of intermittent hanging, packet loss seems to be none however; latency is increased from around .4ms to 4000ms during sparks. I considered a cron might be in order to copy cephfs/dump to the nfs mount point daily.
Any advice would be most appreciated. Thank you!