Backup slowing down

avn

Active Member
Mar 2, 2015
36
4
28
Hello,

I have a similar (or same) problem as in this topic: https://forum.proxmox.com/threads/backup-slowing-down-to-a-crawl.44870. I was asked to create a new thread, because the setup is different.

In my case the storage is HPE MSA 2050 (attached by SAS). There's 4 LUNs where VMs located as LVM, shared between multiple nodes. Backup (snapshot mode) started weekly, and at first performance is great, it's 3-digit numbers. Pure read of sparse data can go up to 400 MB/s.

At some point the speed goes down, but not everywhere at once, only on particular LUN(s). Read speed of sparse data slowed to 20 Mb/s. It happens almost simultaneously on multiple nodes, if backup is going on more than one node at once. But only for particular LUNs, and I think it's most used LUNs (there more VMs that on others).

Somehow restart of controller on the storage correct the problem, and backup return to full speed. Because of multipath I can do that safely, but it needs to be done manually every time after a few hours, beacause read speed go down again.

The problem not likely in the storage, because only read speed of vzdump is affected, everything else work fine. I can copy data from the same LUN at high speed when backup already slowed. I'm not sure when this problem started, and I appreciate every suggestion for finding the cause.
 
In my case the storage is HPE MSA 2050 (attached by SAS). There's 4 LUNs where VMs located as LVM, shared between multiple nodes. Backup (snapshot mode) started weekly, and at first performance is great, it's 3-digit numbers. Pure read of sparse data can go up to 400 MB/s.
How is the resource usage on the MSA2050 when the backup slows down?

Somehow restart of controller on the storage correct the problem, and backup return to full speed. Because of multipath I can do that safely, but it needs to be done manually every time after a few hours, beacause read speed go down again.
Sounds like a congestion or firmware issue on the MSA.

The problem not likely in the storage, because only read speed of vzdump is affected, everything else work fine. I can copy data from the same LUN at high speed when backup already slowed. I'm not sure when this problem started, and I appreciate every suggestion for finding the cause.
I assume, VM/CT have their data cached in memory of the node and just don't read as much as vzdump does. Do the different backup modi (eg. stop mode) have an influence on the performance in those situations?
 
How is the resource usage on the MSA2050 when the backup slows down?
Overall transfer rate go down as well, because backup generate most of the data tranfer. But I can manually initiate copy of some huge file or block device to /dev/null inside VMs, and in that case even when backup already slowed down, copy run at fast speed, hundreds MB/s.

Sounds like a congestion or firmware issue on the MSA.
Maybe, but we updated MSA to the latest firmware, it didn't correct the issue.

About congestion, I was thinking the same. It feels like vzdump at some point encounter some problem with reading data at high speed, speed slowed because of congestion, but when congestion is gone, vzdump doesn't revert to full speed. Is there a way to turn on some debug output in vzdump?

I assume, VM/CT have their data cached in memory of the node and just don't read as much as vzdump does. Do the different backup modi (eg. stop mode) have an influence on the performance in those situations?
It's difficult to test, because of significant downtime needed to backup every VM, and many restarted VMs can cause problems in production environment. Even at the weekend we will face many complaints. Another problem is that not all VMs can correctly shutdown in time by themselves. Mass stop/start always difficult, many things may go wrong. Suspend mode also can cause problems: after suspend/resume some services may fail because of time difference, so it's easy to reboot VM than manually correct every issue. So really we can do backups only in snapshot mode.
 
Overall transfer rate go down as well, because backup generate most of the data tranfer. But I can manually initiate copy of some huge file or block device to /dev/null inside VMs, and in that case even when backup already slowed down, copy run at fast speed, hundreds MB/s.
Are those files cached? And where is the backup written to?

About congestion, I was thinking the same. It feels like vzdump at some point encounter some problem with reading data at high speed, speed slowed because of congestion, but when congestion is gone, vzdump doesn't revert to full speed. Is there a way to turn on some debug output in vzdump?
No debug at this end, but the system itself may log messages in syslog/journal. Also monitoring the performance during backup may shed some light.

Here is a explanation on how the vzdump works.
https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt

It's difficult to test, because of significant downtime needed to backup every VM, and many restarted VMs can cause problems in production environment. Even at the weekend we will face many complaints. Another problem is that not all VMs can correctly shutdown in time by themselves. Mass stop/start always difficult, many things may go wrong. Suspend mode also can cause problems: after suspend/resume some services may fail because of time difference, so it's easy to reboot VM than manually correct every issue. So really we can do backups only in snapshot mode.
This sounds, that there is more to it than a slow vzdump. Latency or IO/s issues? Are the nodes or the MSA capable enough or are resources scarce? You could create a clone of a VM in question and test with it (eg. modes, options, compression).
 
Alwin, thank you for your suggestions. I've checked old backup logs and found out that the problem started when we installed MSA2050 instead of previous storage MSA2040. We will try to revert back to MSA2040 and see if the problem remains. It will take time, but I will post results here.
 
So. I moved whole cluster to MSA2040. It didn't help. The problem still there.

About caching - I've tested read speed by running command "dd if=/dev/vmsX/vm-XXX-disk-X of=/dev/null status=progress" from physical nodes for many virtual machines. The results always multiple times higher than speed of slowed backup. No way there's caching involved. I also did dd test on VMs that was about to start backup. It didn't influence backup speed at all.
 
I found another way to restore backup speed. It's enough to restart just one node, even node without any VMs. Even node, connected to the storage through SAS switch (so storage doesn't know anything about restart). Speed increases on every node in the cluster exactly when PVE on restarted node is booting up. Maybe some initialization task do the trick. I tried to restart server without OS or boot Ubuntu from ISO - it doesn't help. It must be PVE booting up. Of course, after some time backup slow down again. So it doesn't solve the problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!