Backup slowing down

avn · Oct 29, 2018

Hello,

I have a similar (or same) problem as in this topic: https://forum.proxmox.com/threads/backup-slowing-down-to-a-crawl.44870. I was asked to create a new thread, because the setup is different.

In my case the storage is HPE MSA 2050 (attached by SAS). There's 4 LUNs where VMs located as LVM, shared between multiple nodes. Backup (snapshot mode) started weekly, and at first performance is great, it's 3-digit numbers. Pure read of sparse data can go up to 400 MB/s.

At some point the speed goes down, but not everywhere at once, only on particular LUN(s). Read speed of sparse data slowed to 20 Mb/s. It happens almost simultaneously on multiple nodes, if backup is going on more than one node at once. But only for particular LUNs, and I think it's most used LUNs (there more VMs that on others).

Somehow restart of controller on the storage correct the problem, and backup return to full speed. Because of multipath I can do that safely, but it needs to be done manually every time after a few hours, beacause read speed go down again.

The problem not likely in the storage, because only read speed of vzdump is affected, everything else work fine. I can copy data from the same LUN at high speed when backup already slowed. I'm not sure when this problem started, and I appreciate every suggestion for finding the cause.

Alwin · Oct 29, 2018

avn said:
In my case the storage is HPE MSA 2050 (attached by SAS). There's 4 LUNs where VMs located as LVM, shared between multiple nodes. Backup (snapshot mode) started weekly, and at first performance is great, it's 3-digit numbers. Pure read of sparse data can go up to 400 MB/s.

How is the resource usage on the MSA2050 when the backup slows down?

avn said:
Somehow restart of controller on the storage correct the problem, and backup return to full speed. Because of multipath I can do that safely, but it needs to be done manually every time after a few hours, beacause read speed go down again.

Sounds like a congestion or firmware issue on the MSA.

avn said:
The problem not likely in the storage, because only read speed of vzdump is affected, everything else work fine. I can copy data from the same LUN at high speed when backup already slowed. I'm not sure when this problem started, and I appreciate every suggestion for finding the cause.

I assume, VM/CT have their data cached in memory of the node and just don't read as much as vzdump does. Do the different backup modi (eg. stop mode) have an influence on the performance in those situations?

avn · Oct 29, 2018

Alwin said:
How is the resource usage on the MSA2050 when the backup slows down?

Overall transfer rate go down as well, because backup generate most of the data tranfer. But I can manually initiate copy of some huge file or block device to /dev/null inside VMs, and in that case even when backup already slowed down, copy run at fast speed, hundreds MB/s.

Alwin said:
Sounds like a congestion or firmware issue on the MSA.

Maybe, but we updated MSA to the latest firmware, it didn't correct the issue.

About congestion, I was thinking the same. It feels like vzdump at some point encounter some problem with reading data at high speed, speed slowed because of congestion, but when congestion is gone, vzdump doesn't revert to full speed. Is there a way to turn on some debug output in vzdump?

Alwin said:
I assume, VM/CT have their data cached in memory of the node and just don't read as much as vzdump does. Do the different backup modi (eg. stop mode) have an influence on the performance in those situations?

It's difficult to test, because of significant downtime needed to backup every VM, and many restarted VMs can cause problems in production environment. Even at the weekend we will face many complaints. Another problem is that not all VMs can correctly shutdown in time by themselves. Mass stop/start always difficult, many things may go wrong. Suspend mode also can cause problems: after suspend/resume some services may fail because of time difference, so it's easy to reboot VM than manually correct every issue. So really we can do backups only in snapshot mode.

Alwin · Oct 29, 2018

avn said:
Overall transfer rate go down as well, because backup generate most of the data tranfer. But I can manually initiate copy of some huge file or block device to /dev/null inside VMs, and in that case even when backup already slowed down, copy run at fast speed, hundreds MB/s.

Are those files cached? And where is the backup written to?

avn said:
About congestion, I was thinking the same. It feels like vzdump at some point encounter some problem with reading data at high speed, speed slowed because of congestion, but when congestion is gone, vzdump doesn't revert to full speed. Is there a way to turn on some debug output in vzdump?

No debug at this end, but the system itself may log messages in syslog/journal. Also monitoring the performance during backup may shed some light.

Here is a explanation on how the vzdump works.
https://git.proxmox.com/?p=pve-qemu.git;a=blob_plain;f=backup.txt

avn said:
It's difficult to test, because of significant downtime needed to backup every VM, and many restarted VMs can cause problems in production environment. Even at the weekend we will face many complaints. Another problem is that not all VMs can correctly shutdown in time by themselves. Mass stop/start always difficult, many things may go wrong. Suspend mode also can cause problems: after suspend/resume some services may fail because of time difference, so it's easy to reboot VM than manually correct every issue. So really we can do backups only in snapshot mode.

This sounds, that there is more to it than a slow vzdump. Latency or IO/s issues? Are the nodes or the MSA capable enough or are resources scarce? You could create a clone of a VM in question and test with it (eg. modes, options, compression).

avn · Oct 30, 2018

Alwin, thank you for your suggestions. I've checked old backup logs and found out that the problem started when we installed MSA2050 instead of previous storage MSA2040. We will try to revert back to MSA2040 and see if the problem remains. It will take time, but I will post results here.

avn · Nov 19, 2018

So. I moved whole cluster to MSA2040. It didn't help. The problem still there.

About caching - I've tested read speed by running command "dd if=/dev/vmsX/vm-XXX-disk-X of=/dev/null status=progress" from physical nodes for many virtual machines. The results always multiple times higher than speed of slowed backup. No way there's caching involved. I also did dd test on VMs that was about to start backup. It didn't influence backup speed at all.

avn · Nov 26, 2018

I found another way to restore backup speed. It's enough to restart just one node, even node without any VMs. Even node, connected to the storage through SAS switch (so storage doesn't know anything about restart). Speed increases on every node in the cluster exactly when PVE on restarted node is booting up. Maybe some initialization task do the trick. I tried to restart server without OS or boot Ubuntu from ISO - it doesn't help. It must be PVE booting up. Of course, after some time backup slow down again. So it doesn't solve the problem.

Search

Search

Backup slowing down

avn

Active Member

Alwin

Proxmox Retired Staff

avn

Active Member

Alwin

Proxmox Retired Staff

avn

Active Member

avn

Active Member

avn

Active Member

We value your privacy