Slow backups

BloodyIron · Dec 8, 2015

So I've been trying to speed up my backups, and currently they're incredibly slow.

Code:

[COLOR=#000000][FONT=tahoma]INFO: starting new backup job: vzdump 100 400 401 402 403 404 501 502 --mailnotification failure --quiet 1 --mode snapshot --compress gzip --storage Backups --node starbug2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Starting Backup of VM 100 (qemu)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status = stopped[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: update VM 100: -lock backup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: backup mode: stop[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: ionice priority: 7[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: creating archive '/mnt/pve/Backups/dump/vzdump-qemu-100-2015_12_08-01_00_01.vma.gz'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: starting kvm to execute backup task[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: started backup task '1f428d65-7c46-444f-9897-e437315b98a1'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 0% (44761088/53687091200), sparse 0% (39653376), duration 3, 14/1 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 1% (543752192/53687091200), sparse 0% (165613568), duration 68, 7/5 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 2% (1076232192/53687091200), sparse 0% (180756480), duration 181, 4/4 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 3% (1615462400/53687091200), sparse 0% (187392000), duration 261, 6/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 4% (2147614720/53687091200), sparse 0% (188665856), duration 336, 7/7 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 5% (2691170304/53687091200), sparse 0% (361914368), duration 426, 6/4 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 6% (3226075136/53687091200), sparse 0% (371212288), duration 500, 7/7 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 7% (3762421760/53687091200), sparse 0% (388591616), duration 572, 7/7 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 8% (4295229440/53687091200), sparse 0% (411897856), duration 646, 7/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 9% (4834983936/53687091200), sparse 1% (582352896), duration 731, 6/4 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 10% (5370937344/53687091200), sparse 1% (629256192), duration 801, 7/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 11% (5912920064/53687091200), sparse 1% (689819648), duration 870, 7/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 12% (6447104000/53687091200), sparse 1% (745242624), duration 939, 7/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 13% (6982336512/53687091200), sparse 1% (864219136), duration 1014, 7/5 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 14% (7516454912/53687091200), sparse 1% (880877568), duration 1073, 9/8 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 15% (8058306560/53687091200), sparse 1% (968491008), duration 1140, 8/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 16% (8590852096/53687091200), sparse 1% (1042108416), duration 1188, 11/9 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 17% (9134931968/53687091200), sparse 2% (1157025792), duration 1248, 9/7 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 18% (9670885376/53687091200), sparse 2% (1191522304), duration 1327, 6/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 19% (10205921280/53687091200), sparse 2% (1228562432), duration 1379, 10/9 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 20% (10743054336/53687091200), sparse 2% (1280278528), duration 1434, 9/8 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 21% (11280449536/53687091200), sparse 2% (1513259008), duration 1485, 10/5 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 22% (11818303488/53687091200), sparse 3% (2051112960), duration 1528, 12/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 23% (12349865984/53687091200), sparse 4% (2582675456), duration 1581, 10/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 24% (12896174080/53687091200), sparse 5% (3123466240), duration 1633, 10/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 25% (13422624768/53687091200), sparse 6% (3262914560), duration 1686, 9/7 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 26% (13960871936/53687091200), sparse 7% (3785064448), duration 1724, 14/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 27% (14507769856/53687091200), sparse 8% (4331962368), duration 1761, 14/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 28% (15037693952/53687091200), sparse 9% (4857724928), duration 1797, 14/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 29% (15570567168/53687091200), sparse 9% (5069529088), duration 1847, 10/6 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 30% (16121004032/53687091200), sparse 10% (5617881088), duration 1903, 9/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 31% (16653877248/53687091200), sparse 11% (6150754304), duration 1939, 14/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 32% (17183408128/53687091200), sparse 12% (6677839872), duration 1975, 14/0 MB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 33% (17718902784/53687091200), sparse 12% (6800859136), duration 2026, 10/8 MB/s[/FONT][/COLOR]

As you can see the peak is about 14MB/s.

Now, I think the fastest I should reasonably expect from my infrastructure is about 65MB/s, but I'm nowhere near that close. My network is a fully-gigabit network, except right now there's no LAGG or anything like that, so I'm expecting about half of the potential gigabit speeds, even though it's full-duplex the whole way (which should mean ~120MB/s, but whatever).

I have two nodes in my cluster, and I've tried a few different things.

1) I turned jumbo frames on, on my switch, but I have not modified MTUs on my nodes at all.

2) On one of my nodes (starbug1) I declared a tmp directory in /etc/vzdump.conf however it also peaks at about 15MB/s

I'm using GZIP on both for backups, but I saw the same performance when I was using LZO.

Take note, I have not restarted either of these nodes since making these changes (jumbo frames, tmp dir)

Both nodes are backing up to an NFS export on a dedicated NAS that is not overloaded. Not only that, I stagger the backups of each node so the backups start about 4 hours apart. Backups on one node is taking 8 hours, 5 on the other.

What am I doing wrong here?

manu · Dec 9, 2015

Hi BloodyIron
Some questions:
Are you using a dedicated network infrastructure for backup ?
When the backup is running how is the load of the server ? Are waiting for I/O or busy using the available CPU time for compression ?

In my testlab here I daily backup two VMs over NFS, using vzdump live backup for KVM machines, raw throughput oscilate between 50/60MBs for the first one running on SSD, and 10/20MBs for the second one running on a busy SATA drive.

BloodyIron · Dec 9, 2015

It's a home lab, so I don't have dedicated sub-networks for things like live migration or backups. Each node has one NIC.

I schedule the backups to occur when they are generally not in use. So as such the backups should be the only real demanding task on the hypervisors or storage system.

The system hosting the VM images is the same system I'm backing up to. So I understand my speed should be slower than normal, but it is on a ZFS pool with plenty of RAM for ARC (11GB used of 16GB IIRC with a Hit% of ~80-90% on average). As such the storage system should not be the slow part of this.

Also, when digging through posts and documentation, I read that setting a vzdump temp dir would cause the hypervisor to compress the VM backup in that folder first, then send the big file over the network. However, it does not appear to be doing that. Any tips on that specific part?

Thanks for getting back to me

manu said:
Hi BloodyIron
Some questions:
Are you using a dedicated network infrastructure for backup ?
When the backup is running how is the load of the server ? Are waiting for I/O or busy using the available CPU time for compression ?

In my testlab here I daily backup two VMs over NFS, using vzdump live backup for KVM machines, raw throughput oscilate between 50/60MBs for the first one running on SSD, and 10/20MBs for the second one running on a busy SATA drive.

manu · Dec 10, 2015

If you have plenty of CPU time, you should have a look at the pigz parameter from vzdump, it allows you to compress the VM in parallel.
Having VM storage + backup on the same network share using the same network link is not something we recommend.

Here the ZFS ARC might help you to read faster the first 11GB of your VM disk image, but after that you need to hit the hard drive, transfer that over the network to the proxmox node, and after back again to the NAS server.

BloodyIron · Dec 10, 2015

Hi Manu,

Okay so some questions, and responses.

1) I understand my setup is not idyllic, but it's the lab I have, not necessarily the lab I want

I just don't have those resources at this time.

2) Despite the apparent bottlenecks, I still anticipate performance to be substantially faster. My understanding is that if I declare a tmp dir in vzdump.conf that it should read the VM and compress it to the tmp dir on the hypervisor, THEN send it over the network link to the Backup storage. I am not seeing this happening, as I think it's ignoring my tmp declaration in vzdump.conf. If this kind of process is achievable, I think it reasonable to expect substantially greater performance, since the reading the VM and copying the compressed backup would not be done in parallel (so far as I understand it).

3) I would love to leverage pigz, however, considering that my tmp dir declaration is already being ignored. What is the recommended method of implementing pigz in my proxmox backup process?

Thanks for your input so far!

manu said:
If you have plenty of CPU time, you should have a look at the pigz parameter from vzdump, it allows you to compress the VM in parallel.
Having VM storage + backup on the same network share using the same network link is not something we recommend.

Here the ZFS ARC might help you to read faster the first 11GB of your VM disk image, but after that you need to hit the hard drive, transfer that over the network to the proxmox node, and after back again to the NAS server.

e100 · Dec 11, 2015

I suspect that backing up to the same system where your images are stored is why its so slow.

Its demonstrated really well in your log files, when you write little data to the backup file your reads are high, when you write lots to the backup file your reads are low:
Faster reads, lower writes:
14/1 MB/s
12/0 MB/s
14/0 MB/s

Slower reads, higher writes:
6/4 MB/s
7/7 MB/s
7/5 MB/s

That could also be a CPU bound problem such as gzip using 100% CPU.

Not sure why the tmp file setting is not working for you but as a test I think it would be helpful to simply backup to some local storage and see if that is faster.

I think it would also be helpful for you to use iperf and test your network speeds between the proxmox server and your storage server.
A network problem could cause these issues too.

BloodyIron · Dec 11, 2015

I tested iperf in both directions, got about 950mbps.

I can try the local storage for backup, but my hardware is no slouch, the performance should not be this low, even considering it's backing up to the same device.

I still want to figure out why the tmp dir isn't being used

e100 said:
I suspect that backing up to the same system where your images are stored is why its so slow.

Its demonstrated really well in your log files, when you write little data to the backup file your reads are high, when you write lots to the backup file your reads are low:
Faster reads, lower writes:
14/1 MB/s
12/0 MB/s
14/0 MB/s

Slower reads, higher writes:
6/4 MB/s
7/7 MB/s
7/5 MB/s

That could also be a CPU bound problem such as gzip using 100% CPU.

Not sure why the tmp file setting is not working for you but as a test I think it would be helpful to simply backup to some local storage and see if that is faster.

I think it would also be helpful for you to use iperf and test your network speeds between the proxmox server and your storage server.
A network problem could cause these issues too.

BloodyIron · Dec 18, 2015

Okay so one of the backups is happening literally _right now_

The storage system isn't even reaching half the bandwidth it could perform. RX avg 30Mbps/TX avg 67Mbps

The iostat info for the ZFS pool it's working against shows usage, but nowhere near pushing it to the limits. This includes IOPS and throughput.

None of the stats on the proxmox node report any saturation. CPU is barely used, same for network and load.

I see no reason why it should be operating as slow as it is, and yet it is.

What exactly am I doing wrong here?

BloodyIron · Dec 21, 2015

Still trying to figure this out. Checking logs on the nodes and the freenas host, can't find any errors. Performance is still piss-poor, might even be getting worse.

I tried some single backups, turning off compression, trying suspend, can't get any backup configuration to result in good speed.

Starting to take more drastic measures, but unsure what my options are.

BloodyIron · Dec 21, 2015

Okay so more steps:

1) I turned off ALL VMs, and now there is literally no demand against my storage, backup speed did not improve

2) changed vzdump tmpdir from /tmp to /var/tmp, saw no improvement in backup speeds (I see a manifest folder in either one when I declare either folder)

3) I switched my storage AND one of my nodes to MTU 9000, of course ensuring the switch has jumbo frames on (dmesg says it's supported too). NO improvements there.

4) I tried in all these steps using snapshot, off, no compression, lzo, no changes there.

5) I tried backing up a VM with qcow2 and raw, no change there.

6) I tried backing up a VM that was Z1 spinning disk backed, and one that was all-SSD backed, no change there. (this one boggles me too).

7) Also updated my FreeNAS storage fully too.

8) double-checked that nfs is setup with the right amount of processes (4) on FreeNAS.

9) checked health of pool and disks, everything 100% a-ok.

10) looked through logs on proxmox nodes for anything being thrown, no errors being thrown that I can see.

I am pretty much out of ideas. What I can tell is this is a SOFTWARE issue, not my infrastructure. I'm wagging my fingers at Proxmox, and I just don't get why I'm having such poor performance compared to other people who so easily get quite good performance. I just want to aim for about 65MB/s, but I have never seen even close to that! At no point do I ever see a tangible bottleneck. loads are fine, cpu is fine, network is fine, storage is fine, etc!

This is becoming a problem for me because backups are taking upwards of 8+ hours!!!

What. The. Fuck. Am. I. Doing. Wrong????!?!?!?!?

Admins? Devs? Anyone?

sigxcpu · Dec 21, 2015

Please post the output (10-20 lines) of below commands while doing the slow backup:

Code:

iostat -kxz 1

Code:

vmstat 1

tom · Dec 21, 2015

As Proxmox VE is known to work much faster, I expect there is an issue with your storage box (freenas).? To test this, use another storage distro, e.g. openmediavault.

Before test the local storage and the NFS storage with standard benchmark tools, to find the bottleneck.

BloodyIron · Dec 21, 2015

1) iostat isn't installed by default, nor available to default repos, so not sure best course on that part.

2) here is the vmstat stuff:

backup output :

Code:

INFO: starting new backup job: vzdump 201 --remove 0 --mode snapshot --compress lzo --storage Backups --node REDACTED1
INFO: Starting Backup of VM 201 (qemu)
INFO: status = running
INFO: update VM 201: -lock backup
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/Backups/dump/vzdump-qemu-201-2015_12_21-13_17_18.vma.lzo'
INFO: started backup task '3ffe5d06-534e-4274-8fc6-459f9281c118'
INFO: status: 0% (47906816/34359738368), sparse 0% (32448512), duration 3, 15/5 MB/s
INFO: status: 1% (359399424/34359738368), sparse 0% (37720064), duration 24, 14/14 MB/s
ERROR: interrupted by signal
INFO: aborting backup job
ERROR: Backup of VM 201 failed - interrupted by signal
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal

vmstat output:

Code:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 30739392   8420 173572    0    0     0     0 2578 2862  1  0 99  0
 0  0      0 30736540   8428 175072    0    0     0   156 3058 3031  1  1 98  0
 0  0      0 30734200   8428 176720    0    0     0     0 3290 3490  1  1 98  0
 0  0      0 30732464   8428 178348    0    0     0     0 2803 3046  1  0 99  0
 0  0      0 30731968   8428 178972    0    0     0     0 2675 2849  1  1 99  0
 0  0      0 30731860   8428 179136    0    0     0     0 2471 2642  1  0 99  0
 0  0      0 30728884   8428 182136    0    0     0     0 2195 2376  1  0 99  0
 0  0      0 30723288   8436 187464    0    0     0    36 2776 3013  1  0 98  0
 0  0      0 30719444   8436 191456    0    0     0     0 2748 3010  1  1 99  0
 0  0      0 30716592   8436 194108    0    0     0     0 2117 2251  1  0 99  0
 0  0      0 30712016   8436 198392    0    0     0     0 2824 3200  2  0 98  0
 0  0      0 30705552   8436 204712    0    0     0     0 3003 3078  1  1 98  0
 0  0      0 30700708   8444 209044    0    0     0    16 3843 4022  2  1 98  0
 0  0      0 30698368   8444 211328    0    0     0   132 2207 2438  1  0 99  0
 0  0      0 30693860   8444 215932    0    0     0     0 2585 2862  1  0 99  0
 0  0      0 30688776   8444 220848    0    0     0     0 2724 2918  1  1 98  0
 0  0      0 30683856   8444 225332    0    0     0     0 2582 2839  1  0 99  0
 0  0      0 30680260   8452 229188    0    0     0    48 2847 3638  1  1 98  0
 0  0      0 30675300   8452 234124    0    0     0     0 2838 3049  1  0 98  0
 0  0      0 30668604   8452 240324    0    0     0     0 2865 2999  1  0 98  0
 0  0      0 30665388   8452 243636    0    0     0     0 2001 2210  1  0 99  0

sigxcpu said:
Please post the output (10-20 lines) of below commands while doing the slow backup:

Code:

iostat -kxz 1

Code:

vmstat 1

BloodyIron · Dec 21, 2015

Replacing the OS on the storage system isn't an option at this time. It's effectively "in production", and I see no tangible bottlenecks as the cause of FreeNAS. I understand your logic here, but I am not yet convinced that FreeNAS is the cause.

However I'm going to try setting up local backup on one of the nodes and try that.

Is there an NFS benchmark method you would recommend? I've tried a few, but I'm not certain what is the reliable one.

tom said:
As Proxmox VE is known to work much faster, I expect there is an issue with your storage box (freenas).? To test this, use another storage distro, e.g. openmediavault.

Before test the local storage and the NFS storage with standard benchmark tools, to find the bottleneck.

sigxcpu · Dec 21, 2015

Code:

apt-get install sysstat

for *iostat* command

tom · Dec 21, 2015

If you think freenas is not the issue, test the storage of your virtual disks.

where do you store your virtual disks?

BloodyIron · Dec 21, 2015

Started identical backup, same parameters as above.

iostat output:

Code:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.01    0.00    0.25    0.00    0.00   98.74

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     6.00    0.00    5.00     0.00    44.00    17.60     0.00    0.20    0.00    0.20   0.20   0.10
dm-0              0.00     0.00    0.00   11.00     0.00    44.00     8.00     0.01    0.64    0.00    0.64   0.09   0.10

sigxcpu said:
Code:

apt-get install sysstat

for *iostat* command

sigxcpu · Dec 21, 2015

Code:

dd if=/dev/zero of=/mnt/pve/Backups/dump/test.bin bs=1024k count=1000

BloodyIron · Dec 21, 2015

What test do you recommend?

The disk images are connected via an NFS export, and the storage is a 4x2TB Z1, but I also have some images on a 4x120GB (SSD) zfs-10 (striped mirror) config. As I mentioned earlier backups do not improve when backing up a VM that is stored on the SSD-backed storage. Also, zpool iostat shows that demand is insufficient to saturate the disks, be it the 4x2TB or the 4x120GB SSDs.

tom said:
If you think freenas is not the issue, test the storage of your virtual disks.

where do you store your virtual disks?

sigxcpu · Dec 21, 2015

Does Proxmox have any IRC channel on freenode? It helps with this kind of issues. Forum is too "off-line".

Slow backups

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Well-Known Member

Proxmox Staff Member

Renowned Member

Renowned Member

Well-Known Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member