High server load during backup creation

felipe · Nov 21, 2013

hi,

did you install the system with the baremetal iso or with debian wheezy?
i run into this problem after installing debian wheezy..
maybe for some other reason you dont have the right scheduler set...

check cat /sys/block/YOURDISKS/queue/scheduler where YOURDIKS = sda etc...
it should say cfq. any other scheduler would cause the problems you have....

to the sysadming here. can i change the wiki?
the description how to install proxmox via debian wheezy is perfect. i just miss this VERY IMPORTANT step at the end:

echo cfq > /sys/block/DISKS/queue/scheduler to all of your disks
find GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
and add "... elevator=cfq"
run update-grub

otherwise you will run for sure in the same problem as above...
this would actually also happen with LVM snapshots.....

regards
philipp

mir · Nov 21, 2013

If you use SSD or flash cache the recommended elevator setting is noop. This is also true for proxmox.

mmenaz · Nov 21, 2013

But bare metal install default for 3.x is:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"

as far as I know, is cfq that can cause slowliness, not solve it.

felipe · Nov 21, 2013

proxmox uses ionice to make the backup. but as far as i know and experienced only with cfq scheduler ionice takes effekt.... (read man ionice)
for me cfq solved the problem discussed above.

felipe · Nov 21, 2013

maybe deadline is sometimes better (specially on bigger systems).... but when a backup takes all io and having no option to ionice it will stall the system. i experienced this with nfs backup storage also as simple doing dd over shh.
if there is an solution to get this work with the deadline scheduler i would like to know it....
i think in slower disk systems (we have simple raid1 on sata disks) the cfq is the better option. i dont have a big storage yet to test deadline. but since we changed to cfq everything works fine.

mir · Nov 21, 2013

I actually have a test case which can enumerate the error seen using the new backup. Yesterday I was doing some heavy duty IOPS testing using fio in a CT deployed on shared NFS storage. Somewhere in the test the node and VM's and CT's on this node was completely unresponsive which lead to that all HA VM's and CT's was forced online migrated to other nodes.

Node specs:
Local storage: SSD scheduler used is noop (should have same influence on ionice as deadline)
NFS share: ZFS dataset (host 16GB RAM and disks in RAID10)

I will try a new test this evening using scheduler cfg instead.

e100 · Nov 21, 2013

dietmar said:
If somebody find a bug, he should try to provide a test case to reproduce it. We can then try to fix it. Maintaining old code forever is not an option.

We understand you can not maintain old code forever, but removing working code and replacing it with unvetted code is a problem too.
Adding new feature and depreciating the old is what most projects do, especially with such fundamental changes.

We have provided test cases.
IO to backup device stalls, VM negatively effected. Backup to NFS server, unplug power to NFS server doing backup. Backup to a USB disk, disconnect it during backup.
IO is limited by the speed of the backup media, this is obvious because it is dictated by the current design.

The last issue we can not test or compare because we do not have the ability to do since LVM Snapshot backup was removed.
Moving the backup data around inside the KVM process likely has a negative impact on the operation of the VM, that is what many people are complaining about and observing.
I believe it is important to identify if this is a problem or not. If it is a problem maybe someone can find a good solution.
We need to perform some benchmarks to evaluate this:

Examples:
1. Perform a memory intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the memory intensive task run faster when using a particular backup method?
2. Perform a CPU intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the CPU intensive task run faster when using a particular backup method?
3. Perform an IO intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the IO intensive task run faster when using a particular backup method?

dietmar · Nov 21, 2013

e100 said:
I believe it is important to identify if this is a problem or not. If it is a problem maybe someone can find a good solution.
We need to perform some benchmarks to evaluate this:

But this is the wrong place. Would you mind to join the pve-devel list and discuss the issue there?

e100 · Nov 21, 2013

dietmar said:
But this is the wrong place. Would you mind to join the pve-devel list and discuss the issue there?

Sure, I will send a message to the list, already subscribed.
Thank you for providing the proper direction so we can resolve this issue.

e100 · Nov 22, 2013

I've posted my results to the pve-devel mailing list, thought I would cross post them here too.
Maybe some of you can perform the same benchmarks and post your results, just be sure to edit the commands to match your system configuration.

I was just using a stripped down debian wheezy install for the VM, virtio, cache=none, 1 core, 512MB RAM.
Virtual disks stored on local LVM

Start a KVM Live Backup ( I just used the GUI )
Inside the VM immediately run:

Code:

dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192

Repeated same test but used LVM snapshot and vmtar:

Code:

lvcreate -L33000M -s -n test-snapshot /dev/vmdisks/vm-108-disk-2
/usr/lib/qemu-server/vmtar  '/etc/pve/qemu-server/108.conf' 'qemu-server.conf' '/dev/vmdisks/test-snapshot' 'vm-disk'|lzop -o /backup1/dump/backup.tar.lzop

Code:

KVM Live Backup    : 120 seconds or more
LVM Snapshot backup: 55 seconds
With no backup     : 45 seconds

Even worse was to read from an area far away from where the backup process is reading.
I started the backup, in the guest I ran:

Code:

dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192 skip=24000

Code:

KVM Live Backup    : 298 seconds
LVM Snapshot Backup:  58 seconds

I think this explains the load issue.

We still need to test write IO, I do not have the time at the moment.

dietmar · Nov 22, 2013

e100 said:
I think this explains the load issue.

This explains basically nothing - your test is flawed.

But cross posting here is totally useless - please stop this.

jinjer · Nov 24, 2013

I read the whole thread, and I would give you a suggestion: If CFQ solves the issue, just switch the scheduler to CFQ before backup, and switch it back to noop or deadline for normal operation.

Can't be worse than having a VM mount / ro because of journal write timeout.

Nico Haase · Jan 2, 2014

dietmar said:
But this is the wrong place. Would you mind to join the pve-devel list and discuss the issue there?

Hi there! I have seen some discussions on the developers list, but as fas as I see there is no official fix yet. Do you have any ETA for it?

Nico Haase · Jan 20, 2014

....still no news about it?

Datenfalke · Jan 26, 2014

Just to add my experience:

I have the same problems with stalled VMs while they are backed up. The backup also crashed the VM LVM-EXT4 Filessystem one time, so I had to boot a rescue CD and make a manual fsck.

This is the case for a standard Proxmox 3.1 installation using NFS on a rather slow NAS.

cesarpk · Jan 27, 2014

dietmar said:
This explains basically nothing - your test is flawed.

Hi Dietmar,
It is a pleasure to greet you again

I always read the threads about of backup problems due to that i also have problems.

Can you explain why e100 explains basically nothing?. I can't understand you isince that e100 explain with real numbers of benchmark.
I will be very pleased to hear you.

Best regards
Cesar

cesarpk · Jan 27, 2014

dietmar said:
I think you should use a fast local disk for such backups, and use hook script to transfer result to slow storage.

Hi Dietmar again

Only a details:

1- If my Server with PVE Host don't have bay free for HDD?...

(Don't is the better solution)
2- And if my Server with PVE Host have a bay free for a extra HDD, will be great that into the PVE GUI /Backup Tag, have the option of add 2 scripts, one for run it before backup and other for after of backup ("Veeam backup" for VMware has it), in this mode will be easy to run the necessary hooks scripts, and obviously will run consecutively.

Best regards
Cesar

dietmar · Jan 27, 2014

cesarpk said:
Can you explain why e100 explains basically nothing?. I can't understand you isince that e100 explain with real numbers of benchmark.
I will be very pleased to hear you.

We already fixed that issue with e100 (fix will be in next release).

cesarpk · Jan 27, 2014

dietmar said:
We already fixed that issue with e100 (fix will be in next release).

Thanks Dietmar, I don't wait for get the next rekease..

Only to know the details of this fix, can you pass me a web link?

Best regards
Cesar

dietmar · Jan 27, 2014

cesarpk said:
Only to know the details of this fix, can you pass me a web link?

see backup related changes in

https://git.proxmox.com/?p=pve-qemu-kvm.git;a=summary

High server load during backup creation

Well-Known Member

Famous Member

Renowned Member

Well-Known Member

Well-Known Member

Famous Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Member

Member

Renowned Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

We value your privacy