High server load during backup creation

hi,

did you install the system with the baremetal iso or with debian wheezy?
i run into this problem after installing debian wheezy..
maybe for some other reason you dont have the right scheduler set...

check cat /sys/block/YOURDISKS/queue/scheduler where YOURDIKS = sda etc...
it should say cfq. any other scheduler would cause the problems you have....

to the sysadming here. can i change the wiki?
the description how to install proxmox via debian wheezy is perfect. i just miss this VERY IMPORTANT step at the end:


  1. echo cfq > /sys/block/DISKS/queue/scheduler to all of your disks
  2. find GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
  3. and add "... elevator=cfq"
  4. run update-grub


otherwise you will run for sure in the same problem as above...
this would actually also happen with LVM snapshots.....

regards
philipp
 
But bare metal install default for 3.x is:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"
as far as I know, is cfq that can cause slowliness, not solve it.
 
proxmox uses ionice to make the backup. but as far as i know and experienced only with cfq scheduler ionice takes effekt.... (read man ionice)
for me cfq solved the problem discussed above.
 
maybe deadline is sometimes better (specially on bigger systems).... but when a backup takes all io and having no option to ionice it will stall the system. i experienced this with nfs backup storage also as simple doing dd over shh.
if there is an solution to get this work with the deadline scheduler i would like to know it....
i think in slower disk systems (we have simple raid1 on sata disks) the cfq is the better option. i dont have a big storage yet to test deadline. but since we changed to cfq everything works fine.
 
I actually have a test case which can enumerate the error seen using the new backup. Yesterday I was doing some heavy duty IOPS testing using fio in a CT deployed on shared NFS storage. Somewhere in the test the node and VM's and CT's on this node was completely unresponsive which lead to that all HA VM's and CT's was forced online migrated to other nodes.

Node specs:
Local storage: SSD scheduler used is noop (should have same influence on ionice as deadline)
NFS share: ZFS dataset (host 16GB RAM and disks in RAID10)

I will try a new test this evening using scheduler cfg instead.
 
If somebody find a bug, he should try to provide a test case to reproduce it. We can then try to fix it. Maintaining old code forever is not an option.

We understand you can not maintain old code forever, but removing working code and replacing it with unvetted code is a problem too.
Adding new feature and depreciating the old is what most projects do, especially with such fundamental changes.

We have provided test cases.
IO to backup device stalls, VM negatively effected. Backup to NFS server, unplug power to NFS server doing backup. Backup to a USB disk, disconnect it during backup.
IO is limited by the speed of the backup media, this is obvious because it is dictated by the current design.

The last issue we can not test or compare because we do not have the ability to do since LVM Snapshot backup was removed.
Moving the backup data around inside the KVM process likely has a negative impact on the operation of the VM, that is what many people are complaining about and observing.
I believe it is important to identify if this is a problem or not. If it is a problem maybe someone can find a good solution.
We need to perform some benchmarks to evaluate this:

Examples:
1. Perform a memory intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the memory intensive task run faster when using a particular backup method?
2. Perform a CPU intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the CPU intensive task run faster when using a particular backup method?
3. Perform an IO intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the IO intensive task run faster when using a particular backup method?
 
I believe it is important to identify if this is a problem or not. If it is a problem maybe someone can find a good solution.
We need to perform some benchmarks to evaluate this:

But this is the wrong place. Would you mind to join the pve-devel list and discuss the issue there?
 
I've posted my results to the pve-devel mailing list, thought I would cross post them here too.
Maybe some of you can perform the same benchmarks and post your results, just be sure to edit the commands to match your system configuration.

I was just using a stripped down debian wheezy install for the VM, virtio, cache=none, 1 core, 512MB RAM.
Virtual disks stored on local LVM

Start a KVM Live Backup ( I just used the GUI )
Inside the VM immediately run:
Code:
dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192

Repeated same test but used LVM snapshot and vmtar:

Code:
lvcreate -L33000M -s -n test-snapshot /dev/vmdisks/vm-108-disk-2
/usr/lib/qemu-server/vmtar  '/etc/pve/qemu-server/108.conf' 'qemu-server.conf' '/dev/vmdisks/test-snapshot' 'vm-disk'|lzop -o /backup1/dump/backup.tar.lzop
Code:
KVM Live Backup    : 120 seconds or more
LVM Snapshot backup: 55 seconds
With no backup     : 45 seconds

Even worse was to read from an area far away from where the backup process is reading.
I started the backup, in the guest I ran:

Code:
dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192 skip=24000

Code:
KVM Live Backup    : 298 seconds
LVM Snapshot Backup:  58 seconds

I think this explains the load issue.

We still need to test write IO, I do not have the time at the moment.
 
I read the whole thread, and I would give you a suggestion: If CFQ solves the issue, just switch the scheduler to CFQ before backup, and switch it back to noop or deadline for normal operation.

Can't be worse than having a VM mount / ro because of journal write timeout.
 
Just to add my experience:

I have the same problems with stalled VMs while they are backed up. The backup also crashed the VM LVM-EXT4 Filessystem one time, so I had to boot a rescue CD and make a manual fsck.

This is the case for a standard Proxmox 3.1 installation using NFS on a rather slow NAS.
 
This explains basically nothing - your test is flawed.

Hi Dietmar,
It is a pleasure to greet you again

I always read the threads about of backup problems due to that i also have problems.

Can you explain why e100 explains basically nothing?. I can't understand you isince that e100 explain with real numbers of benchmark.
I will be very pleased to hear you.

Best regards
Cesar
 
I think you should use a fast local disk for such backups, and use hook script to transfer result to slow storage.

Hi Dietmar again

Only a details:

1- If my Server with PVE Host don't have bay free for HDD?... :( (Don't is the better solution)
2- And if my Server with PVE Host have a bay free for a extra HDD, will be great that into the PVE GUI /Backup Tag, have the option of add 2 scripts, one for run it before backup and other for after of backup ("Veeam backup" for VMware has it), in this mode will be easy to run the necessary hooks scripts, and obviously will run consecutively.

Best regards
Cesar
 
Can you explain why e100 explains basically nothing?. I can't understand you isince that e100 explain with real numbers of benchmark.
I will be very pleased to hear you.

We already fixed that issue with e100 (fix will be in next release).
 
We already fixed that issue with e100 (fix will be in next release).

Thanks Dietmar, I don't wait for get the next rekease..:D

Only to know the details of this fix, can you pass me a web link?

Best regards
Cesar