High server load during backup creation

felipe

Member
Oct 28, 2013
152
1
18
hi,

did you install the system with the baremetal iso or with debian wheezy?
i run into this problem after installing debian wheezy..
maybe for some other reason you dont have the right scheduler set...

check cat /sys/block/YOURDISKS/queue/scheduler where YOURDIKS = sda etc...
it should say cfq. any other scheduler would cause the problems you have....

to the sysadming here. can i change the wiki?
the description how to install proxmox via debian wheezy is perfect. i just miss this VERY IMPORTANT step at the end:


  1. echo cfq > /sys/block/DISKS/queue/scheduler to all of your disks
  2. find GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
  3. and add "... elevator=cfq"
  4. run update-grub


otherwise you will run for sure in the same problem as above...
this would actually also happen with LVM snapshots.....

regards
philipp
 

mmenaz

Member
Jun 25, 2009
736
5
18
Northern east Italy
But bare metal install default for 3.x is:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"
as far as I know, is cfq that can cause slowliness, not solve it.
 

felipe

Member
Oct 28, 2013
152
1
18
proxmox uses ionice to make the backup. but as far as i know and experienced only with cfq scheduler ionice takes effekt.... (read man ionice)
for me cfq solved the problem discussed above.
 

felipe

Member
Oct 28, 2013
152
1
18
maybe deadline is sometimes better (specially on bigger systems).... but when a backup takes all io and having no option to ionice it will stall the system. i experienced this with nfs backup storage also as simple doing dd over shh.
if there is an solution to get this work with the deadline scheduler i would like to know it....
i think in slower disk systems (we have simple raid1 on sata disks) the cfq is the better option. i dont have a big storage yet to test deadline. but since we changed to cfq everything works fine.
 

mir

Renowned Member
Apr 14, 2012
3,489
97
68
Copenhagen, Denmark
I actually have a test case which can enumerate the error seen using the new backup. Yesterday I was doing some heavy duty IOPS testing using fio in a CT deployed on shared NFS storage. Somewhere in the test the node and VM's and CT's on this node was completely unresponsive which lead to that all HA VM's and CT's was forced online migrated to other nodes.

Node specs:
Local storage: SSD scheduler used is noop (should have same influence on ionice as deadline)
NFS share: ZFS dataset (host 16GB RAM and disks in RAID10)

I will try a new test this evening using scheduler cfg instead.
 

e100

Active Member
Nov 6, 2010
1,235
24
38
Columbus, Ohio
ulbuilder.wordpress.com
If somebody find a bug, he should try to provide a test case to reproduce it. We can then try to fix it. Maintaining old code forever is not an option.
We understand you can not maintain old code forever, but removing working code and replacing it with unvetted code is a problem too.
Adding new feature and depreciating the old is what most projects do, especially with such fundamental changes.

We have provided test cases.
IO to backup device stalls, VM negatively effected. Backup to NFS server, unplug power to NFS server doing backup. Backup to a USB disk, disconnect it during backup.
IO is limited by the speed of the backup media, this is obvious because it is dictated by the current design.

The last issue we can not test or compare because we do not have the ability to do since LVM Snapshot backup was removed.
Moving the backup data around inside the KVM process likely has a negative impact on the operation of the VM, that is what many people are complaining about and observing.
I believe it is important to identify if this is a problem or not. If it is a problem maybe someone can find a good solution.
We need to perform some benchmarks to evaluate this:

Examples:
1. Perform a memory intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the memory intensive task run faster when using a particular backup method?
2. Perform a CPU intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the CPU intensive task run faster when using a particular backup method?
3. Perform an IO intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the IO intensive task run faster when using a particular backup method?
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
16,511
322
103
Austria
www.proxmox.com
I believe it is important to identify if this is a problem or not. If it is a problem maybe someone can find a good solution.
We need to perform some benchmarks to evaluate this:
But this is the wrong place. Would you mind to join the pve-devel list and discuss the issue there?
 

e100

Active Member
Nov 6, 2010
1,235
24
38
Columbus, Ohio
ulbuilder.wordpress.com
I've posted my results to the pve-devel mailing list, thought I would cross post them here too.
Maybe some of you can perform the same benchmarks and post your results, just be sure to edit the commands to match your system configuration.

I was just using a stripped down debian wheezy install for the VM, virtio, cache=none, 1 core, 512MB RAM.
Virtual disks stored on local LVM

Start a KVM Live Backup ( I just used the GUI )
Inside the VM immediately run:
Code:
dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192
Repeated same test but used LVM snapshot and vmtar:

Code:
lvcreate -L33000M -s -n test-snapshot /dev/vmdisks/vm-108-disk-2
/usr/lib/qemu-server/vmtar  '/etc/pve/qemu-server/108.conf' 'qemu-server.conf' '/dev/vmdisks/test-snapshot' 'vm-disk'|lzop -o /backup1/dump/backup.tar.lzop
Code:
KVM Live Backup    : 120 seconds or more
LVM Snapshot backup: 55 seconds
With no backup     : 45 seconds
Even worse was to read from an area far away from where the backup process is reading.
I started the backup, in the guest I ran:

Code:
dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192 skip=24000
Code:
KVM Live Backup    : 298 seconds
LVM Snapshot Backup:  58 seconds
I think this explains the load issue.

We still need to test write IO, I do not have the time at the moment.
 

jinjer

Member
Oct 4, 2010
194
5
18
I read the whole thread, and I would give you a suggestion: If CFQ solves the issue, just switch the scheduler to CFQ before backup, and switch it back to noop or deadline for normal operation.

Can't be worse than having a VM mount / ro because of journal write timeout.
 

Datenfalke

New Member
Jan 26, 2014
15
3
3
Just to add my experience:

I have the same problems with stalled VMs while they are backed up. The backup also crashed the VM LVM-EXT4 Filessystem one time, so I had to boot a rescue CD and make a manual fsck.

This is the case for a standard Proxmox 3.1 installation using NFS on a rather slow NAS.
 

cesarpk

Member
Mar 31, 2012
770
2
18
This explains basically nothing - your test is flawed.
Hi Dietmar,
It is a pleasure to greet you again

I always read the threads about of backup problems due to that i also have problems.

Can you explain why e100 explains basically nothing?. I can't understand you isince that e100 explain with real numbers of benchmark.
I will be very pleased to hear you.

Best regards
Cesar
 

cesarpk

Member
Mar 31, 2012
770
2
18
I think you should use a fast local disk for such backups, and use hook script to transfer result to slow storage.
Hi Dietmar again

Only a details:

1- If my Server with PVE Host don't have bay free for HDD?... :( (Don't is the better solution)
2- And if my Server with PVE Host have a bay free for a extra HDD, will be great that into the PVE GUI /Backup Tag, have the option of add 2 scripts, one for run it before backup and other for after of backup ("Veeam backup" for VMware has it), in this mode will be easy to run the necessary hooks scripts, and obviously will run consecutively.

Best regards
Cesar
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
16,511
322
103
Austria
www.proxmox.com
Can you explain why e100 explains basically nothing?. I can't understand you isince that e100 explain with real numbers of benchmark.
I will be very pleased to hear you.
We already fixed that issue with e100 (fix will be in next release).
 

cesarpk

Member
Mar 31, 2012
770
2
18
We already fixed that issue with e100 (fix will be in next release).
Thanks Dietmar, I don't wait for get the next rekease..:D

Only to know the details of this fix, can you pass me a web link?

Best regards
Cesar
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!