High server load during backup creation

Discussion in 'Proxmox VE: Installation and configuration' started by Nico Haase, Nov 4, 2013.

  1. felipe

    felipe Member

    Joined:
    Oct 28, 2013
    Messages:
    150
    Likes Received:
    1
    hi,

    did you install the system with the baremetal iso or with debian wheezy?
    i run into this problem after installing debian wheezy..
    maybe for some other reason you dont have the right scheduler set...

    check cat /sys/block/YOURDISKS/queue/scheduler where YOURDIKS = sda etc...
    it should say cfq. any other scheduler would cause the problems you have....

    to the sysadming here. can i change the wiki?
    the description how to install proxmox via debian wheezy is perfect. i just miss this VERY IMPORTANT step at the end:


    1. echo cfq > /sys/block/DISKS/queue/scheduler to all of your disks
    2. find GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub
    3. and add "... elevator=cfq"
    4. run update-grub


    otherwise you will run for sure in the same problem as above...
    this would actually also happen with LVM snapshots.....

    regards
    philipp
     
  2. mir

    mir Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 14, 2012
    Messages:
    3,480
    Likes Received:
    96
    If you use SSD or flash cache the recommended elevator setting is noop. This is also true for proxmox.
     
  3. mmenaz

    mmenaz Member

    Joined:
    Jun 25, 2009
    Messages:
    735
    Likes Received:
    5
    But bare metal install default for 3.x is:
    Code:
    GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"
    
    as far as I know, is cfq that can cause slowliness, not solve it.
     
  4. felipe

    felipe Member

    Joined:
    Oct 28, 2013
    Messages:
    150
    Likes Received:
    1
    proxmox uses ionice to make the backup. but as far as i know and experienced only with cfq scheduler ionice takes effekt.... (read man ionice)
    for me cfq solved the problem discussed above.
     
  5. felipe

    felipe Member

    Joined:
    Oct 28, 2013
    Messages:
    150
    Likes Received:
    1
    maybe deadline is sometimes better (specially on bigger systems).... but when a backup takes all io and having no option to ionice it will stall the system. i experienced this with nfs backup storage also as simple doing dd over shh.
    if there is an solution to get this work with the deadline scheduler i would like to know it....
    i think in slower disk systems (we have simple raid1 on sata disks) the cfq is the better option. i dont have a big storage yet to test deadline. but since we changed to cfq everything works fine.
     
  6. mir

    mir Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 14, 2012
    Messages:
    3,480
    Likes Received:
    96
    I actually have a test case which can enumerate the error seen using the new backup. Yesterday I was doing some heavy duty IOPS testing using fio in a CT deployed on shared NFS storage. Somewhere in the test the node and VM's and CT's on this node was completely unresponsive which lead to that all HA VM's and CT's was forced online migrated to other nodes.

    Node specs:
    Local storage: SSD scheduler used is noop (should have same influence on ionice as deadline)
    NFS share: ZFS dataset (host 16GB RAM and disks in RAID10)

    I will try a new test this evening using scheduler cfg instead.
     
  7. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    We understand you can not maintain old code forever, but removing working code and replacing it with unvetted code is a problem too.
    Adding new feature and depreciating the old is what most projects do, especially with such fundamental changes.

    We have provided test cases.
    IO to backup device stalls, VM negatively effected. Backup to NFS server, unplug power to NFS server doing backup. Backup to a USB disk, disconnect it during backup.
    IO is limited by the speed of the backup media, this is obvious because it is dictated by the current design.

    The last issue we can not test or compare because we do not have the ability to do since LVM Snapshot backup was removed.
    Moving the backup data around inside the KVM process likely has a negative impact on the operation of the VM, that is what many people are complaining about and observing.
    I believe it is important to identify if this is a problem or not. If it is a problem maybe someone can find a good solution.
    We need to perform some benchmarks to evaluate this:

    Examples:
    1. Perform a memory intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the memory intensive task run faster when using a particular backup method?
    2. Perform a CPU intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the CPU intensive task run faster when using a particular backup method?
    3. Perform an IO intensive task in a VM while doing an LVM Snapshot backup and repeat using KVM Live backup. Did the IO intensive task run faster when using a particular backup method?
     
  8. dietmar

    dietmar Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    16,432
    Likes Received:
    299
    But this is the wrong place. Would you mind to join the pve-devel list and discuss the issue there?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Sure, I will send a message to the list, already subscribed.
    Thank you for providing the proper direction so we can resolve this issue.
     
  10. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    I've posted my results to the pve-devel mailing list, thought I would cross post them here too.
    Maybe some of you can perform the same benchmarks and post your results, just be sure to edit the commands to match your system configuration.

    I was just using a stripped down debian wheezy install for the VM, virtio, cache=none, 1 core, 512MB RAM.
    Virtual disks stored on local LVM

    Start a KVM Live Backup ( I just used the GUI )
    Inside the VM immediately run:
    Code:
    dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192
    Repeated same test but used LVM snapshot and vmtar:

    Code:
    lvcreate -L33000M -s -n test-snapshot /dev/vmdisks/vm-108-disk-2
    /usr/lib/qemu-server/vmtar  '/etc/pve/qemu-server/108.conf' 'qemu-server.conf' '/dev/vmdisks/test-snapshot' 'vm-disk'|lzop -o /backup1/dump/backup.tar.lzop
    
    
    Code:
    KVM Live Backup    : 120 seconds or more
    LVM Snapshot backup: 55 seconds
    With no backup     : 45 seconds
    
    Even worse was to read from an area far away from where the backup process is reading.
    I started the backup, in the guest I ran:

    Code:
    dd if=/dev/disk_being_backed_up of=/dev/null bs=1M count=8192 skip=24000
    Code:
    KVM Live Backup    : 298 seconds
    LVM Snapshot Backup:  58 seconds
    I think this explains the load issue.

    We still need to test write IO, I do not have the time at the moment.
     
  11. dietmar

    dietmar Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    16,432
    Likes Received:
    299
    This explains basically nothing - your test is flawed.

    But cross posting here is totally useless - please stop this.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. jinjer

    jinjer Member

    Joined:
    Oct 4, 2010
    Messages:
    194
    Likes Received:
    5
    I read the whole thread, and I would give you a suggestion: If CFQ solves the issue, just switch the scheduler to CFQ before backup, and switch it back to noop or deadline for normal operation.

    Can't be worse than having a VM mount / ro because of journal write timeout.
     
  13. Nico Haase

    Nico Haase Member

    Joined:
    Feb 27, 2013
    Messages:
    31
    Likes Received:
    1
    Hi there! I have seen some discussions on the developers list, but as fas as I see there is no official fix yet. Do you have any ETA for it?
     
  14. Nico Haase

    Nico Haase Member

    Joined:
    Feb 27, 2013
    Messages:
    31
    Likes Received:
    1
    ....still no news about it?
     
  15. Datenfalke

    Datenfalke New Member

    Joined:
    Jan 26, 2014
    Messages:
    15
    Likes Received:
    3
    Just to add my experience:

    I have the same problems with stalled VMs while they are backed up. The backup also crashed the VM LVM-EXT4 Filessystem one time, so I had to boot a rescue CD and make a manual fsck.

    This is the case for a standard Proxmox 3.1 installation using NFS on a rather slow NAS.
     
  16. cesarpk

    cesarpk Member

    Joined:
    Mar 31, 2012
    Messages:
    770
    Likes Received:
    2
    Hi Dietmar,
    It is a pleasure to greet you again

    I always read the threads about of backup problems due to that i also have problems.

    Can you explain why e100 explains basically nothing?. I can't understand you isince that e100 explain with real numbers of benchmark.
    I will be very pleased to hear you.

    Best regards
    Cesar
     
  17. cesarpk

    cesarpk Member

    Joined:
    Mar 31, 2012
    Messages:
    770
    Likes Received:
    2
    Hi Dietmar again

    Only a details:

    1- If my Server with PVE Host don't have bay free for HDD?... :( (Don't is the better solution)
    2- And if my Server with PVE Host have a bay free for a extra HDD, will be great that into the PVE GUI /Backup Tag, have the option of add 2 scripts, one for run it before backup and other for after of backup ("Veeam backup" for VMware has it), in this mode will be easy to run the necessary hooks scripts, and obviously will run consecutively.

    Best regards
    Cesar
     
  18. dietmar

    dietmar Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    16,432
    Likes Received:
    299
    We already fixed that issue with e100 (fix will be in next release).
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  19. cesarpk

    cesarpk Member

    Joined:
    Mar 31, 2012
    Messages:
    770
    Likes Received:
    2
    Thanks Dietmar, I don't wait for get the next rekease..:D

    Only to know the details of this fix, can you pass me a web link?

    Best regards
    Cesar
     
  20. dietmar

    dietmar Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    16,432
    Likes Received:
    299
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice