Thanks Tom / Dietmar,
I understand you guys haven't been able to reproduce it, but it's impacting a lot of users on varied setups (judging by all the threads related to the issue.) I'll try to detail my particular situation, it may be useful, I am not sure.
I have been running a ProxMox 1.9 cluster on 5 servers - all identical Dell machines - service tag / hardware inventory can be seen here:
http://www.dell.com/support/troubleshooting/us/en/04/Servicetag/857JSR1 - I was wrong on the age of the machines, they are almost 3 years old now but still running fine in every respect except for this weird backup issue
With the release of 3.0 I decided that it was time to upgrade the ProxMox versions on those machines, using one of the five machines as my test deployment platform. I migrated any running VMs off this box to other cluster members and using an ISO of the release version of ProxMox 3.0, I did a clean install of 3.0. After the install I set up our iSCSI mount which is where we run all our VMs from (a Dell Equallogic SAN) Once this was up and running, I moved the VMs back to the box one at a time and did a vzrestore or qmrestore as needed from the command line until all VMs were back online.
I created a scheduled backup, all running VMs (mostly OpenVZ containers and 2-3 KVM VMs) - when the scheduled backup runs it seems to complete 1-3 backups and then just hangs there, the load on the server skyrockets to over 1000 or more, VMs stop responding and they cannot be stopped or the cause of the load cannot be pinpointed. System logs show a lot of non-specific I/O / wait / filesystem jargon (several log dumps and mount details etc have been posted by me and others in several threads) What's odd is that I can still easily SSH into the server or access the webgui but I am unable to do anything with the box until I physically power it down and reboot it.
At first I assumed it was an NFS issue, so I tried local filesystem backups with no success. I've tried EXT3, EXT4, and doing the recommended kernel upgrades and scheduler changes you guys have recommended. I have tried only backing up OpenVZ containers (I thought maybe KVM VMs were the issue) with no luck. I have attempted a single backup from the command line of one OpenVZ container only to have it lock the box up solid. All of our VMs are running various versions of CentOS(5 / 6) and Ubuntu 12.04 or Ubuntu 12.10.
Just a couple of months ago I purchased six new Dell R620 servers with 32GB of RAM, local RAID storage, etc. I built an entirely new Prox 3.0 cluster with these servers - each server is identical and only runs 4-6 OpenVZ containers (Ubuntu 12.10) each. I set up recurring backups on these as well and they seemed to run fine for the first month or so I had them in production. Recently on two of these new servers the same problem I described above started happening with these machines. System inventory / service tag can be viewed here:
http://www.dell.com/support/troubleshooting/us/en/04/Servicetag/BBM6FX1
There isn't much more details than that, step by step it was simple clean install and migration or a clean install and new VMs - but I am happy to provide as much specific information as I can if it makes clearing this up any easier. If remote access to the affected system would be helpful, I can provide that as well, keeping in mind that they are production servers so the only time I try testing anything where I have to reboot the box because of the failure is done at night so it doesn't impact our users as much.
Have a great day and let me know if I can be more helpful.
Cheers,
Joe Jenkins