NFS Backup Failed, Cluster unusable

S

swordfishman

Guest
Background:
3 Servers in a cluster with an iSCSI target. Multicast is enabled and verified working. Cluster was fine up until recently when the backup jobs grew from ~250GB to ~500GB nightly. Backups are sent to another server (different network entirely) NFS. Guests are currently all working, but management of the guests and servers seems to be completely destroyed.

From the look of things the backups lock the files and then timeout during transfer sometime over the night. The next morning I login to see all three servers listed as offline (red instead of green). I can no longer see the names of the guests, but I can see that they are still there - 100 through 140 VMID.

pvecm n
Node Sts Inc Joined Name
1 M 524 2012-06-07 08:01:07 svr1
2 M 528 2012-06-07 08:01:08 svr2
3 M 528 2012-06-07 08:01:08 svr3

pveversion
pve-manager/2.1/f9b0f63a

/etc/init.d/cman stop
/etc/init.d/cman start - no errors!

/etc/init.d/pve-cluster restart - no errors!

One of the servers I ps aux | grep vzdump and killed off all the processes. Now it won't let me login on the web panel with the login failed for root. Apache restart did nothing.

I blanked out the servernames, but the guests are showing up as in the picture - cannot tell which is which!

Besides rebooting, what's the best way to get management back?

Capture.PNG
 
i guess your backup target nfs server stalled, and then multiple backup processes (which cannot complete due to the nfs server being awol) fires the load on the proxmox servers into orbit, thus the web interface getting timeouts everywhere, unable to show the information you request.

i've got that once too. restarted the nfs target (separate nas just for backups), load went down, then stopped every backup in the gui & did manual backups.

with your vzdump kill you stopped vzdump, but not the master process which actually invoked the vzdump as child (cron) and that's still hogging (unused) resources. so your load is most likely still through the roof, hence the unresponsiveness of the gui.

go reboot. or kill the (now defunc) cron processes, which will probably leave locks everywhere.
 
Last edited:
GT1,

It does look like my NFS had a failure on it that caused all of this. My only problem with rebooting is that when I login on the web panel I cannot see any of the currently running guests - and if I shut down the host then all of the guests will be forced shutdown as well. Is there a way to halt the backups when this happens? Or is it a better idea to not lump all backups at once and maybe break them off hour by hour?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!