Host stops responding after backup starts

donty

Member
Mar 31, 2009
42
0
6
I have an odd one, which could be hardware but it seems odd it shows within 5mins or so of starting a backup starting.

I have a 8 core 8GB Dell server with 500GB of RAID 5 +hot spare. I run a backup to an NFS partition that works fine for the other 3 machines in the cluster and has done with this one in the past. It has 2 W2k3 VMs running on it and they are about 10GB+ each.

The backup starts at 245am and the server stops responding at 250am but not every night nor on a regular basis. Most recent was 10days apart. Requires a reboot of the host to get it back.

Running PVE 1.5 with 2.6.24-10pve.

No log entry seems to show anything odd around the time of failure, unless I am missing a log file somewhere, perhaps I need to turn up PVE logging somewhere?

Any one with any ideas or comments?

Thanks

Donty
 
Hi,
perhaps the nfs-share? Can you try to save the backup to an external disk?
What performance-values do you have during the backup? Load and io-wait.
Which kind of nic you use for the nfs-connection? (Perhaps a driver problem)

Udo
 
Have had another instance of the failure today, so about a week later. Running more detailed monitoring has shown that SNMP reported increased memory use:
Low threshold exceeded for SNMP datasource memAvailReal / memTotalReal * 100.0 on interface x.x.x.x, parms: ds="memAvailReal / memTotalReal * 100.0" value="0.5516650342801175" threshold="5.0" trigger="2" rearm="10.0"

The system has 8GB of memory and normal resource use (in the server view in the cluster admin) is about 5.5GB, i/o delays 0.1-0.15%, 3-5% cpu and currently 16% of disk space used. Backups occur in 'quiet' time and the failure is not every time.

Testing with an non critical VM of a simple small server whilst the other VMs are under normal operational load and seeing usage no more than 10% I/O delays and 20% CPU, memory changed by less than 1GB. NFS under normal loads too.

The servers are both Dell using broadcoms 1GB extreme same as the other cluster servers. All are updated and all using current 1.5 on 2.6.24-10pve.

The same time window causes the failure, so it does seem to be when a backup to nfs is in progress.

Just wondered what the behaviour should be if such a connection was lost or similar service failed? A not impossible scenario.

Would you expect the server to completely freeze solid and require a reboot? It seems a very ungraceful way to fail, cant see anything in logs that shows any prospect of failure so I don't have much to go on. The external monitoring just suggests large memory increase shortly before death.
 
Would you expect the server to completely freeze solid and require a reboot? It seems a very ungraceful way to fail

If the server runs out of memory you are lost. Simply add more ram, or do not run as manny VMs (I have no info about what you run on that server).
 
Hi

Thanks for getting back to me.

There are only 2 VMs on the host with 3.5GB and 2GB assigned. The server has 8GB so leaving 2.5GB available for the underlying host. Is that not sensible?

The VMs are W2k3 and each about 30GB in total, using qcow and an e1000 netcard as hardware.

The host is Dell, 8 core x 2GHz, with 8GB and raid 5+hotspare giving 500GB

I have not put any other cron event on and the backup is about all that is going on at the time, done as a snapshot with compression.

Has been working well for several months up until about 3-4 weeks ago which is after the time when we applied the recent updates to kernel and 1.5. Cant say for sure they are directly linked as it wasnt happening immediately.

What should happen if NFS service did fail during a backup? Would you expect it to perhaps hang the VM or report a failure? Not kill the host?
 
There are only 2 VMs on the host with 3.5GB and 2GB assigned. The server has 8GB so leaving 2.5GB available for the underlying host. Is that not sensible?

That is OK. You then need to find out who/what usese all the RAM on the host.

What should happen if NFS service did fail during a backup? Would you expect it to perhaps hang the VM or report a failure? Not kill the host?

It should report a failure.