PVE 1.6 Cluster master behaving strange

  • Thread starter Thread starter TiagoRF
  • Start date Start date
T

TiagoRF

Guest
Afternoon people!

I've been noticing a strange behavior from the master of the cluster;

- Can't access it through web interface, says wrong login or pass

When I pveca -l it:

spr:~# pveca -l
CID----IPADDRESS----ROLE-STATE--------UPTIME---LOAD----MEM---DISK
1 : 10.0.2.101 M ERROR: 500 read timeout

2 : 10.0.2.102 N S 10 days 18:20 0.47 57% 3%


And the loads..

14:44:31 up 10 days, 18:21, 1 user, load average: 6.55, 6.04, 5.88


Comparing to the node, thats huge! Uptime14:46:14 up 10 days 18:23, load average: 0.25, 0.30, 0.33

thats the node!

In the webinterface through the node, we can see this:

HostnameIP AddressRoleStateUptimeLoadCPUIODelayMemoryDiskspr10.0.2.101MasterERROR: 500 read timeout sse10.0.2.102Nodenosync10 days 18:240.260%0%57%3%

Pretty awkward, rebooting is by any mean a good hint?
 
Probably just found out:

spr:~# ps x | grep vzdump
5723 ? Ds 0:00 /usr/bin/perl -w /usr/sbin/vzdump --quiet --node 1 --snapshot --compress --storage Backups --mailto xpto@xpto.org 101
18655 ? Ds 0:00 /usr/bin/perl -w /usr/sbin/vzdump --quiet --node 1 --suspend --compress --storage Backups --mailto xpto@xpto.org 103

The processes are hang!
 
any logs in /var/log/vzdump regarding these jobs?
 
15:05:54 up 10 days, 18:42, 1 user, load average: 3.50, 3.90, 4.66

the loads are coming normal again with time, yet the sync from the cluster sems to be gone at the moment.

drbd is fine nonetheless
 
not a single log tom

I had that a few times a while ago when we were between the 1.6.5121 & 1.6.5261 releases- while waiting for the 2.6.35 kernel with KSM.
Then twice again after reinstalling the newer iso, using 2.6.35.

I couldn't make it happen, they did it on their own at random it seemed.
I had to use the power button on the host each time.

Then it just stopped happening.
Since that general time period, for reasons unrelated, I started using these 2 packages from testing, and still run the 2.6.35 kernel.
They've been fine for over a week now, probably even more like 2 or 3.

Code:
Package: pve-qemu-kvm
Pin: release c=pvetest
Pin-Priority: 900

Package: qemu-server
Pin: release c=pvetest
Pin-Priority: 900