all nodes red - but quorum - can not find any error

felipe · Mar 4, 2015

Hello,

yesterday all nodes went red.
only each node shows itself green.

i allready tried to restart cman, pvestatd, pvedaemon - all works without error on every node. but nothing changes.
even tried to reboot one node...
i also can write to /etc/pve/...
all nfs shares (images and backups) are working.

i am using pve-manager 3.3-1 till 3.3-5

all servers show this:

root@node6:~# cat /etc/pve/.members
{
"nodename": "node6",
"version": 2,
"cluster": { "name": "cluster01", "version": 11, "nodes": 11, "quorate": 1 },
"nodelist": {
"node1": { "id": 1, "online": 1},
"node7": { "id": 2, "online": 1},
"node2": { "id": 3, "online": 1},
"node5": { "id": 4, "online": 1},
"node4": { "id": 5, "online": 1},
"node6": { "id": 7, "online": 1},
"node8": { "id": 8, "online": 1},
"node3": { "id": 6, "online": 1},
"ceph1": { "id": 9, "online": 1},
"ceph2": { "id": 10, "online": 1},
"ceph3": { "id": 11, "online": 1}
}
}

root@ceph1:~# pvecm status
Version: 6.2.0
Config Version: 11
Cluster Name: cluster01
Cluster Id: 53601
Cluster Member: Yes
Cluster Generation: 3472
Membership state: Cluster-Member
Nodes: 11
Expected votes: 11
Total votes: 11
Node votes: 1
Quorum: 6
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: ceph1
Node ID: 9
Multicast addresses: 239.192.209.51
Node addresses: 192.168.11.31

root@ceph1:~# clustat
Cluster Status for cluster01 @ Tue Mar 3 23:52:14 2015
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
node1 1 Online
node7 2 Online
node2 3 Online
node5 4 Online
node4 5 Online
node3 6 Online
node6 7 Online
node8 8 Online
ceph1 9 Online, Local
ceph2 10 Online
ceph3 11 Online

dietmar · Mar 4, 2015

Seems to be a problem with pvestatd. Any hints in /var/log/syslog?

Please verify that all storages are online:

# pvesm status

Problems with non-accessible storage can introduce long delays in pvestatd ....

felipe · Mar 4, 2015

now i can find in the log of one server (which is the one i think that makes problems)

Mar 4 12:11:55 node6 pvestatd[3224]: status update time (8.769 seconds)
Mar 4 12:12:03 node6 pvestatd[3224]: status update time (5.337 seconds)
Mar 4 12:12:13 node6 pvestatd[3224]: status update time (5.921 seconds)
Mar 4 12:12:22 node6 pvestatd[3224]: status update time (5.740 seconds)
Mar 4 12:12:35 node6 pvestatd[3224]: status update time (7.512 seconds)
Mar 4 12:12:44 node6 pvestatd[3224]: status update time (6.793 seconds)
Mar 4 12:12:53 node6 pvestatd[3224]: status update time (6.489 seconds)
Mar 4 12:13:06 node6 pvestatd[3224]: status update time (8.554 seconds)
Mar 4 12:13:14 node6 pvestatd[3224]: status update time (7.650 seconds)

root@node6:/etc/pve# pvesm status
backups nfs 1 11505827840 9002877952 1918464000 82.93%
guest1 lvm 1 1888657408 0 155099136 0.50%
guest2 lvm 1 1953112064 0 736763904 0.50%
images nfs 1 11505827840 9002877952 1918464000 82.93%
local dir 1 47929224 2076332 43395140 5.07%
pool2replica rbd 1 0 0 0 100.00%
rbd rbd 1 0 0 0 100.00%

pool2replica should not be 0! on all other nodes it shows values. rbd i never confugured but never had an issue...
the strange thing is that in the gui of the node6 i can view pool2replica. and see contents. it is just quiete slow to open. generally the server has a high io wait around 10-15 at the moment...

why actually if just one node has some kind of problems the pvestatd daemon shows no values on all nodes?

sdutremble · Mar 4, 2015

I saw this happening before. Check your network cabling.

It is an easy thing to check.

Serge

felipe · Mar 4, 2015

network cabling?
you mean that something broke (hardware)?

because ping and everything works fine also multicast ping.

felipe · Mar 4, 2015

root@node6:~# cat /etc/pve/.members
{
"nodename": "node6",
"version": 2,
"cluster": { "name": "cluster01", "version": 11, "nodes": 11, "quorate": 1 },
"nodelist": {
"node1": { "id": 1, "online": 1},
"node7": { "id": 2, "online": 1},
"node2": { "id": 3, "online": 1},
"node5": { "id": 4, "online": 1},
"node4": { "id": 5, "online": 1},
"node6": { "id": 7, "online": 1},
"node8": { "id": 8, "online": 1},
"node3": { "id": 6, "online": 1},
"ceph1": { "id": 9, "online": 1},
"ceph2": { "id": 10, "online": 1},
"ceph3": { "id": 11, "online": 1}
}
}

what difference i can see ist that after "online":1 i miss the ip:xxxx
at least on my test cluster i can see this

sdutremble · Mar 4, 2015

Yes, I mean hardware.

Just check the connections (remove/re-connect. Make sure it "clicks" and you have the light. Ping and multicast may work but the cable may be incorrectly set in place.

Also, check to see if the interface does not show any error.

Just saying. It is an easy thing to check.

hansm · Mar 4, 2015

Do you use nic bonding? Especially balance-rr (round robin). If you do, shut down all interfaces in the bond group except one to remain connectivity. Than try again, restart cman and rgmanager and check with clustat.

If this works, change the bond mode to balance-tlb (5) put all interfaces in and try again. Check bond status with cat /proc/net/bonding/bond0

felipe · Mar 4, 2015

3 servers use lacp. the others are all directly connected.

does this problem mean that all server have some (network?) problems or can just 1 server cause the whole cluster not to work any more with pvestatd?

hansm · Mar 4, 2015

felipe said:
3 servers use lacp. the others are all directly connected.

does this problem mean that all server have some (network?) problems or can just 1 server cause the whole cluster not to work any more with pvestatd?

I'm pretty new to PVE, so I'm not sure. But I had the same problem in a three node cluster with balance-rr bonding. Information I found about the problem pointed in the network connectivity direction and that seemed to be true, in my case. For you, I can't tell, but try to remove lacp config. Besides bonding it could be cabling or even your network switch.

RobFantini · Mar 7, 2015

Hello

I ran in to the exact same issue on a new cluster .

It is a 3 node cluster using local zfs . the operating system is installed on zfs.

Felipe : are you using zfs ?

The redness started when I was doing some back and restores. some of those went slow due to an issue with the nfs server - I later got an email from the raid controller at nfs complaining about a drive time out.

tried doing this on all nodes.

Code:

/etc/init.d/pve-manager restart
/etc/init.d/pve-cluster restart
/etc/init.d/pvedaemon restart
/etc/init.d/pvestatd restart

/etc/init.d/cman restart

that did not fix the cluster. output:

Code:

Restarting pve cluster filesystem: pve-cluster.
Restarting PVE Daemon: pvedaemon.
Restarting PVE Status Daemon: pvestatd.
Stopping cluster: 
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Waiting for corosync to shutdown:[  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

at each nodes, pvecm nodes shows only the local host as in the cluster. for instance:

Code:

dell1  ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M     16   2015-03-06 16:41:59  dell1
   2   X     20                        dell2
   3   X     20                        srv4

Any suggestions on how to get the cluster back?

RobFantini · Mar 8, 2015

my issue was I think do to a stupid [ my bad ] script that overwrote known_hosts .

felipe · Mar 10, 2015

i could not find the issue. i think it was relate dwith some nfs share maybe.. but why it did not get away. no idea.
i solved it updating all the nodes to the newest version.
generally NFS shares should allways work (specially when they are hard mounted)
after having a problem it is very difficult to find the reason. or sometimes to fix it.
maybe you need to the
/etc/init.d/pve-manager restart
/etc/init.d/pve-cluster restart
/etc/init.d/pvedaemon restart
/etc/init.d/pvestatd restart

/etc/init.d/cman restart

for all machines. i had definetly no network problem. but you can also check for this...

Search

Search

all nodes red - but quorum - can not find any error

felipe

Well-Known Member

dietmar

Proxmox Staff Member

felipe

Well-Known Member

sdutremble

Renowned Member

felipe

Well-Known Member

felipe

Well-Known Member

sdutremble

Renowned Member

hansm

Well-Known Member

felipe

Well-Known Member

hansm

Well-Known Member

RobFantini

Famous Member

RobFantini

Famous Member

felipe

Well-Known Member