all nodes red - but quorum - can not find any error

felipe

Well-Known Member
Oct 28, 2013
222
6
58
Hello,

yesterday all nodes went red.
only each node shows itself green.

i allready tried to restart cman, pvestatd, pvedaemon - all works without error on every node. but nothing changes.
even tried to reboot one node...
i also can write to /etc/pve/...
all nfs shares (images and backups) are working.

i am using pve-manager 3.3-1 till 3.3-5

all servers show this:

root@node6:~# cat /etc/pve/.members
{
"nodename": "node6",
"version": 2,
"cluster": { "name": "cluster01", "version": 11, "nodes": 11, "quorate": 1 },
"nodelist": {
"node1": { "id": 1, "online": 1},
"node7": { "id": 2, "online": 1},
"node2": { "id": 3, "online": 1},
"node5": { "id": 4, "online": 1},
"node4": { "id": 5, "online": 1},
"node6": { "id": 7, "online": 1},
"node8": { "id": 8, "online": 1},
"node3": { "id": 6, "online": 1},
"ceph1": { "id": 9, "online": 1},
"ceph2": { "id": 10, "online": 1},
"ceph3": { "id": 11, "online": 1}
}
}


root@ceph1:~# pvecm status
Version: 6.2.0
Config Version: 11
Cluster Name: cluster01
Cluster Id: 53601
Cluster Member: Yes
Cluster Generation: 3472
Membership state: Cluster-Member
Nodes: 11
Expected votes: 11
Total votes: 11
Node votes: 1
Quorum: 6
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: ceph1
Node ID: 9
Multicast addresses: 239.192.209.51
Node addresses: 192.168.11.31




root@ceph1:~# clustat
Cluster Status for cluster01 @ Tue Mar 3 23:52:14 2015
Member Status: Quorate


Member Name ID Status
------ ---- ---- ------
node1 1 Online
node7 2 Online
node2 3 Online
node5 4 Online
node4 5 Online
node3 6 Online
node6 7 Online
node8 8 Online
ceph1 9 Online, Local
ceph2 10 Online
ceph3 11 Online
 
Seems to be a problem with pvestatd. Any hints in /var/log/syslog?

Please verify that all storages are online:

# pvesm status

Problems with non-accessible storage can introduce long delays in pvestatd ....
 
now i can find in the log of one server (which is the one i think that makes problems)

Mar 4 12:11:55 node6 pvestatd[3224]: status update time (8.769 seconds)
Mar 4 12:12:03 node6 pvestatd[3224]: status update time (5.337 seconds)
Mar 4 12:12:13 node6 pvestatd[3224]: status update time (5.921 seconds)
Mar 4 12:12:22 node6 pvestatd[3224]: status update time (5.740 seconds)
Mar 4 12:12:35 node6 pvestatd[3224]: status update time (7.512 seconds)
Mar 4 12:12:44 node6 pvestatd[3224]: status update time (6.793 seconds)
Mar 4 12:12:53 node6 pvestatd[3224]: status update time (6.489 seconds)
Mar 4 12:13:06 node6 pvestatd[3224]: status update time (8.554 seconds)
Mar 4 12:13:14 node6 pvestatd[3224]: status update time (7.650 seconds)


root@node6:/etc/pve# pvesm status
backups nfs 1 11505827840 9002877952 1918464000 82.93%
guest1 lvm 1 1888657408 0 155099136 0.50%
guest2 lvm 1 1953112064 0 736763904 0.50%
images nfs 1 11505827840 9002877952 1918464000 82.93%
local dir 1 47929224 2076332 43395140 5.07%
pool2replica rbd 1 0 0 0 100.00%
rbd rbd 1 0 0 0 100.00%

pool2replica should not be 0! on all other nodes it shows values. rbd i never confugured but never had an issue...
the strange thing is that in the gui of the node6 i can view pool2replica. and see contents. it is just quiete slow to open. generally the server has a high io wait around 10-15 at the moment...

why actually if just one node has some kind of problems the pvestatd daemon shows no values on all nodes?
 
I saw this happening before. Check your network cabling.

It is an easy thing to check.

Serge
 
network cabling?
you mean that something broke (hardware)?

because ping and everything works fine also multicast ping.
 
root@node6:~# cat /etc/pve/.members
{
"nodename": "node6",
"version": 2,
"cluster": { "name": "cluster01", "version": 11, "nodes": 11, "quorate": 1 },
"nodelist": {
"node1": { "id": 1, "online": 1},
"node7": { "id": 2, "online": 1},
"node2": { "id": 3, "online": 1},
"node5": { "id": 4, "online": 1},
"node4": { "id": 5, "online": 1},
"node6": { "id": 7, "online": 1},
"node8": { "id": 8, "online": 1},
"node3": { "id": 6, "online": 1},
"ceph1": { "id": 9, "online": 1},
"ceph2": { "id": 10, "online": 1},
"ceph3": { "id": 11, "online": 1}
}
}

what difference i can see ist that after "online":1 i miss the ip:xxxx
at least on my test cluster i can see this
 
Yes, I mean hardware.

Just check the connections (remove/re-connect. Make sure it "clicks" and you have the light. Ping and multicast may work but the cable may be incorrectly set in place.

Also, check to see if the interface does not show any error.

Just saying. It is an easy thing to check.
 
Do you use nic bonding? Especially balance-rr (round robin). If you do, shut down all interfaces in the bond group except one to remain connectivity. Than try again, restart cman and rgmanager and check with clustat.

If this works, change the bond mode to balance-tlb (5) put all interfaces in and try again. Check bond status with cat /proc/net/bonding/bond0
 
3 servers use lacp. the others are all directly connected.

does this problem mean that all server have some (network?) problems or can just 1 server cause the whole cluster not to work any more with pvestatd?
 
3 servers use lacp. the others are all directly connected.

does this problem mean that all server have some (network?) problems or can just 1 server cause the whole cluster not to work any more with pvestatd?

I'm pretty new to PVE, so I'm not sure. But I had the same problem in a three node cluster with balance-rr bonding. Information I found about the problem pointed in the network connectivity direction and that seemed to be true, in my case. For you, I can't tell, but try to remove lacp config. Besides bonding it could be cabling or even your network switch.
 
Hello

I ran in to the exact same issue on a new cluster .

It is a 3 node cluster using local zfs . the operating system is installed on zfs.

Felipe : are you using zfs ?

The redness started when I was doing some back and restores. some of those went slow due to an issue with the nfs server - I later got an email from the raid controller at nfs complaining about a drive time out.

tried doing this on all nodes.
Code:
/etc/init.d/pve-manager restart
/etc/init.d/pve-cluster restart
/etc/init.d/pvedaemon restart
/etc/init.d/pvestatd restart

/etc/init.d/cman restart
that did not fix the cluster. output:
Code:
Restarting pve cluster filesystem: pve-cluster.
Restarting PVE Daemon: pvedaemon.
Restarting PVE Status Daemon: pvestatd.
Stopping cluster: 
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Waiting for corosync to shutdown:[  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]


at each nodes, pvecm nodes shows only the local host as in the cluster. for instance:
Code:
dell1  ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M     16   2015-03-06 16:41:59  dell1
   2   X     20                        dell2
   3   X     20                        srv4


Any suggestions on how to get the cluster back?
 
i could not find the issue. i think it was relate dwith some nfs share maybe.. but why it did not get away. no idea.
i solved it updating all the nodes to the newest version.
generally NFS shares should allways work (specially when they are hard mounted)
after having a problem it is very difficult to find the reason. or sometimes to fix it.
maybe you need to the
/etc/init.d/pve-manager restart
/etc/init.d/pve-cluster restart
/etc/init.d/pvedaemon restart
/etc/init.d/pvestatd restart

/etc/init.d/cman restart

for all machines. i had definetly no network problem. but you can also check for this...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!