[SOLVED] Nodes x'd out in GUI, still show up in pvecm

alchemycs

Well-Known Member
Dec 6, 2011
34
8
48
Seattle, WA
alchemycs.com
I've been having an issue with my cluster for a little while now where a random node would drop out of the GUI, and then would either come back, or if I restarted the pvesr and pvedaemon, it would come back. However, now I have two nodes that don't want to come back, and it's a bit of a problem.

All the nodes show the same information when I run pvecm status:

Code:
#~ pvecm status
Quorum information
------------------
Date:             Thu Jul 11 17:09:19 2019
Quorum provider:  corosync_votequorum
Nodes:            7
Node ID:          0x00000002
Ring ID:          7/354724
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      7
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000007          1 10.0.0.17
0x00000001          1 10.0.0.13
0x00000002          1 10.0.0.14 (local)
0x00000009          1 10.0.0.15
0x00000008          1 10.0.0.16
0x00000006          1 10.0.0.18
0x00000005          1 10.0.0.19

But there are two (.14 and .19) which show up as red in the cluster GUI, and both claim not to have a quorum when I try to use their GUIs.

I did add a new switch to the mix and change how the nodes bond (active/backup), but that months ago, after the dropout behaviour started. FWIW, this cluster has been up for quite a while now.
I did set up multicast on the new switch and while I didn't test before, now omping mostly just returns:
omping: Can't find local address in arguments
even though the local IPs are all in /etc/hosts

Any thoughts/suggestions?
 
Looks like multicast is broken. Did this issue appear after changing the switch?
What do the other nodes (not .14 and .19) see when you run 'pvecm status'?
 
Can you post the output of 'systemctl status pve-cluster' on both nodes?
 
That was it - the pve-cluster had crashed on those two servers! I can't believe I didn't check that!

But, I'm still curious if you might have any thoughts on what might be causing the pvedaemon (or in this case the pve-cluster) to crash on a somewhat regular basis (maybe once a week or so?)

I did finally look up how to run omping, and when I ran this omping command on all the servers in the cluster, this is what I got back

Code:
# omping 10.0.0.13 10.209.178.14 10.209.178.15 10.209.178.16 10.209.178.18 10.209.178.19 10.209.178.17
....
10.0.0.13 :   unicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.045/0.116/0.177/0.023
10.0.0.13 : multicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.050/0.132/0.195/0.024
10.0.0.14 :   unicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.030/0.113/0.210/0.028
10.0.0.14 : multicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.036/0.133/0.223/0.030
10.0.0.15 :   unicast, xmt/rcv/%loss = 238/238/0%, min/avg/max/std-dev = 0.038/0.102/0.231/0.039
10.0.0.15 : multicast, xmt/rcv/%loss = 238/238/0%, min/avg/max/std-dev = 0.044/0.112/0.237/0.039
10.0.0.18 :   unicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.068/0.119/0.605/0.041
10.0.0.18 : multicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.071/0.129/0.606/0.041
10.0.0.19 :   unicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.047/0.102/0.148/0.017
10.0.0.19 : multicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.053/0.111/0.159/0.018
10.0.0.17 :   unicast, xmt/rcv/%loss = 210/210/0%, min/avg/max/std-dev = 0.071/0.193/0.356/0.048
10.0.0.17 : multicast, xmt/rcv/%loss = 210/210/0%, min/avg/max/std-dev = 0.080/0.202/0.361/0.048

Thank you!
 
Check the logs, if it crashes there should be something in there. You could also start pmxcfs in foreground and debug mode.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!