[SOLVED] Nodes x'd out in GUI, still show up in pvecm

alchemycs · Jul 12, 2019

I've been having an issue with my cluster for a little while now where a random node would drop out of the GUI, and then would either come back, or if I restarted the pvesr and pvedaemon, it would come back. However, now I have two nodes that don't want to come back, and it's a bit of a problem.

All the nodes show the same information when I run pvecm status:

Code:

#~ pvecm status
Quorum information
------------------
Date:             Thu Jul 11 17:09:19 2019
Quorum provider:  corosync_votequorum
Nodes:            7
Node ID:          0x00000002
Ring ID:          7/354724
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      7
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000007          1 10.0.0.17
0x00000001          1 10.0.0.13
0x00000002          1 10.0.0.14 (local)
0x00000009          1 10.0.0.15
0x00000008          1 10.0.0.16
0x00000006          1 10.0.0.18
0x00000005          1 10.0.0.19

But there are two (.14 and .19) which show up as red in the cluster GUI, and both claim not to have a quorum when I try to use their GUIs.

I did add a new switch to the mix and change how the nodes bond (active/backup), but that months ago, after the dropout behaviour started. FWIW, this cluster has been up for quite a while now.
I did set up multicast on the new switch and while I didn't test before, now omping mostly just returns:
omping: Can't find local address in arguments
even though the local IPs are all in /etc/hosts

Any thoughts/suggestions?

mira · Jul 12, 2019

Looks like multicast is broken. Did this issue appear after changing the switch?
What do the other nodes (not .14 and .19) see when you run 'pvecm status'?

alchemycs · Jul 14, 2019

This started appearing before the switch change.
All the nodes show the exact same results for 'pvecm status' (with the exception of which is the local node of course).

mira · Jul 15, 2019

Can you post the output of 'systemctl status pve-cluster' on both nodes?

alchemycs · Jul 18, 2019

That was it - the pve-cluster had crashed on those two servers! I can't believe I didn't check that!

But, I'm still curious if you might have any thoughts on what might be causing the pvedaemon (or in this case the pve-cluster) to crash on a somewhat regular basis (maybe once a week or so?)

I did finally look up how to run omping, and when I ran this omping command on all the servers in the cluster, this is what I got back

Code:

# omping 10.0.0.13 10.209.178.14 10.209.178.15 10.209.178.16 10.209.178.18 10.209.178.19 10.209.178.17
....
10.0.0.13 :   unicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.045/0.116/0.177/0.023
10.0.0.13 : multicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.050/0.132/0.195/0.024
10.0.0.14 :   unicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.030/0.113/0.210/0.028
10.0.0.14 : multicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.036/0.133/0.223/0.030
10.0.0.15 :   unicast, xmt/rcv/%loss = 238/238/0%, min/avg/max/std-dev = 0.038/0.102/0.231/0.039
10.0.0.15 : multicast, xmt/rcv/%loss = 238/238/0%, min/avg/max/std-dev = 0.044/0.112/0.237/0.039
10.0.0.18 :   unicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.068/0.119/0.605/0.041
10.0.0.18 : multicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.071/0.129/0.606/0.041
10.0.0.19 :   unicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.047/0.102/0.148/0.017
10.0.0.19 : multicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.053/0.111/0.159/0.018
10.0.0.17 :   unicast, xmt/rcv/%loss = 210/210/0%, min/avg/max/std-dev = 0.071/0.193/0.356/0.048
10.0.0.17 : multicast, xmt/rcv/%loss = 210/210/0%, min/avg/max/std-dev = 0.080/0.202/0.361/0.048

Thank you!

mira · Jul 18, 2019

Check the logs, if it crashes there should be something in there. You could also start pmxcfs in foreground and debug mode.

alchemycs · Jul 19, 2019

Will do, thank you for everything!

Search

Search

[SOLVED] Nodes x'd out in GUI, still show up in pvecm

alchemycs

Well-Known Member

mira

Proxmox Staff Member

alchemycs

Well-Known Member

mira

Proxmox Staff Member

alchemycs

Well-Known Member

mira

Proxmox Staff Member

alchemycs

Well-Known Member