[SOLVED] Nodes x'd out in GUI, still show up in pvecm

Discussion in 'Proxmox VE: Installation and configuration' started by alchemycs, Jul 12, 2019.

  1. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    34
    Likes Received:
    7
    I've been having an issue with my cluster for a little while now where a random node would drop out of the GUI, and then would either come back, or if I restarted the pvesr and pvedaemon, it would come back. However, now I have two nodes that don't want to come back, and it's a bit of a problem.

    All the nodes show the same information when I run pvecm status:

    Code:
    #~ pvecm status
    Quorum information
    ------------------
    Date:             Thu Jul 11 17:09:19 2019
    Quorum provider:  corosync_votequorum
    Nodes:            7
    Node ID:          0x00000002
    Ring ID:          7/354724
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   7
    Highest expected: 7
    Total votes:      7
    Quorum:           4
    Flags:            Quorate
    
    Membership information
    ----------------------
        Nodeid      Votes Name
    0x00000007          1 10.0.0.17
    0x00000001          1 10.0.0.13
    0x00000002          1 10.0.0.14 (local)
    0x00000009          1 10.0.0.15
    0x00000008          1 10.0.0.16
    0x00000006          1 10.0.0.18
    0x00000005          1 10.0.0.19
    
    But there are two (.14 and .19) which show up as red in the cluster GUI, and both claim not to have a quorum when I try to use their GUIs.

    I did add a new switch to the mix and change how the nodes bond (active/backup), but that months ago, after the dropout behaviour started. FWIW, this cluster has been up for quite a while now.
    I did set up multicast on the new switch and while I didn't test before, now omping mostly just returns:
    omping: Can't find local address in arguments
    even though the local IPs are all in /etc/hosts

    Any thoughts/suggestions?
     
  2. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    166
    Likes Received:
    14
    Looks like multicast is broken. Did this issue appear after changing the switch?
    What do the other nodes (not .14 and .19) see when you run 'pvecm status'?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    34
    Likes Received:
    7
    This started appearing before the switch change.
    All the nodes show the exact same results for 'pvecm status' (with the exception of which is the local node of course).
     
  4. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    166
    Likes Received:
    14
    Can you post the output of 'systemctl status pve-cluster' on both nodes?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    34
    Likes Received:
    7
    That was it - the pve-cluster had crashed on those two servers! I can't believe I didn't check that!

    But, I'm still curious if you might have any thoughts on what might be causing the pvedaemon (or in this case the pve-cluster) to crash on a somewhat regular basis (maybe once a week or so?)

    I did finally look up how to run omping, and when I ran this omping command on all the servers in the cluster, this is what I got back

    Code:
    # omping 10.0.0.13 10.209.178.14 10.209.178.15 10.209.178.16 10.209.178.18 10.209.178.19 10.209.178.17
    ....
    10.0.0.13 :   unicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.045/0.116/0.177/0.023
    10.0.0.13 : multicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.050/0.132/0.195/0.024
    10.0.0.14 :   unicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.030/0.113/0.210/0.028
    10.0.0.14 : multicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.036/0.133/0.223/0.030
    10.0.0.15 :   unicast, xmt/rcv/%loss = 238/238/0%, min/avg/max/std-dev = 0.038/0.102/0.231/0.039
    10.0.0.15 : multicast, xmt/rcv/%loss = 238/238/0%, min/avg/max/std-dev = 0.044/0.112/0.237/0.039
    10.0.0.18 :   unicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.068/0.119/0.605/0.041
    10.0.0.18 : multicast, xmt/rcv/%loss = 236/236/0%, min/avg/max/std-dev = 0.071/0.129/0.606/0.041
    10.0.0.19 :   unicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.047/0.102/0.148/0.017
    10.0.0.19 : multicast, xmt/rcv/%loss = 234/234/0%, min/avg/max/std-dev = 0.053/0.111/0.159/0.018
    10.0.0.17 :   unicast, xmt/rcv/%loss = 210/210/0%, min/avg/max/std-dev = 0.071/0.193/0.356/0.048
    10.0.0.17 : multicast, xmt/rcv/%loss = 210/210/0%, min/avg/max/std-dev = 0.080/0.202/0.361/0.048
    
    
    Thank you!
     
  6. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    166
    Likes Received:
    14
    Check the logs, if it crashes there should be something in there. You could also start pmxcfs in foreground and debug mode.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. alchemycs

    alchemycs Member

    Joined:
    Dec 6, 2011
    Messages:
    34
    Likes Received:
    7
    Will do, thank you for everything! :)
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice