[SOLVED] Cluster suddenly fails, with no changes made

Razva

Renowned Member
Dec 3, 2013
252
10
83
Romania
cncted.com
Hello,

Our 3-nodes Proxmox cluster suddenly failed, with all nodes being disconnected from each other. All VMs are still working, but the UI/cluster is showing everything as disconnected.

This happened with no OS upgrades, no network changes (at least did by me). Nobody touched anything.

The ISP is OVH.

I've ran a (short) omping, here are the results:
Code:
192.168.1.2 :   unicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.068/0.151/0.236/0.037
192.168.1.2 : multicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.074/0.207/0.319/0.058
192.168.1.3 :   unicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.100/0.128/0.228/0.034
192.168.1.3 : multicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.108/0.188/0.282/0.046

I can fully ping and resolve the hostnames:
Code:
root@pmx1-lim:~# ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.067 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.071 ms
root@pmx1-lim:~# ping pmx2-lim
PING pmx2-lim (192.168.1.2) 56(84) bytes of data.
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=1 ttl=64 time=0.087 ms
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=2 ttl=64 time=0.080 ms
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=3 ttl=64 time=0.112 ms
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=4 ttl=64 time=0.112 ms

Here's the corosyinc config:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pmx1-lim
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.1
  }
  node {
    name: pmx2-lim
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.2
  }
  node {
    name: pmx3-lim
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: proxmox
  config_version: 3
  interface {
    bindnetaddr: 192.168.1.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Here's /etc/hosts:
Code:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1       .localdomain
PUBLIC-IP    pmx1-3114549
192.168.1.1 pmx1-lim
192.168.1.2 pmx2-lim
192.168.1.3 pmx3-lim

# The following lines are desirable for IPv6 capable hosts
#(added automatically by netbase upgrade)
::1     ip6-localhost ip6-loopback
feo0::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

I've upgraded one of the nodes to latest and rebooted but still no luck (worse, it's now "waiting for quorum").

Any hints?
 
Last edited:
Have you tried to restart the corosync service? `systemctl restart corosync`
What's the output of `pvecm status`?
Also have a look in the syslog. Any suspicious errors?
 
Have you tried to restart the corosync service? `systemctl restart corosync`
What's the output of `pvecm status`?
Also have a look in the syslog. Any suspicious errors?
Since yesterday's reply things have deteriorated. I can access the UI only on one node, the rest are posting "refused to connect".

`pvecm status` is stating:
Code:
root@pmx1-lim:~# pvecm status
Quorum information
------------------
Date:             Wed May  1 10:58:02 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/45680
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.1 (local)

Code:
root@pmx2-lim:~# pvecm status
Cannot initialize CMAP service

Code:
root@pmx3-lim:~# pvecm status
Cannot initialize CMAP service

At this point I think that the best bet would be to totally dismantle the cluster and reboot. I'm afraid to reboot the cluster prior to dismantling it, because VMs will not start without a functional quorum.

Any advices?
 
Quorum: 2 Activity blocked
You have lost quorum, therefore you cannot make any changes. You can actively set the number of votes to 1 for the moment in order to solve the issue by running `pvecm expect 1`.
But you will have to figure out why you have lost quorum. Check the output of `journalctl -b -u corosync` for further hints.