[SOLVED] Cluster suddenly fails, with no changes made

Razva

Renowned Member
Dec 3, 2013
250
8
83
Romania
cncted.com
Hello,

Our 3-nodes Proxmox cluster suddenly failed, with all nodes being disconnected from each other. All VMs are still working, but the UI/cluster is showing everything as disconnected.

This happened with no OS upgrades, no network changes (at least did by me). Nobody touched anything.

The ISP is OVH.

I've ran a (short) omping, here are the results:
Code:
192.168.1.2 :   unicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.068/0.151/0.236/0.037
192.168.1.2 : multicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.074/0.207/0.319/0.058
192.168.1.3 :   unicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.100/0.128/0.228/0.034
192.168.1.3 : multicast, xmt/rcv/%loss = 35/35/0%, min/avg/max/std-dev = 0.108/0.188/0.282/0.046

I can fully ping and resolve the hostnames:
Code:
root@pmx1-lim:~# ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.067 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.071 ms
root@pmx1-lim:~# ping pmx2-lim
PING pmx2-lim (192.168.1.2) 56(84) bytes of data.
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=1 ttl=64 time=0.087 ms
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=2 ttl=64 time=0.080 ms
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=3 ttl=64 time=0.112 ms
64 bytes from pmx2-lim (192.168.1.2): icmp_seq=4 ttl=64 time=0.112 ms

Here's the corosyinc config:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pmx1-lim
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.1
  }
  node {
    name: pmx2-lim
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.2
  }
  node {
    name: pmx3-lim
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.1.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: proxmox
  config_version: 3
  interface {
    bindnetaddr: 192.168.1.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Here's /etc/hosts:
Code:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1       .localdomain
PUBLIC-IP    pmx1-3114549
192.168.1.1 pmx1-lim
192.168.1.2 pmx2-lim
192.168.1.3 pmx3-lim

# The following lines are desirable for IPv6 capable hosts
#(added automatically by netbase upgrade)
::1     ip6-localhost ip6-loopback
feo0::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

I've upgraded one of the nodes to latest and rebooted but still no luck (worse, it's now "waiting for quorum").

Any hints?
 
Last edited:
Have you tried to restart the corosync service? `systemctl restart corosync`
What's the output of `pvecm status`?
Also have a look in the syslog. Any suspicious errors?
 
Have you tried to restart the corosync service? `systemctl restart corosync`
What's the output of `pvecm status`?
Also have a look in the syslog. Any suspicious errors?
Since yesterday's reply things have deteriorated. I can access the UI only on one node, the rest are posting "refused to connect".

`pvecm status` is stating:
Code:
root@pmx1-lim:~# pvecm status
Quorum information
------------------
Date:             Wed May  1 10:58:02 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/45680
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.1 (local)

Code:
root@pmx2-lim:~# pvecm status
Cannot initialize CMAP service

Code:
root@pmx3-lim:~# pvecm status
Cannot initialize CMAP service

At this point I think that the best bet would be to totally dismantle the cluster and reboot. I'm afraid to reboot the cluster prior to dismantling it, because VMs will not start without a functional quorum.

Any advices?
 
Quorum: 2 Activity blocked
You have lost quorum, therefore you cannot make any changes. You can actively set the number of votes to 1 for the moment in order to solve the issue by running `pvecm expect 1`.
But you will have to figure out why you have lost quorum. Check the output of `journalctl -b -u corosync` for further hints.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!