Cluster dead after reboot

Ne00n

Well-Known Member
Apr 30, 2017
32
3
48
33
Hey,

I did configured a 3 Node cluster over tinc, about 10ms between the nodes.
So after I rebooted the nodes to apply changes, the cluster broke up, into 3 nodes in Standalone mode.

I attached screenshots, so you can take a look.
Afterwards, I tried to check the status with: pvecm status
It did return me on all 3 nodes: Cannot initialize CMAP servic

When I tried to peer them again, it returned:

root@firestone:~# pvecm add dimension -force
cluster not ready - no quorum?

It seems like the nodes are 50% cluster and 50% standalone.
Can someone help me to solve this?

Regards,
Ne00n.
 

Attachments

  • firestone.png
    firestone.png
    76.2 KB · Views: 13
  • dimension.png
    dimension.png
    73.2 KB · Views: 10
  • stargate.png
    stargate.png
    77.4 KB · Views: 8
Hey,

I did configured a 3 Node cluster over tinc, about 10ms between the nodes.
So after I rebooted the nodes to apply changes, the cluster broke up, into 3 nodes in Standalone mode.

I attached screenshots, so you can take a look.
Afterwards, I tried to check the status with: pvecm status
It did return me on all 3 nodes: Cannot initialize CMAP servic

When I tried to peer them again, it returned:

root@firestone:~# pvecm add dimension -force
cluster not ready - no quorum?

It seems like the nodes are 50% cluster and 50% standalone.
Can someone help me to solve this?

Regards,
Ne00n.
Hi,
you reboot all nodes at the same time?! I guess yes.

On a three node cluster you need two nodes to get quorum. For me it's looks not, that you have three nodes in standalone mode - but your cluster isn't healty. So it's makes no sense to add nodes to the cluster which are allredy in the config.

Post the output of following commands:
Code:
cat /etc/pve/corosync.conf

cat /etc/hosts

ps aux | grep corosync

ping dimension
ping stargate
ping firestone

service pve-cluster restart
Udo
 
Hey,

It seems like it was a bad timing, I always made sure I just reboot one node and wait until its back.

Running these commands seems to fixed the issue:
service pve-cluster restart
systemctl restart corosync

However, when I reboot a single node, the node runs into the same issue again. Here is the debug stuff you wanted:

Code:
root@Stargate:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Stargate
    nodeid: 2
    quorum_votes: 1
    ring0_addr: Stargate
  }
  node {
    name: dimension
    nodeid: 1
    quorum_votes: 1
    ring0_addr: dimension
  }
  node {
    name: firestone
    nodeid: 3
    quorum_votes: 1
    ring0_addr: firestone
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: micro
  config_version: 3
  interface {
    bindnetaddr: 10.0.0.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Code:
root@Stargate:~# cat /etc/hosts
127.0.0.1   localhost

10.0.0.3    stargate
10.0.0.2    firestone
10.0.0.1    dimension

I did checked /etc/hosts on all nodes, seems ok.

Code:
root@Stargate:~# ps aux | grep corosync
root      6562  0.8  3.6 197696 74836 ?        SLsl 21:15   0:22 /usr/sbin/corosync -f
root     10992  0.0  0.0  12788   968 pts/0    S+   21:58   0:00 grep corosync

Code:
root@Stargate:~# ping dimension
PING dimension (10.0.0.1) 56(84) bytes of data.
64 bytes from dimension (10.0.0.1): icmp_seq=1 ttl=64 time=13.1 ms
64 bytes from dimension (10.0.0.1): icmp_seq=2 ttl=64 time=13.2 ms
64 bytes from dimension (10.0.0.1): icmp_seq=3 ttl=64 time=14.1 ms
^C
--- dimension ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 13.162/13.497/14.112/0.435 ms
root@Stargate:~# ping stargate
PING stargate (10.0.0.3) 56(84) bytes of data.
64 bytes from stargate (10.0.0.3): icmp_seq=1 ttl=64 time=0.056 ms
64 bytes from stargate (10.0.0.3): icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from stargate (10.0.0.3): icmp_seq=3 ttl=64 time=0.037 ms
64 bytes from stargate (10.0.0.3): icmp_seq=4 ttl=64 time=0.030 ms
^C
--- stargate ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3051ms
rtt min/avg/max/mdev = 0.030/0.039/0.056/0.010 ms
root@Stargate:~# ping firestone
PING firestone (10.0.0.2) 56(84) bytes of data.
64 bytes from firestone (10.0.0.2): icmp_seq=1 ttl=64 time=8.07 ms
64 bytes from firestone (10.0.0.2): icmp_seq=2 ttl=64 time=8.72 ms
64 bytes from firestone (10.0.0.2): icmp_seq=4 ttl=64 time=7.90 ms
64 bytes from firestone (10.0.0.2): icmp_seq=5 ttl=64 time=7.85 ms
^C
--- firestone ping statistics ---
5 packets transmitted, 4 received, 20% packet loss, time 4022ms
rtt min/avg/max/mdev = 7.859/8.141/8.726/0.353 ms

Any idea why the error happens again when I reboot a node?
 

Attachments

  • cluster.png
    cluster.png
    83.4 KB · Views: 4
Hi,
perhaps your switch has problems with multicast?

BTW. do you know why the ping to dimension takes app. 5ms longer than the ping to firestone?

Udo

Well, everything worked fine, until I restarted to many nodes at the same time.
Why should that be tinc's fault?

Everything worked before. I reboot a node, it fails back into not enough quonum state until i manually fix it by restarting the services.

Well, these are based in 3 different data centers, on 3 different networks, so the latency is different a bit but around 10ms.

the latency is too high, corosync tolerates latency up to about 2 ms above will not work reliably

see https://pve.proxmox.com/wiki/Cluster_Manager

(edit: thought it was 4ms, when it was really 2ms)

So, you think the latency is related to this issue?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!