Cluster dead after reboot

Ne00n · Nov 28, 2017

Hey,

I did configured a 3 Node cluster over tinc, about 10ms between the nodes.
So after I rebooted the nodes to apply changes, the cluster broke up, into 3 nodes in Standalone mode.

I attached screenshots, so you can take a look.
Afterwards, I tried to check the status with: pvecm status
It did return me on all 3 nodes: Cannot initialize CMAP servic

When I tried to peer them again, it returned:

root@firestone:~# pvecm add dimension -force
cluster not ready - no quorum?

It seems like the nodes are 50% cluster and 50% standalone.
Can someone help me to solve this?

Regards,
Ne00n.

elmacus · Nov 28, 2017

Did you add ipadresses in /etc/hosts to all nodes?

udo · Nov 28, 2017

Ne00n said:
Hey,

I did configured a 3 Node cluster over tinc, about 10ms between the nodes.
So after I rebooted the nodes to apply changes, the cluster broke up, into 3 nodes in Standalone mode.

I attached screenshots, so you can take a look.
Afterwards, I tried to check the status with: pvecm status
It did return me on all 3 nodes: Cannot initialize CMAP servic

When I tried to peer them again, it returned:

root@firestone:~# pvecm add dimension -force
cluster not ready - no quorum?

It seems like the nodes are 50% cluster and 50% standalone.
Can someone help me to solve this?

Regards,
Ne00n.

Hi,
you reboot all nodes at the same time?! I guess yes.

On a three node cluster you need two nodes to get quorum. For me it's looks not, that you have three nodes in standalone mode - but your cluster isn't healty. So it's makes no sense to add nodes to the cluster which are allredy in the config.

Post the output of following commands:

Code:

cat /etc/pve/corosync.conf

cat /etc/hosts

ps aux | grep corosync

ping dimension
ping stargate
ping firestone

service pve-cluster restart

Udo

Ne00n · Nov 28, 2017

Hey,

It seems like it was a bad timing, I always made sure I just reboot one node and wait until its back.

Running these commands seems to fixed the issue:
service pve-cluster restart
systemctl restart corosync

However, when I reboot a single node, the node runs into the same issue again. Here is the debug stuff you wanted:

Code:

root@Stargate:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Stargate
    nodeid: 2
    quorum_votes: 1
    ring0_addr: Stargate
  }
  node {
    name: dimension
    nodeid: 1
    quorum_votes: 1
    ring0_addr: dimension
  }
  node {
    name: firestone
    nodeid: 3
    quorum_votes: 1
    ring0_addr: firestone
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: micro
  config_version: 3
  interface {
    bindnetaddr: 10.0.0.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Code:

root@Stargate:~# cat /etc/hosts
127.0.0.1   localhost

10.0.0.3    stargate
10.0.0.2    firestone
10.0.0.1    dimension

I did checked /etc/hosts on all nodes, seems ok.

Code:

root@Stargate:~# ps aux | grep corosync
root      6562  0.8  3.6 197696 74836 ?        SLsl 21:15   0:22 /usr/sbin/corosync -f
root     10992  0.0  0.0  12788   968 pts/0    S+   21:58   0:00 grep corosync

Code:

root@Stargate:~# ping dimension
PING dimension (10.0.0.1) 56(84) bytes of data.
64 bytes from dimension (10.0.0.1): icmp_seq=1 ttl=64 time=13.1 ms
64 bytes from dimension (10.0.0.1): icmp_seq=2 ttl=64 time=13.2 ms
64 bytes from dimension (10.0.0.1): icmp_seq=3 ttl=64 time=14.1 ms
^C
--- dimension ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 13.162/13.497/14.112/0.435 ms
root@Stargate:~# ping stargate
PING stargate (10.0.0.3) 56(84) bytes of data.
64 bytes from stargate (10.0.0.3): icmp_seq=1 ttl=64 time=0.056 ms
64 bytes from stargate (10.0.0.3): icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from stargate (10.0.0.3): icmp_seq=3 ttl=64 time=0.037 ms
64 bytes from stargate (10.0.0.3): icmp_seq=4 ttl=64 time=0.030 ms
^C
--- stargate ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3051ms
rtt min/avg/max/mdev = 0.030/0.039/0.056/0.010 ms
root@Stargate:~# ping firestone
PING firestone (10.0.0.2) 56(84) bytes of data.
64 bytes from firestone (10.0.0.2): icmp_seq=1 ttl=64 time=8.07 ms
64 bytes from firestone (10.0.0.2): icmp_seq=2 ttl=64 time=8.72 ms
64 bytes from firestone (10.0.0.2): icmp_seq=4 ttl=64 time=7.90 ms
64 bytes from firestone (10.0.0.2): icmp_seq=5 ttl=64 time=7.85 ms
^C
--- firestone ping statistics ---
5 packets transmitted, 4 received, 20% packet loss, time 4022ms
rtt min/avg/max/mdev = 7.859/8.141/8.726/0.353 ms

Any idea why the error happens again when I reboot a node?

udo · Nov 29, 2017

Hi,
perhaps your switch has problems with multicast?

BTW. do you know why the ping to dimension takes app. 5ms longer than the ping to firestone?

Udo

dcsapak · Nov 29, 2017

Ne00n said:
I did configured a 3 Node cluster over tinc, about 10ms between the nodes.

the latency is too high, corosync tolerates latency up to about 2 ms above will not work reliably

see https://pve.proxmox.com/wiki/Cluster_Manager

(edit: thought it was 4ms, when it was really 2ms)

Ne00n · Nov 29, 2017

udo said:
Hi,
perhaps your switch has problems with multicast?

BTW. do you know why the ping to dimension takes app. 5ms longer than the ping to firestone?

Udo

Well, everything worked fine, until I restarted to many nodes at the same time.
Why should that be tinc's fault?

Everything worked before. I reboot a node, it fails back into not enough quonum state until i manually fix it by restarting the services.

Well, these are based in 3 different data centers, on 3 different networks, so the latency is different a bit but around 10ms.

dcsapak said:
the latency is too high, corosync tolerates latency up to about 2 ms above will not work reliably

see https://pve.proxmox.com/wiki/Cluster_Manager

(edit: thought it was 4ms, when it was really 2ms)

So, you think the latency is related to this issue?

dcsapak · Nov 29, 2017

Ne00n said:
So, you think the latency is related to this issue?

yes, you can confirm if you look at the journal output

Code:

journalctl

Search

Search

Cluster dead after reboot

Ne00n

Well-Known Member

Attachments

elmacus

Renowned Member

udo

Distinguished Member

Ne00n

Well-Known Member

Attachments

udo

Distinguished Member

dcsapak

Proxmox Staff Member

Ne00n

Well-Known Member

dcsapak

Proxmox Staff Member

We value your privacy