New node disappear randomly from cluster

raoulh

Active Member
Dec 22, 2015
10
2
43
Swiss
Hi,

I have a cluster with 5 servers, running PVE 6 up to date.

After adding the 6th server, I have an issue where the last node #6 disappear from the cluster.

Here is the output of pvecm status

On one node of the running cluster:
Code:
Cluster information
-------------------
Name:             pve-emcp
Config Version:   10
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Apr 30 14:55:09 2020
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000003
Ring ID:          1.41d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4 
Flags:            Quorate
Unable to get node address for nodeid 6: CS_ERR_NOT_EXIST

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.10.106
0x00000002          1 172.16.10.104
0x00000003          1 172.16.10.102 (local)
0x00000004          1 172.16.10.105
0x00000005          1 172.16.10.103
0x00000006          1

On the missing node:
Code:
Cluster information
-------------------
Name: pve-emcp
Config Version: 10
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Apr 30 14:55:30 2020
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000006
Ring ID: 1.41d
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.10.106
0x00000002 1 172.16.10.104
0x00000003 1 172.16.10.102
0x00000004 1 172.16.10.105
0x00000005 1 172.16.10.103
0x00000006 1 172.16.10.101 (local)

As you can see, the "missing node" still think everything is Ok. But on all other nodes, they all have this error: Unable to get node address for nodeid 6: CS_ERR_NOT_EXIST

If I reboot the node, it's correctly visible again to the cluster, but if I wait from some hours to a few days, it fails again.

What can be the root cause of this?

Thanks for help
 
Have you checked the syslog for any errors?
 
Yes, and I did not see anything. There were some message in corosync log about the node joining again after I restart the missing node.

What is strange is that on the web UI, the node is just missing. As if it were never added to the cluster. Not just greyed out like if the node was offline... And If I connect myself on this node, on the proxmod web, I only see this particular node in the cluster. No one else. As if the cluster was somehow splitted...
 
No corosync leave message or anything? And even when it joins after a reboot, the problematic node does not see any of the others? It sounds strange. Are there any VMs running on that node? If not you could try separating the node without reinstalling: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node (5.5.1)

Do you see the other nodes in /etc/pve/nodes on the problematic node?
Could you provide the syslog? If possible from a fresh boot until it leaves the cluster.
 
After a reboot of the problematic node, everything is back to normal. I don't have anything running on this node (no vm/ct), it's a fresh new installed node.
I rebooted the node now. I will wait for it to fail again (it can happen in a few hours or 2-3 days max.) . Then I'll post the full log.

After a reboot this is the corosync log I get on other running nodes:
Code:
Apr 28 19:09:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: 98fadd
Apr 28 19:09:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: 98fadd
Apr 29 19:50:28 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac0e55
Apr 29 19:50:28 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac0e55 ac0e56
Apr 29 19:50:28 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac0e56
Apr 29 20:30:58 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac93ad
Apr 29 23:59:13 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: af4168
Apr 30 12:03:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: b88cad b88cae
Apr 30 12:03:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: b88cae
Apr 30 14:52:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: bab845
Apr 30 14:52:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: bab846
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] A new membership (1.421) was formed. Members left: 6
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] Failed to receive the leave message. failed: 6
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [QUORUM] Members[5]: 1 2 3 4 5
Apr 30 15:46:33 pve-host2 corosync[3264]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 30 15:51:06 pve-host2 corosync[3264]:   [TOTEM ] A new membership (1.426) was formed. Members joined: 6
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [QUORUM] Members[6]: 1 2 3 4 5 6
Apr 30 15:51:06 pve-host2 corosync[3264]:   [MAIN  ] Completed service synchronization, ready to provide service.

I rebooted the failing node at 15:46. So corosync is working ok...
 
If you see messages like the following it could be a network problem:
Code:
Apr 30 14:52:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: bab846
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] A new membership (1.421) was formed. Members left: 6
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] Failed to receive the leave message. failed: 6
 
Yes, during a reboot it is common.
 
So, the node is not displayed anymore on the pve web gui. It happens fast this time.

Attached, the full boot log of the failing node. And the log of another working node from the cluster since the last reboot (I did the reboot at 15:46)
 

Attachments

  • log_node_fail.log
    544.8 KB · Views: 1
  • log_node2_cluster.log
    16.2 KB · Views: 1
No sorry, nothing in the logs that would indicate a problem with the cluster.
When this happens, can you provide the output of systemctl status pve-cluster and systemctl status corosync from both a node in the cluster and the separated one?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!