New node disappear randomly from cluster

raoulh · Apr 30, 2020

Hi,

I have a cluster with 5 servers, running PVE 6 up to date.

After adding the 6th server, I have an issue where the last node #6 disappear from the cluster.

Here is the output of pvecm status

On one node of the running cluster:

Code:

Cluster information
-------------------
Name:             pve-emcp
Config Version:   10
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Apr 30 14:55:09 2020
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000003
Ring ID:          1.41d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4 
Flags:            Quorate
Unable to get node address for nodeid 6: CS_ERR_NOT_EXIST

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.10.106
0x00000002          1 172.16.10.104
0x00000003          1 172.16.10.102 (local)
0x00000004          1 172.16.10.105
0x00000005          1 172.16.10.103
0x00000006          1

On the missing node:

Code:

Cluster information
-------------------
Name: pve-emcp
Config Version: 10
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Apr 30 14:55:30 2020
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000006
Ring ID: 1.41d
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.10.106
0x00000002 1 172.16.10.104
0x00000003 1 172.16.10.102
0x00000004 1 172.16.10.105
0x00000005 1 172.16.10.103
0x00000006 1 172.16.10.101 (local)

As you can see, the "missing node" still think everything is Ok. But on all other nodes, they all have this error: Unable to get node address for nodeid 6: CS_ERR_NOT_EXIST

If I reboot the node, it's correctly visible again to the cluster, but if I wait from some hours to a few days, it fails again.

What can be the root cause of this?

Thanks for help

mira · Apr 30, 2020

Have you checked the syslog for any errors?

raoulh · Apr 30, 2020

Yes, and I did not see anything. There were some message in corosync log about the node joining again after I restart the missing node.

What is strange is that on the web UI, the node is just missing. As if it were never added to the cluster. Not just greyed out like if the node was offline... And If I connect myself on this node, on the proxmod web, I only see this particular node in the cluster. No one else. As if the cluster was somehow splitted...

mira · Apr 30, 2020

No corosync leave message or anything? And even when it joins after a reboot, the problematic node does not see any of the others? It sounds strange. Are there any VMs running on that node? If not you could try separating the node without reinstalling: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node (5.5.1)

Do you see the other nodes in /etc/pve/nodes on the problematic node?
Could you provide the syslog? If possible from a fresh boot until it leaves the cluster.

raoulh · Apr 30, 2020

After a reboot of the problematic node, everything is back to normal. I don't have anything running on this node (no vm/ct), it's a fresh new installed node.
I rebooted the node now. I will wait for it to fail again (it can happen in a few hours or 2-3 days max.) . Then I'll post the full log.

After a reboot this is the corosync log I get on other running nodes:

Code:

Apr 28 19:09:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: 98fadd
Apr 28 19:09:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: 98fadd
Apr 29 19:50:28 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac0e55
Apr 29 19:50:28 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac0e55 ac0e56
Apr 29 19:50:28 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac0e56
Apr 29 20:30:58 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: ac93ad
Apr 29 23:59:13 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: af4168
Apr 30 12:03:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: b88cad b88cae
Apr 30 12:03:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: b88cae
Apr 30 14:52:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: bab845
Apr 30 14:52:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: bab846
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] A new membership (1.421) was formed. Members left: 6
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] Failed to receive the leave message. failed: 6
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 1 received
Apr 30 15:46:33 pve-host2 corosync[3264]:   [QUORUM] Members[5]: 1 2 3 4 5
Apr 30 15:46:33 pve-host2 corosync[3264]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 30 15:51:06 pve-host2 corosync[3264]:   [TOTEM ] A new membership (1.426) was formed. Members joined: 6
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [CPG   ] downlist left_list: 0 received
Apr 30 15:51:06 pve-host2 corosync[3264]:   [QUORUM] Members[6]: 1 2 3 4 5 6
Apr 30 15:51:06 pve-host2 corosync[3264]:   [MAIN  ] Completed service synchronization, ready to provide service.

I rebooted the failing node at 15:46. So corosync is working ok...

mira · Apr 30, 2020

If you see messages like the following it could be a network problem:

Code:

Apr 30 14:52:43 pve-host2 corosync[3264]:   [TOTEM ] Retransmit List: bab846
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] A new membership (1.421) was formed. Members left: 6
Apr 30 15:46:33 pve-host2 corosync[3264]:   [TOTEM ] Failed to receive the leave message. failed: 6

raoulh · Apr 30, 2020

This message is only displayed when I reboot the node. So it is normal message right?

mira · Apr 30, 2020

Yes, during a reboot it is common.

raoulh · Apr 30, 2020

So, the node is not displayed anymore on the pve web gui. It happens fast this time.

Attached, the full boot log of the failing node. And the log of another working node from the cluster since the last reboot (I did the reboot at 15:46)

raoulh · May 4, 2020

Any idea of what could be wrong based on the boot logs?

mira · May 4, 2020

No sorry, nothing in the logs that would indicate a problem with the cluster.
When this happens, can you provide the output of systemctl status pve-cluster and systemctl status corosync from both a node in the cluster and the separated one?

raoulh · May 13, 2020

The issue did not appear for more than a week now... I will continue to monitor and comment again if it hapens again.

Search

Search

New node disappear randomly from cluster

raoulh

Active Member

mira

Proxmox Staff Member

raoulh

Active Member

mira

Proxmox Staff Member

raoulh

Active Member

mira

Proxmox Staff Member

raoulh

Active Member

mira

Proxmox Staff Member

raoulh

Active Member

Attachments

raoulh

Active Member

mira

Proxmox Staff Member

raoulh

Active Member