Nodes getting offline

Paulo Maligaya

New Member
Jul 23, 2016
18
0
1
40
Hi,

I have a cluster setup with 2 nodes (proxmox01 and proxmox02). Proxmox01 is my primary node, then Proxmox02 is my second.

What I noticed is everytime proxmox02 gets a bit loaded (system load reach around 2.0) because of the "kvm" process, it automatically get disconnected from the cluster. Luckily, all guest VMs running on this node are still online (and reachable), it just disconnect itself from the cluster -- when I do "pvecm status" there is only one member of the cluser, or if you check the PVE management web UI, you'll see proxmox02 in red mark.

I dug through the corosync and pve-cluster log and so far this is the only error messages that I can correlate with the issue:

--------------------------------
Oct 05 09:59:11 proxmox02 corosync[65732]: [TOTEM ] A processor failed, forming new configuration.
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.1:43544) was formed. Members joined: 1 left: 1
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:15 proxmox02 corosync[65732]: [TOTEM ] FAILED TO RECEIVE
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.2:43548) was formed. Members left: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] Members[1]: 2
Oct 05 09:59:16 proxmox02 corosync[65732]: [MAIN ] Completed service synchronization, ready to provide service.

Oct 5 15:21:27 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:28 proxmox02 pmxcfs[100284]: [main] notice: teardown filesystem
Oct 5 15:21:29 proxmox02 corosync[100488]: [TOTEM ] A new membership (192.168.0.2:51724) was formed. Members
Oct 5 15:21:29 proxmox02 corosync[100488]: [QUORUM] Members[1]: 2
Oct 5 15:21:29 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Transport endpoint is not connected
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused

--------------------------------

It also comes back up and automatically rejoin the cluser as soon as the server load gets normal.

Is there any underlying issue that I should look further? anyone had this issue before?

TIA!
 
Hah! I bet you'd say that. :)

No, cluster address is via bridge interface. This way I can provide dedicated IPs to the guest VMs. Besides, as per the doc, dedicated nic is only needed if you used shared storage which is not my setup.
 
@proxtest not really sure what is the relation of having an ipv6 in my proxmox cluster to a nodes getting disconnected from the cluster when the system load increases. Could you share some details? TIA
 
Proxmox needs ipv6 also. Is it working?

Where do you get that from? Multicast is required for corosync operations. Multicast is not synonymous with IPv6. IPv6 does relies on multicast communication for NDP, a protocol that replaces arp with multicast operations at link layer. IPv4 also supports multicast depending on your switching gear.
 
Hah! I bet you'd say that. :)

No, cluster address is via bridge interface. This way I can provide dedicated IPs to the guest VMs. Besides, as per the doc, dedicated nic is only needed if you used shared storage which is not my setup.

Tcpdump your primary nics, also check for packet loss.
 
Where do you get that from? Multicast is required for corosync operations. Multicast is not synonymous with IPv6. IPv6 does relies on multicast communication for NDP, a protocol that replaces arp with multicast operations at link layer. IPv4 also supports multicast depending on your switching gear.

We have the same problems over many months, the cluster lost a node or two while ceph was still working. I can't find any reason for that only my ip6 was not configured.
So long time i search the net and somewhere i fond that statement. After i configure IPv6 on all my nodes the problem disappeared and never happen again even there is more load on the cluster than before.
 
@dmora hmmm. interesting findings! I can try this out, hopefully to get the same result as yours! :)

But you have configured ipV6 on the nodes interface that is being used by the virtual bridge (e.g. virt0), correct? so it's like you have configured this on an additional network interface (i.e. eth2).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!