Nodes getting offline

Paulo Maligaya · Oct 6, 2016

Hi,

I have a cluster setup with 2 nodes (proxmox01 and proxmox02). Proxmox01 is my primary node, then Proxmox02 is my second.

What I noticed is everytime proxmox02 gets a bit loaded (system load reach around 2.0) because of the "kvm" process, it automatically get disconnected from the cluster. Luckily, all guest VMs running on this node are still online (and reachable), it just disconnect itself from the cluster -- when I do "pvecm status" there is only one member of the cluser, or if you check the PVE management web UI, you'll see proxmox02 in red mark.

I dug through the corosync and pve-cluster log and so far this is the only error messages that I can correlate with the issue:

--------------------------------
Oct 05 09:59:11 proxmox02 corosync[65732]: [TOTEM ] A processor failed, forming new configuration.
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.1:43544) was formed. Members joined: 1 left: 1
Oct 05 09:59:12 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:15 proxmox02 corosync[65732]: [TOTEM ] FAILED TO RECEIVE
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] A new membership (192.168.0.2:43548) was formed. Members left: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [TOTEM ] Failed to receive the leave message. failed: 1
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 05 09:59:16 proxmox02 corosync[65732]: [QUORUM] Members[1]: 2
Oct 05 09:59:16 proxmox02 corosync[65732]: [MAIN ] Completed service synchronization, ready to provide service.

Oct 5 15:21:27 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:28 proxmox02 pmxcfs[100284]: [main] notice: teardown filesystem
Oct 5 15:21:29 proxmox02 corosync[100488]: [TOTEM ] A new membership (192.168.0.2:51724) was formed. Members
Oct 5 15:21:29 proxmox02 corosync[100488]: [QUORUM] Members[1]: 2
Oct 5 15:21:29 proxmox02 corosync[100488]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Transport endpoint is not connected
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused
Oct 5 15:21:30 proxmox02 pve-ha-lrm[3103]: ipcc_send_rec failed: Connection refused
--------------------------------

It also comes back up and automatically rejoin the cluser as soon as the server load gets normal.

Is there any underlying issue that I should look further? anyone had this issue before?

TIA!

udo · Oct 6, 2016

Hi,
primary and secondary is gone since a very long time ;-)

Do you use an separate NIC for the cluster-communication, like recommended?

Udo

Paulo Maligaya · Oct 6, 2016

Hah! I bet you'd say that.

No, cluster address is via bridge interface. This way I can provide dedicated IPs to the guest VMs. Besides, as per the doc, dedicated nic is only needed if you used shared storage which is not my setup.

proxtest · Oct 7, 2016

Proxmox needs ipv6 also. Is it working?

Paulo Maligaya · Oct 19, 2016

@proxtest not really sure what is the relation of having an ipv6 in my proxmox cluster to a nodes getting disconnected from the cluster when the system load increases. Could you share some details? TIA

dmora · Oct 19, 2016

proxtest said:
Proxmox needs ipv6 also. Is it working?

Where do you get that from? Multicast is required for corosync operations. Multicast is not synonymous with IPv6. IPv6 does relies on multicast communication for NDP, a protocol that replaces arp with multicast operations at link layer. IPv4 also supports multicast depending on your switching gear.

dmora · Oct 19, 2016

Paulo Maligaya said:
Hah! I bet you'd say that.

No, cluster address is via bridge interface. This way I can provide dedicated IPs to the guest VMs. Besides, as per the doc, dedicated nic is only needed if you used shared storage which is not my setup.

Tcpdump your primary nics, also check for packet loss.

proxtest · Oct 29, 2016

dmora said:
Where do you get that from? Multicast is required for corosync operations. Multicast is not synonymous with IPv6. IPv6 does relies on multicast communication for NDP, a protocol that replaces arp with multicast operations at link layer. IPv4 also supports multicast depending on your switching gear.

We have the same problems over many months, the cluster lost a node or two while ceph was still working. I can't find any reason for that only my ip6 was not configured.
So long time i search the net and somewhere i fond that statement. After i configure IPv6 on all my nodes the problem disappeared and never happen again even there is more load on the cluster than before.

Paulo Maligaya · Oct 31, 2016

@dmora hmmm. interesting findings! I can try this out, hopefully to get the same result as yours!

But you have configured ipV6 on the nodes interface that is being used by the virtual bridge (e.g. virt0), correct? so it's like you have configured this on an additional network interface (i.e. eth2).

Search

Search

Nodes getting offline

Paulo Maligaya

New Member

udo

Distinguished Member

Paulo Maligaya

New Member

proxtest

Active Member

Paulo Maligaya

New Member

dmora

New Member

dmora

New Member

proxtest

Active Member

Paulo Maligaya

New Member

We value your privacy