Proxmox VE 6 - Cluster nodes suddenly went offline

Gabble · Jul 25, 2019

I am still new to Proxmox, hope you can help me…

A couple of days ago I created a cluster with two nodes, both run PVE 6.0-4.

The first node has a VM and a CT just for testing purposes, the second node is "empty".
The cluster was created on node1, and node2 joined successfully.

Until today nobody worked on the nodes, on their configuration, on the network side;
Nobody even logged in to Proxmox.

Today, node1 can't see node2 (offline) and vice versa, node2 reports node1 as offline.

Here follows the syslog on node2, the relevant part:

Code:

Jul 24 11:16:00 proxmox2 systemd[1]: pvesr.service: Succeeded.
Jul 24 11:16:00 proxmox2 systemd[1]: Started Proxmox VE replication runner.
Jul 24 11:17:00 proxmox2 systemd[1]: Starting Proxmox VE replication runner...
Jul 24 11:17:00 proxmox2 systemd[1]: pvesr.service: Succeeded.
Jul 24 11:17:00 proxmox2 systemd[1]: Started Proxmox VE replication runner.
Jul 24 11:17:01 proxmox2 CRON[188989]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 24 11:17:50 proxmox2 corosync[56583]:   [KNET  ] link: host: 1 link: 0 is down
Jul 24 11:17:50 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 24 11:17:50 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 has no active links
Jul 24 11:17:51 proxmox2 corosync[56583]:   [TOTEM ] Token has not been received in 216 ms
Jul 24 11:17:51 proxmox2 corosync[56583]:   [KNET  ] rx: host: 1 link: 0 is up
Jul 24 11:17:51 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 24 11:17:51 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 has no active links
Jul 24 11:17:52 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:164) was formed. Members left: 1
Jul 24 11:17:52 proxmox2 corosync[56583]:   [TOTEM ] Failed to receive the leave message. failed: 1
Jul 24 11:17:53 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:168) was formed. Members
Jul 24 11:17:53 proxmox2 corosync[56583]:   [CPG   ] downlist left_list: 1 received
Jul 24 11:17:53 proxmox2 corosync[56583]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 24 11:17:53 proxmox2 pmxcfs[56591]: [dcdb] notice: members: 2/56591
Jul 24 11:17:53 proxmox2 corosync[56583]:   [QUORUM] Members[1]: 2
Jul 24 11:17:53 proxmox2 corosync[56583]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 24 11:17:53 proxmox2 pmxcfs[56591]: [status] notice: node lost quorum
Jul 24 11:17:53 proxmox2 pmxcfs[56591]: [status] notice: members: 2/56591
Jul 24 11:17:54 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:172) was formed. Members
Jul 24 11:17:54 proxmox2 corosync[56583]:   [CPG   ] downlist left_list: 0 received
Jul 24 11:17:54 proxmox2 corosync[56583]:   [QUORUM] Members[1]: 2
Jul 24 11:17:54 proxmox2 corosync[56583]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 24 11:17:56 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:176) was formed. Members
Jul 24 11:17:56 proxmox2 corosync[56583]:   [CPG   ] downlist left_list: 0 received
Jul 24 11:17:56 proxmox2 corosync[56583]:   [QUORUM] Members[1]: 2
Jul 24 11:17:56 proxmox2 corosync[56583]:   [MAIN  ] Completed service synchronization, ready to provide service.

I can see that something happened at 11:17:50, and the cluster was left with one member… but what happened exactly? How can I solve the issue and make the nodes see each other again?

Thanks in advance!

PS: the servers are both on OVH (dedicated, on the same farm)

Alwin · Jul 26, 2019

Gabble said:
PS: the servers are both on OVH (dedicated, on the same farm)

What does dedicated on the same farm mean? Especially from an network point of view.

Gabble · Jul 26, 2019

Alwin said:
What does dedicated on the same farm mean? Especially from an network point of view.

I mean that both are dedicated servers provided by OVH, and are located in the same datacenter.

Alwin · Jul 26, 2019

Gabble said:
I mean that both are dedicated servers provided by OVH, and are located in the same datacenter.

But the don't run on a private switch? Please also post your network config.

Gabble · Jul 29, 2019

Hello Alwin,

here follows the network config in /etc/network/interfaces.

Node1

Code:

auto lo
iface lo inet loopback
iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address  x.y.z.4
    netmask  24
    gateway  x.y.z.254
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

Node2

Code:

auto lo
iface lo inet loopback

auto vmbr0
iface vmbr0 inet static
    address a.b.c.65/24
    gateway a.b.c.254
    bridge_ports eno3
    bridge_stp off
    bridge_fd 0

Alwin · Jul 29, 2019

You run all the traffic through one interface, this alone will interfere with corosync traffic and may result in above issue. The public switch to connect both is another interference factor, as every other customer can hog resources. Then as a third possibility, maintenance done by OVH will also interfere with operation of the cluster.

Search

Search

Proxmox VE 6 - Cluster nodes suddenly went offline

Gabble

Active Member

Alwin

Proxmox Retired Staff

Gabble

Active Member

Alwin

Proxmox Retired Staff

Gabble

Active Member

Alwin

Proxmox Retired Staff

We value your privacy