Proxmox VE 6 - Cluster nodes suddenly went offline

Jul 23, 2019
10
0
6
I am still new to Proxmox, hope you can help me…

A couple of days ago I created a cluster with two nodes, both run PVE 6.0-4.

The first node has a VM and a CT just for testing purposes, the second node is "empty".
The cluster was created on node1, and node2 joined successfully.

Until today nobody worked on the nodes, on their configuration, on the network side;
Nobody even logged in to Proxmox.

Today, node1 can't see node2 (offline) and vice versa, node2 reports node1 as offline.

Here follows the syslog on node2, the relevant part:

Code:
Jul 24 11:16:00 proxmox2 systemd[1]: pvesr.service: Succeeded.
Jul 24 11:16:00 proxmox2 systemd[1]: Started Proxmox VE replication runner.
Jul 24 11:17:00 proxmox2 systemd[1]: Starting Proxmox VE replication runner...
Jul 24 11:17:00 proxmox2 systemd[1]: pvesr.service: Succeeded.
Jul 24 11:17:00 proxmox2 systemd[1]: Started Proxmox VE replication runner.
Jul 24 11:17:01 proxmox2 CRON[188989]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 24 11:17:50 proxmox2 corosync[56583]:   [KNET  ] link: host: 1 link: 0 is down
Jul 24 11:17:50 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 24 11:17:50 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 has no active links
Jul 24 11:17:51 proxmox2 corosync[56583]:   [TOTEM ] Token has not been received in 216 ms
Jul 24 11:17:51 proxmox2 corosync[56583]:   [KNET  ] rx: host: 1 link: 0 is up
Jul 24 11:17:51 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 24 11:17:51 proxmox2 corosync[56583]:   [KNET  ] host: host: 1 has no active links
Jul 24 11:17:52 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:164) was formed. Members left: 1
Jul 24 11:17:52 proxmox2 corosync[56583]:   [TOTEM ] Failed to receive the leave message. failed: 1
Jul 24 11:17:53 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:168) was formed. Members
Jul 24 11:17:53 proxmox2 corosync[56583]:   [CPG   ] downlist left_list: 1 received
Jul 24 11:17:53 proxmox2 corosync[56583]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 24 11:17:53 proxmox2 pmxcfs[56591]: [dcdb] notice: members: 2/56591
Jul 24 11:17:53 proxmox2 corosync[56583]:   [QUORUM] Members[1]: 2
Jul 24 11:17:53 proxmox2 corosync[56583]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 24 11:17:53 proxmox2 pmxcfs[56591]: [status] notice: node lost quorum
Jul 24 11:17:53 proxmox2 pmxcfs[56591]: [status] notice: members: 2/56591
Jul 24 11:17:54 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:172) was formed. Members
Jul 24 11:17:54 proxmox2 corosync[56583]:   [CPG   ] downlist left_list: 0 received
Jul 24 11:17:54 proxmox2 corosync[56583]:   [QUORUM] Members[1]: 2
Jul 24 11:17:54 proxmox2 corosync[56583]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 24 11:17:56 proxmox2 corosync[56583]:   [TOTEM ] A new membership (2:176) was formed. Members
Jul 24 11:17:56 proxmox2 corosync[56583]:   [CPG   ] downlist left_list: 0 received
Jul 24 11:17:56 proxmox2 corosync[56583]:   [QUORUM] Members[1]: 2
Jul 24 11:17:56 proxmox2 corosync[56583]:   [MAIN  ] Completed service synchronization, ready to provide service.

I can see that something happened at 11:17:50, and the cluster was left with one member… but what happened exactly? How can I solve the issue and make the nodes see each other again?

Thanks in advance!

PS: the servers are both on OVH (dedicated, on the same farm)
 
PS: the servers are both on OVH (dedicated, on the same farm)
What does dedicated on the same farm mean? Especially from an network point of view.
 
I mean that both are dedicated servers provided by OVH, and are located in the same datacenter.
But the don't run on a private switch? Please also post your network config.
 
Hello Alwin,

here follows the network config in /etc/network/interfaces.

Node1
Code:
auto lo
iface lo inet loopback
iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address  x.y.z.4
    netmask  24
    gateway  x.y.z.254
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

Node2
Code:
auto lo
iface lo inet loopback

auto vmbr0
iface vmbr0 inet static
    address a.b.c.65/24
    gateway a.b.c.254
    bridge_ports eno3
    bridge_stp off
    bridge_fd 0
 
You run all the traffic through one interface, this alone will interfere with corosync traffic and may result in above issue. The public switch to connect both is another interference factor, as every other customer can hog resources. Then as a third possibility, maintenance done by OVH will also interfere with operation of the cluster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!