3 node cluster rebooted at the same time with corosync issues.

Aug 7, 2020
4
0
1
44
Hello we have a client with a 3 node ceph cluster who's servers have rebooted all at the same time. Other proxmox clusters in the rack were unaffected, don't think it's a switch issue. We're running the latest proxmox 6.2, corosync v3, Libknet is v1.16. We believe that the nodes were all fenced at the same time due to some networking blip. Any thoughts?


Code:
Aug  5 20:46:44 proxmox1 corosync[1972]:   [KNET  ] rx: host: 3 link: 0 is up
Aug  5 20:46:44 proxmox1 corosync[1972]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug  5 20:46:44 proxmox1 corosync[1972]:   [TOTEM ] A new membership (1.d66) was formed. Members
Aug  5 20:46:45 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 80
Aug  5 20:46:46 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 90
Aug  5 20:46:46 proxmox1 corosync[1972]:   [TOTEM ] A new membership (1.d6a) was formed. Members
Aug  5 20:46:47 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 100
Aug  5 20:46:47 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retried 100 times
Aug  5 20:46:47 proxmox1 pmxcfs[1856]: [status] crit: cpg_send_message failed: 6
Aug  5 20:46:48 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 10
Aug  5 20:46:48 proxmox1 corosync[1972]:   [TOTEM ] A new membership (1.d6e) was formed. Members
Aug  5 20:46:49 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 20
Aug  5 20:46:50 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 30
Aug  5 20:46:50 proxmox1 corosync[1972]:   [TOTEM ] A new membership (1.d72) was formed. Members
Aug  5 20:46:51 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 40
Aug  5 20:46:52 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 50
Aug  5 20:46:52 proxmox1 corosync[1972]:   [TOTEM ] A new membership (1.d76) was formed. Members
Aug  5 20:46:53 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 60
Aug  5 20:46:54 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 70
Aug  5 20:46:54 proxmox1 corosync[1972]:   [TOTEM ] A new membership (1.d7a) was formed. Members
Aug  5 20:46:55 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 80
Aug  5 20:46:56 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 90
Aug  5 20:46:56 proxmox1 corosync[1972]:   [TOTEM ] A new membership (1.d7e) was formed. Members
Aug  5 20:46:57 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 100
Aug  5 20:46:57 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retried 100 times
Aug  5 20:46:57 proxmox1 pmxcfs[1856]: [status] crit: cpg_send_message failed: 6
Aug  5 20:46:57 proxmox1 pve-firewall[2078]: firewall update time (15.972 seconds)
Aug  5 20:46:58 proxmox1 pmxcfs[1856]: [status] notice: cpg_send_message retry 10
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
 
Hi,

as the log shows, there are no members in this cluster left.
Why network connectivity are lost can have multiple reasons.
- Nic driver bug
- Network overload
- Switch problems

Can you provide a bit more information about your network config?
 
Hi,

as the log shows, there are no members in this cluster left.
Why network connectivity are lost can have multiple reasons.
- Nic driver bug
- Network overload
- Switch problems

Can you provide a bit more information about your network config?
We have a qfx5100-48t-6q and use QinQ to create separate vlan environments for the 8 or so other proxmox clusters in the rack. I'm including the interface file of one of the nodes for your reference.

Code:
cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface enp33s0f0 inet manual

iface eno1 inet manual

iface eno2 inet manual

iface enp33s0f1 inet manual
    mtu 9000

auto enp33s0f1.1000
iface enp33s0f1.1000 inet static
    address 10.0.0.1/24
    mtu 9000

auto enp33s0f1.1001
iface enp33s0f1.1001 inet static
    address 10.0.10.1/24
    mtu 9000

auto vmbr0
iface vmbr0 inet static
    address ***public***
    gateway ***gateway***
    bridge-ports enp33s0f0
    bridge-stp off
    bridge-fd 0

auto vmbr1
iface vmbr1 inet static
    address 192.168.100.1/24
    bridge-ports enp33s0f1.1002
    bridge-stp off
    bridge-fd


I have been running a ping test and outputting to a log.
Node1 -> Node2
Node2 -> Node3
Node3 -> Node1

That same corosync error happened at 2:14 last night:

Code:
cat /var/log/syslog | grep corosync
Aug 13 02:14:12 proxmox1 corosync[1968]:   [TOTEM ] Token has not been received in 61 ms
Aug 13 02:14:12 proxmox1 corosync[1968]:   [TOTEM ] A processor failed, forming new configuration.
Aug 13 02:14:13 proxmox1 corosync[1968]:   [TOTEM ] A new membership (1.eec) was formed. Members
Aug 13 02:14:13 proxmox1 corosync[1968]:   [QUORUM] Members[3]: 1 2 3
Aug 13 02:14:13 proxmox1 corosync[1968]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 13 02:14:42 proxmox1 corosync[1968]:   [KNET  ] link: host: 3 link: 0 is down
Aug 13 02:14:42 proxmox1 corosync[1968]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 13 02:14:42 proxmox1 corosync[1968]:   [KNET  ] host: host: 3 has no active links
Aug 13 02:14:43 proxmox1 corosync[1968]:   [TOTEM ] Token has not been received in 1237 ms
Aug 13 02:14:43 proxmox1 corosync[1968]:   [KNET  ] rx: host: 3 link: 0 is up
Aug 13 02:14:43 proxmox1 corosync[1968]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

However there were no breaks in the ping test at all on any of the nodes. I've also attached dmesg as requested. Thanks guys.

*Edit I'd like to add that we think the reboots are from the nodes fencing after HA tries to move the vms. We've disabled HA for now and haven't had the nodes go down but the problem is still there it seems
 

Attachments

Can you please verify the following on all the three servers
pvecm status | grep -i Ring
I also faced this issue that RingID is different on all the servers and all servers very gone out of sync. I could find out the reason later
Just check at your end
 
Edit I'd like to add that we think the reboots are from the nodes fencing after HA tries to move the vms. We've disabled HA for now and haven't had the nodes go down but the problem is still there it seems
This is not the source of the problem only the result.
The problem is that you are losing quorum in the cluster.
and yes this ends if you have HA enable in a fenced cluster.

I have been running a ping test and outputting to a log.
This test should be run for 24 hours and also search for latency spikes.
The cluster can stop working if the latency in the network increases.
This is also the reason why you should run this 24hours because maybe you have jobs in your network that kill the latency.
Generally, we recommend a dedicated network for corosync and a VLAN is not dedicated.
A VLAN can't guarantee the needed max latency.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!