cluster of 3 rebooted

weerok

Active Member
Jun 2, 2016
10
0
41
54
Hello,

Two days ago, the proxmox cluster we are testing, rebooted completely :(

Cluster components: 2 physical servers and 1 VM server as quorum member
- 2x ProLiant DL585 G6 with ilo
- 1x VMWare VM on another physical server as Quorum VM
All 5 OS (3 proxmox and 2 ilo) are NTP syncronized !! (double checked)

Issue
- All 3 cluster members rebooted, without any reasonable cause
- It doesnt seem to be a HP hardware problem (MB/RAM/CPU), because one cluster member is a VM on VMWare on separted physical server.

iLO's status
Fans: Ok; Fully Redundant
Temperatures: Ok
VRMs: Ok
Power Supplies: Ok; Fully Redundant

iLO logs
Severity Class Last Update Initial Update Count Description
Informational iLO 2 11/26/2016 19:36 11/26/2016 19:36 1 Server power restored.
Caution iLO 2 11/26/2016 19:35 11/26/2016 19:35 1 Server reset.

Version
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.6-1-pve: 4.4.6-48


Events in time order
- HP server1: Nov 26 19.33.08 corosync[2910]: [TOTEM ] A processor failed, forming new configuration.
Nov 26 19:33:10 corosync[2910]: [TOTEM ] Failed to receive the leave message. failed: 3
.....
Nov 26 19:33:50 corosync[2910]: [QUORUM] Members[3]: 1 2 3
Nov 26 19:33:50 corosync[2910]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 19:33:50 corosync[2910]: [TOTEM ] Retransmit List: 21 22
Nov 26 19:34:24 corosync[2910]: [TOTEM ] Retransmit List: 1a9 1aa 1ab 1ac 1ad 1ae 1af 1b0
- HP server1: Nov 26 19:37:46 First HP server rebooted

- HP server2: Nov 26 19:33:10 corosync[22342]: [TOTEM ] A new membership (server1:532) was formed. Members left: 3
.....
- HP server2 last message: Nov 26 19:33:50 corosync[22342]: [MAIN ] Completed service synchronization, ready to provide service.
- HP server2 rebooted: Nov 26 19:37:50 systemd-modules-load[431]: Module 'fuse' is builtin

- VM server 3 : Nov 26 19:33:15 corosync[1347]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
- VM server 3 : Nov 26 19:33:16 corosync[1347]: [TOTEM ] A new membership (server1:536) was formed. Members joined: 1 2 left: 1 2
.....
- VM server 3 : Nov 26 19:39:10 corosync[1347]: [MAIN ] Corosync main process was not scheduled for 1397.3107 ms (threshold is 1320.0000 ms). Consider token timeout increase.
- VM server 3 rebooted: Nov 26 19:41:08 systemd-modules-load[286]: Module 'fuse' is builtin


Question
- What happened and what should we do ?

I saw another question with the same symptoms
https://forum.proxmox.com/threads/softdog-reboots-while-having-quorum.30383/

Image of Network traffic attached

2016-11-28_104135.jpg
 
we check for cluster issues by checking logs for a 'A processor failed' line in our central rsyslog server. That line is the one item that past various cluster issues started with.

We've had the issue one time in the last 3-4 months. Before that they'd happen more often.

The last time I was running too many backups and kvm moves at the same time .

In the past we could cause that to occur by having all nodes doing backups at the same time + some other backups using rsync and zfs send receive.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!