cluster of 3 rebooted

weerok · Nov 28, 2016

Hello,

Two days ago, the proxmox cluster we are testing, rebooted completely

Cluster components: 2 physical servers and 1 VM server as quorum member
- 2x ProLiant DL585 G6 with ilo
- 1x VMWare VM on another physical server as Quorum VM
All 5 OS (3 proxmox and 2 ilo) are NTP syncronized !! (double checked)

Issue
- All 3 cluster members rebooted, without any reasonable cause
- It doesnt seem to be a HP hardware problem (MB/RAM/CPU), because one cluster member is a VM on VMWare on separted physical server.

iLO's status
Fans: Ok; Fully Redundant
Temperatures: Ok
VRMs: Ok
Power Supplies: Ok; Fully Redundant

iLO logs
Severity Class Last Update Initial Update Count Description
Informational iLO 2 11/26/2016 19:36 11/26/2016 19:36 1 Server power restored.
Caution iLO 2 11/26/2016 19:35 11/26/2016 19:35 1 Server reset.

Version
proxmox-ve: 4.3-71 (running kernel: 4.4.21-1-pve)
pve-manager: 4.3-10 (running version: 4.3-10/7230e60f)
pve-kernel-4.4.6-1-pve: 4.4.6-48

Events in time order
- HP server1: Nov 26 19.33.08 corosync[2910]: [TOTEM ] A processor failed, forming new configuration.
Nov 26 19:33:10 corosync[2910]: [TOTEM ] Failed to receive the leave message. failed: 3
.....
Nov 26 19:33:50 corosync[2910]: [QUORUM] Members[3]: 1 2 3
Nov 26 19:33:50 corosync[2910]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 26 19:33:50 corosync[2910]: [TOTEM ] Retransmit List: 21 22
Nov 26 19:34:24 corosync[2910]: [TOTEM ] Retransmit List: 1a9 1aa 1ab 1ac 1ad 1ae 1af 1b0
- HP server1: Nov 26 19:37:46 First HP server rebooted

- HP server2: Nov 26 19:33:10 corosync[22342]: [TOTEM ] A new membership (server1:532) was formed. Members left: 3
.....
- HP server2 last message: Nov 26 19:33:50 corosync[22342]: [MAIN ] Completed service synchronization, ready to provide service.
- HP server2 rebooted: Nov 26 19:37:50 systemd-modules-load[431]: Module 'fuse' is builtin

- VM server 3 : Nov 26 19:33:15 corosync[1347]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
- VM server 3 : Nov 26 19:33:16 corosync[1347]: [TOTEM ] A new membership (server1:536) was formed. Members joined: 1 2 left: 1 2
.....
- VM server 3 : Nov 26 19:39:10 corosync[1347]: [MAIN ] Corosync main process was not scheduled for 1397.3107 ms (threshold is 1320.0000 ms). Consider token timeout increase.
- VM server 3 rebooted: Nov 26 19:41:08 systemd-modules-load[286]: Module 'fuse' is builtin

Question
- What happened and what should we do ?

I saw another question with the same symptoms
https://forum.proxmox.com/threads/softdog-reboots-while-having-quorum.30383/

Image of Network traffic attached

RobFantini · Nov 28, 2016

we check for cluster issues by checking logs for a 'A processor failed' line in our central rsyslog server. That line is the one item that past various cluster issues started with.

We've had the issue one time in the last 3-4 months. Before that they'd happen more often.

The last time I was running too many backups and kvm moves at the same time .

In the past we could cause that to occur by having all nodes doing backups at the same time + some other backups using rsync and zfs send receive.

Search

Search

cluster of 3 rebooted

weerok

Renowned Member

RobFantini

Famous Member