8 nodes cluster disaster

Ch@rlus

Renowned Member
Feb 14, 2013
31
3
73
Hey guys,

We're facing an issue with one of our Proxmox cluster.

We have a 8 nodes setup, that completely fenced for "no reason" (I assume there's a good reason for this, but not an obvious one).

This cluster has a 2 Ring corosync network, and both are independant network fully monitored => It *shouldn't* be an issue on this side.

There's no big particularity on this setup, unless that one of the node (Compute004) was off since a few days, due to hardware maintenance. So we were on a 7 "UP & running" nodes setup.

Here's my understanding of the events :

Around 14:09:24, all the nodes starts to logs some "[TOTEM ] Retransmit". Theses retransmit lasts nearly 1 min, and then the whole cluster crashed, at 14:10:21. I assume that the quorum "failed" and tried to rebuild, during 1 min, then the watchdog kicked in on all the nodes, explaining why the whole cluster rebooted ?

I have no logs of any ring failure regarding corosync. But the corosync.conf uses the default "passive" rrp_mode. Could the network problem have been enough to make the quorum fail, and prevent it from rebuilding itself, without triggering the failover on the second ring?

If so, would it be safe to use the "active" parameter for rrp_mode, to ensure that both ring are used ? I know the main part of the solution to this crash is to find & fix the network problem that caused the quorum to crash, but we'd also want to ensure that if this problem happens again, our second rings "takes the wheel" properly.

PS : I can include some logs if it can be useful. I just don't know what logs, which logs among the 7 servers can be useful.

Thanks !
 
I know the main part of the solution to this crash is to find & fix the network problem that caused the quorum to crash, but we'd also want to ensure that if this problem happens again, our second rings "takes the wheel" properly.
Thanks !

Maybe the 2nd ring is not properly configured. Did you try it e.g. by removing temporarily the 1st ring's cable?
 
Yep, I tried it, and it works fine (well, it seems to be).

I also have both ring "working" if I do "corosync-cfgtool -s"

Code:
Printing ring status.
Local node ID 5
RING ID 0
    id    = 10.3.16.12
    status    = ring 0 active with no faults
RING ID 1
    id    = 10.3.17.12
    status    = ring 1 active with no faults
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!