The whole cluster restarted at same time

JesperAP · Sep 25, 2024

Yesterday between 00:00 and 04:00 all of our servers restarted at exactly the same time, (uptime of all of the servers are within 1 or 2 seconds the same).

We've asked Equinix if there was any power disturbance but they couldn't find anything in their system.

Our NetApp, NAS and Firewall didn't restart so power outage is not an option.

If I check the uptime of all of the nodes, they all restarted at 03:38

I've found some log items which state something with restart at 00:00:
https://pastebin.com/tkfYWPaY

Also around 03:00 our firewall (fortigate) had some HA flapping, in this time there were also alot of log items in all of the proxmox nodes.
https://pastebin.com/jEta8m1n <-- this is a log file one of the nodes

Can someone identify the problem based on the logfile or should i do some more digging, if so, where do I look because I am out of ideas

Thanks for your help.

SourishCreature · Sep 25, 2024

In short it looks the fortigate flapping caused it and you lost network connectivity on the host, then lost quorum and the watchdog timer expired and rebooted the host:

Sep 24 03:35:49 pve04 watchdog-mux[1417]: client watchdog expired - disable watchdog updates

JesperAP · Sep 25, 2024

SourishCreature said:
In short it looks the fortigate flapping caused it and you lost network connectivity on the host, then lost quorum and the watchdog timer expired and rebooted the host:

Sep 24 03:35:49 pve04 watchdog-mux[1417]: client watchdog expired - disable watchdog updates

Can I disable that reboot if it doesn't have network connection? It doesn't sound like the right thing to do if you lose network connection...

Because I just found out the fortigate did a update around that time...
It shouldn't have lost network connection because of HA?

Falk R. · Sep 25, 2024

JesperAP said:
Can I disable that reboot if it doesn't have network connection? It doesn't sound like the right thing to do if you lose network connection...

Only if you don't want to have HA

JesperAP said:
Because I just found out the fortigate dit a update around that time...
It shouldn't have lost network connection because of HA?

To me, this sounds like a faulty network design.
Corosync should always be able to connect via multiple networks. Multiple rings.
It is best to build a dedicated Layer2 network for Corosync Ring0.
If there is no dedicated link, then physically separate Layer2 networks should be used for Corosync, e.g. VM network and storage network.

esi_y · Sep 25, 2024

JesperAP said:
Can I disable that reboot if it doesn't have network connection? It doesn't sound like the right thing to do if you lose network connection...

Do you mean you want it to freeze instead or continue going? In the latter case, you do NOT want to use any HA because you risk split-brain situations - your orphaned nodes cannot know if they just lost network connectivity with each other or with the outside world or both or if the other node(s) died. A node that does not know about the situation in the cluster is dangerous to be left running without risking duplicate VMs launched, for instance.

You may read more on the watchdogs (including how to disable them - if you do NOT use HA) here:
https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/

JesperAP said:
Because I just found out the fortigate did a update around that time...
It shouldn't have lost network connection because of HA?

What is your /etc/corosync.conf like?

Search

Search

The whole cluster restarted at same time

JesperAP

New Member

SourishCreature

New Member

JesperAP

New Member

Falk R.

Distinguished Member

esi_y

Renowned Member