The whole cluster restarted at same time

JesperAP

New Member
Jun 18, 2024
13
0
1
Yesterday between 00:00 and 04:00 all of our servers restarted at exactly the same time, (uptime of all of the servers are within 1 or 2 seconds the same).

We've asked Equinix if there was any power disturbance but they couldn't find anything in their system.

Our NetApp, NAS and Firewall didn't restart so power outage is not an option.

If I check the uptime of all of the nodes, they all restarted at 03:38

I've found some log items which state something with restart at 00:00:
https://pastebin.com/tkfYWPaY

Also around 03:00 our firewall (fortigate) had some HA flapping, in this time there were also alot of log items in all of the proxmox nodes.
https://pastebin.com/jEta8m1n <-- this is a log file one of the nodes


Can someone identify the problem based on the logfile or should i do some more digging, if so, where do I look because I am out of ideas

Thanks for your help.
 
Last edited:
In short it looks the fortigate flapping caused it and you lost network connectivity on the host, then lost quorum and the watchdog timer expired and rebooted the host:

  1. Sep 24 03:35:49 pve04 watchdog-mux[1417]: client watchdog expired - disable watchdog updates
 
In short it looks the fortigate flapping caused it and you lost network connectivity on the host, then lost quorum and the watchdog timer expired and rebooted the host:

  1. Sep 24 03:35:49 pve04 watchdog-mux[1417]: client watchdog expired - disable watchdog updates
Can I disable that reboot if it doesn't have network connection? It doesn't sound like the right thing to do if you lose network connection...

Because I just found out the fortigate did a update around that time...
It shouldn't have lost network connection because of HA?
 
Last edited:
Can I disable that reboot if it doesn't have network connection? It doesn't sound like the right thing to do if you lose network connection...
Only if you don't want to have HA
Because I just found out the fortigate dit a update around that time...
It shouldn't have lost network connection because of HA?
To me, this sounds like a faulty network design.
Corosync should always be able to connect via multiple networks. Multiple rings.
It is best to build a dedicated Layer2 network for Corosync Ring0.
If there is no dedicated link, then physically separate Layer2 networks should be used for Corosync, e.g. VM network and storage network.
 
  • Like
Reactions: esi_y
Can I disable that reboot if it doesn't have network connection? It doesn't sound like the right thing to do if you lose network connection...

Do you mean you want it to freeze instead or continue going? In the latter case, you do NOT want to use any HA because you risk split-brain situations - your orphaned nodes cannot know if they just lost network connectivity with each other or with the outside world or both or if the other node(s) died. A node that does not know about the situation in the cluster is dangerous to be left running without risking duplicate VMs launched, for instance.

You may read more on the watchdogs (including how to disable them - if you do NOT use HA) here:
https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/

Because I just found out the fortigate did a update around that time...
It shouldn't have lost network connection because of HA?

What is your /etc/corosync.conf like?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!