pve watchdog ha-fencing reboots node without HA enabled.

tuxis

Famous Member
Jan 3, 2014
218
159
108
Ede, NL
www.tuxis.nl
One of our clusters used to have HA enabled, whilst not anymore it seems one of our pve-nodes had been rebooted by watchdog-mux.
Looking at the logs, the machine rebooted at 15:31:10, not even 4 seconds earlier this message is displayed:
Jan 06 15:31:06 watchdog-mux[2482]: client watchdog expired - disable watchdog updates

A couple minutes after the reboot completed, journalctl shows us this:
Jan 06 15:32:27 node10 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. Jan 06 15:32:28 node10 systemd[1]: Started Proxmox VE watchdog multiplexer. Jan 06 15:32:28 node10 watchdog-mux[2442]: Watchdog driver 'Software Watchdog', version 0 Jan 06 15:32:30 node10 corosync[3311]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow

We did notice network traffic was quite high @ ~9Gbit/s during this period. Hardware resources were below 60% usage.
Any idea what could've caused this sudden reboot?
 
Have you really removed all HA resources and configurations or perhaps just set the VMs to the state “ignored”?
 
All configuration & resources have been removed from HA.
both resources.cfg & groups.cfg are empty aswell.
 
One of our clusters used to have HA enabled, whilst not anymore it seems one of our pve-nodes had been rebooted by watchdog-mux.
Looking at the logs, the machine rebooted at 15:31:10, not even 4 seconds earlier this message is displayed:
Jan 06 15:31:06 watchdog-mux[2482]: client watchdog expired - disable watchdog updates

A couple minutes after the reboot completed, journalctl shows us this:
Jan 06 15:32:27 node10 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Jan 06 15:32:28 node10 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jan 06 15:32:28 node10 watchdog-mux[2442]: Watchdog driver 'Software Watchdog', version 0
Jan 06 15:32:30 node10 corosync[3311]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow

We did notice network traffic was quite high @ ~9Gbit/s during this period. Hardware resources were below 60% usage.
Any idea what could've caused this sudden reboot?

Possibly because the node in question used to be HA master in the past and it has not rebooted since:
https://bugzilla.proxmox.com/show_bug.cgi?id=5243
 
  • Like
Reactions: andlil

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!