Getting rid of watchdog emergency node reboot

Drive arrays are all configured the same across our two clusters.

Dell M.2 Boss cards for booting Proxmox. This is a hardware (well, on card) RAID1 of the two M.2's

The disks are then connected to HBA330's and passed directly through to Proxmox. We then use Ceph for storage, the disks are all added into the same storage pool. Do you need our ceph configuration?
 
Drive arrays are all configured the same across our two clusters.

Dell M.2 Boss cards for booting Proxmox. This is a hardware (well, on card) RAID1 of the two M.2's

The disks are then connected to HBA330's and passed directly through to Proxmox. We then use Ceph for storage, the disks are all added into the same storage pool. Do you need our ceph configuration?
Before we dig into that (still might not be the issue), let's just check the frequency and incidence.

Can you pull something like:

Code:
journalctl -S "1 year ago" | grep -e " mfi " -e "-- Boot"

From all nodes? The 3 bad ones and, well, let's be reasonable one of the 10 good ones? Since they are all the same hardware config, if there's a discrepancy it's something to investigate, if not, well ... let's see.

(EDIT: Added spacing around mfi.)
 
Last edited:
mfi.txt attached from the 3 bad nodes and a good node.

It might also be worth noting that uptime (since upgrade to pve8 current) is 5 days with no reboots (yet) (touch wood).
 

Attachments

pve01, pve02, pve03 were then shortly decomissioned and removed from the cluster.

The cluster now consists of: pve04, pve05, pve06, pve07, pve08, storage01, storage02

Stable nodes are: pve04, pve05, storage01, storage02.

mfi.txt attached from the 3 bad nodes and a good node.

I got a little lost here, within the attached mfi.txt, which one is the good one?

I think the last one -b (as opposed to -br ones)? The good node had no reboots for a year?
 
Last edited:
I got a little lost here, within the attached mfi.txt, which one is the good one?

I think the last one -b (as opposed to -br ones)? The good node had no reboots for a year?
yeah you're spot on mate.

BR1 = Equinix BR1 data centre
B1 = NextDC B1 data centre

the last one was a good node, and no reboots.
 
yeah you're spot on mate.

BR1 = Equinix BR1 data centre
B1 = NextDC B1 data centre

the last one was a good node, and no reboots.

Even if this got miraculously resolved now after update, maybe just keep that in the back of your mind that it might have been "mfi" related ... as least you have something to start with...