Getting rid of watchdog emergency node reboot

Drive arrays are all configured the same across our two clusters.

Dell M.2 Boss cards for booting Proxmox. This is a hardware (well, on card) RAID1 of the two M.2's

The disks are then connected to HBA330's and passed directly through to Proxmox. We then use Ceph for storage, the disks are all added into the same storage pool. Do you need our ceph configuration?
 
Drive arrays are all configured the same across our two clusters.

Dell M.2 Boss cards for booting Proxmox. This is a hardware (well, on card) RAID1 of the two M.2's

The disks are then connected to HBA330's and passed directly through to Proxmox. We then use Ceph for storage, the disks are all added into the same storage pool. Do you need our ceph configuration?
Before we dig into that (still might not be the issue), let's just check the frequency and incidence.

Can you pull something like:

Code:
journalctl -S "1 year ago" | grep -e " mfi " -e "-- Boot"

From all nodes? The 3 bad ones and, well, let's be reasonable one of the 10 good ones? Since they are all the same hardware config, if there's a discrepancy it's something to investigate, if not, well ... let's see.

(EDIT: Added spacing around mfi.)
 
Last edited:
mfi.txt attached from the 3 bad nodes and a good node.

It might also be worth noting that uptime (since upgrade to pve8 current) is 5 days with no reboots (yet) (touch wood).
 

Attachments

  • mfi.txt
    93 KB · Views: 2
pve01, pve02, pve03 were then shortly decomissioned and removed from the cluster.

The cluster now consists of: pve04, pve05, pve06, pve07, pve08, storage01, storage02

Stable nodes are: pve04, pve05, storage01, storage02.

mfi.txt attached from the 3 bad nodes and a good node.

I got a little lost here, within the attached mfi.txt, which one is the good one?

I think the last one -b (as opposed to -br ones)? The good node had no reboots for a year?
 
Last edited:
I got a little lost here, within the attached mfi.txt, which one is the good one?

I think the last one -b (as opposed to -br ones)? The good node had no reboots for a year?
yeah you're spot on mate.

BR1 = Equinix BR1 data centre
B1 = NextDC B1 data centre

the last one was a good node, and no reboots.
 
yeah you're spot on mate.

BR1 = Equinix BR1 data centre
B1 = NextDC B1 data centre

the last one was a good node, and no reboots.

Even if this got miraculously resolved now after update, maybe just keep that in the back of your mind that it might have been "mfi" related ... as least you have something to start with...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!