Getting rid of watchdog emergency node reboot

joshfinlay · Aug 21, 2024

Drive arrays are all configured the same across our two clusters.

Dell M.2 Boss cards for booting Proxmox. This is a hardware (well, on card) RAID1 of the two M.2's

The disks are then connected to HBA330's and passed directly through to Proxmox. We then use Ceph for storage, the disks are all added into the same storage pool. Do you need our ceph configuration?

esi_y · Aug 21, 2024

joshfinlay said:
Drive arrays are all configured the same across our two clusters.

Dell M.2 Boss cards for booting Proxmox. This is a hardware (well, on card) RAID1 of the two M.2's

The disks are then connected to HBA330's and passed directly through to Proxmox. We then use Ceph for storage, the disks are all added into the same storage pool. Do you need our ceph configuration?

Before we dig into that (still might not be the issue), let's just check the frequency and incidence.

Can you pull something like:

Code:

journalctl -S "1 year ago" | grep -e " mfi " -e "-- Boot"

From all nodes? The 3 bad ones and, well, let's be reasonable one of the 10 good ones? Since they are all the same hardware config, if there's a discrepancy it's something to investigate, if not, well ... let's see.

(EDIT: Added spacing around mfi.)

joshfinlay · Aug 22, 2024

mfi.txt attached from the 3 bad nodes and a good node.

It might also be worth noting that uptime (since upgrade to pve8 current) is 5 days with no reboots (yet) (touch wood).

esi_y · Aug 22, 2024

joshfinlay said:
pve01, pve02, pve03 were then shortly decomissioned and removed from the cluster.

The cluster now consists of: pve04, pve05, pve06, pve07, pve08, storage01, storage02

Stable nodes are: pve04, pve05, storage01, storage02.

joshfinlay said:
mfi.txt attached from the 3 bad nodes and a good node.

I got a little lost here, within the attached mfi.txt, which one is the good one?

I think the last one -b (as opposed to -br ones)? The good node had no reboots for a year?

joshfinlay · Aug 23, 2024

esi_y said:
I got a little lost here, within the attached mfi.txt, which one is the good one?

I think the last one -b (as opposed to -br ones)? The good node had no reboots for a year?

yeah you're spot on mate.

BR1 = Equinix BR1 data centre
B1 = NextDC B1 data centre

the last one was a good node, and no reboots.

esi_y · Aug 23, 2024

joshfinlay said:
yeah you're spot on mate.

BR1 = Equinix BR1 data centre
B1 = NextDC B1 data centre

the last one was a good node, and no reboots.

Even if this got miraculously resolved now after update, maybe just keep that in the back of your mind that it might have been "mfi" related ... as least you have something to start with...

Getting rid of watchdog emergency node reboot

joshfinlay

New Member

esi_y

Renowned Member

joshfinlay

New Member

Attachments

esi_y

Renowned Member

joshfinlay

New Member

esi_y

Renowned Member

We value your privacy