Cluster frozen after one node / VM behaved badly

Glowsome · Aug 14, 2023

This just happened :

On backup i got on one node on my cluster ( Bookworm/8.0.4/latest release Proxmox) weird behavior which in general blocked my whole cluster :

Code:

Message from syslogd@node01 at Aug 14 01:11:59 ...
 kernel:[1233000.842855] watchdog: BUG: soft lockup - CPU#33 stuck for 1863s! [task UPIDnode0:3490731]

Now i have in the past not seen this before, and thus, this is a new thing,

In essence what it did was (from a GUI point of view) break the complete cluster, whichever node i pointed to in accessing the GUI it told me that all nodes were broken, however background/running tasks seemed to continue ( for a part atleast)

Only after i rebooted the mentioned node forcefully everything went green again in the GUI.

Some research lead me to https://forum.proxmox.com/threads/vm-cpu-issues-watchdog-bug-soft-lockup-cpu-7-stuck-for-22s.107379/

But apart from the solution provided the only thing which was not honored was : async-io: threads
So am/are we dealing with a re-introduced issue, or is this the new 'default -to be way' ?

- Glowsome

Search

Search

Cluster frozen after one node / VM behaved badly

Glowsome

Renowned Member

We value your privacy