[SOLVED] node losing cluster connection at random intervalls

May 22, 2025
2
0
1
Hello,

I am at a loss on how to troubleshoot this.

Running a 3 node PVE-Cluster with Ceph.
Since a few days node02 suddenly seems to reject "control-plane" traffic.
ceph-mon appears to be stopped, but at the same time the node does not accept any traffic directed at itself except for ping, which it is still answering. no SSH connection possible (reset by peer). When using gui node02 and all VMs are indicated with the '?'. When trying to change settings on node02 via GUI I get timeout errors.

chekcing cluster connectivity shows the node is not even connected
Code:
pvesh get /cluster/status

until now only a hard reboot fixes this temporarily.
when connected the noede just seems to be bored with cpu at a meager 1% and ram at about 15 - 20%.

EDIT:
And of course I forgot to mention, that all VMs located on node02 are working fine :confused:
 
Last edited:
Update:
dmesg and attaching a Monitor to the affected node indicate a failing Drive.
The USB SSD the PVE was installed on is reset and the File system ist then mounted in ready only mode.