VM freeze after Ceph Node failed: suggestions?

thebadmachine

Member
Nov 25, 2019
3
0
6
25
Hello!

We have a 4 node cluster with Proxmox and Ceph installed on them.

The Ceph cluster runs on a separate 10G network and on 15K RPM HDDs.

We've recently had one of the nodes fail and all of our VMs that ran on that storage, froze over or had a kernel panic. After the Ceph storage was rebuilt and recovered, everything was fine but, we were wondering if there was any configuration issue that caused the VMs to freeze, instead of to keep on running. We have a 3/2 pool for all of our storage.

If we have misinterpreted how Ceph works, please give us any sort of suggestion of what to do or an alternate solution that is within reason. We want our VMs to keep running unless all of our nodes fail. (We've also setup HA between the nodes.)
 
This should not have happened. Do you have any data like screenshots of the Ceph panel in the GUI or the output of ceph -s while the cluster caused the VMs to freeze?

How was the IO Delay? (Node -> Summary)

Without that, we can only guess why it happened.
Can you post the following infos? Maybe we can spot some misconfiguration.

Code:
pveceph pool ls
ceph -s
ceph osd df tree
 
This should not have happened. Do you have any data like screenshots of the Ceph panel in the GUI or the output of ceph -s while the cluster caused the VMs to freeze?

How was the IO Delay? (Node -> Summary)

Without that, we can only guess why it happened.
Can you post the following infos? Maybe we can spot some misconfiguration.

Code:
pveceph pool ls
ceph -s
ceph osd df tree
1644417948737.png
1644417974431.png1644418013220.png


Our whole cluster started working again, after we shutdown the failing node (node04).
 
Our whole cluster started working again, after we shutdown the failing node (node04).
Okay now this is interesting. How was node4 failing? As apparently it was not just a simple power loss due to some HW error or something like it.

You currently also seem to have OSD 16 and the MON on node3 with slow ops. You could try to restart those services to see if it helps to make that warning go away.
 
Okay now this is interesting. How was node4 failing? As apparently it was not just a simple power loss due to some HW error or something like it.

You currently also seem to have OSD 16 and the MON on node3 with slow ops. You could try to restart those services to see if it helps to make that warning go away.
Some of the OSDs on node4 kept going out and the monitor on that node didn't seem to connect to the cluster anymore. We ran a quick ping test on the node to the other nodes via the Ceph network and it seemed like the node4 was connecting fine to the other nodes, so we thought there is no network issue.

We removed the different ceph services from node4 and kept the OSDs out while the node wasn't connected to the network and put the node back on the network again afterwards, but it froze up every VM on the Ceph storage, so we had to shut it down again.

We're assuming that the node has some kind of hardware failure relating to the harddrives, however on the IPMI for the server, the raid controller says that every hard drive is fine. So we don't know if we should get replacement HW or if we can just reinstall and redeploy Proxmox and Ceph on it.

The issue with the monitor on node3 and osd16 have been fixed now after letting the Ceph array rebuild.