VM freeze after Ceph Node failed: suggestions?

thebadmachine · Feb 9, 2022

Hello!

We have a 4 node cluster with Proxmox and Ceph installed on them.

The Ceph cluster runs on a separate 10G network and on 15K RPM HDDs.

We've recently had one of the nodes fail and all of our VMs that ran on that storage, froze over or had a kernel panic. After the Ceph storage was rebuilt and recovered, everything was fine but, we were wondering if there was any configuration issue that caused the VMs to freeze, instead of to keep on running. We have a 3/2 pool for all of our storage.

If we have misinterpreted how Ceph works, please give us any sort of suggestion of what to do or an alternate solution that is within reason. We want our VMs to keep running unless all of our nodes fail. (We've also setup HA between the nodes.)

aaron · Feb 9, 2022

This should not have happened. Do you have any data like screenshots of the Ceph panel in the GUI or the output of ceph -s while the cluster caused the VMs to freeze?

How was the IO Delay? (Node -> Summary)

Without that, we can only guess why it happened.
Can you post the following infos? Maybe we can spot some misconfiguration.

Code:

pveceph pool ls
ceph -s
ceph osd df tree

thebadmachine · Feb 9, 2022

aaron said:
This should not have happened. Do you have any data like screenshots of the Ceph panel in the GUI or the output of ceph -s while the cluster caused the VMs to freeze?

How was the IO Delay? (Node -> Summary)

Without that, we can only guess why it happened.
Can you post the following infos? Maybe we can spot some misconfiguration.

Code:

pveceph pool ls ceph -s ceph osd df tree

Our whole cluster started working again, after we shutdown the failing node (node04).

aaron · Feb 10, 2022

thebadmachine said:
Our whole cluster started working again, after we shutdown the failing node (node04).

Okay now this is interesting. How was node4 failing? As apparently it was not just a simple power loss due to some HW error or something like it.

You currently also seem to have OSD 16 and the MON on node3 with slow ops. You could try to restart those services to see if it helps to make that warning go away.

thebadmachine · Feb 10, 2022

aaron said:
Okay now this is interesting. How was node4 failing? As apparently it was not just a simple power loss due to some HW error or something like it.

You currently also seem to have OSD 16 and the MON on node3 with slow ops. You could try to restart those services to see if it helps to make that warning go away.

Some of the OSDs on node4 kept going out and the monitor on that node didn't seem to connect to the cluster anymore. We ran a quick ping test on the node to the other nodes via the Ceph network and it seemed like the node4 was connecting fine to the other nodes, so we thought there is no network issue.

We removed the different ceph services from node4 and kept the OSDs out while the node wasn't connected to the network and put the node back on the network again afterwards, but it froze up every VM on the Ceph storage, so we had to shut it down again.

We're assuming that the node has some kind of hardware failure relating to the harddrives, however on the IPMI for the server, the raid controller says that every hard drive is fine. So we don't know if we should get replacement HW or if we can just reinstall and redeploy Proxmox and Ceph on it.

The issue with the monitor on node3 and osd16 have been fixed now after letting the Ceph array rebuild.

Search

Search

VM freeze after Ceph Node failed: suggestions?

thebadmachine

Member

aaron

Proxmox Staff Member

thebadmachine

Member

aaron

Proxmox Staff Member

thebadmachine

Member

We value your privacy