Proxmox Ceph Cluster problems after Node crash

rene.k

New Member
Oct 14, 2025
2
0
1
Hello all,

I have 2 Proxmox Nodes with:
  • CPU: 2x 18 Core Intel Xeon Gold 6154
  • RAM: 16x 32GB ECC DDR4 SDRAM (512 GB)
  • Storage: 10x 960GB Samsung PM93A NVMe

And a third Proxmox node as Quorum Device as VM.

I have created a Proxmox Cluster with all 3 Nodes, then I have installed Ceph on all 3 Nodes.
I just want to replicate on two Nodes, so in the configuration I set:
Code:
    osd_pool_default_min_size = 1
    osd_pool_default_size = 2

For all 3 Nodes I created a Ceph Monitor, Ceph Manager and Ceph Metadata Server.

Then I created OSDs with the NVMes on two Nodes and I created a pool for these.


I can install VMs, import VMs, migrate VMs between the two Nodes. I can shutdown and restart a Node and everything works fine.


Then I start testing worst case scenarios and there I get a Problem.
When I simulate a Node Crash with echo c > /proc/sysrq-trigger, I get a short cluster Timeout and then I got some laggy PGs. I cant access any VMs. But the Ceph Cluster says HEATH_OK. And when there are some VMs on the crashed Node, they are moved to the healthy Node. So the HA rules are working.
When the crashed Node is back online, everything works fine.

I would expect, that the cluster says its degraded and everything ist working, same as if I would shutdown a node.


Has someone an idea why the Cluster stops working, after a Node crash?


Thanks in advance!
René
 
Or as a real world example referenced by Proxmox developer @dcsapak in an earlier discussion on these parameters:

https://docs.ceph.com/docs/master/rados/operations/pools/
says that

min_size:
Sets the minimum number of replicas required for I/O.

so no, this is actually the number of replicas where it can still write (so 3/2 can tolerate a replica of 2 and still write)

2/1 is generally a bad idea because it is very easy to lose data, e.g. bit rot on one disk while the other fails/flapping osds, etc.
the more osds you have the more likely your data loss will be with this

a more prominent example of how it can fail is this story from 2016:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
even though they did not lose any(many?) data, it was much work to get it working again



Here is another old discussion: https://forum.proxmox.com/threads/cannot-create-ceph-pool-with-min_size-1-or-min_size-2.77676/