Proxmox Ceph Cluster problems after Node crash

rene.k · 2025-10-14T13:23:58+0200

Hello all,

I have 2 Proxmox Nodes with:

CPU: 2x 18 Core Intel Xeon Gold 6154

RAM: 16x 32GB ECC DDR4 SDRAM (512 GB)

Storage: 10x 960GB Samsung PM93A NVMe

And a third Proxmox node as Quorum Device as VM.

I have created a Proxmox Cluster with all 3 Nodes, then I have installed Ceph on all 3 Nodes.
I just want to replicate on two Nodes, so in the configuration I set:

Code:

    osd_pool_default_min_size = 1
    osd_pool_default_size = 2

For all 3 Nodes I created a Ceph Monitor, Ceph Manager and Ceph Metadata Server.

Then I created OSDs with the NVMes on two Nodes and I created a pool for these.

I can install VMs, import VMs, migrate VMs between the two Nodes. I can shutdown and restart a Node and everything works fine.

Then I start testing worst case scenarios and there I get a Problem.
When I simulate a Node Crash with echo c > /proc/sysrq-trigger, I get a short cluster Timeout and then I got some laggy PGs. I cant access any VMs. But the Ceph Cluster says HEATH_OK. And when there are some VMs on the crashed Node, they are moved to the healthy Node. So the HA rules are working.
When the crashed Node is back online, everything works fine.

I would expect, that the cluster says its degraded and everything ist working, same as if I would shutdown a node.

Has someone an idea why the Cluster stops working, after a Node crash?

Thanks in advance!
René

gurubert · 2025-10-14T13:40:23+0200

Do not run pools with size=2 and min_size=1. You will lose data.

rene.k · 2025-10-14T13:47:02+0200

gurubert said:
Do not run pools with size=2 and min_size=1. You will lose data.

Thank you.
But why would this end up in data loss?
And why you would not lose data in 3/2, when a node crash?

Johannes S · 2025-10-14T15:45:26+0200

Read https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/ to get an idea

Johannes S · 2025-10-14T23:40:32+0200

Or as a real world example referenced by Proxmox developer @dcsapak in an earlier discussion on these parameters:

https://docs.ceph.com/docs/master/rados/operations/pools/
says that

min_size:
Sets the minimum number of replicas required for I/O.

so no, this is actually the number of replicas where it can still write (so 3/2 can tolerate a replica of 2 and still write)

2/1 is generally a bad idea because it is very easy to lose data, e.g. bit rot on one disk while the other fails/flapping osds, etc.
the more osds you have the more likely your data loss will be with this

a more prominent example of how it can fail is this story from 2016:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
even though they did not lose any(many?) data, it was much work to get it working again

Post in thread 'Ceph pool size (is 2/1 really a bad idea?)'

Apr 27, 2020

https://docs.ceph.com/docs/master/rados/operations/pools/
says that

min_size:
Sets the minimum number of replicas required for I/O.

so no, this is actually the number of replicas where it can still write (so 3/2 can tolerate a replica of 2 and still write)

2/1 is generally a bad idea because it is very easy to lose data, e.g. bit rot on one disk while the other fails/flapping osds, etc.
the more osds you have the more likely your data loss will be with this

a more prominent example of how it can fail is this story from 2016...

dcsapak

Here is another old discussion: https://forum.proxmox.com/threads/cannot-create-ceph-pool-with-min_size-1-or-min_size-2.77676/

Search

Search

Proxmox Ceph Cluster problems after Node crash

rene.k

New Member

gurubert

Distinguished Member

rene.k

New Member

Johannes S

Famous Member

Johannes S

Famous Member

Post in thread 'Ceph pool size (is 2/1 really a bad idea?)'

We value your privacy