ceph never actually achieves desired redundancy

lewinernst · May 16, 2023

I have setup a three node ceph cluster with 8 OSDs. My goal is to be able to tolerate one node failure without losing data access. To test this i put a couple of ISOs and vm backups (80 GB total) on the cephfs, waited until it reached steady state and then shut down one node (gracefully). All my nodes are running the latest ceph and pve versions and i have kept the default crushmap. Unfortunately, this consistently resuluts in "reduced data availability":

When i restart the Node my pool goes back to healthy. What can i do to make sure, the cluster will actually remain available when losing a node? Can anyone point me to a way to diagnose why the data doesnt get distributed correctly (IMO that should be 3 copies, one on each node)?

The only irregular thing i can think of is that my OSDs are slightly imbalanced due to having added two spare hard drives. However, when i set them to out and wait for steady state one by one , the problem remains. Furthermore, all data together is smaller than the smallest OSD.

For reference - my pools:

Update: And the state my pool is in with all osds up/in:

alexskysilk · May 16, 2023

consider what you're doing.

you have three lopsided nodes; any PG that is placed on either of the two larger nodes in excess of the capacity of priddy is not HA by default in a 3/2 pool (and not only, pg placement logic has other variables.) Whats worse, your performance will be utterly shit since you have pg's going on HDD OSDs- so it doesnt matter how fast your ssd OSDs are.

What is the usecase for this? either remove the HDDs from your pool, or add more HDDs and make a HDD specific pool for stuff that is ok with very slow performance. In case its not clear, HAVE THE SAME CAPACITY OF OSDs per class per node.

ibravo · May 17, 2023

I think your Ceph cluster is working as intended. If you have the default replica = 3, when only two nodes are running, your cluster is in an unhealthy state until the third node comes back. This still allowes you to write data to the cluster, to the best of my knowledge, until only one server is up. Then it will turn to Read Only to safeguard the data.

lewinernst · May 17, 2023

alexskysilk said:
consider what you're doing.

you have three lopsided nodes; any PG that is placed on either of the two larger nodes in excess of the capacity of priddy is not HA by default in a 3/2 pool (and not only, pg placement logic has other variables.) Whats worse, your performance will be utterly shit since you have pg's going on HDD OSDs- so it doesnt matter how fast your ssd OSDs are.

What is the usecase for this? either remove the HDDs from your pool, or add more HDDs and make a HDD specific pool for stuff that is ok with very slow performance. In case its not clear, HAVE THE SAME CAPACITY OF OSDs per class per node.

Please consider two points of clarification from my OP:
a) The total of all data is smaller than the smallest OSD -> shouldn't it therefore still get distributed to all three nodes?
b) For troubleshooting this problem i have already had the hdds set to up/out -> Does this not create the situation you recommend?

I added the hdds for testing exactly that performance drop since i am deciding on whether i should create two pools and manually curate the data to a performance tier or let ceph do that.

lewinernst · May 17, 2023

ibravo said:
I think your Ceph cluster is working as intended. If you have the default replica = 3, when only two nodes are running, your cluster is in an unhealthy state until the third node comes back. This still allowes you to write data to the cluster, to the best of my knowledge, until only one server is up. Then it will turn to Read Only to safeguard the data.

Mostly yes but isn't that untrue for the 1 PG listed in the health warning as unavailable?

ceph never actually achieves desired redundancy

lewinernst

Member

alexskysilk

Distinguished Member

ibravo

Member

lewinernst

Member

lewinernst

Member

We value your privacy