3 Node CEPH with size/min 3/2 - random unhealthy/blocked io for short time if 1 node down

cheesecake · Apr 17, 2023

Following setup:

3 exactly same nodes
2x OSD with 1TB per OSD on each node
10 GBit CEPH Only (both networks)
- mesh
- ovs bridge
- not much load
MON, MGR and MDS on each node
size/min 3/2
used space 1tb
all servers use the same chrony config

Following not expected behavior, which i can't explain at the moment - maybe i completely misunderstood how it works:

SOMETIMES! if 1 node is not available, everything continues to work for about 30-60 seconds. After 30-60 seconds the status switches to 0/1 healthy (volumes for example) and to "recovering", for about 1-2 minutes and no access to io possible. VMs do not continue to run, are frozen and shared storage via cephfs is also not available. Roughly after 1-2 minutes, even if the node is still not available, everything works again.

This behavior does not always occur, but it drives me crazy and is, from my point of view, not the expected behavior.

Can someone help me out?
best & thanks

gurubert · Apr 17, 2023

What does "ceph -s" say in that situation?
What does the ceph cluster log tell you?

Search

Search

3 Node CEPH with size/min 3/2 - random unhealthy/blocked io for short time if 1 node down

cheesecake

New Member

gurubert

Distinguished Member