3 Node CEPH with size/min 3/2 - random unhealthy/blocked io for short time if 1 node down

Apr 17, 2023
6
1
3
Following setup:
  • 3 exactly same nodes
  • 2x OSD with 1TB per OSD on each node
  • 10 GBit CEPH Only (both networks)
    • mesh
    • ovs bridge
    • not much load
  • MON, MGR and MDS on each node
  • size/min 3/2
  • used space 1tb
  • all servers use the same chrony config
Following not expected behavior, which i can't explain at the moment - maybe i completely misunderstood how it works:

SOMETIMES! if 1 node is not available, everything continues to work for about 30-60 seconds. After 30-60 seconds the status switches to 0/1 healthy (volumes for example) and to "recovering", for about 1-2 minutes and no access to io possible. VMs do not continue to run, are frozen and shared storage via cephfs is also not available. Roughly after 1-2 minutes, even if the node is still not available, everything works again.

This behavior does not always occur, but it drives me crazy and is, from my point of view, not the expected behavior.

Can someone help me out?
best & thanks
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!