Hello everyone,
I’m seeking advice on a recurring issue where our Ceph cluster freezes during node reboots.
Environment:
The Symptom: When any node reboots, we experience a cluster-wide IO freeze lasting several minutes.
Observations & Questions:
Best regards,
I’m seeking advice on a recurring issue where our Ceph cluster freezes during node reboots.
Environment:
- Proxmox: 9.1.1 Enterprise
- Ceph: 19.2.3 (Squid)
- Hardware: 8 Nodes, 56 OSDs total (7.68TB NVMe), 5 monitors
- Network: 40 Gbps, MTU 9000 (Jumbo Frames) for Ceph traffic. Dedicated bonds for Cluster/Ceph/Management.
- Primary Pools: size 4 / min_size 2 (e.g., Ceph_01)
- Autoscale: Set to warn mode.
The Symptom: When any node reboots, we experience a cluster-wide IO freeze lasting several minutes.
- Ceph reports "Slow OPS" and "Blocked Requests."
- Restarting the "slow" OSD manually usually resolves the hang, but the impact on production VMs is already done.
- I have also noticed that during the recovery phase, some OSDs crash
Observations & Questions:
- With a size 4 / min_size 2 setup, why would a single node (7 OSDs) going offline cause a total hang?
- Are there specific Squid-version tunables for OSD peering or heartbeat timeouts that are recommended for high-performance NVMe/40Gbps environments to prevent a single slow OSD from blocking the primary PGs?
Best regards,
Last edited: