Hello everyone,
I’m currently facing a rather strange issue on my Proxmox cluster, which uses Ceph for storage.
My infrastructure consists of 8 nodes, each equipped with 7 NVMe drives of 7.68 TB.
Each node therefore hosts 7 OSDs (one per drive), for a total of 56 OSDs across the cluster.
Each node is connected to a 40 Gbps core network, and I’ve configured several dedicated bonds and bridges for the following purposes:
For virtual machine networking, I use an SDN zone in VLAN mode with dedicated VMNets.
Whenever a node reboots — either for maintenance or due to a crash — the Ceph cluster sometimes completely freezes for several minutes.
After some investigation, it appears this happens when one OSD becomes slow: Ceph reports “slow OPS”, and the entire cluster seems to hang.
It’s quite surprising that a single slow OSD (out of 56) can have such a severe impact on the whole production environment.
Once the affected OSD is restarted, performance gradually returns to normal, but the production impact remains significant.
For context, I recently changed the mClock profile from “balanced” to “high_client_ops” in an attempt to reduce latency.
Has anyone experienced a similar issue — specifically, VMs freezing when a Ceph node reboots?
If so, what solutions or best practices did you implement to prevent this from happening again?
Thank you in advance for your help — this issue is a real challenge in my production environment.
Have a great day,
Léo
I’m currently facing a rather strange issue on my Proxmox cluster, which uses Ceph for storage.
My infrastructure consists of 8 nodes, each equipped with 7 NVMe drives of 7.68 TB.
Each node therefore hosts 7 OSDs (one per drive), for a total of 56 OSDs across the cluster.
Each node is connected to a 40 Gbps core network, and I’ve configured several dedicated bonds and bridges for the following purposes:
- Proxmox cluster communication
- Ceph communication
- Node management
- Live migration
For virtual machine networking, I use an SDN zone in VLAN mode with dedicated VMNets.
Issue
Whenever a node reboots — either for maintenance or due to a crash — the Ceph cluster sometimes completely freezes for several minutes.
After some investigation, it appears this happens when one OSD becomes slow: Ceph reports “slow OPS”, and the entire cluster seems to hang.
It’s quite surprising that a single slow OSD (out of 56) can have such a severe impact on the whole production environment.
Once the affected OSD is restarted, performance gradually returns to normal, but the production impact remains significant.
For context, I recently changed the mClock profile from “balanced” to “high_client_ops” in an attempt to reduce latency.
Question
Has anyone experienced a similar issue — specifically, VMs freezing when a Ceph node reboots?
If so, what solutions or best practices did you implement to prevent this from happening again?
Thank you in advance for your help — this issue is a real challenge in my production environment.
Have a great day,
Léo