Ceph freeze when a node reboots on Proxmox cluster

Oct 30, 2025
5
0
1
Hello everyone,


I’m currently facing a rather strange issue on my Proxmox cluster, which uses Ceph for storage.


My infrastructure consists of 8 nodes, each equipped with 7 NVMe drives of 7.68 TB.
Each node therefore hosts 7 OSDs (one per drive), for a total of 56 OSDs across the cluster.


Each node is connected to a 40 Gbps core network, and I’ve configured several dedicated bonds and bridges for the following purposes:


  • Proxmox cluster communication
  • Ceph communication
  • Node management
  • Live migration

For virtual machine networking, I use an SDN zone in VLAN mode with dedicated VMNets.




Issue​


Whenever a node reboots — either for maintenance or due to a crash — the Ceph cluster sometimes completely freezes for several minutes.


After some investigation, it appears this happens when one OSD becomes slow: Ceph reports “slow OPS”, and the entire cluster seems to hang.


It’s quite surprising that a single slow OSD (out of 56) can have such a severe impact on the whole production environment.
Once the affected OSD is restarted, performance gradually returns to normal, but the production impact remains significant.


For context, I recently changed the mClock profile from “balanced” to “high_client_ops” in an attempt to reduce latency.




Question​


Has anyone experienced a similar issue — specifically, VMs freezing when a Ceph node reboots?
If so, what solutions or best practices did you implement to prevent this from happening again?


Thank you in advance for your help — this issue is a real challenge in my production environment.


Have a great day,
Léo
 
In maintenance time I set noout norebalance norecover flags before OSD/server shutdown. It stops from moving data around others OSD.

In some Ceph talks was mentioned that single HDD can impact all cluster event HDD SMART will not show any evidence of coming HDD death. So you must track of disk activity.

As of SSD/NVMe sometimes device struggles to respond fast enough. Could it be the reason? Then I do single ( not multi parallel ) write test on SSD I see very low numbers.
 
In maintenance time I set noout norebalance norecover flags before OSD/server shutdown. It stops from moving data around others OSD.

In some Ceph talks was mentioned that single HDD can impact all cluster event HDD SMART will not show any evidence of coming HDD death. So you must track of disk activity.

As of SSD/NVMe sometimes device struggles to respond fast enough. Could it be the reason? Then I do single ( not multi parallel ) write test on SSD I see very low numbers.
I’ve already performed write tests, and the results are quite good on my side.
As for the maintenance part, indeed, setting noout, norebalance, and norecover is one way to handle it, but it’s not a viable solution in production. I can’t afford to have my virtual machines crash when I lose one node out of eight.