Ceph freeze when a node reboots on Proxmox cluster

Oct 30, 2025
5
0
1
Hello everyone,


I’m currently facing a rather strange issue on my Proxmox cluster, which uses Ceph for storage.


My infrastructure consists of 8 nodes, each equipped with 7 NVMe drives of 7.68 TB.
Each node therefore hosts 7 OSDs (one per drive), for a total of 56 OSDs across the cluster.


Each node is connected to a 40 Gbps core network, and I’ve configured several dedicated bonds and bridges for the following purposes:


  • Proxmox cluster communication
  • Ceph communication
  • Node management
  • Live migration

For virtual machine networking, I use an SDN zone in VLAN mode with dedicated VMNets.





Whenever a node reboots — either for maintenance or due to a crash — the Ceph cluster sometimes completely freezes for several minutes.


After some investigation, it appears this happens when one OSD becomes slow: Ceph reports “slow OPS”, and the entire cluster seems to hang.


It’s quite surprising that a single slow OSD (out of 56) can have such a severe impact on the whole production environment.
Once the affected OSD is restarted, performance gradually returns to normal, but the production impact remains significant.


For context, I recently changed the mClock profile from “balanced” to “high_client_ops” in an attempt to reduce latency.





Has anyone experienced a similar issue — specifically, VMs freezing when a Ceph node reboots?
If so, what solutions or best practices did you implement to prevent this from happening again?


Thank you in advance for your help — this issue is a real challenge in my production environment.


Have a great day,
Léo
 
Hello everyone,


I’m currently facing a rather strange issue on my Proxmox cluster, which uses Ceph for storage.


My infrastructure consists of 8 nodes, each equipped with 7 NVMe drives of 7.68 TB.
Each node therefore hosts 7 OSDs (one per drive), for a total of 56 OSDs across the cluster.


Each node is connected to a 40 Gbps core network, and I’ve configured several dedicated bonds and bridges for the following purposes:


  • Proxmox cluster communication
  • Ceph communication
  • Node management
  • Live migration

For virtual machine networking, I use an SDN zone in VLAN mode with dedicated VMNets.





Whenever a node reboots — either for maintenance or due to a crash — the Ceph cluster sometimes completely freezes for several minutes.


After some investigation, it appears this happens when one OSD becomes slow: Ceph reports “slow OPS”, and the entire cluster seems to hang.


It’s quite surprising that a single slow OSD (out of 56) can have such a severe impact on the whole production environment.
Once the affected OSD is restarted, performance gradually returns to normal, but the production impact remains significant.


For context, I recently changed the mClock profile from “balanced” to “high_client_ops” in an attempt to reduce latency.





Has anyone experienced a similar issue — specifically, VMs freezing when a Ceph node reboots?
If so, what solutions or best practices did you implement to prevent this from happening again?


Thank you in advance for your help — this issue is a real challenge in my production environment.


Have a great day,
Léo
Wenn du die Konfiguration deines Ceph (Mon's, Manager und Pool) teilst, kann man eventuell helfen.