Ceph freeze when a node reboots on Proxmox cluster

LeoDAVID · Oct 30, 2025

Hello everyone,

I’m currently facing a rather strange issue on my Proxmox cluster, which uses Ceph for storage.

My infrastructure consists of 8 nodes, each equipped with 7 NVMe drives of 7.68 TB.
Each node therefore hosts 7 OSDs (one per drive), for a total of 56 OSDs across the cluster.

Each node is connected to a 40 Gbps core network, and I’ve configured several dedicated bonds and bridges for the following purposes:

Proxmox cluster communication
Ceph communication
Node management
Live migration

For virtual machine networking, I use an SDN zone in VLAN mode with dedicated VMNets.

Whenever a node reboots — either for maintenance or due to a crash — the Ceph cluster sometimes completely freezes for several minutes.

After some investigation, it appears this happens when one OSD becomes slow: Ceph reports “slow OPS”, and the entire cluster seems to hang.

It’s quite surprising that a single slow OSD (out of 56) can have such a severe impact on the whole production environment.
Once the affected OSD is restarted, performance gradually returns to normal, but the production impact remains significant.

For context, I recently changed the mClock profile from “balanced” to “high_client_ops” in an attempt to reduce latency.

Has anyone experienced a similar issue — specifically, VMs freezing when a Ceph node reboots?
If so, what solutions or best practices did you implement to prevent this from happening again?

Thank you in advance for your help — this issue is a real challenge in my production environment.

Have a great day,
Léo

micneu · Oct 30, 2025

du bist hier im deutschen bereich, bitte auf deutsch schreiben

Falk R. · Oct 31, 2025

LeoDAVID said:
Hello everyone,

I’m currently facing a rather strange issue on my Proxmox cluster, which uses Ceph for storage.

My infrastructure consists of 8 nodes, each equipped with 7 NVMe drives of 7.68 TB.
Each node therefore hosts 7 OSDs (one per drive), for a total of 56 OSDs across the cluster.

Each node is connected to a 40 Gbps core network, and I’ve configured several dedicated bonds and bridges for the following purposes:

Proxmox cluster communication

Ceph communication

Node management

Live migration

For virtual machine networking, I use an SDN zone in VLAN mode with dedicated VMNets.

Whenever a node reboots — either for maintenance or due to a crash — the Ceph cluster sometimes completely freezes for several minutes.

After some investigation, it appears this happens when one OSD becomes slow: Ceph reports “slow OPS”, and the entire cluster seems to hang.

It’s quite surprising that a single slow OSD (out of 56) can have such a severe impact on the whole production environment.
Once the affected OSD is restarted, performance gradually returns to normal, but the production impact remains significant.

For context, I recently changed the mClock profile from “balanced” to “high_client_ops” in an attempt to reduce latency.

Has anyone experienced a similar issue — specifically, VMs freezing when a Ceph node reboots?
If so, what solutions or best practices did you implement to prevent this from happening again?

Thank you in advance for your help — this issue is a real challenge in my production environment.

Have a great day,
Léo

Wenn du die Konfiguration deines Ceph (Mon's, Manager und Pool) teilst, kann man eventuell helfen.

Ceph freeze when a node reboots on Proxmox cluster

LeoDAVID

New Member

micneu

Renowned Member

Falk R.

Distinguished Member

We value your privacy