Ceph Squid (19.2.3) Cluster Hangs on Node Reboot - 56 NVMe OSDs - PVE 9.1.1

kidoly · Dec 30, 2025

Hello everyone,

I’m seeking advice on a recurring issue where our Ceph cluster freezes during node reboots.

Environment:

Proxmox: 9.1.1 Enterprise
Ceph: 19.2.3 (Squid)
Hardware: 8 Nodes, 56 OSDs total (7.68TB NVMe), 5 monitors
Network: 40 Gbps, MTU 9000 (Jumbo Frames) for Ceph traffic. Dedicated bonds for Cluster/Ceph/Management.

Pool Configuration:

Primary Pools: size 4 / min_size 2 (e.g., Ceph_01)
Autoscale: Set to warn mode.

The Symptom: When any node reboots, we experience a cluster-wide IO freeze lasting several minutes.

Ceph reports "Slow OPS" and "Blocked Requests."
Restarting the "slow" OSD manually usually resolves the hang, but the impact on production VMs is already done.
I have also noticed that during the recovery phase, some OSDs crash

Observations & Questions:

With a size 4 / min_size 2 setup, why would a single node (7 OSDs) going offline cause a total hang?
Are there specific Squid-version tunables for OSD peering or heartbeat timeouts that are recommended for high-performance NVMe/40Gbps environments to prevent a single slow OSD from blocking the primary PGs?

Best regards,

gurubert · Dec 31, 2025

Please post the output of

ceph status
ceph mon dump
ceph config dump
ceph osd df tree
ceph osd crush rule dump
ceph osd pool ls detail
from each node: ip addr show

kidoly · Dec 31, 2025

gurubert said:
Please post the output of

ceph status

ceph mon dump

ceph config dump

ceph osd df tree

ceph osd crush rule dump

ceph osd pool ls detail

from each node: ip addr show

Here are the outputs requested for the 8-node Ceph Squid cluster.
I'm quite limited on the number of characters so i put the rest in attached files.
I anonymized the data.

Bash:

root@PVE01:~# ceph status
  cluster:
    id:     cluster_id
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum PVE01,PVE03,PVE05,PVE07,wit01 (age 7d)
    mgr: PVE06(active, since 2w), standbys: PVE01
    osd: 56 osds: 56 up (since 7d), 56 in (since 2w)
 
  data:
    pools:   6 pools, 3969 pgs
    objects: 9.91M objects, 36 TiB
    usage:   145 TiB used, 247 TiB / 391 TiB avail
    pgs:     3969 active+clean
 
  io:
    client:   33 MiB/s rd, 58 MiB/s wr, 1.34k op/s rd, 1.80k op/s wr
 
root@PVE01:~# ceph mon dump
epoch 26
fsid cluster_id
last_changed 2025-12-15T15:34:17.287101+0100
created 2025-05-06T16:52:06.505622+0200
min_mon_release 19 (squid)
election_strategy: 1
0: [v2:10.170.252.21:3300/0,v1:10.170.252.21:6789/0] mon.PVE01
1: [v2:10.170.252.23:3300/0,v1:10.170.252.23:6789/0] mon.PVE03
2: [v2:10.170.252.25:3300/0,v1:10.170.252.25:6789/0] mon.PVE05
3: [v2:10.170.252.27:3300/0,v1:10.170.252.27:6789/0] mon.PVE07
4: [v2:10.160.252.20:3300/0,v1:10.160.252.20:6789/0] mon.wit01
dumped monmap epoch 26

root@PVE01:~# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 10,
        "rule_name": "replicated-4x-datacenter",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "datacenter"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 11,
        "rule_name": "replicated-2-per-dc",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 2,
                "type": "datacenter"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 12,
        "rule_name": "site1-primary",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -19,
                "item_name": "site1"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            },
            {
                "op": "take",
                "item": -20,
                "item_name": "site2"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 14,
        "rule_name": "only-site1",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -19,
                "item_name": "site1"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]
root@PVE01:~# ceph osd pool ls detail
pool 1 '.mgr' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 14 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 50.00
pool 26 'DS_Ceph_RBD_01' replicated size 4 min_size 2 crush_rule 11 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 67969 lfor 0/6342/6869 flags hashpspool,selfmanaged_snaps stripe_width 0 pg_num_min 512 application rbd read_balance_score 1.75
pool 27 'DS_Ceph_RBD_02' replicated size 4 min_size 2 crush_rule 10 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn last_change 21640 lfor 0/15716/21633 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.48
pool 28 'rbd-bdd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 15734 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.97
pool 29 'DS_RBD_site1_Primary' replicated size 4 min_size 2 crush_rule 12 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 30574 lfor 0/0/24790 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.50
pool 30 'DS_RBD_site1_ONLY' replicated size 2 min_size 1 crush_rule 14 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 64020 lfor 0/0/29431 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.12

alexskysilk · Dec 31, 2025

kidoly said:
With a size 4 / min_size 2 setup, why would a single node (7 OSDs) going offline cause a total hang?

Unknown, especially since you posted output from the cluster in a healthy state. size 4 is generally a bad idea (even number; the last copy offers no utility), but it should not cause you any issues like this, especially with only one node out.

My suggestion- remove 2 monitors (you dont need them and they generate unnecessary traffic.) check all your node's NICs for tx/tx retries. Assuming all is normal, watch all 3 surviving monitors log files and shut down a node to cause the failure. If you need help decoding whats happening, post the resultant monitor logs here.

yangsm · Jan 2, 2026

Provide the content of the /etc/network/interfaces file‌

gurubert · Jan 2, 2026

Is a specific pool affected by shutting down one host?

Your CRUSH rules are a wild mix.

kidoly · Jan 2, 2026

alexskysilk said:
Unknown, especially since you posted output from the cluster in a healthy state. size 4 is generally a bad idea (even number; the last copy offers no utility), but it should not cause you any issues like this, especially with only one node out.

My suggestion- remove 2 monitors (you dont need them and they generate unnecessary traffic.) check all your node's NICs for tx/tx retries. Assuming all is normal, watch all 3 surviving monitors log files and shut down a node to cause the failure. If you need help decoding whats happening, post the resultant monitor logs here.

Hello,
Appreciate your help ;]
It's a production environment, so I can't test it today. I'll try to negotiate to test it next weekend.

kidoly · Jan 2, 2026

yangsm said:
Provide the content of the /etc/network/interfaces file‌

Thanks for your help, here

Code:

auto lo
iface lo inet loopback

#Bond LACP
auto bond0
iface bond0 inet manual
        bond-slaves enp152s0 enp152s0d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        bond-lacp-rate fast
        mtu 9000

#VLAN3000 - Management
auto bond0.3000
iface bond0.3000 inet manual

auto vmbr0v3000
iface vmbr0v3000 inet static
        address 10.150.252.21/24
        gateway 10.150.252.1
        bridge-ports bond0.3000
        bridge-stp off
        bridge-fd 0
        dns-nameservers 1.1.1.1

#VLAN3001 - Public-Ceph
auto bond0.3001
iface bond0.3001 inet manual

auto vmbr0v3001
iface vmbr0v3001 inet static
        address 10.170.252.21/24
        bridge-ports bond0.3001
        bridge-stp off
        bridge-fd 0
        mtu 9000

#VLAN3002 - Communication-Cluster
auto bond0.3002
iface bond0.3002 inet manual

auto vmbr0v3002
iface vmbr0v3002 inet static
        address 10.171.252.21/24
        bridge-ports bond0.3002
        bridge-stp off
        bridge-fd 0

#VLAN3003 - Live-Migration
auto bond0.3003
iface bond0.3003 inet manual

auto vmbr0v3003
iface vmbr0v3003 inet static
        address 10.172.252.21/24
        bridge-ports bond0.3003
        bridge-stp off
        bridge-fd 0
        mtu 9000

#VLAN3011 - Ceph-Cluster
auto bond0.3011
iface bond0.3011 inet manual

auto vmbr0v3011
iface vmbr0v3011 inet static
        address 10.11.252.21/24
        bridge-ports bond0.3011
        bridge-stp off
        bridge-fd 0
        mtu 9000

#Bridge zone SDN type VLAN "z01"
auto vmbr0
iface vmbr0 inet static
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

source /etc/network/interfaces.d/*

gurubert · Jan 2, 2026

alexskysilk said:
remove 2 monitors (you dont need them and they generate unnecessary traffic

The current recommendation from the Ceph project is to run 5 MONs.
With only three MONs you run into a high risk situation after losing just one MON. Losing another and your cluster stops.
With five MONs you can loose two and the cluster will still work.

kidoly · Jan 2, 2026

gurubert said:
Is a specific pool affected by shutting down one host?

Your CRUSH rules are a wild mix.

All pools are impacted when a single host is shut down.

Below are logs from an OSD that crashed during the recovery process after a PVE reboot. In total, four OSDs crashed: two on the node that was rebooting and two on a different node.

alexskysilk · Jan 2, 2026

gurubert said:
The current recommendation from the Ceph project is to run 5 MONs.

cite your sources please. 5 monitors are "suggested" with a high number of OSD nodes.

gurubert said:
With only three MONs you run into a high risk situation after losing just one MON. Losing another and your cluster stops.

with a typical crush rule of 3:2, this only makes sense IF you have dedicated monitor nodes (eg, no OSD) AND you have environmental issues that take your nodes down routinely. Otherwise, the risk is miniscule. If your monitors are on OSD servers, losing 2 nodes will take your cluster down anyway (if at least partially.) Each monitor regenerates and consumes traffic, and the synchronous nature of monitor traffic means I/O latency goes up exponentially with more monitors; To be performant, you need to have no more then the absolute minimum number while meeting your availability criteria- which, as pointed to above, is aimed at single node fault anyway.

Oh, and the cluster will continue to function with a single monitor. they're not placement groups.

gurubert · Jan 5, 2026

alexskysilk said:
Oh, and the cluster will continue to function with a single monitor.

No, as soon as there is no quorum any more between the MONs, i.e. the majority of MONs do not see each other, the cluster will stop working.

And for the number of MONs: The cephadm orchestrator deploys 5 by default for the reasons I outlined: https://docs.ceph.com/en/latest/cephadm/services/mon/

alexskysilk · Jan 5, 2026

gurubert said:
No, as soon as there is no quorum any more between the MONs, i.e. the majority of MONs do not see each other, the cluster will stop working.

This isnt actually so. you can think of monitoring quorum rule as 3:1. fun fact- a cluster with 2 monitors is more prone to pg errors (monitor disagree) then with one. Feel free to try it yourself- shut down all but one of your monitors and see what happens. This has happened to me on numerous occasions (not by choice...)

Noted on the cephadm toolset. I understand their POV, but disagree with the "lightweight" reasoning; while the daemon itself doesn't consume much resources they do generate quite a bit of traffic without any notable benefit.

Search

Search

Ceph Squid (19.2.3) Cluster Hangs on Node Reboot - 56 NVMe OSDs - PVE 9.1.1

kidoly

New Member

gurubert

Distinguished Member

kidoly

New Member

Attachments

alexskysilk

Distinguished Member

yangsm

Active Member

gurubert

Distinguished Member

kidoly

New Member

kidoly

New Member

gurubert

Distinguished Member

kidoly

New Member

Attachments

alexskysilk

Distinguished Member

gurubert

Distinguished Member

alexskysilk

Distinguished Member

We value your privacy