Ceph Squid (19.2.3) Cluster Hangs on Node Reboot - 56 NVMe OSDs - PVE 9.1.1

kidoly

New Member
Jun 15, 2025
2
0
1
Hello everyone,

I’m seeking advice on a recurring issue where our Ceph cluster freezes during node reboots.

Environment:
  • Proxmox: 9.1.1 Enterprise
  • Ceph: 19.2.3 (Squid)
  • Hardware: 8 Nodes, 56 OSDs total (7.68TB NVMe), 5 monitors
  • Network: 40 Gbps, MTU 9000 (Jumbo Frames) for Ceph traffic. Dedicated bonds for Cluster/Ceph/Management.
Pool Configuration:
  • Primary Pools: size 4 / min_size 2 (e.g., Ceph_01)
  • Autoscale: Set to warn mode.

The Symptom: When any node reboots, we experience a cluster-wide IO freeze lasting several minutes.
  • Ceph reports "Slow OPS" and "Blocked Requests."
  • Restarting the "slow" OSD manually usually resolves the hang, but the impact on production VMs is already done.
  • I have also noticed that during the recovery phase, some OSDs crash

Observations & Questions:
  1. With a size 4 / min_size 2 setup, why would a single node (7 OSDs) going offline cause a total hang?
  2. Are there specific Squid-version tunables for OSD peering or heartbeat timeouts that are recommended for high-performance NVMe/40Gbps environments to prevent a single slow OSD from blocking the primary PGs?

Best regards,
 
Last edited:
Please post the output of
  • ceph status
  • ceph mon dump
  • ceph config dump
  • ceph osd df tree
  • ceph osd crush rule dump
  • ceph osd pool ls detail
  • from each node: ip addr show
Here are the outputs requested for the 8-node Ceph Squid cluster.
I'm quite limited on the number of characters so i put the rest in attached files.
I anonymized the data.

Bash:
root@PVE01:~# ceph status
  cluster:
    id:     cluster_id
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum PVE01,PVE03,PVE05,PVE07,wit01 (age 7d)
    mgr: PVE06(active, since 2w), standbys: PVE01
    osd: 56 osds: 56 up (since 7d), 56 in (since 2w)
 
  data:
    pools:   6 pools, 3969 pgs
    objects: 9.91M objects, 36 TiB
    usage:   145 TiB used, 247 TiB / 391 TiB avail
    pgs:     3969 active+clean
 
  io:
    client:   33 MiB/s rd, 58 MiB/s wr, 1.34k op/s rd, 1.80k op/s wr
 
root@PVE01:~# ceph mon dump
epoch 26
fsid cluster_id
last_changed 2025-12-15T15:34:17.287101+0100
created 2025-05-06T16:52:06.505622+0200
min_mon_release 19 (squid)
election_strategy: 1
0: [v2:10.170.252.21:3300/0,v1:10.170.252.21:6789/0] mon.PVE01
1: [v2:10.170.252.23:3300/0,v1:10.170.252.23:6789/0] mon.PVE03
2: [v2:10.170.252.25:3300/0,v1:10.170.252.25:6789/0] mon.PVE05
3: [v2:10.170.252.27:3300/0,v1:10.170.252.27:6789/0] mon.PVE07
4: [v2:10.160.252.20:3300/0,v1:10.160.252.20:6789/0] mon.wit01
dumped monmap epoch 26

root@PVE01:~# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 10,
        "rule_name": "replicated-4x-datacenter",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "datacenter"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 11,
        "rule_name": "replicated-2-per-dc",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 2,
                "type": "datacenter"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 12,
        "rule_name": "site1-primary",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -19,
                "item_name": "site1"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            },
            {
                "op": "take",
                "item": -20,
                "item_name": "site2"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 14,
        "rule_name": "only-site1",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -19,
                "item_name": "site1"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]
root@PVE01:~# ceph osd pool ls detail
pool 1 '.mgr' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 14 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 50.00
pool 26 'DS_Ceph_RBD_01' replicated size 4 min_size 2 crush_rule 11 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 67969 lfor 0/6342/6869 flags hashpspool,selfmanaged_snaps stripe_width 0 pg_num_min 512 application rbd read_balance_score 1.75
pool 27 'DS_Ceph_RBD_02' replicated size 4 min_size 2 crush_rule 10 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn last_change 21640 lfor 0/15716/21633 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.48
pool 28 'rbd-bdd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 15734 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.97
pool 29 'DS_RBD_site1_Primary' replicated size 4 min_size 2 crush_rule 12 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 30574 lfor 0/0/24790 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.50
pool 30 'DS_RBD_site1_ONLY' replicated size 2 min_size 1 crush_rule 14 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 64020 lfor 0/0/29431 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.12
 

Attachments

Last edited:
With a size 4 / min_size 2 setup, why would a single node (7 OSDs) going offline cause a total hang?
Unknown, especially since you posted output from the cluster in a healthy state. size 4 is generally a bad idea (even number; the last copy offers no utility), but it should not cause you any issues like this, especially with only one node out.

My suggestion- remove 2 monitors (you dont need them and they generate unnecessary traffic.) check all your node's NICs for tx/tx retries. Assuming all is normal, watch all 3 surviving monitors log files and shut down a node to cause the failure. If you need help decoding whats happening, post the resultant monitor logs here.