Ceph Squid (19.2.3) Cluster Hangs on Node Reboot - 56 NVMe OSDs - PVE 9.1.1

kidoly

New Member
Jun 15, 2025
5
0
1
Hello everyone,

I’m seeking advice on a recurring issue where our Ceph cluster freezes during node reboots.

Environment:
  • Proxmox: 9.1.1 Enterprise
  • Ceph: 19.2.3 (Squid)
  • Hardware: 8 Nodes, 56 OSDs total (7.68TB NVMe), 5 monitors
  • Network: 40 Gbps, MTU 9000 (Jumbo Frames) for Ceph traffic. Dedicated bonds for Cluster/Ceph/Management.
Pool Configuration:
  • Primary Pools: size 4 / min_size 2 (e.g., Ceph_01)
  • Autoscale: Set to warn mode.

The Symptom: When any node reboots, we experience a cluster-wide IO freeze lasting several minutes.
  • Ceph reports "Slow OPS" and "Blocked Requests."
  • Restarting the "slow" OSD manually usually resolves the hang, but the impact on production VMs is already done.
  • I have also noticed that during the recovery phase, some OSDs crash

Observations & Questions:
  1. With a size 4 / min_size 2 setup, why would a single node (7 OSDs) going offline cause a total hang?
  2. Are there specific Squid-version tunables for OSD peering or heartbeat timeouts that are recommended for high-performance NVMe/40Gbps environments to prevent a single slow OSD from blocking the primary PGs?

Best regards,
 
Last edited:
Please post the output of
  • ceph status
  • ceph mon dump
  • ceph config dump
  • ceph osd df tree
  • ceph osd crush rule dump
  • ceph osd pool ls detail
  • from each node: ip addr show
Here are the outputs requested for the 8-node Ceph Squid cluster.
I'm quite limited on the number of characters so i put the rest in attached files.
I anonymized the data.

Bash:
root@PVE01:~# ceph status
  cluster:
    id:     cluster_id
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum PVE01,PVE03,PVE05,PVE07,wit01 (age 7d)
    mgr: PVE06(active, since 2w), standbys: PVE01
    osd: 56 osds: 56 up (since 7d), 56 in (since 2w)
 
  data:
    pools:   6 pools, 3969 pgs
    objects: 9.91M objects, 36 TiB
    usage:   145 TiB used, 247 TiB / 391 TiB avail
    pgs:     3969 active+clean
 
  io:
    client:   33 MiB/s rd, 58 MiB/s wr, 1.34k op/s rd, 1.80k op/s wr
 
root@PVE01:~# ceph mon dump
epoch 26
fsid cluster_id
last_changed 2025-12-15T15:34:17.287101+0100
created 2025-05-06T16:52:06.505622+0200
min_mon_release 19 (squid)
election_strategy: 1
0: [v2:10.170.252.21:3300/0,v1:10.170.252.21:6789/0] mon.PVE01
1: [v2:10.170.252.23:3300/0,v1:10.170.252.23:6789/0] mon.PVE03
2: [v2:10.170.252.25:3300/0,v1:10.170.252.25:6789/0] mon.PVE05
3: [v2:10.170.252.27:3300/0,v1:10.170.252.27:6789/0] mon.PVE07
4: [v2:10.160.252.20:3300/0,v1:10.160.252.20:6789/0] mon.wit01
dumped monmap epoch 26

root@PVE01:~# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 10,
        "rule_name": "replicated-4x-datacenter",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "datacenter"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 11,
        "rule_name": "replicated-2-per-dc",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 2,
                "type": "datacenter"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 12,
        "rule_name": "site1-primary",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -19,
                "item_name": "site1"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            },
            {
                "op": "take",
                "item": -20,
                "item_name": "site2"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 14,
        "rule_name": "only-site1",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -19,
                "item_name": "site1"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]
root@PVE01:~# ceph osd pool ls detail
pool 1 '.mgr' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 14 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 50.00
pool 26 'DS_Ceph_RBD_01' replicated size 4 min_size 2 crush_rule 11 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 67969 lfor 0/6342/6869 flags hashpspool,selfmanaged_snaps stripe_width 0 pg_num_min 512 application rbd read_balance_score 1.75
pool 27 'DS_Ceph_RBD_02' replicated size 4 min_size 2 crush_rule 10 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn last_change 21640 lfor 0/15716/21633 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.48
pool 28 'rbd-bdd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 15734 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.97
pool 29 'DS_RBD_site1_Primary' replicated size 4 min_size 2 crush_rule 12 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 30574 lfor 0/0/24790 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.50
pool 30 'DS_RBD_site1_ONLY' replicated size 2 min_size 1 crush_rule 14 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 64020 lfor 0/0/29431 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.12
 

Attachments

Last edited:
With a size 4 / min_size 2 setup, why would a single node (7 OSDs) going offline cause a total hang?
Unknown, especially since you posted output from the cluster in a healthy state. size 4 is generally a bad idea (even number; the last copy offers no utility), but it should not cause you any issues like this, especially with only one node out.

My suggestion- remove 2 monitors (you dont need them and they generate unnecessary traffic.) check all your node's NICs for tx/tx retries. Assuming all is normal, watch all 3 surviving monitors log files and shut down a node to cause the failure. If you need help decoding whats happening, post the resultant monitor logs here.
 
Unknown, especially since you posted output from the cluster in a healthy state. size 4 is generally a bad idea (even number; the last copy offers no utility), but it should not cause you any issues like this, especially with only one node out.

My suggestion- remove 2 monitors (you dont need them and they generate unnecessary traffic.) check all your node's NICs for tx/tx retries. Assuming all is normal, watch all 3 surviving monitors log files and shut down a node to cause the failure. If you need help decoding whats happening, post the resultant monitor logs here.
Hello,
Appreciate your help ;]
It's a production environment, so I can't test it today. I'll try to negotiate to test it next weekend.
 
Provide the content of the /etc/network/interfaces file‌
Thanks for your help, here

Code:
auto lo
iface lo inet loopback

#Bond LACP
auto bond0
iface bond0 inet manual
        bond-slaves enp152s0 enp152s0d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        bond-lacp-rate fast
        mtu 9000

#VLAN3000 - Management
auto bond0.3000
iface bond0.3000 inet manual

auto vmbr0v3000
iface vmbr0v3000 inet static
        address 10.150.252.21/24
        gateway 10.150.252.1
        bridge-ports bond0.3000
        bridge-stp off
        bridge-fd 0
        dns-nameservers 1.1.1.1

#VLAN3001 - Public-Ceph
auto bond0.3001
iface bond0.3001 inet manual

auto vmbr0v3001
iface vmbr0v3001 inet static
        address 10.170.252.21/24
        bridge-ports bond0.3001
        bridge-stp off
        bridge-fd 0
        mtu 9000

#VLAN3002 - Communication-Cluster
auto bond0.3002
iface bond0.3002 inet manual

auto vmbr0v3002
iface vmbr0v3002 inet static
        address 10.171.252.21/24
        bridge-ports bond0.3002
        bridge-stp off
        bridge-fd 0

#VLAN3003 - Live-Migration
auto bond0.3003
iface bond0.3003 inet manual

auto vmbr0v3003
iface vmbr0v3003 inet static
        address 10.172.252.21/24
        bridge-ports bond0.3003
        bridge-stp off
        bridge-fd 0
        mtu 9000

#VLAN3011 - Ceph-Cluster
auto bond0.3011
iface bond0.3011 inet manual

auto vmbr0v3011
iface vmbr0v3011 inet static
        address 10.11.252.21/24
        bridge-ports bond0.3011
        bridge-stp off
        bridge-fd 0
        mtu 9000

#Bridge zone SDN type VLAN "z01"
auto vmbr0
iface vmbr0 inet static
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

source /etc/network/interfaces.d/*
 
remove 2 monitors (you dont need them and they generate unnecessary traffic
The current recommendation from the Ceph project is to run 5 MONs.
With only three MONs you run into a high risk situation after losing just one MON. Losing another and your cluster stops.
With five MONs you can loose two and the cluster will still work.
 
  • Like
Reactions: kidoly and UdoB
Is a specific pool affected by shutting down one host?

Your CRUSH rules are a wild mix.
All pools are impacted when a single host is shut down.

Below are logs from an OSD that crashed during the recovery process after a PVE reboot. In total, four OSDs crashed: two on the node that was rebooting and two on a different node.
 

Attachments

The current recommendation from the Ceph project is to run 5 MONs.
cite your sources please. 5 monitors are "suggested" with a high number of OSD nodes.

With only three MONs you run into a high risk situation after losing just one MON. Losing another and your cluster stops.
with a typical crush rule of 3:2, this only makes sense IF you have dedicated monitor nodes (eg, no OSD) AND you have environmental issues that take your nodes down routinely. Otherwise, the risk is miniscule. If your monitors are on OSD servers, losing 2 nodes will take your cluster down anyway (if at least partially.) Each monitor regenerates and consumes traffic, and the synchronous nature of monitor traffic means I/O latency goes up exponentially with more monitors; To be performant, you need to have no more then the absolute minimum number while meeting your availability criteria- which, as pointed to above, is aimed at single node fault anyway.

Oh, and the cluster will continue to function with a single monitor. they're not placement groups.
 
No, as soon as there is no quorum any more between the MONs, i.e. the majority of MONs do not see each other, the cluster will stop working.
This isnt actually so. you can think of monitoring quorum rule as 3:1. fun fact- a cluster with 2 monitors is more prone to pg errors (monitor disagree) then with one. Feel free to try it yourself- shut down all but one of your monitors and see what happens. This has happened to me on numerous occasions (not by choice...)

Noted on the cephadm toolset. I understand their POV, but disagree with the "lightweight" reasoning; while the daemon itself doesn't consume much resources they do generate quite a bit of traffic without any notable benefit.