VMs freeze after node failure/reboot in 5-node Ceph cluster

May 8, 2025
32
3
8
We are running a 5-node Proxmox cluster with Ceph and experiencing VM freezes on one of the other nodes whenever a node goes down or is rebooted — hoping someone has seen this before or can point me in the right direction.

Environment

  • Proxmox VE 9.1.6 cluster with 5 nodes
  • Ceph 19.2.3 (Squid)
  • 10 OSDs (2x NVMe, 3.6 TB each, per node)
  • Pool configuration:
    Code:
    size=3
    Code:
    min_size=2
  • CRUSH rule:
    Code:
    chooseleaf_firstn
    by
    Code:
    host
  • Network: 10G, Ceph public and cluster network on the same subnet
    Code:
    10.0.1.0/24
    Corosync on a separate interface
    Code:
    10.0.0.0/24

Problem

When a node fails or is rebooted, VMs on one of the remaining nodes freeze completely. The affected VMs are stored on the Ceph HA pool and show the following repeated kernel messages:

Code:
INFO: task jbd2/sda1-8:228 blocked for more than 120 seconds
INFO: task systemd-journal:272 blocked for more than 120 seconds

The blocked time increases continuously (120s → 241s → 362s → 483s...). The VMs become entirely unresponsive and even QEMU itself stops responding:

Code:
VM 129 qmp command 'quit' failed - got timeout

Observations

Ceph status during the outage:

  • 2 OSDs down (1 host down)
  • 75 PGs in state
    Code:
    active+undersized+degraded
  • Approximately 19% of objects degraded

HA manager loses locks:

Code:
lost lock 'ha_agent_pve-5_lock' - cfs lock update failed - Device or resource busy
status change active => lost_agent_lock

This occurs repeatedly, with the agent toggling between
Code:
active
and
Code:
lost_agent_lock
multiple times.

Primarily affected: VMs running on the node that holds the active Ceph Manager. It appears that the MGR failover combined with the peering storm completely blocks local VM I/O.

Steps taken so far

Recovery throttling was applied but did not resolve the issue:

Code:
ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3
ceph config set osd osd_recovery_sleep 0.1
ceph config set osd osd_recovery_op_priority 3

This suggests that the problem is not caused by recovery/backfill, but rather by the peering phase itself blocking client I/O.

Configuration

ceph.conf:

Code:
[global]
cluster_network = 10.0.1.1/24
public_network = 10.0.1.1/24
osd_pool_default_min_size = 2
osd_pool_default_size = 3

Public and cluster network point to the same subnet — there is no separate cluster network.

OSD tree:

Code:
ID   CLASS  WEIGHT    TYPE NAME                 STATUS
 -1         36.38687  root default
 -3          7.27737      host pve-1
  0   nvme   3.63869          osd.0                 up
  1   nvme   3.63869          osd.1                 up
 -7          7.27737      host pve-2
  2   nvme   3.63869          osd.2                 up
  3   nvme   3.63869          osd.3                 up
[... 2 OSDs per node across all 5 nodes]

Sample VM configuration (affected):

Code:
scsi0: ceph-ha:vm-203-disk-0,iothread=1,size=32G
scsihw: virtio-scsi-single



Any help is appreciated!
 
Hi @torbho, before recommending solutions, I'd need more information:
  1. Reproducibility: Does this happen every time a node goes down/reboots, or was this a one-time event? If reproducible, this strongly suggests a structural issue (network, config). If one-time, it could be a transient condition or bug.
  2. Duration: How long does the freeze last? Does it eventually resolve on its own, or does it require manual intervention (e.g., restarting OSDs, fencing the down node)?
  3. Timing correlation: How quickly after the node goes down do VMs start freezing? Immediately (seconds), or after a delay (30s+)? This helps distinguish between peering-blocked I/O (immediate) vs network saturation (gradual).
  4. Which VMs freeze: Do ALL VMs on healthy nodes freeze, or only VMs whose RBD images have PGs on the lost OSDs? If only affected PGs' VMs freeze, that's peering-blocked I/O. If all VMs freeze, that's more likely network saturation.
  5. [B]noout[/B] flag: Was `noout` set before planned reboots? If not, Ceph marks OSDs as `out` after `mon_osd_down_out_interval` (default 600s), triggering full recovery/backfill on top of peering.
  6. Previous state: Was the cluster in `HEALTH_OK` before the node went down? Any pre-existing degraded PGs, slow requests, or warnings?
  7. MGR observation: You mentioned VMs on the active-MGR node are primarily affected. Do VMs on other surviving nodes also freeze, or is it truly only the MGR node? This helps us determine if the MGR is relevant or if it's just correlated because that node also lost 2 OSDs.
  8. Pool details: How many Ceph pools do you have, and what are their PG counts? The 75 degraded PGs — is that across all pools or just `ceph-ha`? What's the total PG count? (`ceph osd pool ls detail`)
  9. Does peering eventually complete? Does the freeze eventually resolve on its own after peering finishes, or do you have to intervene? If peering never completes, the PGs may be stuck (e.g., in `down+peering` because they need an OSD that's on the downed node), which is a different problem from slow peering.

Also, we need to run commands to collect more info from the cluster:
  1. Catch client ops stuck in "waiting for peered" state
    This is the smoking gun. During the freeze, run on each surviving node:
    Bash:
    # Show all ops currently blocked, filtered for peering-related delaysceph daemon osd.<ID> dump_ops_in_flight | \
    jq '[.ops[] | select(.state == "waiting for peered") |
    {description: .description, duration: .duration, state: .state}] |
    sort_by(-.duration)'
    If you see many ops with state "waiting for peered" and durations matching the freeze time (~120s+), this confirms peering is the blocker. The `duration` field tells you exactly how long each op has been waiting.
    Also check for ops stuck earlier in the pipeline:
    Bash:
    # Ops that haven't even reached their PG yet (OSD message queue congestion)ceph daemon osd.<ID> dump_ops_in_flight | \
    jq '[.ops[] | select(.state == "queued for pg" or .state == "reached pg") |
    {description: .description, duration: .duration, state: .state}] |
    sort_by(-.duration) | .[0:10]'
    # After the event resolves, check historic slow ops for the full pictureceph daemon osd.<ID> dump_historic_slow_ops | \
    jq '[.ops[] | {description: .description, duration: .duration,
    initiated_at: .initiated_at, state: .state}] | sort_by(-.duration) | .[0:20]'
  2. Recovery state latency counters (the key metrics)
    Ceph tracks time spent in each PG recovery state as OSD-level perf counters (collection name `recoverystate_perf`, defined in `src/osd/osd_perf_counters.h:146-178`). These are `time_avg` counters — each reports `{avgcount, sum, avgtime}`.
    Bash:
    # Dump peering substate latencies for a specific OSD
    ceph daemon osd.<ID> perf dump recoverystate_perf | \
      jq '{peering: .recoverystate_perf.peering_latency,
           getinfo: .recoverystate_perf.getinfo_latency,
           getlog: .recoverystate_perf.getlog_latency,
           getmissing: .recoverystate_perf.getmissing_latency,
           waitupthru: .recoverystate_perf.waitupthru_latency,
           waitactingchange: .recoverystate_perf.waitactingchange_latency,
           active: .recoverystate_perf.active_latency}'
    See the table below the the interpretation.
  3. Client op latency and degraded-delay counters
    Bash:
    # Shows client ops that were delayed due to PG state
    ceph daemon osd.<ID> perf dump osd | \
      jq '{op_latency: .osd.op_latency,
           op_w_latency: .osd.op_w_latency,
           op_r_latency: .osd.op_r_latency,
           op_delayed_unreadable: .osd.op_delayed_unreadable,
           op_delayed_degraded: .osd.op_delayed_degraded}''
    op_delayed_unreadable and op_delayed_degraded count client ops delayed because PGs were in degraded/unreadable state. High values during the event confirm client I/O is being held up by PG state.
  4. PG log sizes (can be checked now, before the event)**
    Bash:
    # Large PG logs slow down the GetLog peering phase
    ceph pg dump --format json-pretty | \
      jq '[.pg_map.pg_stats[] | {pgid: .pgid, log_size: .log_size,
        ondisk_log_size: .ondisk_log_size}] | sort_by(-.log_size) | .[0:10]'
    
    # Also check the configured PG log limits
    ceph config get osd osd_min_pg_log_entries
    ceph config get osd osd_max_pg_log_entries
    ceph config get osd osd_pg_log_dups_tracked

    If PGs have >10k log entries, the GetLog phase transfers more data. Default osd_max_pg_log_entries is 10000 (or 100000 for SSDs in some versions). Check if this has been increased.
The key unknown is why peering takes >120 seconds on NVMe with 10G networking. On a healthy cluster, peering 75 PGs should take seconds. The `recoverystate_perf` counters will tell us exactly which peering substate is slow.

Counter​
What it measures​
If high, points to​
peering_latency
Total time in peering (the key metric)​
Peering is the bottleneck (expected <100ms on NVMe)​
getinfo_latency
Collecting PG info from peers​
Slow OSD responses or message queue congestion​
getlog_latency
Exchanging PG logs​
Large PG logs or slow peer I/O​
getmissing_latency
Determining missing objects​
Many objects to reconcile​
waitupthru_latency
Waiting for mon to acknowledge up_thru​
Slow monitor or mon network​
waitactingchange_latency
Waiting for acting set change​
CRUSH instability or slow mon​
The output per counter: `{"avgcount": N, "sum": T, "avgtime": A}`​
  • avgcount = number of times PGs entered this state
  • sum = total wall-clock seconds across all PGs
  • avgtime = average duration per entry (`sum / avgcount`)
    If peering_latency.avgtime >> 100ms, peering is abnormally slow. The substate breakdown tells you where the time is spent. For example: getlog_latency.avgtime = 5s but getinfo_latency.avgtime = 50ms means GetLog is the bottleneck → check PG log sizes. Please note, these counters are cumulative since OSD start. To get delta values for just the failure event, capture them before and after the event and subtract:
 
It appears that the MGR failover combined with the peering storm completely blocks local VM I/O.

MGR's actual role:
  • MgrClient in the OSD (src/osd/OSD.cc) only does: send_pgstats(), update_daemon_health(), set_perf_metric_query_cb() — all async, none on the I/O path
  • PeeringState.cc has zero references to MGR
  • PrimaryLogPG::do_request() never touches MGR
  • Client I/O flow: client → OSD → PrimaryLogPG → BlueStore — MGR is never consulted
So, your observation "VMs on the active-MGR node are primarily affected" is almost certainly correlation. One minor interaction to check: if the newly-active MGR on a surviving node causes a CPU spike during `load_all_metadata()` and Python module startup, it could marginally slow OSDs on that same node. But this would be a small transient effect, not a 120-second freeze.
 
@tchaikov
Thanks for the incredibly detailed response — this gives us a clear path forward. Let me answer your questions:


Reproducibility: This happens almost every single time a node goes down and comes back up.


Duration: The freeze does not resolve on its own. We have to manually intervene — typically by restarting the affected VMs, which sometimes requires a force-stop since QMP is unresponsive.


Timing: This is an important detail — the VMs don't freeze when the node goes down. The cluster runs fine in degraded state with 75 PGs
Code:
active+undersized+degraded
, and VMs continue to operate normally. The freeze occurs when the node comes back up and its 2 OSDs rejoin, triggering all 75 PGs to re-peer simultaneously. So the problem is not the initial peering after the failure, but the re-peering on OSD reentry.


Which VMs freeze: All VMs freeze on one of the healthy nodes. We haven't yet mapped out whether it's strictly correlated to PGs on the returning OSDs, but we'll check during the next test. It seems like it is the actual Manager Node.


noout flag: No,
Code:
noout
was not set before planned reboots. We'll start doing this going forward for maintenance windows.


Previous state: Yes, the cluster was in
Code:
HEALTH_OK
before each event. No pre-existing degraded PGs or slow requests.


MGR observation: Based on your earlier analysis of the source code, we agree the MGR correlation was likely coincidental. We'll pay closer attention to which exact VMs freeze during the next test and map them to affected PGs.


Pool details: We'll provide the output of
Code:
ceph osd pool ls detail
and total PG counts before the next test.


Peering completion: Peering does seem to complete eventually based on the PG states, but the affected VMs remain frozen even after — they never recover without manual intervention.


We're planning a controlled test during a maintenance window. We'll capture
Code:
recoverystate_perf
counters before and after, run
Code:
dump_ops_in_flight
during the freeze, and check PG log sizes beforehand. Will report back with the full data. Your breakdown of the peering substates and what to look for is exactly what we needed.
 
Your reply confirms this is a two-phase problem:

  1. Phase 1: Why does re-peering take long enough to cause 120s+ blocked I/O?
    On NVMe with 10G networking, peering 75 PGs should complete in seconds, not minutes. Something is making it abnormally slow. The recoverystate_perf counters from my previous post will pinpoint which peering substate is the bottleneck.
  2. Phase 2: Why don't VMs recover after peering completes?
    This is the more puzzling part. Once peering finishes and PGs become active+clean, the waiting_for_peered queue drains and all blocked ops are reprocessed. VMs should resume. The fact that they don't -- and that QMP becomes unresponsive — tells us something else breaks permanently during the prolonged block.
A couple of observations that may help narrow this down:
  1. The jbd2/sda1-8 blocked message: where are you seeing this? If it's on the VM console, then `jbd2/sda1-8` is the guest kernel's ext4 journaling thread for the VM's virtual disk (`/dev/sda1` inside the guest, backed by ceph-ha:vm-203-disk-0). That would be the expected symptom: Ceph I/O is blocked during peering → guest's disk I/O hangs → guest kernel reports blocked tasks. If it's in the host's `dmesg`, that's a different (and more concerning) situation — it would mean the host's own filesystem is stuck too.
  2. The guest kernel's SCSI timeout is likely why VMs don't recover. The default SCSI command timeout for virtio-scsi is 30 seconds. If the peering freeze lasts >30s (and yours lasts >120s), the guest kernel will see SCSI command timeouts. When ext4's journal commit times out, it aborts the journal and remounts the filesystem read-only. At that point, the guest is broken — even after Ceph recovers and RBD I/O resumes, the guest's filesystem won't come back without manual intervention (which looks like "the VM is permanently frozen").
    After force-restarting a frozen VM, check the guest dmesg for `EXT4-fs error`, `Remounting filesystem read-only`, or SCSI timeout messages. If you see those, this is the cause.
  3. The escalating block times (120 → 241 → 362 → 483) are the kernel's `hung_task_timeout_secs` (default 120s) re-reporting the same stuck task. This is one continuous block, not separate events.
A note on noout: Setting noout before planned reboots is good practice, but it won't prevent the re-peering storm. noout prevents OSDs from being marked out (which avoids triggering recovery/backfill), but when the returning OSDs come back up, CRUSH still recalculates the up set, the interval changes, and all affected PGs still re-peer. What noout helps with is avoiding the *additional* data movement (recovery/backfill) that happens if the OSD was out long enough to be fully removed from the acting set.
 
  • Like
Reactions: torbho
maintenance windows
FWIW, based on info I had found here, we set:
Code:
nodeep-scrub
noout
norebalance
norecover
noscrub

After rebooting a node, we uncheck "norecover" flag so it can recover, which takes 5 seconds or so, then set norecover again, and repeat with the next node.

And "ha-manager crm-command node-maintenance enable nodename" and "...disable nodename" to start/end maintenance mode and auto migrate all VMs.
 
  • Like
Reactions: torbho
Here's an update on our investigation and the changes we've made so far.

Network topology:
What we hadn't mentioned earlier is that our 5-node cluster uses a ring topology — the servers are directly connected to each other with no switch in between. Each node has a dual-port Intel X540-AT2 10G NIC, with both ports used as bridge ports on multiple bridges. STP is used.


Node 1 ── Node 2 ── Node 3 ── Node 4 ── Node 5
└────────────────────────┘


Current bridge setup:
Now we've separated all traffic into individual VLANs on dedicated bridges:

  • br0 (VLAN 10): Corosync only
  • br1 (VLAN 20): Ceph storage traffic
  • br2 (VLAN 40): VXLAN (VM internet traffic, previously shared with Corosync on br0)
  • vmbr2 (VLAN 30): Inter-VM traffic

All four bridges have
Code:
bridge-stp on
with two ports each.

STP forward delay:
We've reduced
Code:
bridge-fd
from 5 to 2 on all bridges across all nodes.
This reduces the STP reconvergence time from ~10 seconds (worst case ~30s if root bridge fails) to ~4 seconds (worst case ~24s).

MSTPD/RSTP:We wanted to install mstpd for RSTP support, which would bring reconvergence down to 1-2 seconds. However, the mstpd package has been removed from Debian Trixie (Proxmox 9 is based on Trixie). Has anyone successfully installed mstpd on Proxmox 9 / Debian Trixie? Did you compile from source or use the Bookworm package?

Next steps:
We're planning a controlled test — shutting down a node and bringing it back up while monitoring Ceph peering states, STP topology changes, and HA manager logs. Will report back with the results.

The key question remains: Given our ring topology with Linux bridge STP (no RSTP available), is a 4-second reconvergence window short enough to avoid triggering the peering storm, or is RSTP essential for this to work reliably?
 
Thanks for the detailed network diagram. Pulling together everything you've shared, I think the picture is fairly clear now -- though some targeted measurements would still help confirm the timing.

The core reason VMs freeze

During Ceph peering, affected PGs are not available for client I/O. Any QEMU I/O request to a disk on a peering PG simply blocks until peering completes. Normally peering takes milliseconds to a few seconds and guests don't notice. In your case, several factors stack up to stretch peering into tens of seconds, long enough to trigger the `jbd2` 120s blocked alarm and eventually cause the guest filesystem to go read-only.

Why peering is so slow in your environment

In a normal node reentry -- no network disruption, healthy topology -- the same 75 PGs would peer and return to `active+clean` in a few seconds without guests noticing. The key question is what makes this case different.

  1. cSTP reconvergence disrupts traffic between healthy nodes (ring-specific)
    In a star topology (nodes connected through a central switch), a node returning only affects its own uplink. Other nodes keep talking to each other uninterrupted, so peering between healthy OSDs proceeds normally while the returning OSDs bootstrap.
    In your ring topology, STP blocks one port somewhere in the ring to prevent a loop. When the returning node's two ports come up, STP reconverges the entire ring — the blocked port may move, and during the transition, traffic between *healthy* nodes can be briefly interrupted too. Your four independent bridge instances (br0, br1, br2, vmbr2) each reconverge separately and asynchronously, so the effective disruption window on br1 (Ceph VLAN) depends on when that specific bridge resolves.
    If br1 is disrupted for longer than `osd_heartbeat_grace` (default 20s), the Monitor may mark multiple OSDs down — not just the returning node's, but any whose heartbeats are lost during the disruption — and then back up, triggering a full down→up cycle on top of the reentry peering. If shorter than 20s, peering messages between healthy OSDs are still dropped and must time out and retry, stretching what would normally be a sub-second exchange into tens of seconds per PG.
  2. Retries across 75 PGs amplify the load on booting OSDs
    A single node hosts two OSDs that are primary or acting-set members for 75 PGs. In the normal case, all 75 start peering simultaneously and each completes in milliseconds — the parallelism is fine. What changes here is that STP-induced message loss forces retries. Each of the 75 PGs retries its GetInfo and GetLog requests, and the target of most of those retries is the returning node's OSDs — which are still running BlueStore mount and journal recovery. The combination of bootstrap I/O and a retry storm across all 75 PGs simultaneously is what pushes peering time from seconds into the range that triggers guest SCSI timeouts.
  3. HA lock loss (separate, concurrent symptom)
    The `HA agent lost lock` events you saw are a separate consequence of br0 (Corosync VLAN) reconverging — not a feedback loop through Ceph. Proxmox HA locks through `pmxcfs`/Corosync, not RADOS. The two disruptions happen concurrently because the same node return triggers STP reconvergence on all bridges at once, but they're independent failure modes.
Recommended fixes, in priority order
  1. Is 4-second convergence sufficient? Tune heartbeat grace to find out
    With `bridge-fd 2`, the typical forwarding delay is ~4s after a topology change is detected. However, the worst case in classic STP includes the max-age detection timeout (default 20s) + 2×fd = ~24s. With the default `osd_heartbeat_grace=20s`, your worst-case 24s disruption still causes OSDs to be marked down.
    Setting heartbeat grace above your worst-case convergence time prevents the OSD down→up cycle entirely:
    Bash:
    ceph config set global osd_heartbeat_grace 30   # default 20 — covers your worst-case ~24s
    This is not a runtime-dynamic option — both OSDs and monitors need to restart to pick it up. Set it in `global` (not just `osd`) so that monitors read it too, since both daemons use this value. A rolling OSD restart is sufficient; monitors can be restarted one at a time without quorum loss.
    With this in place, a 4s typical disruption will cause TCP connection resets and peering message loss, but OSDs won't be declared down. Whether the resulting reconnect overhead and retry storm still pushes peering past the SCSI timeout threshold is the remaining question — that's exactly what the diagnostics below will tell you. If the freeze disappears or shortens significantly, 4s convergence is sufficient. If VMs still freeze, RSTP is necessary.
  2. If 4-second convergence is not sufficient: move to RSTP
    Since `mstpd` is no longer available in Trixie, the practical option is Open vSwitch, which supports RSTP natively and is available as `openvswitch-switch`. RSTP converges in ~1s rather than 4s and eliminates the topology-change detection delay entirely. The trade-off is that OVS adds configuration complexity compared to Linux bridges.
  3. Reduce peering log comparison work (minor, but easy)
    The GetLog phase of peering compares PG logs between OSDs. Reducing the cap reduces the amount of data exchanged per PG:
    Bash:
    ceph config set osd osd_max_pg_log_entries 1000 # default 10000
    This controls how many log entries OSDs retain per PG going forward — it won't retroactively trim already-accumulated logs, so it takes effect gradually as logs are trimmed over time. It won't help with the very next reboot but reduces the steady-state GetLog cost.
  4. Separate public and cluster networks (best practice, not a fix for this issue)
    Your public and cluster networks are on the same subnet. With 10G NICs, peering messages alone won't saturate the link — this is unlikely to be contributing to the freeze. That said, separating them is worth doing eventually so that post-peering recovery and backfill traffic doesn't compete with guest I/O over the long term. It's not urgent here.
Confirming whether fix #2 is needed

Apply fix #1 (`osd_heartbeat_grace=30`) unconditionally — it's low-risk and prevents the worst case regardless. Then run a controlled reboot to see if the freeze is gone, shortened, or unchanged. These captures help interpret the result:

Quick post-mortem (no preparation needed):
Bash:
# After the freeze resolves — did bridge ports leave forwarding state?
dmesg -T | grep -Ei 'topology|stp|blocking|forwarding|listening|learning|br[0-9]'

# Were OSDs actually marked down/up?
grep -E 'osd\.[0-9]+ (down|up)' /var/log/ceph/ceph.log | tail -40

If OSDs were marked down and then back up despite the new 30s grace period, the STP disruption is exceeding your worst-case estimate — fix #2 (RSTP) becomes necessary. If OSDs stayed up and the freeze is gone or much shorter, fix #1 was sufficient and 4s convergence is tolerable.

Real-time capture (run before triggering the reboot):
Bash:
# Watch OSD up/down events live
ceph -w >> /tmp/ceph-events.log &

# Measure actual packet loss on the Ceph network
tcpdump -i br1 -n 'portrange 6800-7300' -w /tmp/br1-ceph.pcap &

# STP events
dmesg -w | grep -Ei 'topology|stp|blocking|forwarding|listening|learning|br[0-9]' \
  >> /tmp/stp-events.log &

# Bridge port state changes
bridge monitor >> /tmp/bridge-monitor.log &

After the freeze, stop with `kill %1 %2 %3 %4` and and review the logs:

pcap — actual disruption duration on br1:
Bash:
tcpdump -r /tmp/br1-ceph.pcap -n -tt | awk \
  'NR==1{prev=$1; next} {gap=$1-prev; if(gap>1) print "GAP: "gap"s at "$1; prev=$1}'
Each line reports a gap in Ceph traffic between healthy nodes. The duration tells you the actual br1 disruption window.

`/tmp/ceph-events.log` — were OSDs declared down?
Bash:
grep -E 'osd\.[0-9]+ (down|up)' /tmp/ceph-events.log
Look for `osd.N down` followed by `osd.N up` events. Note their timestamps relative to when you triggered the reboot. If you see down/up pairs, the disruption exceeded 30s (or the heartbeat grace wasn't picked up yet) and the full down→up cycle occurred.

`/tmp/bridge-monitor.log` — port-level detail:
Bash:
grep -A2 'br1' /tmp/bridge-monitor.log
`bridge monitor` outputs netlink events per port. Look for state change events on br1's member interfaces (the physical NICs bound to br1) around the time of the reboot. This gives finer-grained timing than dmesg alone.

The gap duration in the pcap is the key number:
  • if it exceeds 30s, OSDs were declared down despite fix #1.
  • If it's under 30s but VMs still freeze, TCP reconnect overhead during the disruption is pushing peering past the SCSI timeout — fix #2 (RSTP via OVS) is needed.
 
Last edited:
  • Like
Reactions: torbho and UdoB
Just edited the previous reply: ceph config <span>set</span> global osd_heartbeat_grace <span>30</span> and the accompanied comment. Because, after auditing squid 's source code, osd_heartbeat_grace has no flags: [runtime] entry — unlike options that DO have that flag, this one requires a restart to take effect. The definition also notes it must be set in both [mon] and [osd] (or [global]) sections since both monitor and OSD daemons read it.
 
@tchaikov
Thank you! However, I just checked whether the value was actually picked up without a restart:

Code:
Config database:
ceph config get osd osd_heartbeat_grace: 30

Running daemon:
ceph daemon osd.(0-9) config get osd_heartbeat_grace {"osd_heartbeat_grace": "30"}

Both show 30, and I have not restarted any OSDs since setting the value with ceph config set osd osd_heartbeat_grace 30. So it appears the running daemons picked up the new value dynamically without a restart — at least on Ceph Squid 19.2.3.

I'll proceed with the controlled test as soon as I can schedule a maintenance window. The cluster is in production use, which makes the current situation particularly painful — every unplanned node reboot or failure directly impacts live services. We'll report back with the full diagnostic results once we've had the chance to run the test.
 
Both show 30, and I have not restarted any OSDs since setting the value with ceph config set osd osd_heartbeat_grace 30. So it appears the running daemons picked up the new value dynamically without a restart — at least on Ceph Squid 19.2.3.

Thanks for checking this. It turns out the runtime flag is not the sole indicator of whether an option can be updated at runtime; rather, it serves to override or make explicit the default policy used by the config system. In this case, since osd_heartbeat_grace is an int option and not a mgr option, it is allowed to be updated at runtime even without the explicit runtime flag. I’ll take a look at how we can improve the documentation to make this behavior clearer.

See also https://github.com/ceph/ceph/blob/2...3d4a74c273697b/src/common/options.h#L400-L413