VMs freeze after node failure/reboot in 5-node Ceph cluster

May 8, 2025
30
3
8
We are running a 5-node Proxmox cluster with Ceph and experiencing VM freezes on one of the other nodes whenever a node goes down or is rebooted — hoping someone has seen this before or can point me in the right direction.

Environment

  • Proxmox VE 9.1.6 cluster with 5 nodes
  • Ceph 19.2.3 (Squid)
  • 10 OSDs (2x NVMe, 3.6 TB each, per node)
  • Pool configuration:
    Code:
    size=3
    Code:
    min_size=2
  • CRUSH rule:
    Code:
    chooseleaf_firstn
    by
    Code:
    host
  • Network: 10G, Ceph public and cluster network on the same subnet
    Code:
    10.0.1.0/24
    Corosync on a separate interface
    Code:
    10.0.0.0/24

Problem

When a node fails or is rebooted, VMs on one of the remaining nodes freeze completely. The affected VMs are stored on the Ceph HA pool and show the following repeated kernel messages:

Code:
INFO: task jbd2/sda1-8:228 blocked for more than 120 seconds
INFO: task systemd-journal:272 blocked for more than 120 seconds

The blocked time increases continuously (120s → 241s → 362s → 483s...). The VMs become entirely unresponsive and even QEMU itself stops responding:

Code:
VM 129 qmp command 'quit' failed - got timeout

Observations

Ceph status during the outage:

  • 2 OSDs down (1 host down)
  • 75 PGs in state
    Code:
    active+undersized+degraded
  • Approximately 19% of objects degraded

HA manager loses locks:

Code:
lost lock 'ha_agent_pve-5_lock' - cfs lock update failed - Device or resource busy
status change active => lost_agent_lock

This occurs repeatedly, with the agent toggling between
Code:
active
and
Code:
lost_agent_lock
multiple times.

Primarily affected: VMs running on the node that holds the active Ceph Manager. It appears that the MGR failover combined with the peering storm completely blocks local VM I/O.

Steps taken so far

Recovery throttling was applied but did not resolve the issue:

Code:
ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3
ceph config set osd osd_recovery_sleep 0.1
ceph config set osd osd_recovery_op_priority 3

This suggests that the problem is not caused by recovery/backfill, but rather by the peering phase itself blocking client I/O.

Configuration

ceph.conf:

Code:
[global]
cluster_network = 10.0.1.1/24
public_network = 10.0.1.1/24
osd_pool_default_min_size = 2
osd_pool_default_size = 3

Public and cluster network point to the same subnet — there is no separate cluster network.

OSD tree:

Code:
ID   CLASS  WEIGHT    TYPE NAME                 STATUS
 -1         36.38687  root default
 -3          7.27737      host pve-1
  0   nvme   3.63869          osd.0                 up
  1   nvme   3.63869          osd.1                 up
 -7          7.27737      host pve-2
  2   nvme   3.63869          osd.2                 up
  3   nvme   3.63869          osd.3                 up
[... 2 OSDs per node across all 5 nodes]

Sample VM configuration (affected):

Code:
scsi0: ceph-ha:vm-203-disk-0,iothread=1,size=32G
scsihw: virtio-scsi-single



Any help is appreciated!
 
Hi @torbho, before recommending solutions, I'd need more information:
  1. Reproducibility: Does this happen every time a node goes down/reboots, or was this a one-time event? If reproducible, this strongly suggests a structural issue (network, config). If one-time, it could be a transient condition or bug.
  2. Duration: How long does the freeze last? Does it eventually resolve on its own, or does it require manual intervention (e.g., restarting OSDs, fencing the down node)?
  3. Timing correlation: How quickly after the node goes down do VMs start freezing? Immediately (seconds), or after a delay (30s+)? This helps distinguish between peering-blocked I/O (immediate) vs network saturation (gradual).
  4. Which VMs freeze: Do ALL VMs on healthy nodes freeze, or only VMs whose RBD images have PGs on the lost OSDs? If only affected PGs' VMs freeze, that's peering-blocked I/O. If all VMs freeze, that's more likely network saturation.
  5. [B]noout[/B] flag: Was `noout` set before planned reboots? If not, Ceph marks OSDs as `out` after `mon_osd_down_out_interval` (default 600s), triggering full recovery/backfill on top of peering.
  6. Previous state: Was the cluster in `HEALTH_OK` before the node went down? Any pre-existing degraded PGs, slow requests, or warnings?
  7. MGR observation: You mentioned VMs on the active-MGR node are primarily affected. Do VMs on other surviving nodes also freeze, or is it truly only the MGR node? This helps us determine if the MGR is relevant or if it's just correlated because that node also lost 2 OSDs.
  8. Pool details: How many Ceph pools do you have, and what are their PG counts? The 75 degraded PGs — is that across all pools or just `ceph-ha`? What's the total PG count? (`ceph osd pool ls detail`)
  9. Does peering eventually complete? Does the freeze eventually resolve on its own after peering finishes, or do you have to intervene? If peering never completes, the PGs may be stuck (e.g., in `down+peering` because they need an OSD that's on the downed node), which is a different problem from slow peering.

Also, we need to run commands to collect more info from the cluster:
  1. Catch client ops stuck in "waiting for peered" state
    This is the smoking gun. During the freeze, run on each surviving node:
    Bash:
    # Show all ops currently blocked, filtered for peering-related delaysceph daemon osd.<ID> dump_ops_in_flight | \
    jq '[.ops[] | select(.state == "waiting for peered") |
    {description: .description, duration: .duration, state: .state}] |
    sort_by(-.duration)'
    If you see many ops with state "waiting for peered" and durations matching the freeze time (~120s+), this confirms peering is the blocker. The `duration` field tells you exactly how long each op has been waiting.
    Also check for ops stuck earlier in the pipeline:
    Bash:
    # Ops that haven't even reached their PG yet (OSD message queue congestion)ceph daemon osd.<ID> dump_ops_in_flight | \
    jq '[.ops[] | select(.state == "queued for pg" or .state == "reached pg") |
    {description: .description, duration: .duration, state: .state}] |
    sort_by(-.duration) | .[0:10]'
    # After the event resolves, check historic slow ops for the full pictureceph daemon osd.<ID> dump_historic_slow_ops | \
    jq '[.ops[] | {description: .description, duration: .duration,
    initiated_at: .initiated_at, state: .state}] | sort_by(-.duration) | .[0:20]'
  2. Recovery state latency counters (the key metrics)
    Ceph tracks time spent in each PG recovery state as OSD-level perf counters (collection name `recoverystate_perf`, defined in `src/osd/osd_perf_counters.h:146-178`). These are `time_avg` counters — each reports `{avgcount, sum, avgtime}`.
    Bash:
    # Dump peering substate latencies for a specific OSD
    ceph daemon osd.<ID> perf dump recoverystate_perf | \
      jq '{peering: .recoverystate_perf.peering_latency,
           getinfo: .recoverystate_perf.getinfo_latency,
           getlog: .recoverystate_perf.getlog_latency,
           getmissing: .recoverystate_perf.getmissing_latency,
           waitupthru: .recoverystate_perf.waitupthru_latency,
           waitactingchange: .recoverystate_perf.waitactingchange_latency,
           active: .recoverystate_perf.active_latency}'
    See the table below the the interpretation.
  3. Client op latency and degraded-delay counters
    Bash:
    # Shows client ops that were delayed due to PG state
    ceph daemon osd.<ID> perf dump osd | \
      jq '{op_latency: .osd.op_latency,
           op_w_latency: .osd.op_w_latency,
           op_r_latency: .osd.op_r_latency,
           op_delayed_unreadable: .osd.op_delayed_unreadable,
           op_delayed_degraded: .osd.op_delayed_degraded}''
    op_delayed_unreadable and op_delayed_degraded count client ops delayed because PGs were in degraded/unreadable state. High values during the event confirm client I/O is being held up by PG state.
  4. PG log sizes (can be checked now, before the event)**
    Bash:
    # Large PG logs slow down the GetLog peering phase
    ceph pg dump --format json-pretty | \
      jq '[.pg_map.pg_stats[] | {pgid: .pgid, log_size: .log_size,
        ondisk_log_size: .ondisk_log_size}] | sort_by(-.log_size) | .[0:10]'
    
    # Also check the configured PG log limits
    ceph config get osd osd_min_pg_log_entries
    ceph config get osd osd_max_pg_log_entries
    ceph config get osd osd_pg_log_dups_tracked

    If PGs have >10k log entries, the GetLog phase transfers more data. Default osd_max_pg_log_entries is 10000 (or 100000 for SSDs in some versions). Check if this has been increased.
The key unknown is why peering takes >120 seconds on NVMe with 10G networking. On a healthy cluster, peering 75 PGs should take seconds. The `recoverystate_perf` counters will tell us exactly which peering substate is slow.

Counter​
What it measures​
If high, points to​
peering_latency
Total time in peering (the key metric)​
Peering is the bottleneck (expected <100ms on NVMe)​
getinfo_latency
Collecting PG info from peers​
Slow OSD responses or message queue congestion​
getlog_latency
Exchanging PG logs​
Large PG logs or slow peer I/O​
getmissing_latency
Determining missing objects​
Many objects to reconcile​
waitupthru_latency
Waiting for mon to acknowledge up_thru​
Slow monitor or mon network​
waitactingchange_latency
Waiting for acting set change​
CRUSH instability or slow mon​
The output per counter: `{"avgcount": N, "sum": T, "avgtime": A}`​
  • avgcount = number of times PGs entered this state
  • sum = total wall-clock seconds across all PGs
  • avgtime = average duration per entry (`sum / avgcount`)
    If peering_latency.avgtime >> 100ms, peering is abnormally slow. The substate breakdown tells you where the time is spent. For example: getlog_latency.avgtime = 5s but getinfo_latency.avgtime = 50ms means GetLog is the bottleneck → check PG log sizes. Please note, these counters are cumulative since OSD start. To get delta values for just the failure event, capture them before and after the event and subtract:
 
It appears that the MGR failover combined with the peering storm completely blocks local VM I/O.

MGR's actual role:
  • MgrClient in the OSD (src/osd/OSD.cc) only does: send_pgstats(), update_daemon_health(), set_perf_metric_query_cb() — all async, none on the I/O path
  • PeeringState.cc has zero references to MGR
  • PrimaryLogPG::do_request() never touches MGR
  • Client I/O flow: client → OSD → PrimaryLogPG → BlueStore — MGR is never consulted
So, your observation "VMs on the active-MGR node are primarily affected" is almost certainly correlation. One minor interaction to check: if the newly-active MGR on a surviving node causes a CPU spike during `load_all_metadata()` and Python module startup, it could marginally slow OSDs on that same node. But this would be a small transient effect, not a 120-second freeze.
 
@tchaikov
Thanks for the incredibly detailed response — this gives us a clear path forward. Let me answer your questions:


Reproducibility: This happens almost every single time a node goes down and comes back up.


Duration: The freeze does not resolve on its own. We have to manually intervene — typically by restarting the affected VMs, which sometimes requires a force-stop since QMP is unresponsive.


Timing: This is an important detail — the VMs don't freeze when the node goes down. The cluster runs fine in degraded state with 75 PGs
Code:
active+undersized+degraded
, and VMs continue to operate normally. The freeze occurs when the node comes back up and its 2 OSDs rejoin, triggering all 75 PGs to re-peer simultaneously. So the problem is not the initial peering after the failure, but the re-peering on OSD reentry.


Which VMs freeze: All VMs freeze on one of the healthy nodes. We haven't yet mapped out whether it's strictly correlated to PGs on the returning OSDs, but we'll check during the next test. It seems like it is the actual Manager Node.


noout flag: No,
Code:
noout
was not set before planned reboots. We'll start doing this going forward for maintenance windows.


Previous state: Yes, the cluster was in
Code:
HEALTH_OK
before each event. No pre-existing degraded PGs or slow requests.


MGR observation: Based on your earlier analysis of the source code, we agree the MGR correlation was likely coincidental. We'll pay closer attention to which exact VMs freeze during the next test and map them to affected PGs.


Pool details: We'll provide the output of
Code:
ceph osd pool ls detail
and total PG counts before the next test.


Peering completion: Peering does seem to complete eventually based on the PG states, but the affected VMs remain frozen even after — they never recover without manual intervention.


We're planning a controlled test during a maintenance window. We'll capture
Code:
recoverystate_perf
counters before and after, run
Code:
dump_ops_in_flight
during the freeze, and check PG log sizes beforehand. Will report back with the full data. Your breakdown of the peering substates and what to look for is exactly what we needed.
 
Your reply confirms this is a two-phase problem:

  1. Phase 1: Why does re-peering take long enough to cause 120s+ blocked I/O?
    On NVMe with 10G networking, peering 75 PGs should complete in seconds, not minutes. Something is making it abnormally slow. The recoverystate_perf counters from my previous post will pinpoint which peering substate is the bottleneck.
  2. Phase 2: Why don't VMs recover after peering completes?
    This is the more puzzling part. Once peering finishes and PGs become active+clean, the waiting_for_peered queue drains and all blocked ops are reprocessed. VMs should resume. The fact that they don't -- and that QMP becomes unresponsive — tells us something else breaks permanently during the prolonged block.
A couple of observations that may help narrow this down:
  1. The jbd2/sda1-8 blocked message: where are you seeing this? If it's on the VM console, then `jbd2/sda1-8` is the guest kernel's ext4 journaling thread for the VM's virtual disk (`/dev/sda1` inside the guest, backed by ceph-ha:vm-203-disk-0). That would be the expected symptom: Ceph I/O is blocked during peering → guest's disk I/O hangs → guest kernel reports blocked tasks. If it's in the host's `dmesg`, that's a different (and more concerning) situation — it would mean the host's own filesystem is stuck too.
  2. The guest kernel's SCSI timeout is likely why VMs don't recover. The default SCSI command timeout for virtio-scsi is 30 seconds. If the peering freeze lasts >30s (and yours lasts >120s), the guest kernel will see SCSI command timeouts. When ext4's journal commit times out, it aborts the journal and remounts the filesystem read-only. At that point, the guest is broken — even after Ceph recovers and RBD I/O resumes, the guest's filesystem won't come back without manual intervention (which looks like "the VM is permanently frozen").
    After force-restarting a frozen VM, check the guest dmesg for `EXT4-fs error`, `Remounting filesystem read-only`, or SCSI timeout messages. If you see those, this is the cause.
  3. The escalating block times (120 → 241 → 362 → 483) are the kernel's `hung_task_timeout_secs` (default 120s) re-reporting the same stuck task. This is one continuous block, not separate events.
A note on noout: Setting noout before planned reboots is good practice, but it won't prevent the re-peering storm. noout prevents OSDs from being marked out (which avoids triggering recovery/backfill), but when the returning OSDs come back up, CRUSH still recalculates the up set, the interval changes, and all affected PGs still re-peer. What noout helps with is avoiding the *additional* data movement (recovery/backfill) that happens if the OSD was out long enough to be fully removed from the acting set.
 
maintenance windows
FWIW, based on info I had found here, we set:
Code:
nodeep-scrub
noout
norebalance
norecover
noscrub

After rebooting a node, we uncheck "norecover" flag so it can recover, which takes 5 seconds or so, then set norecover again, and repeat with the next node.

And "ha-manager crm-command node-maintenance enable nodename" and "...disable nodename" to start/end maintenance mode and auto migrate all VMs.