We are running a 5-node Proxmox cluster with Ceph and experiencing VM freezes on one of the other nodes whenever a node goes down or is rebooted — hoping someone has seen this before or can point me in the right direction.
Environment
Problem
When a node fails or is rebooted, VMs on one of the remaining nodes freeze completely. The affected VMs are stored on the Ceph HA pool and show the following repeated kernel messages:
The blocked time increases continuously (120s → 241s → 362s → 483s...). The VMs become entirely unresponsive and even QEMU itself stops responding:
Observations
Ceph status during the outage:
HA manager loses locks:
This occurs repeatedly, with the agent toggling between
and
multiple times.
Primarily affected: VMs running on the node that holds the active Ceph Manager. It appears that the MGR failover combined with the peering storm completely blocks local VM I/O.
Steps taken so far
Recovery throttling was applied but did not resolve the issue:
This suggests that the problem is not caused by recovery/backfill, but rather by the peering phase itself blocking client I/O.
Configuration
ceph.conf:
Public and cluster network point to the same subnet — there is no separate cluster network.
OSD tree:
Sample VM configuration (affected):
Any help is appreciated!
Environment
- Proxmox VE 9.1.6 cluster with 5 nodes
- Ceph 19.2.3 (Squid)
- 10 OSDs (2x NVMe, 3.6 TB each, per node)
- Pool configuration:
Code:
size=3Code:min_size=2 - CRUSH rule:
byCode:
chooseleaf_firstnCode:host - Network: 10G, Ceph public and cluster network on the same subnet
Corosync on a separate interfaceCode:
10.0.1.0/24Code:10.0.0.0/24
Problem
When a node fails or is rebooted, VMs on one of the remaining nodes freeze completely. The affected VMs are stored on the Ceph HA pool and show the following repeated kernel messages:
Code:
INFO: task jbd2/sda1-8:228 blocked for more than 120 seconds
INFO: task systemd-journal:272 blocked for more than 120 seconds
The blocked time increases continuously (120s → 241s → 362s → 483s...). The VMs become entirely unresponsive and even QEMU itself stops responding:
Code:
VM 129 qmp command 'quit' failed - got timeout
Observations
Ceph status during the outage:
- 2 OSDs down (1 host down)
- 75 PGs in state
Code:
active+undersized+degraded - Approximately 19% of objects degraded
HA manager loses locks:
Code:
lost lock 'ha_agent_pve-5_lock' - cfs lock update failed - Device or resource busy
status change active => lost_agent_lock
This occurs repeatedly, with the agent toggling between
Code:
active
Code:
lost_agent_lock
Primarily affected: VMs running on the node that holds the active Ceph Manager. It appears that the MGR failover combined with the peering storm completely blocks local VM I/O.
Steps taken so far
Recovery throttling was applied but did not resolve the issue:
Code:
ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 3
ceph config set osd osd_recovery_sleep 0.1
ceph config set osd osd_recovery_op_priority 3
This suggests that the problem is not caused by recovery/backfill, but rather by the peering phase itself blocking client I/O.
Configuration
ceph.conf:
Code:
[global]
cluster_network = 10.0.1.1/24
public_network = 10.0.1.1/24
osd_pool_default_min_size = 2
osd_pool_default_size = 3
Public and cluster network point to the same subnet — there is no separate cluster network.
OSD tree:
Code:
ID CLASS WEIGHT TYPE NAME STATUS
-1 36.38687 root default
-3 7.27737 host pve-1
0 nvme 3.63869 osd.0 up
1 nvme 3.63869 osd.1 up
-7 7.27737 host pve-2
2 nvme 3.63869 osd.2 up
3 nvme 3.63869 osd.3 up
[... 2 OSDs per node across all 5 nodes]
Sample VM configuration (affected):
Code:
scsi0: ceph-ha:vm-203-disk-0,iothread=1,size=32G
scsihw: virtio-scsi-single
Any help is appreciated!