Ceph 3-node cluster, VM I/O freeze after node reboot/update

degan · Jun 9, 2026

Hi everyone,

I am investigating an issue with a 3-node Proxmox/Ceph cluster and would like to ask if anyone has seen a similar failure mode before.

Environment

3 Proxmox nodes
PVE: 9.2.2
Ceph version: 19.2.3
Ceph network: 10.0.50.0/24
public_network and cluster_network are currently on the same network
Ceph interface: bond1
bond1 is 2 × 10G LACP / 802.3ad
MTU 9000 on the Proxmox side
NICs: Broadcom BCM57412 NetXtreme-E 10GbE, driver: bnxt_en
Switch: Huawei S6730-H48X6C stack
Pools are size=3, min_size=2

Current Linux bonding status on all three nodes looks clean from a LACP point of view:

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
LACP active: on
LACP rate: slow
Number of ports: 2
Both slaves: 10000 Mbps/full
Aggregator ID: same on both slaves
Actor/Partner Churn State: none
Link Failure Count: 0

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host server1 {
id -3 # do not change unnecessarily
id -4 class nvme # do not change unnecessarily
# weight 27.94519
alg straw2
hash 0 # rjenkins1
item osd.0 weight 6.98630
item osd.1 weight 6.98630
item osd.2 weight 6.98630
item osd.3 weight 6.98630
}
host server2 {
id -5 # do not change unnecessarily
id -6 class nvme # do not change unnecessarily
# weight 27.94519
alg straw2
hash 0 # rjenkins1
item osd.4 weight 6.98630
item osd.5 weight 6.98630
item osd.6 weight 6.98630
item osd.7 weight 6.98630
}
host server3 {
id -7 # do not change unnecessarily
id -8 class nvme # do not change unnecessarily
# weight 27.94519
alg straw2
hash 0 # rjenkins1
item osd.8 weight 6.98630
item osd.9 weight 6.98630
item osd.10 weight 6.98630
item osd.11 weight 6.98630
}
root default {
id -1 # do not change unnecessarily
id -2 class nvme # do not change unnecessarily
# weight 83.83557
alg straw2
hash 0 # rjenkins1
item server1 weight 27.94519
item server2 weight 27.94519
item server3 weight 27.94519
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

What happened

During maintenance we upgraded/rebooted the nodes one after another.

The order was roughly:

server1 was upgraded/rebooted first.
server2 was upgraded/rebooted next.
The outage happened shortly after server2 came back.
server3 had not yet been upgraded/rebooted at the time the outage started.

After server2 rejoined, Ceph/RBD I/O became stuck. VMs froze. Logs showed Ceph OSD heartbeat problems between server1 and server2.

Examples from the syslog:

heartbeat_check: no reply from 10.0.50.11
heartbeat_check: no reply from 10.0.50.10
slow ops

Eventually all hosts were rebooted and the cluster recovered.

Interesting observations

1. One LACP member per server has carried almost no traffic for at least a year

We checked interface graphs for the physical Ceph NICs.

For each node, one of the two LACP member interfaces has basically no useful traffic on it for at least a year. The graph shows only a few hundred bit/s average and occasional tiny kbit/s spikes, which looks like LACP/LLDP/control traffic only.

So although the Ceph network is physically 2 × 10G per node, it appears to have been effectively using only one 10G member per node.

The Linux bonding policy is currently: Transmit Hash Policy: layer2 (0)
This may explain the poor distribution, because with only three Ceph nodes there are very few MAC pairs. However, I am surprised that the pattern is so consistent over such a long time.

2. The active Ceph ports seem to be on the same stack member

LLDP shows that the Ceph NICs are connected like this:

server1:
enp194s0f0np0 -> XGigabitEthernet1/0/48 -> PortAggregID 21
enp194s0f1np1 -> XGigabitEthernet0/0/1 -> PortAggregID 21

server2:
enp194s0f0np0 -> XGigabitEthernet1/0/46 -> PortAggregID 22
enp194s0f1np1 -> XGigabitEthernet0/0/3 -> PortAggregID 22

server3:
enp194s0f0np0 -> XGigabitEthernet1/0/1 -> PortAggregID 23
enp194s0f1np1 -> XGigabitEthernet0/0/48 -> PortAggregID 23

So each server has one Ceph link on stack member 1 and one Ceph link on stack member 0.

Based on our monitoring, it looks like the useful Ceph traffic has historically been on the same side/member, while the other physical link is mostly idle.

3. Switch output drops were reported, but the traffic graph does not show a clear spike at the outage time

The data center reported Huawei switch logs with output queue drops / congestion messages on one of the Ceph member ports.

However, when looking at traffic graphs around the outage time, we do not see a clear bandwidth spike. In fact, the interface traffic seems to drop to zero shortly after the problem starts.

This makes me unsure whether simple port congestion is really the root cause. It feels more like a temporary forwarding/blackhole/LACP/stack/NIC issue than just “the 10G link was overloaded”.

4. MTU

On the Proxmox side, bond1 is configured with MTU 9000. LLDP from the Huawei switch shows: MFS: 9216 on the relevant Ceph ports.
Jumbo ping tests with DF have worked without any problems.

5. Broadcom bnxt_en messages

On boot, the Broadcom NICs show messages like:

hwrm_tunnel_dst_port_alloc failed. rc:-95
UDP tunnel port sync failed port 4789 type vxlan: -95

These appear on the Broadcom interfaces. Ceph itself is not using VXLAN, so I am not sure whether this is relevant or just an unrelated offload/firmware warning.

Flow control is disabled.

Questions

Has anyone seen a similar issue?

Any suggestions for specific counters or tests would be appreciated.

j.theisen · Jun 12, 2026

Hi @degan

thanks for posting in the forum!

This does seem odd.
Could you please provide the complete journal during that timeframe, so we can assess if there were additional events that might have caused the outage.
journalctl --since "2026-06-09 08:00" --until "2026-06-10 12:00"
Please adjust the timestamps accordingly.

degan said:
The Linux bonding policy is currently: Transmit Hash Policy: layer2 (0)
This may explain the poor distribution, because with only three Ceph nodes there are very few MAC pairs. However, I am surprised that the pattern is so consistent over such a long time.

If i interpret the manual of your switch correctly the default configuration for eth-trunks is src-dst-ip, so the switch already uses layer 3 for distribution. Since the hash policy does not have to be the same, i don't see a reason not to use layer2+3 or even better layer3+4.

bond1 is 2 × 10G LACP / 802.3ad

There were issues with clusters using LACP bonds as their corosync interfaces due to the way LACP handles NIC failures by default. This can be fixed by setting the bond-lacp-rate parameter, see [1] for details.

You didn't mention your corosync configuration. Does it use the same network as Ceph or a separate one?
If you can please provide the corosync config for a better overview.
cat /etc/corosync/corosync.conf

Yours sincerely
Jonas

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#:~:text=IEEE,details.

SteveITS · Jun 12, 2026

degan said:
Ceph/RBD I/O became stuck. VMs froze

Are you setting Ceph flags before reboot? And/or waiting for full recovery before updating/rebooting the second node?

Our notes for updating PVE include:

in Ceph set checkboxes for:
(any node > Ceph > OSD > Manage Global Flags)
nodeep-scrub
noout
norebalance
norecover
noscrub

**Do NOT pick nodown or I/O will pause for all VMs

enable maintenance mode for the PVE node: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_maintenance_mode (this migrates running VMs)

install updates via web GUI
reboot if needed via web GUI

uncheck "norecover" flag so Ceph can recover (takes several seconds)

repeat for all nodes (including check norecover/uncheck norecover)

degan · Jun 13, 2026

Thank you for your reply.

j.theisen said:
Could you please provide the complete journal during that timeframe, so we can assess if there were additional events that might have caused the outage.
journalctl --since "2026-06-09 08:00" --until "2026-06-10 12:00"
Please adjust the timestamps accordingly.

You can find the logs in the attachment. The outage occurred around 7:57 p.m.

j.theisen said:
If i interpret the manual of your switch correctly the default configuration for eth-trunks is src-dst-ip, so the switch already uses layer 3 for distribution. Since the hash policy does not have to be the same, i don't see a reason not to use layer2+3 or even better layer3+4.

Yes, you're right. We're planning to switch to layer3+4 soon. However, I don't think that has anything to do with our problem.

j.theisen said:
There were issues with clusters using LACP bonds as their corosync interfaces due to the way LACP handles NIC failures by default. This can be fixed by setting the bond-lacp-rate parameter, see [1] for details.

We do not use LACP for Corosync; instead, we use two separate interfaces.

j.theisen said:
You didn't mention your corosync configuration. Does it use the same network as Ceph or a separate one?
If you can please provide the corosync config for a better overview.
cat /etc/corosync/corosync.conf

We are using a separate Corosync network. Here is the configuration:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: srv-ntt10
nodeid: 1
quorum_votes: 1
ring0_addr: 10.0.60.10
ring1_addr: 10.0.70.10
}
node {
name: srv-ntt11
nodeid: 2
quorum_votes: 1
ring0_addr: 10.0.60.11
ring1_addr: 10.0.70.11
}
node {
name: srv-ntt12
nodeid: 3
quorum_votes: 1
ring0_addr: 10.0.60.12
ring1_addr: 10.0.70.12
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: PVE-NTT
config_version: 3
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}

SteveITS said:
Are you setting Ceph flags before reboot? And/or waiting for full recovery before updating/rebooting the second node?

Yes, we only set the `noout` flag and put the host into maintenance mode.
Ceph was healthy at the time of the restart.

jdancer · Jun 14, 2026

If you know you will never expand this 3-node cluster, I suggest using a full-mesh broadcast network. This eliminates a switch and every node will drop packets not addressed to it. Promox has a KB on it at pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup

While not considered best practice, I do put Ceph private, public and Corosync traffic on this full-mesh broadcast network. To make sure the network traffic never gets routed, I use a IPv4 link local address of 169.254.0.0/16 subnetted to 169.254.1.0/24. Make sure to switch the migration network to this network in Datacenter options. I would I set it to insecure for faster migration since it's an isolated network.

Never had any issues. Again, not considered best networking practice but works for me.

SteveITS · Jun 14, 2026

One can set norebalance, norecover as above to avoid that during a reboot. With the default 3/2 replicas it will work fine with 2 nodes.

The implication here is some PGs don’t have 2 available hence I/O pauses.

j.theisen · Jun 15, 2026

Thanks for the logs.
So to me it seems like there was a network communication issue where the nodes 1 and 2 couldn't communicate with each other over the Ceph interfaces. Corosync seems fine according to the log.
My reason to believe this: Log of node 1:

Code:

Jun 06 19:56:59 srv-ntt10 ceph-osd[3133]: 2026-06-06T19:56:59.395+0200 72c6866e26c0 -1 osd.0 16498 heartbeat_check: no reply from 10.0.50.11:6822 osd.4 ever on either front or back, first ping sent 2026-06-06T19:56:38.873181+0200 (oldest deadline 2026-06-06T19:56:58.873181+0200)
Jun 06 19:56:59 srv-ntt10 ceph-osd[3133]: 2026-06-06T19:56:59.395+0200 72c6866e26c0 -1 osd.0 16498 heartbeat_check: no reply from 10.0.50.11:6830 osd.5 ever on either front or back, first ping sent 2026-06-06T19:56:38.873181+0200 (oldest deadline 2026-06-06T19:56:58.873181+0200)
Jun 06 19:56:59 srv-ntt10 ceph-osd[3133]: 2026-06-06T19:56:59.395+0200 72c6866e26c0 -1 osd.0 16498 heartbeat_check: no reply from 10.0.50.11:6806 osd.6 ever on either front or back, first ping sent 2026-06-06T19:56:38.873181+0200 (oldest deadline 2026-06-06T19:56:58.873181+0200)
Jun 06 19:56:59 srv-ntt10 ceph-osd[3133]: 2026-06-06T19:56:59.395+0200 72c6866e26c0 -1 osd.0 16498 heartbeat_check: no reply from 10.0.50.11:6814 osd.7 ever on either front or back, first ping sent 2026-06-06T19:56:38.873181+0200 (oldest deadline 2026-06-06T19:56:58.873181+0200)

Log of node 2:

Code:

Jun 06 19:56:58 srv-ntt11 ceph-osd[3145]: 2026-06-06T19:56:58.197+0200 720f9d1856c0 -1 osd.6 16498 heartbeat_check: no reply from 10.0.50.10:6829 osd.0 ever on either front or back, first ping sent 2026-06-06T19:56:37.939585+0200 (oldest deadline 2026-06-06T19:56:57.939585+0200)
Jun 06 19:56:58 srv-ntt11 ceph-osd[3145]: 2026-06-06T19:56:58.197+0200 720f9d1856c0 -1 osd.6 16498 heartbeat_check: no reply from 10.0.50.10:6824 osd.1 ever on either front or back, first ping sent 2026-06-06T19:56:37.939585+0200 (oldest deadline 2026-06-06T19:56:57.939585+0200)
Jun 06 19:56:58 srv-ntt11 ceph-osd[3145]: 2026-06-06T19:56:58.197+0200 720f9d1856c0 -1 osd.6 16498 heartbeat_check: no reply from 10.0.50.10:6814 osd.2 ever on either front or back, first ping sent 2026-06-06T19:56:37.939585+0200 (oldest deadline 2026-06-06T19:56:57.939585+0200)
Jun 06 19:56:58 srv-ntt11 ceph-osd[3145]: 2026-06-06T19:56:58.197+0200 720f9d1856c0 -1 osd.6 16498 heartbeat_check: no reply from 10.0.50.10:6806 osd.3 ever on either front or back, first ping sent 2026-06-06T19:56:37.939585+0200 (oldest deadline 2026-06-06T19:56:57.939585+0200)

Notice the nearly identical timestamps and inverted OSD numbers. So node 1's OSD 0 couldn't communicate with all of node 2's OSDs and node 2's OSD 6 couldn't communicate with all of node 1's OSDs.

Since node 3 doesn't have any logs of this kind during that timeframe, i assume both node 1 and 2's services were actually fine. The logs also don't indicate any other problems with the OSD services.

Sadly i currently don't have any idea to why this happens.

Yours sincerely
Jonas

degan · Jun 15, 2026

j.theisen said:
Thanks for the logs.
So to me it seems like there was a network communication issue where the nodes 1 and 2 couldn't communicate with each other over the Ceph interfaces. Corosync seems fine according to the log.

Thanks, that was my conclusion as well. As a next step, I will examine the network more closely.

jdancer said:
If you know you will never expand this 3-node cluster, I suggest using a full-mesh broadcast network. This eliminates a switch and every node will drop packets not addressed to it. Promox has a KB on it at pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup

There are plans to expand the cluster by one or two hosts in the near future. Consequently, this is unfortunately not an option.

Ceph 3-node cluster, VM I/O freeze after node reboot/update

degan

Active Member

Environment

What happened

Interesting observations

1. One LACP member per server has carried almost no traffic for at least a year

2. The active Ceph ports seem to be on the same stack member

3. Switch output drops were reported, but the traffic graph does not show a clear spike at the outage time

4. MTU

5. Broadcom bnxt_en messages

Questions

j.theisen

Active Member

SteveITS

Renowned Member

degan

Active Member

Attachments

jdancer

Renowned Member

SteveITS

Renowned Member

j.theisen

Active Member

degan

Active Member

We value your privacy

Ceph 3-node cluster, VM I/O freeze after node reboot/update

Active Member

Environment​

What happened​

Interesting observations​

1. One LACP member per server has carried almost no traffic for at least a year​

2. The active Ceph ports seem to be on the same stack member​

3. Switch output drops were reported, but the traffic graph does not show a clear spike at the outage time​

4. MTU​

5. Broadcom bnxt_en messages​

Questions​

Active Member

Renowned Member

Active Member

Attachments

Renowned Member

Renowned Member

Active Member

Active Member

We value your privacy

Environment

What happened

Interesting observations

1. One LACP member per server has carried almost no traffic for at least a year

2. The active Ceph ports seem to be on the same stack member

3. Switch output drops were reported, but the traffic graph does not show a clear spike at the outage time

4. MTU

5. Broadcom bnxt_en messages

Questions