CEPH heartbeat issue after reboot

cotremblay · Feb 26, 2026

Feb 25 16:41:29 ga-node-08c ceph-osd[1892]: 2026-02-25T16:41:29.697-0500 77896f6686c0 -1 osd.11 4078 heartbeat_check: no reply from 10.77.204.122:6816 osd.12 ever on either front or back, first ping sent 2026-02-25T16:36:56.638684-0500 (oldest deadline 2026-02-25T16:37:16.638684-0500)
Feb 25 16:41:29 ga-node-08c ceph-osd[1892]: 2026-02-25T16:41:29.697-0500 77896f6686c0 -1 osd.11 4078 heartbeat_check: no reply from 10.77.204.123:6802 osd.14 ever on either front or back, first ping sent 2026-02-25T16:36:56.638684-0500 (oldest deadline 2026-02-25T16:37:16.638684-0500)
Feb 25 16:41:29 ga-node-08c ceph-osd[1892]: 2026-02-25T16:41:29.697-0500 77896f6686c0 -1 osd.11 4078 heartbeat_check: no reply from 10.77.204.123:6806 osd.15 ever on either front or back, first ping sent 2026-02-25T16:36:56.638684-0500 (oldest deadline 2026-02-25T16:37:16.638684-0500)

# Ceph Debugging Documentation — ga-node-08c / ga-node-01c
**Date:** 2026-02-25
**Cluster:** Proxmox VE 8 (Debian Bookworm)
**Issue:** OSDs unable to rejoin cluster after node reboot

---

## 1. Environment

### Cluster Topology
| Node | Public IP (vmbr0) | Cluster IP (vmbr1078) | Role |
|---|---|---|---|
| ga-node-08c | 10.77.204.117 | 10.78.204.117 | OSD (problematic) |
| ga-node-10c | 10.77.204.122 | 10.78.204.122 | OSD + MGR |
| ga-node-11c | 10.77.204.123 | 10.78.204.123 | OSD + MON |
| ga-node-13cr | 10.77.204.150 | 10.78.204.150 | OSD + MON |
| ga-node-14c | 10.77.204.116 | 10.78.204.116 | OSD + MON (leader) |
| ga-node-01c | — | — | OSD (problematic, OSDs deleted) |

### Initial OSDs on ga-node-08c
- osd.11, osd.13, osd.16 (all failing after reboot)
- osd.12 → confirmed on ga-node-10c (not on 08c)

### Network Configuration
- `vmbr0`: public network, standard MTU
- `vmbr1078`: cluster network, **MTU 9000** (jumbo frames)
- NICs: Mellanox ConnectX

### `/etc/ceph/ceph.conf` (identical on all nodes)
```ini
[global]
cluster_network = 10.78.204.122/24
public_network = 10.77.204.122/24
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
```
> **Note:** `public_network` and `cluster_network` use a host IP with /24 — Ceph correctly interprets the subnet.

---

## 2. Initial Symptoms

After rebooting ga-node-08c (OSDs were pre-marked `out`):
- osd.11 starts, briefly registers with the monitor, then immediately goes down
- OSD logs show: `heartbeat_check: no reply from osd.X`
- Repeating cycle: start → heartbeat failure → stop → restart
- Same behavior observed on ga-node-01c (OSDs eventually deleted)

---

## 3. Tests Performed

### 3.1 Network Connectivity (nc)
**Result:

OK in both directions**
```bash
# From another node toward ga-node-08c
nc -zv 10.77.204.117 6800 # public OSD
nc -zv 10.77.204.117 6802 # public heartbeat
nc -zv 10.78.204.117 6800 # cluster OSD
nc -zv 10.78.204.117 6802 # cluster heartbeat

# Ports tested: 6800, 6801, 6802, 6803, 6804, 6805
```

### 3.2 Firewall
**Result:

No blocking**
```bash
iptables -L -n # empty
nft list ruleset # empty
# Proxmox firewall disabled
```

### 3.3 NTP / Clock Synchronization
**Result:

Synchronized**
```bash
chronyc tracking
timedatectl
```
All nodes synchronized, negligible clock drift.

### 3.4 OSD Keyring
**Result:

Match confirmed**
```bash
# On ga-node-08c
cat /var/lib/ceph/osd/ceph-11/keyring
# Key: AQDCRHBpozx9IxAAEK327zjGBVKv5kSyP0zwlw==

# Compared with cluster
ceph auth get osd.11
# Identical

```

### 3.5 Blocklist
**Result:

No entries**
```bash
ceph osd blocklist ls
```

### 3.6 Kernel Version
**Result:

Inconclusive (not the root cause)**
- Tested with `6.8.12-18-pve` (installed version) → failure
- Tested with `6.8.12-16-pve` → same failure
- Conclusion: kernel version is **not** the cause

### 3.7 NIC Offloading Disabled
**Result:

Did not resolve the issue**
```bash
ethtool -K <interface> tx off rx off gso off gro off tso off
```

### 3.8 Jumbo Frames (MTU 9000)
**Result:

Working** (tested from ga-node-11c)
```bash
ping -M do -s 8972 10.78.204.150
# 8980 bytes from 10.78.204.150: icmp_seq=1 ttl=64 time=0.063 ms

```
> **TODO:** Test specifically **from ga-node-08c**

### 3.9 Heartbeat Debug (debug_ms=5)
**Result: Revealing**

Applied via admin socket (runtime):
```bash
ceph daemon /var/run/ceph/ceph-osd.11.asok config set debug_ms 5
```
Then persistently:
```bash
ceph config set osd.11 debug_ms 5
```
Logs in `/var/log/ceph/ceph-osd.11.log`

**Key finding:** Heartbeat **works in both directions** at the network level:
- osd.11 sends pings (`-->`) to all peers
- All peers reply with `ping_reply` (`<==`)
- All connections in `s=READY` state
- BUT `up_from 0` in osd.11's pings → deadlock (see section 5)

---

## 4. Configurations Applied

### 4.1 mClock Profile (beginning of session)
```bash
ceph config set osd osd_mclock_profile high_client_ops
```

### 4.2 osd_heartbeat_grace (temporary — removed)
```bash
# Applied only to osd.11 for debugging
ceph config set osd.11 osd_heartbeat_grace 300
# Then removed
ceph config rm osd.11 osd_heartbeat_grace
```
> **Unintended side effect:** Other OSDs also waited 300s before reporting osd.11 as dead, causing the OSD boot cycle to repeat every ~300 seconds instead of 20 seconds.

### 4.3 debug_ms (remove when done)
```bash
ceph config set osd.11 debug_ms 5
# Remove when debugging is complete:
ceph config rm osd.11 debug_ms
```

### 4.4 Recommended Permanent Fix
```bash
ceph config set osd osd_heartbeat_grace 60
```
> Increases the grace period before an OSD is reported as dead (20s → 60s). Gives ga-node-08c OSDs enough time to initialize heartbeat connections on startup.

---

## 5. Root Cause Analysis

### The Identified Deadlock

```
osd.11 starts
↓
Monitor briefly marks it UP
↓
osd.11 sends pings with up_from=0
↓
Peer OSDs ignore these pings (osd.11 is marked DOWN in the OSD map)
↓
After ~300s (grace), peer OSDs report osd.11 as failed to the monitor
↓
Monitor marks osd.11 DOWN
↓
osd.11 receives "wrongly marked me down" → kills itself
↓
Restart → cycle repeats
```

### Evidence in Monitor Logs (`ceph.log` on ga-node-14c)
```
16:41:36 - osd.11 boot
16:46:38 - osd.7, osd.12, osd.14, osd.6, osd.5 report osd.11 failed
"after 300.061138 >= grace 69.944756"
16:46:39 - osd.11 marked itself dead as of e4082
"Monitor daemon marked osd.11 down, but it is still running"
16:56:40 - osd.11 boot (2nd attempt)
→ IMMEDIATELY reports all other OSDs as failed
17:01:42 - same cycle, killed again
"after 300.249378 >= grace 119.815915"
```

### Why `up_from` Stays at 0
When an OSD boots, it sets `up_from` to the epoch at which the monitor officially marks it `up`. If the monitor keeps marking it `down` faster than the OSD can stabilize, `up_from` never gets set. Peer OSDs receiving pings with `up_from=0` treat them as invalid and do not update their heartbeat timers — which causes them to report the OSD as failed, completing the deadlock.

### Impact on VMs
Every attempt to add osd.11 causes:
- 27 PGs entering `remapped+peering` state (I/O blocked)
- Slow ops blocked for 60-70+ seconds
- **VMs paused** for the entire duration of peering
- Once osd.11 dies and the cycle repeats, the pauses repeat

---

## 6. Actions Taken

### nodown Attempt
```bash
ceph osd set nodown # prevents monitor from acting on failure reports
ceph config rm osd.11 osd_heartbeat_grace
systemctl restart ceph-osd@11
# OSD came up, cluster started rebalancing
# Stopped too early (30s) — needs 2-3 min minimum patience
ceph osd unset nodown
```

### ga-node-08c Cleanup
```bash
# OSDs 11, 13, 16 purged from cluster (already absent from OSD tree)
ceph osd purge 11 --yes-i-really-mean-it
ceph osd purge 13 --yes-i-really-mean-it
ceph osd purge 16 --yes-i-really-mean-it
# Host removed from crush map
ceph osd crush rm ga-node-08c
# Ceph services stopped on ga-node-08c (only ceph-crash.service remained)
systemctl stop ceph-crash.service
systemctl disable ceph-crash.service
```

> **Warning:** `pveceph purge` was attempted but aborted — this command destroys the **entire** cluster, not just the local node. Use manual cleanup instead.

---

## 7. Current Cluster State

```
cluster health: HEALTH_OK
osd: 14 osds: 14 up, 12 in
pools: 2 pools, 129 pgs → active+clean
data: 4.1 TiB, 13 TiB used, 8.3 TiB avail
```

### Current OSD Tree
| Node | OSDs | Status |
|---|---|---|
| ga-node-10c | 0, 4, 7, 12 | up, in |
| ga-node-11c | 14, 15 | up, **reweight=0** (slow disks, intentional) |
| ga-node-13cr | 2, 5, 8, 9 | up, in |
| ga-node-14c | 1, 3, 6, 10 | up, in |
| ga-node-08c | — | **removed** |
| ga-node-01c | — | empty crush entry |

---

## 8. Unresolved Issues

1. **Root cause unknown:** Why does ga-node-08c (and ga-node-01c) take more than 20s to establish heartbeat after reboot? Untested lead: jumbo frame ping **from** ga-node-08c specifically.

2. **ga-node-01c** has an empty crush entry to clean up:
```bash
ceph osd crush rm ga-node-01c
```

3. **debug_ms=5** may still be set in the config store for osd.11 (now deleted — verify):
```bash
ceph config rm osd.11 debug_ms
```

---

## 9. Recommended Procedure to Reintegrate ga-node-08c

When ready to recreate OSDs on ga-node-08c:

```bash
# 1. Apply global grace (if not already done)
ceph config set osd osd_heartbeat_grace 60

# 2. Prevent monitor from marking OSDs down during startup
ceph osd set nodown

# 3. Create new OSD via Proxmox UI or ceph-volume

# 4. Monitor — wait at least 2-3 MINUTES before taking any action
watch -n 2 'ceph osd tree | grep ga-node-08c'
# Terminal 2:
watch -n 2 'ceph status'
# Terminal 3 (live monitor log):
ssh 10.77.204.116 "tail -f /var/log/ceph/ceph.log | grep osd"

# 5. Once OSD is stable (up_from != 0, status up)
ceph osd unset nodown

# 6. Set in only after cluster is HEALTH_OK
ceph osd in <id>
```

---

## 10. Reference Commands

```bash
# General cluster status
ceph status
ceph osd tree
ceph health detail

# Live monitor log
ssh 10.77.204.116 "tail -f /var/log/ceph/ceph.log"

# Persistent config management
ceph config dump
ceph config get osd.X <parameter>
ceph config set osd.X <parameter> <value>
ceph config rm osd.X <parameter>

# Admin socket (runtime only, not persistent)
ceph daemon /var/run/ceph/ceph-osd.11.asok config set <param> <value>

# Cluster flags
ceph osd set nodown
ceph osd unset nodown

# Heartbeat debug
ceph config set osd.11 debug_ms 5
tail -f /var/log/ceph/ceph-osd.11.log | grep -E "ping|heartbeat"

# OSD map inspection
ceph osd dump | python3 -c "
import sys, json
d = json.load(sys.stdin)
for o in d['osds']:
if o['osd'] == 11:
print(o)
"
```

---

## 11. Key Lessons Learned

| Finding | Detail |
|---|---|
| Heartbeat network works | Both directions confirmed via debug_ms=5 logs |
| Root cause is a deadlock | `up_from=0` → peers ignore pings → failure reports → monitor marks down → repeat |
| `osd_heartbeat_grace` on a single OSD | Affects how long peer OSDs wait before reporting that specific OSD as failed |
| `nodown` flag | Breaks the deadlock by preventing the monitor from acting on failure reports |
| 30s is not enough | After adding an OSD, wait 2-3 minutes before concluding it failed |
| VM pauses during OSD ops | Caused by PGs entering `peering` state — normal but needs OSD stability to resolve |

gurubert · Feb 26, 2026

cotremblay said:
OSDs were pre-marked `out`

What do you mean by this?

What type of drives are used?

cotremblay · Feb 26, 2026

Marking a disk as out on a Ceph cluster means it needs to rebalance because the disk will be voluntarily removed.

The disks used are enterprise‑grade K DC600M SSDs

fstrankowski · Feb 26, 2026

Welcome to the Forum.
Please format your post and dont copy & paste AI Slop.

cotremblay · Feb 26, 2026

fstrankowski said:
Welcome to the Forum.
Please format your post and dont copy & paste AI Slop.

I understand your request. As someone who has been passionate about Linux systems for over 20 years, I enjoy doing things myself and strive to go beyond what AI can offer when it comes to listing and summarizing roughly eight hours of testing and verification — both physical and low‑level using TCPdump.
I kindly ask for your patience and understanding in this matter. If I am able to provide precise details on my own, I will be more than happy to do so.
Thank you.

fstrankowski · Feb 26, 2026

Problem is, most of it just doesnt make any sense whatsoever as its just thrown in a thread without even investing time to markup the post and remove the nonsense in it in the slightest way. Please invest the time to either adequately edit, rewrite and/or sum up what you did with the help of AI or simply write it up yourself. Thanks for your understanding!

gurubert · Feb 26, 2026

cotremblay said:
Marking a disk as out on a Ceph cluster means it needs to rebalance because the disk will be voluntarily removed.

Which is not needed when you just want to reboot a node.

An OSD is set to "out" when it should be removed with ceph osd purge.

CEPH heartbeat issue after reboot

cotremblay

New Member

gurubert

Distinguished Member

cotremblay

New Member

fstrankowski

Famous Member

cotremblay

New Member

fstrankowski

Famous Member

gurubert

Distinguished Member

We value your privacy