OSD's keep crashing down/in, down/out

sarman

New Member
Jul 30, 2023
8
1
3
I am running a simple 3 node Proxmox Cluster with CEPH storage. Each host has a 2 port 10G nic, and all nodes are daisy chained to each other:
VMHOST1 (nic1) → VMHOST2 (nic1)
VMHOST1(nic2) → VMHOST3 (nic1)
VMHOST2 (nic2) → VMHOST3 (nic2)
We have started observing that CEPH remains in an unhealthy state:
1752609426138.png
Looking in syslog (Same results in ceph/osd logs, we see the following:
2025-07-15T17:39:33.227893+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6808 osd.13 ever on either front or back, first ping sent 2025-07-15T17:37:39.174150+0000 (oldest deadline 2025-07-15T17:37:59.174150+0000)
2025-07-15T17:39:33.227950+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6812 osd.14 ever on either front or back, first ping sent 2025-07-15T17:38:16.476776+0000 (oldest deadline 2025-07-15T17:38:36.476776+0000)
2025-07-15T17:39:33.227965+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6816 osd.20 ever on either front or back, first ping sent 2025-07-15T17:37:39.174150+0000 (oldest deadline 2025-07-15T17:37:59.174150+0000)
2025-07-15T17:39:33.227979+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6804 osd.21 ever on either front or back, first ping sent 2025-07-15T17:38:55.579357+0000 (oldest deadline 2025-07-15T17:39:15.579357+0000)
2025-07-15T17:39:33.227993+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6812 osd.23 ever on either front or back, first ping sent 2025-07-15T17:38:38.678072+0000 (oldest deadline 2025-07-15T17:38:58.678072+0000)
2025-07-15T17:39:33.228006+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6816 osd.24 ever on either front or back, first ping sent 2025-07-15T17:34:31.861564+0000 (oldest deadline 2025-07-15T17:34:51.861564+0000)
2025-07-15T17:39:33.994440+00:00 vmhost4 pvedaemon[1273]: <root@pam> starting task UPID:vmhost4:002097EB:0331C185:68769255:srvstop:osd.11:root@pam:
2025-07-15T17:39:34.002726+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
2025-07-15T17:39:34.002799+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 osd.11 4153 *** Got signal Terminated ***
2025-07-15T17:39:34.002842+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 osd.11 4153 *** Immediate shutdown (osd_fast_shutdown=true) ***
2025-07-15T17:39:34.003076+00:00 vmhost4 systemd[1]: Stopping ceph-osd@11.service - Ceph object storage daemon osd.11...
2025-07-15T17:39:34.325377+00:00 vmhost4 systemd[1]: ceph-osd@11.service: Deactivated successfully.
2025-07-15T17:39:34.325522+00:00 vmhost4 systemd[1]: Stopped ceph-osd@11.service - Ceph object storage daemon osd.11.
2025-07-15T17:39:34.325725+00:00 vmhost4 systemd[1]: ceph-osd@11.service: Consumed 13min 43.206s CPU time.
2025-07-15T17:39:34.331661+00:00 vmhost4 pvedaemon[1273]: <root@pam> end task UPID:vmhost4:002097EB:0331C185:68769255:srvstop:osd.11:root@pam: OK

I have tried restarting the Monitors, mgmr's, even the metadata servers. We even rebuilt the servers, but the issue persists. We see our OSD's going down and/or out on all VMHOSTs at various times, but can not figure out what is causing the OSD's to behave this way. If I didn't know better, I would say this was networking, but with the hosts daisychained like they are, its all direct connections. Any advice would be greatly appreciated!

pveversion
pve-manager/8.3.5/dac3aa88bac3f300 (running kernel: 6.8.12-9-pve)

ceph -v
ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)

ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.49277 root default
-3 0.87320 host vmhost4
10 ssd 0.43660 osd.10 DNE 0
11 ssd 0.43660 osd.11 DNE 0
-5 1.30978 host vmhost5
12 ssd 0.43660 osd.12 up 1.00000 1.00000
13 ssd 0.43660 osd.13 up 1.00000 1.00000
14 ssd 0.21829 osd.14 up 1.00000 1.00000
20 ssd 0.21829 osd.20 up 1.00000 1.00000
-7 1.30978 host vmhost6
21 ssd 0.43660 osd.21 up 1.00000 1.00000
22 ssd 0.43660 osd.22 up 1.00000 1.00000
23 ssd 0.21829 osd.23 up 1.00000 1.00000
24 ssd 0.21829 osd.24 up 1.00000 1.00000

ceph -s
cluster:
id: 4f0de565-25af-4368-bbee-abf7afff1564
health: HEALTH_WARN
2 osds exist in the crush map but not in the osdmap
Reduced data availability: 3 pgs inactive
34 slow ops, oldest one blocked for 434 sec, daemons [osd.12,osd.13,osd.22,mon.vmhost4] have slow ops.

services:
mon: 3 daemons, quorum vmhost4,vmhost5,vmhost6 (age 62m)
mgr: vmhost6(active, since 53m), standbys: vmhost4, vmhost5
osd: 8 osds: 8 up (since 5m), 8 in (since 62m)

data:
pools: 2 pools, 3 pgs
objects: 0 objects, 0 B
usage: 858 MiB used, 2.6 TiB / 2.6 TiB avail
pgs: 100.000% pgs unknown
3 unknown
 
Someone asked for network config, but the post is gone, hear it is incase that person comes back:
auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto ens1f0np0
iface ens1f0np0 inet manual
mtu 9000

auto ens1f1np1
iface ens1f1np1 inet manual
mtu 9000

iface ens2f0 inet manual

iface ens2f1 inet manual

auto bond0
iface bond0 inet manual
bond-slaves eno1 eno2
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3

auto bond0.584
iface bond0.584 inet manual

auto cephbr0
iface cephbr0 inet static
address 172.16.25.163/27
bridge-ports ens1f0np0 ens1f1np1
bridge-stp on
mtu 9000
nobridge-waitport 0

auto vmbr0
iface vmbr0 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 580 581 582 583 588 589 590 592 594 595

auto vmbr1
iface vmbr1 inet static
address 172.16.25.133/27
gateway 172.16.25.129
bridge-ports bond0.584
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094

post-up ip route add default via 172.16.25.129 dev bond0.584
 
Hi,

my assumption is bond0 is for the uplink?

Then it seems, you created a bridge but, with no bond underneath for the ceph?
So your bridge might not know, which interfaces to utilize to send back received traffic.
( you can take a detailed look with tcp dump)

You can follow the this official guide, for possible setups.
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

From easy to sophisticated.
1. The Broadcast setup <- can be done via GUI.
2. The Routed Setup (Simple) <- needs manual configuration, therefore consider to set it up in "/etc/network/interfaces.d/"
3. All other setups. with RSPT etc.

Good luck,
BR
 
Someone asked for network config, but the post is gone, hear it is incase that person comes back:
It was me. I removed my post because there seems to be something wrong with your setup.

like bl1mp said, you need to reconfigure your setup for ceph. i can recommend the routed simple method. easy to configure, simple to understand, rock stable. I do not understand, why every node got duplicated monitors, managers and metas.
 
CEPH is using 2 dedicated NIC not used by anything other then CEPH:
auto cephbr0
iface cephbr0 inet static
address 172.16.25.163/27
bridge-ports ens1f0np0 ens1f1np1
bridge-stp on
mtu 9000
nobridge-waitport 0

This is the same setup we have in all other racks (We have multiple Promxox environments), but this is the only lab that has this issue. We are sending someone to the Datacenter to verify the cables are good as that is the only thing that I thnk could be the issue.