I am running a simple 3 node Proxmox Cluster with CEPH storage. Each host has a 2 port 10G nic, and all nodes are daisy chained to each other:
VMHOST1 (nic1) → VMHOST2 (nic1)
VMHOST1(nic2) → VMHOST3 (nic1)
VMHOST2 (nic2) → VMHOST3 (nic2)
We have started observing that CEPH remains in an unhealthy state:

Looking in syslog (Same results in ceph/osd logs, we see the following:
I have tried restarting the Monitors, mgmr's, even the metadata servers. We even rebuilt the servers, but the issue persists. We see our OSD's going down and/or out on all VMHOSTs at various times, but can not figure out what is causing the OSD's to behave this way. If I didn't know better, I would say this was networking, but with the hosts daisychained like they are, its all direct connections. Any advice would be greatly appreciated!
pveversion
ceph -v
ceph osd tree
ceph -s
VMHOST1 (nic1) → VMHOST2 (nic1)
VMHOST1(nic2) → VMHOST3 (nic1)
VMHOST2 (nic2) → VMHOST3 (nic2)
We have started observing that CEPH remains in an unhealthy state:

Looking in syslog (Same results in ceph/osd logs, we see the following:
2025-07-15T17:39:33.227893+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6808 osd.13 ever on either front or back, first ping sent 2025-07-15T17:37:39.174150+0000 (oldest deadline 2025-07-15T17:37:59.174150+0000)
2025-07-15T17:39:33.227950+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6812 osd.14 ever on either front or back, first ping sent 2025-07-15T17:38:16.476776+0000 (oldest deadline 2025-07-15T17:38:36.476776+0000)
2025-07-15T17:39:33.227965+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6816 osd.20 ever on either front or back, first ping sent 2025-07-15T17:37:39.174150+0000 (oldest deadline 2025-07-15T17:37:59.174150+0000)
2025-07-15T17:39:33.227979+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6804 osd.21 ever on either front or back, first ping sent 2025-07-15T17:38:55.579357+0000 (oldest deadline 2025-07-15T17:39:15.579357+0000)
2025-07-15T17:39:33.227993+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6812 osd.23 ever on either front or back, first ping sent 2025-07-15T17:38:38.678072+0000 (oldest deadline 2025-07-15T17:38:58.678072+0000)
2025-07-15T17:39:33.228006+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6816 osd.24 ever on either front or back, first ping sent 2025-07-15T17:34:31.861564+0000 (oldest deadline 2025-07-15T17:34:51.861564+0000)
2025-07-15T17:39:33.994440+00:00 vmhost4 pvedaemon[1273]: <root@pam> starting task UPID:vmhost4:002097EB:0331C185:68769255:srvstop:osd.11:root@pam:
2025-07-15T17:39:34.002726+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
2025-07-15T17:39:34.002799+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 osd.11 4153 *** Got signal Terminated ***
2025-07-15T17:39:34.002842+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 osd.11 4153 *** Immediate shutdown (osd_fast_shutdown=true) ***
2025-07-15T17:39:34.003076+00:00 vmhost4 systemd[1]: Stopping ceph-osd@11.service - Ceph object storage daemon osd.11...
2025-07-15T17:39:34.325377+00:00 vmhost4 systemd[1]: ceph-osd@11.service: Deactivated successfully.
2025-07-15T17:39:34.325522+00:00 vmhost4 systemd[1]: Stopped ceph-osd@11.service - Ceph object storage daemon osd.11.
2025-07-15T17:39:34.325725+00:00 vmhost4 systemd[1]: ceph-osd@11.service: Consumed 13min 43.206s CPU time.
2025-07-15T17:39:34.331661+00:00 vmhost4 pvedaemon[1273]: <root@pam> end task UPID:vmhost4:002097EB:0331C185:68769255:srvstop:osd.11:root@pam: OK
I have tried restarting the Monitors, mgmr's, even the metadata servers. We even rebuilt the servers, but the issue persists. We see our OSD's going down and/or out on all VMHOSTs at various times, but can not figure out what is causing the OSD's to behave this way. If I didn't know better, I would say this was networking, but with the hosts daisychained like they are, its all direct connections. Any advice would be greatly appreciated!
pveversion
pve-manager/8.3.5/dac3aa88bac3f300 (running kernel: 6.8.12-9-pve)
ceph -v
ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.49277 root default
-3 0.87320 host vmhost4
10 ssd 0.43660 osd.10 DNE 0
11 ssd 0.43660 osd.11 DNE 0
-5 1.30978 host vmhost5
12 ssd 0.43660 osd.12 up 1.00000 1.00000
13 ssd 0.43660 osd.13 up 1.00000 1.00000
14 ssd 0.21829 osd.14 up 1.00000 1.00000
20 ssd 0.21829 osd.20 up 1.00000 1.00000
-7 1.30978 host vmhost6
21 ssd 0.43660 osd.21 up 1.00000 1.00000
22 ssd 0.43660 osd.22 up 1.00000 1.00000
23 ssd 0.21829 osd.23 up 1.00000 1.00000
24 ssd 0.21829 osd.24 up 1.00000 1.00000
ceph -s
cluster:
id: 4f0de565-25af-4368-bbee-abf7afff1564
health: HEALTH_WARN
2 osds exist in the crush map but not in the osdmap
Reduced data availability: 3 pgs inactive
34 slow ops, oldest one blocked for 434 sec, daemons [osd.12,osd.13,osd.22,mon.vmhost4] have slow ops.
services:
mon: 3 daemons, quorum vmhost4,vmhost5,vmhost6 (age 62m)
mgr: vmhost6(active, since 53m), standbys: vmhost4, vmhost5
osd: 8 osds: 8 up (since 5m), 8 in (since 62m)
data:
pools: 2 pools, 3 pgs
objects: 0 objects, 0 B
usage: 858 MiB used, 2.6 TiB / 2.6 TiB avail
pgs: 100.000% pgs unknown
3 unknown