I am running a simple 3 node Proxmox Cluster with CEPH storage. Each host has a 2 port 10G nic, and all nodes are daisy chained to each other:
VMHOST1 (nic1) → VMHOST2 (nic1)
VMHOST1(nic2) → VMHOST3 (nic1)
VMHOST2 (nic2) → VMHOST3 (nic2)
We have started observing that CEPH remains in an unhealthy state:

Looking in syslog (Same results in ceph/osd logs, we see the following:
I have tried restarting the Monitors, mgmr's, even the metadata servers. We even rebuilt the servers, but the issue persists. We see our OSD's going down and/or out on all VMHOSTs at various times, but can not figure out what is causing the OSD's to behave this way. If I didn't know better, I would say this was networking, but with the hosts daisychained like they are, its all direct connections. Any advice would be greatly appreciated!
pveversion
ceph -v
ceph osd tree
ceph -s
VMHOST1 (nic1) → VMHOST2 (nic1)
VMHOST1(nic2) → VMHOST3 (nic1)
VMHOST2 (nic2) → VMHOST3 (nic2)
We have started observing that CEPH remains in an unhealthy state:

Looking in syslog (Same results in ceph/osd logs, we see the following:
2025-07-15T17:39:33.227893+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6808 osd.13 ever on either front or back, first ping sent 2025-07-15T17:37:39.174150+0000 (oldest deadline 2025-07-15T17:37:59.174150+0000)2025-07-15T17:39:33.227950+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6812 osd.14 ever on either front or back, first ping sent 2025-07-15T17:38:16.476776+0000 (oldest deadline 2025-07-15T17:38:36.476776+0000)2025-07-15T17:39:33.227965+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.132:6816 osd.20 ever on either front or back, first ping sent 2025-07-15T17:37:39.174150+0000 (oldest deadline 2025-07-15T17:37:59.174150+0000)2025-07-15T17:39:33.227979+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6804 osd.21 ever on either front or back, first ping sent 2025-07-15T17:38:55.579357+0000 (oldest deadline 2025-07-15T17:39:15.579357+0000)2025-07-15T17:39:33.227993+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6812 osd.23 ever on either front or back, first ping sent 2025-07-15T17:38:38.678072+0000 (oldest deadline 2025-07-15T17:38:58.678072+0000)2025-07-15T17:39:33.228006+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:33.227+0000 791831c576c0 -1 osd.11 4153 heartbeat_check: no reply from 172.16.25.133:6816 osd.24 ever on either front or back, first ping sent 2025-07-15T17:34:31.861564+0000 (oldest deadline 2025-07-15T17:34:51.861564+0000)2025-07-15T17:39:33.994440+00:00 vmhost4 pvedaemon[1273]: <root@pam> starting task UPID:vmhost4:002097EB:0331C185:68769255:srvstop:osd.11:root@pam:2025-07-15T17:39:34.002726+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 02025-07-15T17:39:34.002799+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 osd.11 4153 *** Got signal Terminated ***2025-07-15T17:39:34.002842+00:00 vmhost4 ceph-osd[10001]: 2025-07-15T17:39:34.002+0000 7918396ca6c0 -1 osd.11 4153 *** Immediate shutdown (osd_fast_shutdown=true) ***2025-07-15T17:39:34.003076+00:00 vmhost4 systemd[1]: Stopping ceph-osd@11.service - Ceph object storage daemon osd.11...2025-07-15T17:39:34.325377+00:00 vmhost4 systemd[1]: ceph-osd@11.service: Deactivated successfully.2025-07-15T17:39:34.325522+00:00 vmhost4 systemd[1]: Stopped ceph-osd@11.service - Ceph object storage daemon osd.11.2025-07-15T17:39:34.325725+00:00 vmhost4 systemd[1]: ceph-osd@11.service: Consumed 13min 43.206s CPU time.2025-07-15T17:39:34.331661+00:00 vmhost4 pvedaemon[1273]: <root@pam> end task UPID:vmhost4:002097EB:0331C185:68769255:srvstop:osd.11:root@pam: OKI have tried restarting the Monitors, mgmr's, even the metadata servers. We even rebuilt the servers, but the issue persists. We see our OSD's going down and/or out on all VMHOSTs at various times, but can not figure out what is causing the OSD's to behave this way. If I didn't know better, I would say this was networking, but with the hosts daisychained like they are, its all direct connections. Any advice would be greatly appreciated!
pveversion
pve-manager/8.3.5/dac3aa88bac3f300 (running kernel: 6.8.12-9-pve)ceph -v
ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-1 3.49277 root default-3 0.87320 host vmhost410 ssd 0.43660 osd.10 DNE 011 ssd 0.43660 osd.11 DNE 0-5 1.30978 host vmhost512 ssd 0.43660 osd.12 up 1.00000 1.0000013 ssd 0.43660 osd.13 up 1.00000 1.0000014 ssd 0.21829 osd.14 up 1.00000 1.0000020 ssd 0.21829 osd.20 up 1.00000 1.00000-7 1.30978 host vmhost621 ssd 0.43660 osd.21 up 1.00000 1.0000022 ssd 0.43660 osd.22 up 1.00000 1.0000023 ssd 0.21829 osd.23 up 1.00000 1.0000024 ssd 0.21829 osd.24 up 1.00000 1.00000ceph -s
cluster: id: 4f0de565-25af-4368-bbee-abf7afff1564 health: HEALTH_WARN 2 osds exist in the crush map but not in the osdmap Reduced data availability: 3 pgs inactive 34 slow ops, oldest one blocked for 434 sec, daemons [osd.12,osd.13,osd.22,mon.vmhost4] have slow ops. services: mon: 3 daemons, quorum vmhost4,vmhost5,vmhost6 (age 62m) mgr: vmhost6(active, since 53m), standbys: vmhost4, vmhost5 osd: 8 osds: 8 up (since 5m), 8 in (since 62m) data: pools: 2 pools, 3 pgs objects: 0 objects, 0 B usage: 858 MiB used, 2.6 TiB / 2.6 TiB avail pgs: 100.000% pgs unknown 3 unknown

