Long heartbeat ping times on back interface seen

Hi!

We just upgraded our Ceph nodes to PVE 7.2 with kernel 5.15.39-4 and Ceph 16.2.9 and experience this exact issue with OSD_SLOW_PING_TIME_FRONT/BACK.

Previous version PVE 7.1 was running kernel 5.13.19-6 and Ceph 16.2.7 very stable for months.

Hardware is Supermicro X11/X12 with Mellanox ConnectX5 100G and Intel PCIe HHHL NVMe + Seagate Exos HDs.

Any advice on which 5.15 kernel are stable to run? Is 5.15.53-1 a good option? Or should we reboot into 5.13.19-6 for now?
 
Yeah, in our particular case, the only 5.15 version kernel that wasn't AS problematic was 5.15.35-2-pve

We are currently running 5.13.19-6-pve as that has proven to be the fastest so far .. we haven't tested the 5.15.53-1-pve kernel yet .. For our customers we MUST have the stability so I'm not sure we will even try to go to the 5.15.53-1-pve kernel .. we'll see
--- edit ---
our "long heartbeat ping times" first went away on 5.15.35-2-pve

just for extra clarity
 
Last edited:
seems the problem is back - we've noticed with Kernel 5.15.108-1-pve

Code:
 cluster [WRN] Health check failed: Slow OSD heartbeats on back (longest 1064.667ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check failed: Slow OSD heartbeats on front (longest 1244.808ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 2670.215ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 1744.936ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 2889.051ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 2889.027ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 1872.833ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 2481.821ms) (OSD_SLOW_PING_TIME_FRONT)

started as a VM or two where migrated from one node to another.
 
We started to see the same with a 7 node cluster after upgrading to the newest version. Kernel 6.2.16-4-pve, Ceph 17.2.6 . We never saw this before, no other changes in cluster. 60 SSD's via 2x10G fiber to switches.

[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 4854.443ms)
Slow OSD heartbeats on back from osd.50 [] to osd.47 [] 4854.443 msec
[WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 4579.349ms)
Slow OSD heartbeats on front from osd.50 [] to osd.47 [] 4579.349 msec

Pings through both ceph networks are under 0.1ms between nodes.

Is there anybody who sorted it out, and found the root cause?
 
Last edited:
I don't believe we ever got root cause. Remediation came via upgrading to a kernel that didn't have the issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!