Long heartbeat ping times on back interface seen

kaltsi

Active Member
Mar 22, 2013
71
11
28
http://goo.gl/maps/d9tFg
I have a proxmox cluster with 5 nodes.
The system uses ceph.
It has been working without problems for 3 years.
There was nothing changing
But the ceph shows this error.

Code:
  cluster:
    id:     fb926dd9-17b9-42fb-88d6-27f4944fd554
    health: HEALTH_WARN
            Long heartbeat ping times on back interface seen, longest is 6566.709 msec
            Long heartbeat ping times on front interface seen, longest is 8522.771 msec

  services:
    mon: 3 daemons, quorum vCTDB-host-2,vCTDB-host-4,vCDTB-host-5
    mgr: vCTDB-host-2(active), standbys: vCTDB-host-5, vCTDB-host-4
    osd: 19 osds: 18 up, 18 in

  data:
    pools:   1 pools, 512 pgs
    objects: 823.49k objects, 2.59TiB
    usage:   6.67TiB used, 14.6TiB / 21.3TiB avail
    pgs:     510 active+clean
             2   active+clean+scrubbing+deep

  io:
    client:   57.1KiB/s rd, 415KiB/s wr, 38op/s rd, 32op/s wr

The ceph is very slow. I didn't found network issue.

Could you please help me to solve this issue.
 
I have a proxmox cluster with 5 nodes.
The system uses ceph.
It has been working without problems for 3 years.
There was nothing changing
But the ceph shows this error.

Code:
  cluster:
    id:     fb926dd9-17b9-42fb-88d6-27f4944fd554
    health: HEALTH_WARN
            Long heartbeat ping times on back interface seen, longest is 6566.709 msec
            Long heartbeat ping times on front interface seen, longest is 8522.771 msec

  services:
    mon: 3 daemons, quorum vCTDB-host-2,vCTDB-host-4,vCDTB-host-5
    mgr: vCTDB-host-2(active), standbys: vCTDB-host-5, vCTDB-host-4
    osd: 19 osds: 18 up, 18 in

  data:
    pools:   1 pools, 512 pgs
    objects: 823.49k objects, 2.59TiB
    usage:   6.67TiB used, 14.6TiB / 21.3TiB avail
    pgs:     510 active+clean
             2   active+clean+scrubbing+deep

  io:
    client:   57.1KiB/s rd, 415KiB/s wr, 38op/s rd, 32op/s wr

The ceph is very slow. I didn't found network issue.

Could you please help me to solve this issue.
How are the response times when using the simple ping (from node to node) on the command line?

Maybe stopping deep scrubbing improves the situation (see https://ceph.io/geen-categorie/temporarily-disable-ceph-scrubbing-to-resolve-high-io-load/)
 
I have the same issue on a testing cluster without any load on it. Any chance to troubleshoot this? What interfaces is meant (which ceph-node)?
I changed nothing on the network, it just appeared out of nowhere. Anyway to reset this? ceph crash rm / ceph crash archive-all does not work, seems like this error is coming not from ceph crash?

Edit: ceph health detail shows more, seems like its more a problem with some specific osds

root@pve-03:~# ceph health detail
HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 24948.135 msec; Long heartbeat ping times on front interface seen, longest is 25203.634 msec; 10 slow ops, oldest one blocked for 4409 sec, mon.pve-01 has slow ops
OSD_SLOW_PING_TIME_BACK Long heartbeat ping times on back interface seen, longest is 24948.135 msec
Slow heartbeat ping on back interface from osd.16 to osd.6 24948.135 msec
Slow heartbeat ping on back interface from osd.14 to osd.6 3018.578 msec
Slow heartbeat ping on back interface from osd.14 to osd.7 3018.487 msec
Slow heartbeat ping on back interface from osd.15 to osd.2 1517.088 msec possibly improving
OSD_SLOW_PING_TIME_FRONT Long heartbeat ping times on front interface seen, longest is 25203.634 msec
Slow heartbeat ping on front interface from osd.16 to osd.6 25203.634 msec
Slow heartbeat ping on front interface from osd.14 to osd.7 3132.751 msec
Slow heartbeat ping on front interface from osd.14 to osd.6 3132.705 msec
Slow heartbeat ping on front interface from osd.15 to osd.2 1651.816 msec possibly improving

Edit 2:

my errors disappeared without changing something on the network. I restarted osd.2 with systemctl restart ceph-osd@2.service did not change something else. Errors are all away now.
 
Last edited:
I have the same issue on a testing cluster without any load on it. Any chance to troubleshoot this? What interfaces is meant (which ceph-node)?
I changed nothing on the network, it just appeared out of nowhere. Anyway to reset this? ceph crash rm / ceph crash archive-all does not work, seems like this error is coming not from ceph crash?

Edit: ceph health detail shows more, seems like its more a problem with some specific osds

root@pve-03:~# ceph health detail
HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 24948.135 msec; Long heartbeat ping times on front interface seen, longest is 25203.634 msec; 10 slow ops, oldest one blocked for 4409 sec, mon.pve-01 has slow ops
OSD_SLOW_PING_TIME_BACK Long heartbeat ping times on back interface seen, longest is 24948.135 msec
Slow heartbeat ping on back interface from osd.16 to osd.6 24948.135 msec
Slow heartbeat ping on back interface from osd.14 to osd.6 3018.578 msec
Slow heartbeat ping on back interface from osd.14 to osd.7 3018.487 msec
Slow heartbeat ping on back interface from osd.15 to osd.2 1517.088 msec possibly improving
OSD_SLOW_PING_TIME_FRONT Long heartbeat ping times on front interface seen, longest is 25203.634 msec
Slow heartbeat ping on front interface from osd.16 to osd.6 25203.634 msec
Slow heartbeat ping on front interface from osd.14 to osd.7 3132.751 msec
Slow heartbeat ping on front interface from osd.14 to osd.6 3132.705 msec
Slow heartbeat ping on front interface from osd.15 to osd.2 1651.816 msec possibly improving

Edit 2:

my errors disappeared without changing something on the network. I restarted osd.2 with systemctl restart ceph-osd@2.service did not change something else. Errors are all away now.

Same here, 3 node cluster, up to date. Just saw the kernel updates today.

Suspected network issues, but nothing to be found.

Issue went away by restarting the related OSDs service...
 
  • Like
Reactions: willybong
Just upgraded all nodes to 6.3-3.

I have to point out I am tracking CEPH 15.2.6...

Shortly after booting and migrating some VMs across nodes, same happened, this time on different OSDs. Restarting the services of both fixed it.

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.6-pve1
ceph-fuse: 15.2.6-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
Last edited:
This problem has arisen with our Hyperconverged Ceph too .. when? Immediately after upgrading to Proxmox 7.2 and rebooting to the new 5.15 Kernel. It's come up on all Ceph nodes (Pacific 16.2.6). They are Dell R740s running with Mellanox 25Gb fiber cards and 100% NVMe drives. Everything was running fine on Proxmox 7.1 with kernel 5.13.19.

I would assume this is a bug either in Proxmox code or a change in the kernel .. any suggestions? Should this be reported as a bug or something else?
 
Bump .. also, messed up on Ceph version .. 16.2.7 not 16.2.6 .. has been since that was released by Proxmox team
Also, all servers in question are on Proxmox Enterprise repo.
 
Last edited:
I dont have this problem anymore (Im also on 7.2) but as you can see in the history I had it in dec 2020. I dont think its a bug more a little interruption on the network or so. That usually appears if you have network issues. You can fix it by restarting the osds that are affected from this issue (after you have fixed the network or after it is up again).

Maybe it happens if you reboot node without the usually needed global flags? Could be something .. but im not totally sure. Maybe proxmox can tell more about it.
 
Last edited:
No, because there have been no network issues. I have also restarted various OSDs and that has done nothing. The issue didn't appear on any previous kernels or versions of Proxmox and now it comes up with Kernel 5.15

This is absolutely related to the new Kernel or new Proxmox binaries brought in as of 7.2 in some way ...

Unless you are using Dell R720XD machines fully outfited with U.2 NVMe drives and Mellanox 25Gb network cards, I'm not sure you could make a direct comparison. Thanks for the attempt at explaining this though.

I may have to reboot each node to the previous 5.13.19-6 kernel and leave it that way until Proxmox does something about this on the new kernel version... however, I'm waiting for someone at the top to say something about this that might shed some light on things.

Thanks again
 
  • Like
Reactions: jsterr
FYI .. I reverted all Ceph nodes to 5.13.19-6 kernel and the issue of the OSD heartbeat on front and back disappeared. Clearly something changed with the new kernel 5.15. It'd be nice for anyone else that's having this issue to pipe up and say so to make it clear we aren't the only ones.
 
I'm in the same boat. Five node cluster, with everything up to date, and I'm getting slow OSD heartbeats on back and on front.
Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-4.15: 5.4-16
pve-kernel-4.15.18-27-pve: 4.15.18-55
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 15.2.16-pve1
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-1
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.1-1
proxmox-backup-file-restore: 2.2.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-7
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
Not happy to hear you're having the same issue but glad to know someone else is having it. It's also interesting to note that you are running a slightly newer kernel version than we were when we had the issue which tells me I made the right choice to not update to the latest kernel yet .. we are still running on 5.13.19-6-pve and there are no issues with the slow OSD heartbeats on back and front. Our problems happened specifically on the 5.15.30-2-pve kernel.

Thanks
 
Something to add to this OSD heartbeat problem with the new kernel ... on nodes that only execute VMs, with the new kernel, VMs will just shutdown for no reason .. bluescreen sometimes .. backed those nodes down to kernel 5.13.19-6 and that weirdness stopped too
 
To add to this .. since the Proxmox 7.2 update .. we've been seeing this weird issue .. after rebooting, servers come up and they give an error when trying to view them through the WebGUI .. can't remember the error exactly, maybe error 595? Anyway, you have to ssh to the server and do
systemctl restart pve-cluster
After that it's fine and you can migrate VMs on and off of the node
It's as if pve-cluster is trying to be started before networking is fully operation during bootup

Sorry if this belongs in a different post .. it seems related to me because it's something that has happened since Proxmox 7.2 and since

If you feel I should repost this somewhere else, I'm all ears ..
 
@midsize_erp - it would seem @Richard isn't answering .. perhaps because they already know about the regressions and are actively working the issue? I'm really hoping so because I'm not liking being stuck on the 5.13.19-6-pve kernel when there are already 2 newer kernel versions available in the Enterprise repo

The problem absolutely appears to be related to a kernel regression .. I've had issues on both the two latest 5.15 kernels
 
So I've had a couple of days of stability (the two nodes that were mis-behaving have not been restarting daily). Perhaps 5.15.35-2-pve has solved something?


Code:
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.4-4
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.35-1-pve: 5.15.35-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.16-pve1
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
I really hope you are right .. hopefully your issues stay gone as a good sign that the latest kernel did indeed take of at least some of the issues .. I am still running my Ceph nodes on 5.13.19-6-pve because I simply can't have those problems .. I'm running my VM execution nodes on 5.15.35-1-pve though and we have VMs just shutting down out of no where constantly .. I'm hoping that the issue with Ceph has disappeared but also these odd VM behavioral issues too
 
Ok, so, I've updated all the nodes in our cluster to the latest 5.15.35-2-pve .. things seem to be better so far .. It's only been a little over 3 hours but so far no anomalies. No Ceph slow ping errors and so far, very good VM performance and responsiveness. Time will tell if this is truly as stable as we'd hope it to be like as stable as 5.13.19-6-pve was.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!