Ceph 14.2.5 - get_health_metrics reporting 1 slow ops

psionic

Member
May 23, 2019
75
9
13
Did upgrades today that included Ceph 14.2.5, Had to restart all OSDs, Monitors, and Managers.

After restarting all Monitors and Managers was still getting errors every 5 seconds:

Dec 17 21:59:05 pve11 ceph-mon[3925461]: 2019-12-17 21:59:05.214 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:10 pve11 ceph-mon[3925461]: 2019-12-17 21:59:10.214 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:15 pve11 ceph-mon[3925461]: 2019-12-17 21:59:15.214 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:20 pve11 ceph-mon[3925461]: 2019-12-17 21:59:20.214 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:25 pve11 ceph-mon[3925461]: 2019-12-17 21:59:25.218 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:30 pve11 ceph-mon[3925461]: 2019-12-17 21:59:30.218 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:35 pve11 ceph-mon[3925461]: 2019-12-17 21:59:35.218 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:40 pve11 ceph-mon[3925461]: 2019-12-17 21:59:40.218 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:45 pve11 ceph-mon[3925461]: 2019-12-17 21:59:45.218 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:50 pve11 ceph-mon[3925461]: 2019-12-17 21:59:50.221 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)
Dec 17 21:59:55 pve11 ceph-mon[3925461]: 2019-12-17 21:59:55.221 7f29ff2c5700 -1 mon.pve11@0(leader) e5 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.13 [v2:10.10.3.14:6802/3287700,v1:10.10.3.14:6805/3287700] for 25sec e3725 v3725)

Restarting all Monitors and Managers a second time has so far stopped the error messages...
 
Restarting all Monitors and Managers a second time has so far stopped the error messages...
Do you have the crash plugin active? We noticed that under certain circumstances some logs were produced until a second restart of the managers was done. But we saw no harm whatsoever, it was "just" the manager, and in very plain default setups it did not happen at all, so we did not think about that to much..
 
Do you have the crash plugin active? We noticed that under certain circumstances some logs were produced until a second restart of the managers was done. But we saw no harm whatsoever, it was "just" the manager, and in very plain default setups it did not happen at all, so we did not think about that to much..

ceph mgr module enable crash
module 'crash' is already enabled (always-on)

ceph crash stat
0 crashes recorded

Dec 17 21:59:55 pve11 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 17 21:59:55 pve11 systemd[1]: Started Ceph cluster monitor daemon.
Dec 17 22:00:21 pve11 ceph-mon[1477954]: 2019-12-17 22:00:21.217 7f1b23449700 -1 mon.pve11@0(electing) e5 failed to get devid for : fallback method has serial ''but no model
Dec 17 22:00:21 pve11 ceph-mon[1477954]: 2019-12-17 22:00:21.233 7f1b23449700 -1 mon.pve11@0(electing) e5 failed to get devid for : fallback method has serial ''but no model
Dec 17 22:00:35 pve11 ceph-mon[1477954]: 2019-12-17 22:00:35.321 7f1b23449700 -1 mon.pve11@0(electing) e5 failed to get devid for : fallback method has serial ''but no model
Dec 18 00:00:00 pve11 ceph-mon[1477954]: 2019-12-18 00:00:00.088 7f1b29c56700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 1826421) UID: 0
Dec 18 00:00:00 pve11 ceph-mon[1477954]: 2019-12-18 00:00:00.144 7f1b29c56700 -1 received signal: Hangup from (PID: 1826422) UID: 0
Dec 18 01:21:27 pve11 ceph-mon[1477954]: 2019-12-18 01:21:26.998 7f1b23449700 -1 mon.pve11@0(electing) e5 failed to get devid for : fallback method has serial ''but no model
Dec 18 01:21:38 pve11 ceph-mon[1477954]: 2019-12-18 01:21:38.073 7f1b23449700 -1 mon.pve11@0(electing) e5 failed to get devid for : fallback method has serial ''but no model
Dec 18 01:21:38 pve11 ceph-mon[1477954]: 2019-12-18 01:21:38.081 7f1b23449700 -1 mon.pve11@0(electing) e5 failed to get devid for : fallback method has serial ''but no model
Dec 18 01:21:46 pve11 ceph-mon[1477954]: 2019-12-18 01:21:46.233 7f1b23449700 -1 mon.pve11@0(electing) e5 failed to get devid for : fallback method has serial ''but no model

Bunch of repeated 'but no model' messages...

Dec 17 22:03:32 pve11 systemd[1]: Stopped Ceph cluster manager daemon.
Dec 17 22:03:32 pve11 systemd[1]: Started Ceph cluster manager daemon.
Dec 18 00:00:00 pve11 ceph-mgr[1489626]: 2019-12-18 00:00:00.088 7f626d2a4700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 1826421) UID: 0
Dec 18 00:00:00 pve11 ceph-mgr[1489626]: 2019-12-18 00:00:00.144 7f626d2a4700 -1 received signal: Hangup from (PID: 1826422) UID: 0

ceph -s
cluster:
id: aff8cf10-e628-4f89-bf4b-01b451d11775
health: HEALTH_OK

services:
mon: 3 daemons, quorum pve11,pve12,pve13 (age 43m)
mgr: pve11(active, since 11h), standbys: pve12, pve13
osd: 16 osds: 16 up (since 5h), 16 in (since 2w)

data:
pools: 1 pools, 256 pgs
objects: 510.81k objects, 1.9 TiB
usage: 5.3 TiB used, 24 TiB / 29 TiB avail
pgs: 256 active+clean

io:
client: 396 KiB/s wr, 0 op/s rd, 75 op/s wr

ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 29.10864 - 29 TiB 5.3 TiB 5.3 TiB 454 KiB 19 GiB 24 TiB 18.38 1.00 - root default
-3 7.27716 - 7.3 TiB 1.3 TiB 1.3 TiB 76 KiB 4.5 GiB 5.9 TiB 18.27 0.99 - host pve11
0 ssd 1.81929 1.00000 1.8 TiB 362 GiB 361 GiB 16 KiB 1.2 GiB 1.5 TiB 19.45 1.06 51 up osd.0
1 ssd 1.81929 1.00000 1.8 TiB 314 GiB 313 GiB 28 KiB 1024 MiB 1.5 TiB 16.85 0.92 44 up osd.1
2 ssd 1.81929 1.00000 1.8 TiB 386 GiB 385 GiB 12 KiB 1.2 GiB 1.4 TiB 20.71 1.13 54 up osd.2
3 ssd 1.81929 1.00000 1.8 TiB 299 GiB 298 GiB 20 KiB 1024 MiB 1.5 TiB 16.05 0.87 42 up osd.3
-5 7.27716 - 7.3 TiB 1.3 TiB 1.3 TiB 147 KiB 4.6 GiB 6.0 TiB 17.63 0.96 - host pve12
4 ssd 1.81929 1.00000 1.8 TiB 308 GiB 306 GiB 36 KiB 1.3 GiB 1.5 TiB 16.52 0.90 43 up osd.4
5 ssd 1.81929 1.00000 1.8 TiB 399 GiB 397 GiB 31 KiB 1.3 GiB 1.4 TiB 21.40 1.16 56 up osd.5
6 ssd 1.81929 1.00000 1.8 TiB 328 GiB 327 GiB 44 KiB 1.0 GiB 1.5 TiB 17.61 0.96 46 up osd.6
7 ssd 1.81929 1.00000 1.8 TiB 279 GiB 278 GiB 36 KiB 1024 MiB 1.5 TiB 14.97 0.81 39 up osd.7
-7 7.27716 - 7.3 TiB 1.4 TiB 1.4 TiB 88 KiB 4.9 GiB 5.9 TiB 19.31 1.05 - host pve13
8 ssd 1.81929 1.00000 1.8 TiB 384 GiB 383 GiB 36 KiB 1.1 GiB 1.4 TiB 20.62 1.12 54 up osd.8
9 ssd 1.81929 1.00000 1.8 TiB 307 GiB 306 GiB 24 KiB 1.0 GiB 1.5 TiB 16.47 0.90 43 up osd.9
10 ssd 1.81929 1.00000 1.8 TiB 354 GiB 353 GiB 16 KiB 1.1 GiB 1.5 TiB 19.00 1.03 50 up osd.10
11 ssd 1.81929 1.00000 1.8 TiB 394 GiB 393 GiB 12 KiB 1.7 GiB 1.4 TiB 21.16 1.15 55 up osd.11
-9 7.27716 - 7.3 TiB 1.3 TiB 1.3 TiB 143 KiB 4.6 GiB 5.9 TiB 18.30 1.00 - host pve14
12 ssd 1.81929 1.00000 1.8 TiB 279 GiB 278 GiB 23 KiB 1.2 GiB 1.5 TiB 15.00 0.82 39 up osd.12
13 ssd 1.81929 1.00000 1.8 TiB 451 GiB 450 GiB 40 KiB 1.2 GiB 1.4 TiB 24.20 1.32 63 up osd.13
14 ssd 1.81929 1.00000 1.8 TiB 335 GiB 334 GiB 52 KiB 1.0 GiB 1.5 TiB 17.98 0.98 47 up osd.14
15 ssd 1.81929 1.00000 1.8 TiB 299 GiB 298 GiB 28 KiB 1.1 GiB 1.5 TiB 16.04 0.87 42 up osd.15
TOTAL 29 TiB 5.3 TiB 5.3 TiB 458 KiB 19 GiB 24 TiB 18.38
MIN/MAX VAR: 0.81/1.32 STDDEV: 2.58
 
Just some more data for this, we had the same issue on one of our nodes today. We're running 14.2.5-pve1. "ceph -s" showed

Code:
13 slow ops, oldest one blocked for 74234 sec, mon.hv4 has slow ops

On node hv4 we were seeing

Code:
Dec 22 13:17:58 hv4 ceph-mon[2871]: 2019-12-22 13:17:58.475 7f552ad45700 -1 mon.hv4@0(leader) e22 get_health_metrics reporting 13 slow ops, oldest is osd_failure(failed timeout osd.6

According to "ceph osd df tree" osd.6 appeared Ok

Code:
ID CLASS WEIGHT   REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE VAR  PGS STATUS TYPE NAME       
 6   ssd  1.81929  1.00000 1.8 TiB  53 GiB  52 GiB   4 KiB 1024 MiB 1.8 TiB 2.85 0.97 133     up         osd.6

'ceph crash stat' only shows a couple of old crashes (over a month ago), nothing since the upgrade. I restarted all ceph processes on hv4 which then logged the below error

Code:
Dec 22 13:21:26 ed-hv4 ceph-mon[1714029]: 2019-12-22 13:21:26.890 7f794f22f700 -1 mon.hv4@0(electing) e22 failed to get devid for : fallback method has serial ''but no model

Since then all has appeared OK.

Code:
ceph -s
  cluster:
    id:     f9441e6e-4288-47df-b174-8a432726870c
    health: HEALTH_OK
 
  services:
    mon: 4 daemons, quorum hv4,hv3,hv2,hv1 (age 16m)
    mgr: hv1(active, since 3d), standbys: hv2, hv3, hv4
    osd: 24 osds: 24 up (since 15m), 24 in (since 5w)
 
  data:
    pools:   1 pools, 1024 pgs
    objects: 131.34k objects, 511 GiB
    usage:   1.2 TiB used, 40 TiB / 41 TiB avail
    pgs:     1024 active+clean
 
  io:
    client:   0 B/s rd, 71 KiB/s wr, 0 op/s rd, 5 op/s wr
 
Some further info in case it's useful. Looking into this, the slow ops log messages started after a brief network outage yesterday (spanning tree topology change from a failover test). The messages below were logged in addition to the slow log messages. They were logged once per minute after the network change, and stopped at the top of the hour (i.e. from 16:43 through to 17:00). After that the slow ops warning continued every 5 seconds until the processes were restarted.

Code:
2019-12-21 16:58:03.658443 mon.hv4 (mon.0) 44762 : cluster [WRN] Health check update: Long heartbeat ping times on back interface seen, longest is 1204.049 msec (OSD_SLOW_PING_TIME_BACK)
2019-12-21 16:58:03.658496 mon.hv4 (mon.0) 44763 : cluster [WRN] Health check update: Long heartbeat ping times on front interface seen, longest is 1222.866 msec (OSD_SLOW_PING_TIME_FRONT)
 
I have had this issue ( 1 slow ops ) since a network crash 10 days ago. restarting managers and monitors helps for awhile , then the slow ops start again.
we are using ceph: 14.2.9-pve1

all the storage tests OK per smartctl.

attached is a daily log report from our central rsyslog server.

the issue started when power went off and on about 30 times due to a bad battery on one of two large UPS's . the ceph storage switches have two power supplies each, and mistakenly I had both cords plugged in to the same UPS.

in addition one of the 7 nodes had both power cords plugged into the same power source.

so I wonder of the slow ops warning is caused by hardware damage.

before going down that path of trying to look for bad hardware I figured I'd post this question.

Are others seeing slow ops warnings?
Code:
grep get_health_metrics /var/log/syslog
another string to check for
Code:
dmesg -T| grep /proc/sys/kernel/hung_task_timeout_secs

most of our slow ops occured during backups. however 3 to 4 times a week they have occurred during seemingly non busy times.
 

Attachments

Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!