[SOLVED] Ceph Slow Ops Reported

markp

Member
Sep 28, 2020
10
0
6
53
Houston, Texas
I have an brand new, completely clean, 6.2 installation that was completed on Friday. No VMs yet, 4 identical machines (hhcs000[1-4]). And on two of the three monitors I am getting a lot of these in the logs on hhcs0002 and hhcs0003, while hhcs0001 seems fine: mon.hhsc0002@1(peon) e5 get_health_metrics reporting 4 slow ops, oldest is mon_command({"prefix":"pg dump","dumpcontents":["osds"],"format":"json"} v 0). If I look at the health it shows that the monitors are crashing.

Code:
# ceph health detail
HEALTH_WARN 5 daemons have recently crashed
RECENT_CRASH 5 daemons have recently crashed
    mon.hhcs0003 crashed on host hhcs0003 at 2020-09-25 23:58:24.642998Z
    mon.hhsc0002 crashed on host hhsc0002 at 2020-09-25 23:58:39.260149Z
    mon.hhsc0002 crashed on host hhsc0002 at 2020-09-26 00:17:37.328757Z
    mon.hhcs0003 crashed on host hhcs0003 at 2020-09-26 00:18:18.605781Z
    mon.hhcs0003 crashed on host hhcs0003 at 2020-09-26 00:38:00.457194Z

I originally had four monitors but after reading these forums I destroyed on Friday during troubleshooting (hhcs0004 still show up in the gui and likely needs to be addressed). The OSS tree only displays sporadically in the web GUI and doesn't appear to be related to which machine I logged in to or am using to display it at the time. The OSD tree looks good as far as I know and the command returns immediately on all 4 systems. The DBs are stored on the OSD disks.
Code:
# ceph osd tree 
ID CLASS WEIGHT  TYPE NAME         STATUS REWEIGHT PRI-AFF 
-1       1.74640 root default                              
-3       0.43660     host hhcs0001                         
 0   ssd 0.43660         osd.0         up  1.00000 1.00000 
-7       0.43660     host hhcs0003                         
 2   ssd 0.43660         osd.2         up  1.00000 1.00000 
-9       0.43660     host hhcs0004                         
 3   ssd 0.43660         osd.3         up  1.00000 1.00000 
-5       0.43660     host hhsc0002                         
 1   ssd 0.43660         osd.1         up  1.00000 1.00000

Digging deeper in to the logs shows: https://pastebin.com/sUnQuZX

The hardware and configuration are identical for each machine:
  • CPU: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
  • Memory: 256 GB
  • HDD:
    • 2x INTEL SSD DC 3510 120GB using RAID 1 ZFS via the installer for OS
    • 1x Intel® SSD D3-S4510 480GB SSD used for OSD
  • 2x Intel X710 10GB network cards
    • bond0 802.3ad for frontend network, used for vmbr0
    • bond1 802.3ad for backend network, CEPH/Corosync (will likely split corosync out to shared port with BMC)
  • Baseboard: Supermicro X11DPT-PS
  • /etc/pve/ceph.conf (I did notice, just now, that the public network and the cluster network are the same in ceph.conf and that's not accurate but I don't think it would cause what I'm seeing. I will, however wait to change that until I get this sorted or it's identified as a possible cause.)
The install was all done by me, following: https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster and should be bog standard except for the bonds which I validated are correct using https://pve.proxmox.com/wiki/Network_Configuration. The only thing I can think I did differently was using the CLI tools for hhcs0001 and using the web gui for hhcs000[2-4]. I can't imagine that making a difference but I figured anything is worth mentioning. I don't know where to look next to get the ceph-mon bits stabilized.
 
Digging deeper in to the logs shows: https://pastebin.com/sUnQuZX
The paste doesn't exist anymore.

RECENT_CRASH 5 daemons have recently crashed
Can you please post one of the crash logs?

The only thing I can think I did differently was using the CLI tools for hhcs0001 and using the web gui for hhcs000[2-4]. I can't imagine that making a difference but I figured anything is worth mentioning. I don't know where to look next to get the ceph-mon bits stabilized.
The pveceph and GUI use the same API. There should be not much different.
 
New error, or one I hadn't see before:

Code:
Sep 28 14:58:30 hhsc0002 ceph-crash[1910]: WARNING:__main__:post /var/lib/ceph/crash/2020-09-28_19:41:51.826711Z_f86ab357-b41f-4cb4-b1ae-04eab176e905 as client.crash.hhsc0002 failed: [errno 2] error connecting to the cluster
Sep 28 14:58:30 hhsc0002 ceph-crash[1910]: WARNING:__main__:post /var/lib/ceph/crash/2020-09-28_19:41:51.826711Z_f86ab357-b41f-4cb4-b1ae-04eab176e905 as client.crash failed: [errno 2] error connecting to the cluster

Crash info: https://pastebin.com/994gZXSF
 
In the ceph.conf I can see, that you use the '169.254.0.0/16', this is a reserved range for link-local addresses.
https://en.wikipedia.org/wiki/Reserved_IP_addresses said:
Used for link-local addresses[6] between two hosts on a single link when no IP address is otherwise specified, such as would have normally been retrieved from a DHCP server.
Since there seems to be no data on the cluster, destroy all Ceph daemons and start fresh with one of the private ip ranges.

If you search 'ceph_assert((sharded_in_flight_list.back())->ops_in_flight_sharded.empty())', it seems that it has to do with OSD shutdown.
https://tracker.ceph.com/issues/42347
https://tracker.ceph.com/issues/36546

Is there anything more in the logs and in general in the syslog?

And BTW, separate corosync onto its own physical NIC port. Shared media introduces interference, corosync needs low and stable latency. Best use two corosync links for redundancy.
  • bond1 802.3ad for backend network, CEPH/Corosync (will likely split corosync out to shared port with BMC)
 
Sorry I didn't respond sooner. The errors are no longer happening, it appeared to be the result of my creating, then destroying CephFS. Even 'df' was hanging. In the end it was my own mistake. Once corrected everything is up, running, and stable. We are testing all of the aspects of HA and things are going really well.

I have moved corosync to its own network.
 
The errors are no longer happening, it appeared to be the result of my creating, then destroying CephFS.
That operation does not crash the MONs. Otherwise one could never remove CephFS once created. As said above the ip range '169.254.0.0/16' is reserved and should not be used.
 
CephFS was still mounted by the OS, once it was unmounted the errors stopped. I can't explain that other than it being the only thing that changed in the environment.

Yes I am aware the 169.254 address range is reserved for link-local in the RFC. However none of these machines have any link-local devices. Unfortunately I don't get to choose the IPs I use in my environment. I did get a new range assigned and will be changing it today. However nothing prevents the range from working from a technical perspective.
 
Last edited:
CephFS was still mounted by the OS, once it was unmounted the errors stopped. I can't explain that other than it being the only thing that changed in the environment.
The slow ops or the crashes?

However nothing prevents the range from working from a technical perspective.
Should be. But the use of that range might trigger some condition in Ceph or the kernel, since it is mostly likely not a tested scenario.
 
The slow ops or the crashes?


Should be. But the use of that range might trigger some condition in Ceph or the kernel, since it is mostly likely not a tested scenario.
The slow ops and crashes went completely away when I unmounted the no longer existing CephFS mount. I don't know enough about Ceph to comment on why it happened that way. I didn't unmount the filesystem prior to destroying CephFS so I'm blaming it on there being a nut loose behind the keyboard.

And I will be changing the subnet used. I also need to separate the Ceph public and private networks and repeat my testing.
 
After accidentially misconfiguring the network switch I have got infinitely messages in syslog
Code:
Apr 18 14:17:01 asr1 ceph-mon[2262]: 2021-04-18T14:17:01.917+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:06 asr1 ceph-mon[2262]: 2021-04-18T14:17:06.937+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:11 asr1 ceph-mon[2262]: 2021-04-18T14:17:11.957+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:16 asr1 ceph-mon[2262]: 2021-04-18T14:17:16.985+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:22 asr1 ceph-mon[2262]: 2021-04-18T14:17:22.009+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:27 asr1 ceph-mon[2262]: 2021-04-18T14:17:27.037+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Solved it by restarting ceph-mon.
Code:
systemctl restart ceph-mon@asr1.service