[SOLVED] Ceph Slow Ops Reported

markp · Sep 28, 2020

I have an brand new, completely clean, 6.2 installation that was completed on Friday. No VMs yet, 4 identical machines (hhcs000[1-4]). And on two of the three monitors I am getting a lot of these in the logs on hhcs0002 and hhcs0003, while hhcs0001 seems fine:

mon.hhsc0002@1(peon) e5 get_health_metrics reporting 4 slow ops, oldest is mon_command({"prefix":"pg dump","dumpcontents":["osds"],"format":"json"} v 0)

. If I look at the health it shows that the monitors are crashing.

Code:

# ceph health detail
HEALTH_WARN 5 daemons have recently crashed
RECENT_CRASH 5 daemons have recently crashed
    mon.hhcs0003 crashed on host hhcs0003 at 2020-09-25 23:58:24.642998Z
    mon.hhsc0002 crashed on host hhsc0002 at 2020-09-25 23:58:39.260149Z
    mon.hhsc0002 crashed on host hhsc0002 at 2020-09-26 00:17:37.328757Z
    mon.hhcs0003 crashed on host hhcs0003 at 2020-09-26 00:18:18.605781Z
    mon.hhcs0003 crashed on host hhcs0003 at 2020-09-26 00:38:00.457194Z

I originally had four monitors but after reading these forums I destroyed on Friday during troubleshooting (hhcs0004 still show up in the gui and likely needs to be addressed). The OSS tree only displays sporadically in the web GUI and doesn't appear to be related to which machine I logged in to or am using to display it at the time. The OSD tree looks good as far as I know and the command returns immediately on all 4 systems. The DBs are stored on the OSD disks.

Code:

# ceph osd tree 
ID CLASS WEIGHT  TYPE NAME         STATUS REWEIGHT PRI-AFF 
-1       1.74640 root default                              
-3       0.43660     host hhcs0001                         
 0   ssd 0.43660         osd.0         up  1.00000 1.00000 
-7       0.43660     host hhcs0003                         
 2   ssd 0.43660         osd.2         up  1.00000 1.00000 
-9       0.43660     host hhcs0004                         
 3   ssd 0.43660         osd.3         up  1.00000 1.00000 
-5       0.43660     host hhsc0002                         
 1   ssd 0.43660         osd.1         up  1.00000 1.00000

Digging deeper in to the logs shows: https://pastebin.com/sUnQuZX

The hardware and configuration are identical for each machine:

CPU: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
Memory: 256 GB
HDD:
- 2x INTEL SSD DC 3510 120GB using RAID 1 ZFS via the installer for OS
- 1x Intel® SSD D3-S4510 480GB SSD used for OSD
2x Intel X710 10GB network cards
- bond0 802.3ad for frontend network, used for vmbr0
- bond1 802.3ad for backend network, CEPH/Corosync (will likely split corosync out to shared port with BMC)
Baseboard: Supermicro X11DPT-PS
/etc/pve/ceph.conf (I did notice, just now, that the public network and the cluster network are the same in ceph.conf and that's not accurate but I don't think it would cause what I'm seeing. I will, however wait to change that until I get this sorted or it's identified as a possible cause.)

The install was all done by me, following: https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster and should be bog standard except for the bonds which I validated are correct using https://pve.proxmox.com/wiki/Network_Configuration. The only thing I can think I did differently was using the CLI tools for hhcs0001 and using the web gui for hhcs000[2-4]. I can't imagine that making a difference but I figured anything is worth mentioning. I don't know where to look next to get the ceph-mon bits stabilized.

Alwin · Sep 28, 2020

markp said:
Digging deeper in to the logs shows: https://pastebin.com/sUnQuZX

The paste doesn't exist anymore.

markp said:
RECENT_CRASH 5 daemons have recently crashed

Can you please post one of the crash logs?

markp said:
The only thing I can think I did differently was using the CLI tools for hhcs0001 and using the web gui for hhcs000[2-4]. I can't imagine that making a difference but I figured anything is worth mentioning. I don't know where to look next to get the ceph-mon bits stabilized.

The pveceph and GUI use the same API. There should be not much different.

markp · Sep 28, 2020

Alwin said:
The paste doesn't exist anymore.

Can you please post one of the crash logs?

Sorry about that: https://pastebin.com/Y5RT148v

Don't know why it disappeared.

The pveceph and GUI use the same API. There should be not much different.

That was my assumption but at this point I'm willing to poke anything with a stick.

markp · Sep 28, 2020

New error, or one I hadn't see before:

Code:

Sep 28 14:58:30 hhsc0002 ceph-crash[1910]: WARNING:__main__:post /var/lib/ceph/crash/2020-09-28_19:41:51.826711Z_f86ab357-b41f-4cb4-b1ae-04eab176e905 as client.crash.hhsc0002 failed: [errno 2] error connecting to the cluster
Sep 28 14:58:30 hhsc0002 ceph-crash[1910]: WARNING:__main__:post /var/lib/ceph/crash/2020-09-28_19:41:51.826711Z_f86ab357-b41f-4cb4-b1ae-04eab176e905 as client.crash failed: [errno 2] error connecting to the cluster

Crash info: https://pastebin.com/994gZXSF

Alwin · Sep 29, 2020

In the ceph.conf I can see, that you use the '169.254.0.0/16', this is a reserved range for link-local addresses.

https://en.wikipedia.org/wiki/Reserved_IP_addresses said:
Used for link-local addresses[6] between two hosts on a single link when no IP address is otherwise specified, such as would have normally been retrieved from a DHCP server.

Since there seems to be no data on the cluster, destroy all Ceph daemons and start fresh with one of the private ip ranges.

If you search 'ceph_assert((sharded_in_flight_list.back())->ops_in_flight_sharded.empty())', it seems that it has to do with OSD shutdown.
https://tracker.ceph.com/issues/42347
https://tracker.ceph.com/issues/36546

Is there anything more in the logs and in general in the syslog?

And BTW, separate corosync onto its own physical NIC port. Shared media introduces interference, corosync needs low and stable latency. Best use two corosync links for redundancy.

markp said:
bond1 802.3ad for backend network, CEPH/Corosync (will likely split corosync out to shared port with BMC)

markp · Oct 1, 2020

Sorry I didn't respond sooner. The errors are no longer happening, it appeared to be the result of my creating, then destroying CephFS. Even 'df' was hanging. In the end it was my own mistake. Once corrected everything is up, running, and stable. We are testing all of the aspects of HA and things are going really well.

I have moved corosync to its own network.

Alwin · Oct 2, 2020

markp said:
The errors are no longer happening, it appeared to be the result of my creating, then destroying CephFS.

That operation does not crash the MONs. Otherwise one could never remove CephFS once created. As said above the ip range '169.254.0.0/16' is reserved and should not be used.

markp · Oct 2, 2020

CephFS was still mounted by the OS, once it was unmounted the errors stopped. I can't explain that other than it being the only thing that changed in the environment.

Yes I am aware the 169.254 address range is reserved for link-local in the RFC. However none of these machines have any link-local devices. Unfortunately I don't get to choose the IPs I use in my environment. I did get a new range assigned and will be changing it today. However nothing prevents the range from working from a technical perspective.

Alwin · Oct 4, 2020

markp said:
CephFS was still mounted by the OS, once it was unmounted the errors stopped. I can't explain that other than it being the only thing that changed in the environment.

The slow ops or the crashes?

markp said:
However nothing prevents the range from working from a technical perspective.

Should be. But the use of that range might trigger some condition in Ceph or the kernel, since it is mostly likely not a tested scenario.

markp · Oct 5, 2020

Alwin said:
The slow ops or the crashes?

Should be. But the use of that range might trigger some condition in Ceph or the kernel, since it is mostly likely not a tested scenario.

The slow ops and crashes went completely away when I unmounted the no longer existing CephFS mount. I don't know enough about Ceph to comment on why it happened that way. I didn't unmount the filesystem prior to destroying CephFS so I'm blaming it on there being a nut loose behind the keyboard.

And I will be changing the subnet used. I also need to separate the Ceph public and private networks and repeat my testing.

sidereus · Apr 18, 2021

After accidentially misconfiguring the network switch I have got infinitely messages in syslog

Code:

Apr 18 14:17:01 asr1 ceph-mon[2262]: 2021-04-18T14:17:01.917+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:06 asr1 ceph-mon[2262]: 2021-04-18T14:17:06.937+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:11 asr1 ceph-mon[2262]: 2021-04-18T14:17:11.957+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:16 asr1 ceph-mon[2262]: 2021-04-18T14:17:16.985+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:22 asr1 ceph-mon[2262]: 2021-04-18T14:17:22.009+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)
Apr 18 14:17:27 asr1 ceph-mon[2262]: 2021-04-18T14:17:27.037+0300 7f1ac94f1700 -1 mon.asr1@0(leader) e3 get_health_metrics reporting 11 slow ops, oldest is osd_failure(failed timeout osd.6 [v2:192.168.121.3:6848/168239,v1:192.168.121.3:6849/168239] for 99sec e4783 v4783)

Solved it by restarting ceph-mon.

Code:

systemctl restart ceph-mon@asr1.service

Search

Search

[SOLVED] Ceph Slow Ops Reported

markp

Member

Alwin

Proxmox Retired Staff

markp

Member

markp

Member

Alwin

Proxmox Retired Staff

markp

Member

Alwin

Proxmox Retired Staff

markp

Member

Alwin

Proxmox Retired Staff

markp

Member

sidereus

Member

We value your privacy