Attention: (Potential) bug in Ceph identified

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,
I'm affected by a severe issue with Ceph.
Once there's an incident that leads to a situation that Ceph is not healthy and starts recovering from this state, the MGR is not doing it's job although the service is running.
In my case the relevant MGR log is spoiled with these error messages:
2019-11-06 17:46:22.363 7f81ffdcc700 0 auth: could not find secret_id=3865
2019-11-06 17:46:22.363 7f81ffdcc700 0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=3865


The secret_id is changing though.

Reporting this issue in Ceph user-list was long time silent, but now Sage Weil from Redhat has responded this:
My current working theory is that the mgr is getting hung up when it tries
to scrape the device metrics from the mon. The 'tell' mechanism used to
send mon-targetted commands is pretty kludgey/broken in nautilus and
earlier. It's been rewritten for octopus, but isn't worth backporting--it
never really caused problems until the devicemanager started using it
heavily.

In any case, this PR just disables scraping of mon devices for nautilus:

https://github.com/ceph/ceph/pull/31446

There is a build queued at


https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$

which should get packages in 1-2 hours.

Perhaps you can install that package on the mgr host and try again to
reproduce it again?

I noticed a few other oddities in the logs while looking through them,
like

https://tracker.ceph.com/issues/42666

which will hopefully have a fix ready for 14.2.5. I'm not sure about that
auth error message, though!

sage


Can you pay attention to this issue and provide a fix?

THX
 
Here is some information of the root cause provided by ceph developer Sage Weil:
The ceph-mgr package is sufficient.

Note that the only change on top of 14.2.4 is that the mgr devicehealth
module will scrape OSDs only, not mons.

You can probably/hopefully induce the (previously) bad behavior by
triggering a scrape manually with 'ceph device scrape-health-metrics'?

sage


Regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!