Hi,
I'm affected by a severe issue with Ceph.
Once there's an incident that leads to a situation that Ceph is not healthy and starts recovering from this state, the MGR is not doing it's job although the service is running.
In my case the relevant MGR log is spoiled with these error messages:
2019-11-06 17:46:22.363 7f81ffdcc700 0 auth: could not find secret_id=3865
2019-11-06 17:46:22.363 7f81ffdcc700 0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=3865
The secret_id is changing though.
Reporting this issue in Ceph user-list was long time silent, but now Sage Weil from Redhat has responded this:
My current working theory is that the mgr is getting hung up when it tries
to scrape the device metrics from the mon. The 'tell' mechanism used to
send mon-targetted commands is pretty kludgey/broken in nautilus and
earlier. It's been rewritten for octopus, but isn't worth backporting--it
never really caused problems until the devicemanager started using it
heavily.
In any case, this PR just disables scraping of mon devices for nautilus:
https://github.com/ceph/ceph/pull/31446
There is a build queued at
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$
which should get packages in 1-2 hours.
Perhaps you can install that package on the mgr host and try again to
reproduce it again?
I noticed a few other oddities in the logs while looking through them,
like
https://tracker.ceph.com/issues/42666
which will hopefully have a fix ready for 14.2.5. I'm not sure about that
auth error message, though!
sage
Can you pay attention to this issue and provide a fix?
THX
I'm affected by a severe issue with Ceph.
Once there's an incident that leads to a situation that Ceph is not healthy and starts recovering from this state, the MGR is not doing it's job although the service is running.
In my case the relevant MGR log is spoiled with these error messages:
2019-11-06 17:46:22.363 7f81ffdcc700 0 auth: could not find secret_id=3865
2019-11-06 17:46:22.363 7f81ffdcc700 0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=3865
The secret_id is changing though.
Reporting this issue in Ceph user-list was long time silent, but now Sage Weil from Redhat has responded this:
My current working theory is that the mgr is getting hung up when it tries
to scrape the device metrics from the mon. The 'tell' mechanism used to
send mon-targetted commands is pretty kludgey/broken in nautilus and
earlier. It's been rewritten for octopus, but isn't worth backporting--it
never really caused problems until the devicemanager started using it
heavily.
In any case, this PR just disables scraping of mon devices for nautilus:
https://github.com/ceph/ceph/pull/31446
There is a build queued at
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$
which should get packages in 1-2 hours.
Perhaps you can install that package on the mgr host and try again to
reproduce it again?
I noticed a few other oddities in the logs while looking through them,
like
https://tracker.ceph.com/issues/42666
which will hopefully have a fix ready for 14.2.5. I'm not sure about that
auth error message, though!
sage
Can you pay attention to this issue and provide a fix?
THX