Error message spoils MGR log: connect got BADAUTHORIZER

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,
my cluster is not healthy, means there were many slow request and unknown pgs.
Then I noticed an error message that spoiled the MGR log heavily:
2019-11-06 11:37:39.977 7f90028d7700 0 --1- 10.97.206.96:0/3948014004 >> v1:10.97.206.93:6918/101424 conn(0x56480ee7f600 0x56480eece000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER
2019-11-06 11:37:39.981 7f90028d7700 0 --1- 10.97.206.96:0/3948014004 >> v1:10.97.206.93:6918/101424 conn(0x56480ec36400 0x56480d654000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER
2019-11-06 11:37:39.981 7f90028d7700 0 --1- 10.97.206.96:0/3948014004 >> v1:10.97.206.93:6918/101424 conn(0x56480ee7f600 0x56480eece000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER
2019-11-06 11:37:39.985 7f90028d7700 0 --1- 10.97.206.96:0/3948014004 >> v1:10.97.206.93:6918/101424 conn(0x56480ec36400 0x56480d654000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER
2019-11-06 11:37:39.985 7f90028d7700 0 --1- 10.97.206.96:0/3948014004 >> v1:10.97.206.93:6918/101424 conn(0x56480ee7f600 0x56480eece000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER


I immediately stopped MGR on the relevant node.
What happened is that the next standby MGR took over.
However, the same error message spoiled the log of this active node, too.

Then I decided to stop all other MGR services and all MON services.

When I restart the MON service sequentially on 4 nodes there's no problem.
However when I start just on MGR service, the log is spoiled again with the same error.

I could now stop all OSD services, however this will increase the health problems of the cluster again.

What is causing the error messages BADAUTHORIZER?

THX
 
Hi Thomas,

the nodes are time-synced.

Currently I assume that this error is related to the potential bug with MGR.
In order to resolve it I have installed updated ceph packages incl. ceph-mgr provided by some developer.
Since then my cluster recovered from unhealthy state and is now back to normal operations.

Regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!