[SOLVED] ceph-mgr failing to start on one node after nautilus migration

resoli

Renowned Member
Mar 9, 2010
147
4
83
One of the manager refuses to start after a seven node cluster migation to pve6 and nautilus, here the debug trace:

Code:
# /usr/bin/ceph-mgr -d --cluster ceph --id pvenode2 --setuser ceph --setgroup ceph --debug_ms 1 2>&1 | tee ceph-mgr.start.log

2019-11-28 12:17:39.136 7f40b23a1dc0  1  Processor -- start

2019-11-28 12:17:39.136 7f40b23a1dc0  1 --  start start

2019-11-28 12:17:39.136 7f40b23a1dc0  1 --2-  >> v2:10.1.1.211:3300/0 conn(0x55f23d39e000 0x55f23d2ea580 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect

2019-11-28 12:17:39.136 7f40b23a1dc0  1 --2-  >> v2:10.1.1.216:3300/0 conn(0x55f23d39e480 0x55f23d2eab00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect

2019-11-28 12:17:39.136 7f40b23a1dc0  1 --2-  >> v2:10.1.1.212:3300/0 conn(0x55f23d39e900 0x55f23d2eb080 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect

2019-11-28 12:17:39.136 7f40b23a1dc0  1 --  --> v2:10.1.1.211:3300/0 -- mon_getmap magic: 0 v1 -- 0x55f23c729180 con 0x55f23d39e000

2019-11-28 12:17:39.136 7f40b23a1dc0  1 --  --> v2:10.1.1.212:3300/0 -- mon_getmap magic: 0 v1 -- 0x55f23c729340 con 0x55f23d39e900

2019-11-28 12:17:39.136 7f40b23a1dc0  1 --  --> v2:10.1.1.216:3300/0 -- mon_getmap magic: 0 v1 -- 0x55f23c729500 con 0x55f23d39e480

2019-11-28 12:17:39.136 7f40b2157700  1 --2-  >> v2:10.1.1.212:3300/0 conn(0x55f23d39e900 0x55f23d2eb080 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0

2019-11-28 12:17:39.136 7f40b1956700  1 --2-  >> v2:10.1.1.211:3300/0 conn(0x55f23d39e000 0x55f23d2ea580 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0

2019-11-28 12:17:39.136 7f40b2157700  1 --2-  >> v2:10.1.1.212:3300/0 conn(0x55f23d39e900 0x55f23d2eb080 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello peer v2:10.1.1.212:3300/0 says I am v2:10.1.1.212:35226/0 (socket says 10.1.1.212:35226)

2019-11-28 12:17:39.136 7f40b2157700  1 -- 10.1.1.212:0/1734694411 learned_addr learned my addr 10.1.1.212:0/1734694411 (peer_addr_for_me v2:10.1.1.212:0/0)

2019-11-28 12:17:39.136 7f40b1956700  1 --2-  >> v2:10.1.1.211:3300/0 conn(0x55f23d39e000 0x55f23d2ea580 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello peer v2:10.1.1.211:3300/0 says I am v2:10.1.1.212:42474/0 (socket says 10.1.1.212:42474)

2019-11-28 12:17:39.136 7f40b1155700  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.216:3300/0 conn(0x55f23d39e480 0x55f23d2eab00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0

2019-11-28 12:17:39.136 7f40b2157700  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.212:3300/0 conn(0x55f23d39e900 0x55f23d2eb080 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_auth_bad_method method=2 result (1) Operation not permitted, allowed methods=[2], allowed modes=[2,1]

2019-11-28 12:17:39.136 7f40b2157700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]

2019-11-28 12:17:39.136 7f40b2157700  1 -- 10.1.1.212:0/1734694411 >> v2:10.1.1.212:3300/0 conn(0x55f23d39e900 msgr2=0x55f23d2eb080 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down

2019-11-28 12:17:39.136 7f40b2157700  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.212:3300/0 conn(0x55f23d39e900 0x55f23d2eb080 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop

2019-11-28 12:17:39.136 7f40b1956700  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.211:3300/0 conn(0x55f23d39e000 0x55f23d2ea580 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_auth_bad_method method=2 result (1) Operation not permitted, allowed methods=[2], allowed modes=[2,1]

2019-11-28 12:17:39.136 7f40b1956700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]

2019-11-28 12:17:39.136 7f40b1956700  1 -- 10.1.1.212:0/1734694411 >> v2:10.1.1.211:3300/0 conn(0x55f23d39e000 msgr2=0x55f23d2ea580 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down

2019-11-28 12:17:39.136 7f40b1956700  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.211:3300/0 conn(0x55f23d39e000 0x55f23d2ea580 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop

2019-11-28 12:17:39.140 7f40b1155700  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.216:3300/0 conn(0x55f23d39e480 0x55f23d2eab00 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_auth_bad_method method=2 result (1) Operation not permitted, allowed methods=[2], allowed modes=[2,1]

2019-11-28 12:17:39.140 7f40b1155700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]

2019-11-28 12:17:39.140 7f40b1155700  1 -- 10.1.1.212:0/1734694411 >> v2:10.1.1.216:3300/0 conn(0x55f23d39e480 msgr2=0x55f23d2eab00 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down

2019-11-28 12:17:39.140 7f40b1155700  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.216:3300/0 conn(0x55f23d39e480 0x55f23d2eab00 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop

2019-11-28 12:17:39.140 7f40b23a1dc0  1 -- 10.1.1.212:0/1734694411 shutdown_connections

2019-11-28 12:17:39.140 7f40b23a1dc0  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.212:3300/0 conn(0x55f23d39e900 0x55f23d2eb080 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop

2019-11-28 12:17:39.140 7f40b23a1dc0  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.211:3300/0 conn(0x55f23d39e000 0x55f23d2ea580 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop

2019-11-28 12:17:39.140 7f40b23a1dc0  1 --2- 10.1.1.212:0/1734694411 >> v2:10.1.1.216:3300/0 conn(0x55f23d39e480 0x55f23d2eab00 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop

2019-11-28 12:17:39.140 7f40b23a1dc0  1 -- 10.1.1.212:0/1734694411 shutdown_connections

2019-11-28 12:17:39.140 7f40b23a1dc0  1 -- 10.1.1.212:0/1734694411 wait complete.

2019-11-28 12:17:39.140 7f40b23a1dc0  1 -- 10.1.1.212:0/1734694411 >> 10.1.1.212:0/1734694411 conn(0x55f23c655a80 msgr2=0x55f23d398000 unknown :-1 s=STATE_NONE l=0).mark_down

failed to fetch mon config (--no-mon-config to skip)

any hint?

Thanks,
rob
 
2019-11-28 12:17:39.140 7f40b1155700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
It seems like the authentication isn't handled correctly by either the MON or MGR.

How does the ceph.conf look like? And did you upgrade that MGR or is it a new installation? Did you try to re-create the MGR?
 
  • Like
Reactions: resoli
It seems like the authentication isn't handled correctly by either the MON or MGR.

yes, I noted that /etc/ceph/ceph.client.admin.keyring was not aligned with the same file on other nodes, and manually replaced it, but this did not changed anything.

How does the ceph.conf look like?

Here follows (I redacted node names):

Code:
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     bluestore block db size = 1073741824
     bluestore block wal size = 5368709120
     cluster network = 10.1.1.0/24
     fsid = f328e698-bf41-4fa8-a480-8d52005a9d27
     mon allow pool delete = true
     osd journal size = 5120
     osd pool default min size = 2
     osd pool default size = 3
     public network = 10.1.1.0/24
     mon host = 10.1.1.210,10.1.1.211,10.1.1.212,10.1.1.213,10.1.1.214,10.1.1.215,10.1.1.216

[mon.penode6]
     host = penode6
     mon addr = 10.1.1.216:6789

[mon.penode5]
     host = penode5
     mon addr = 10.1.1.215:6789

[mon.penode3]
     host = penode3
     mon addr = 10.1.1.213:6789

[mon.penode4]
     host = penode4
     mon addr = 10.1.1.214:6789

[mon.penode0]
     host = penode0
     mon addr = 10.1.1.210:6789

[mon.penode2]
     host = penode2
     mon addr = 10.1.1.212:6789

[mon.penode1]
     host = penode1
     mon addr = 10.1.1.211:6789

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

And did you upgrade that MGR or is it a new installation? Did you try to re-create the MGR?

I came form pve 5.4 latest and Luminous, did not try to recreate.

Thanks,
rob
 
yes, I noted that /etc/ceph/ceph.client.admin.keyring was not aligned with the same file on other nodes, and manually replaced it, but this did not changed anything.
The file is saved in /etc/pve/priv/ and will be used by the CLI commands only.

mon host = 10.1.1.210,10.1.1.211,10.1.1.212,10.1.1.213,10.1.1.214,10.1.1.215,10.1.1.216
Aside, you only need 3x MONs, more just generate resource overhead and don't benefit smaller clusters.

I came form pve 5.4 latest and Luminous, did not try to recreate.
Did you go through our upgrade guide? Did you set the messenger protocol, ceph mon dump?
https://pve.proxmox.com/wiki/Ceph_L...msgrv2_protocol_and_update_Ceph_configuration
 
  • Like
Reactions: resoli
The file is saved in /etc/pve/priv/ and will be used by the CLI commands only.


Aside, you only need 3x MONs, more just generate resource overhead and don't benefit smaller clusters.


Did you go through our upgrade guide? Did you set the messenger protocol, ceph mon dump?
https://pve.proxmox.com/wiki/Ceph_L...msgrv2_protocol_and_update_Ceph_configuration

CLI, ok.

3 MONs, ok. Given that under normal circumstances we will not have more than one node down. I guess could be an issue if we had two (with MON onboard) down.

msgsrv2, yes. I carefully followed the (very accurate indeed) guide.

bye,
rob
 
3 MONs, ok. Given that under normal circumstances we will not have more than one node down. I guess could be an issue if we had two (with MON onboard) down.
There is always a tradeoff to be made. But the likelihood that two or even three die (unrepairable), is relativity small.
 
  • Like
Reactions: resoli

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!