[SOLVED] PVE 5 and Ceph luminous: unable to create monitor

hi,

i have the same problem. I have six nodes and started with the first one. On the first three nodes, I was able to create a mon. On the forth not. The fifth and the sixth went fine.
First I got messages about electing on node 4.... than that this mon had not quorum. I removed all references to the forth one and tried again with a complete reinstall from the whole node and added the node to the Proxmox cluster again. Than I tried again with pveceph createmon, but all I get is a timeout. Also none other ceph command works. It hangs or breaks with a timeout.
The node is identical to all others. Same version .. same hardware ... Adding to the proxmox cluster works also without any problems. No firewall ...

Code:
ceph 12.2.0-pve1
ceph-base 12.2.0-pve1
ceph-common 12.2.0-pve1
ceph-mgr 12.2.0-pve1
ceph-mon 12.2.0-pve1
ceph-osd 12.2.0-pve1
libcephfs1 10.2.5-7.2
libcephfs2 12.2.0-pve1
nagios-plugins-ceph 1.5.1-1
python-cephfs 12.2.0-pve1

I have no idea ... why only one node works not, but all others. We are using Puppet and cluster SSH, to make sure, everything is identical.

Any suggestions ?
 
is the failing monitor referenced in the monmap? did you delete its keyring (check "ceph auth ls")? what do the monitor logs on the failing and first node say?
 
hi Fabian,

that was one of the first thing, testing ceph auth list. On the five nodes, it was working instantly after install ceph (pveseph install) on the node. On the failing node, there is simply nothing. Every command related to Ceph ending in a timeout/hang. The ceph keyring is on the place (ceph.conf too), like on all others, thanks to Corosync :) If I activate debug for mon and just type "ceph" in the command line, I can see the communications between the nodes, but the failing node just hang ... But was is even more strange, after a complete reinstall from the whole node with wiping the OS ... same problems. I also thougt, that may the MTU 9000 is a problem, but at the time, I created the cluster, they had all the same settings.
But I take a deeper look later.
Also no logging /var/log/ceph is empty. Only one journal entry I can find: ... ceph .. pam timeout (do not have the exactly message here).

cu denny
 
hi,

the MTU looks fine ... but I enabled debug .... for example: ceph auth list

Code:
# ceph auth list
2017-10-06 16:42:15.525808 7fea7eaac700  1  Processor -- start
2017-10-06 16:42:15.525869 7fea7eaac700  1 -- - start start
2017-10-06 16:42:15.526561 7fea7eaac700  1 -- - --> 10.3.0.2:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7fea78174370 con 0
2017-10-06 16:42:15.526606 7fea7eaac700  1 -- - --> 10.3.0.6:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7fea781747d0 con 0
2017-10-06 16:42:15.526944 7fea77fff700  1 -- 10.3.0.4:0/1283719132 learned_addr learned my addr 10.3.0.4:0/1283719132
2017-10-06 16:42:15.527330 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.4 10.3.0.6:6789/0 1 ==== mon_map magic: 0 v1 ==== 802+0+0 (4146653720 0 0) 0x7fea68001880 con 0x7fea78176640
2017-10-06 16:42:15.527403 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.1 10.3.0.2:6789/0 1 ==== mon_map magic: 0 v1 ==== 802+0+0 (4146653720 0 0) 0x7fea6c001620 con 0x7fea78179cd0
2017-10-06 16:42:15.527450 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.4 10.3.0.6:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (2102421076 0 0) 0x7fea68001e40 con 0x7fea78176640
2017-10-06 16:42:15.527521 7fea767fc700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.6:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fea640023c0 con 0
2017-10-06 16:42:15.527538 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.1 10.3.0.2:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (1970791971 0 0) 0x7fea6c001be0 con 0x7fea78179cd0
2017-10-06 16:42:15.527570 7fea767fc700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.2:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fea64003a00 con 0
2017-10-06 16:42:15.527784 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.4 10.3.0.6:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (231347969 0 0) 0x7fea680012e0 con 0x7fea78176640
2017-10-06 16:42:15.527837 7fea767fc700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.6:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7fea64002a40 con 0
2017-10-06 16:42:15.527870 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.1 10.3.0.2:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (3860874913 0 0) 0x7fea6c001080 con 0x7fea78179cd0
2017-10-06 16:42:15.527910 7fea767fc700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.2:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7fea640030e0 con 0
2017-10-06 16:42:15.528083 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.4 10.3.0.6:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 580+0+0 (342926252 0 0) 0x7fea68002250 con 0x7fea78176640
2017-10-06 16:42:15.528142 7fea767fc700  1 -- 10.3.0.4:0/1283719132 >> 10.3.0.2:6789/0 conn(0x7fea78179cd0 :-1 s=STATE_OPEN pgs=629449 cs=1 l=1).mark_down
2017-10-06 16:42:15.528203 7fea767fc700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.6:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x7fea78180490 con 0
2017-10-06 16:42:15.528239 7fea7eaac700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.6:6789/0 -- mon_subscribe({mgrmap=0+}) v2 -- 0x7fea781747d0 con 0
2017-10-06 16:42:15.528307 7fea7eaac700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.6:6789/0 -- mon_subscribe({osdmap=0}) v2 -- 0x7fea781747d0 con 0
2017-10-06 16:42:15.528326 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.4 10.3.0.6:6789/0 5 ==== mon_map magic: 0 v1 ==== 802+0+0 (4146653720 0 0) 0x7fea68001810 con 0x7fea78176640
2017-10-06 16:42:15.528411 7fea767fc700  1 -- 10.3.0.4:0/1283719132 <== mon.4 10.3.0.6:6789/0 6 ==== mgrmap(e 43) v1 ==== 591+0+0 (179944394 0 0) 0x7fea680029d0 con 0x7fea78176640
2017-10-06 16:42:15.531840 7fea7eaac700  1 -- 10.3.0.4:0/1283719132 --> 10.3.0.6:6789/0 -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7fea780d61d0 con 0

it hangs .... and sometimes I get some packages ... but thats all ... also trying "ceph --help" ... and the command hangs.
 
Last edited:
hi,

it is the MTU 9000! After I set the MTU to default 1500, than the ceph commands working as expected. But I don't understand, why. The MTU is set on all Switch ports and all other hosts. May I have to reboot the switch.
What may related ... the new Mon doesn't get a quorum (Quorum = no)

I tried to recreate the mon ...

Code:
ceph-create-keys:ceph-mon is not in quorum: u'probing'
...
 
  • Like
Reactions: hahosting
Hi all,

I know this thread is nearly 2 years old now, but wanted to give kudos for it helping me fix a similar problem just now.

We've just upgraded a monitor node from 5.3 to 5.4, which has upped Ceph to 12.2.12. The monitor failed upon reboot, and removing/readding the Monitor both by the Proxmox GUI and Ceph shell commands always gives the same (failed) fault.

From the Proxmox GUI, we got this on the Create task:
TASK ERROR: command 'ceph-create-keys -i XXhostXX' failed: exit code 1

..and in the Ceph mon logs, we got this:
e39 ms_verify_authorizer bad authorizer from mon 10.XXX.XXX.93:6789/0

On the 2nd day of searching, I came across this thread, and initially skipped the MTU answer because we have other hosts with Jumbo Frames and they work OK. But after setting the MTU on the bond back to 1500, creating the Monitor works perfectly, no key errors, and the monitors achieve quorum again.

We've upgraded the node in order to migrate it to an InfiniBand network, which has an MTU of 1500 anyway, so we won't be setting it back to 9000.

Hopefully this post appears in <search engine of your choice> so others with the same problem find it quicker than I did! Thanks to Denny!

Cheers,
Stuart.

ceph-create-keys failed, ceph, add monitor, mtu
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!