[SOLVED] Ceph Cluster - Can not start monitors

Neb

Well-Known Member
Apr 27, 2017
35
0
46
29
Hi,

I've some issues with ceph cluster installation.

I've 3 physical servers where ceph is installed on each node. On this nodes there is 3 SAS disks and several NIC 10Gbps. One disk has 300GB, where is installed proxmox packages, the other disks have 1TB, available for my osds. But when i create osds on these disks, they are always "down" and "out". Like this :

Code:
10:23:28 ~ # ceph osd tree                                                          root@px-node-1
# id    weight    type name    up/down    reweight
-1    0    root default
0    0    osd.0    down    0 
1    0    osd.1    down    0 
2    0    osd.2    down    0

Logs of ceph-mon : https://pastebin.com/x1hy68wz

And ceph-osd logs : https://pastebin.com/j3kVJBPg

on these logs we can see that :

Code:
=== from ceph-osd.1.log ===
2017-05-15 14:53:29.620510 7f864d60d7c0 -1 filestore(/var/lib/ceph/tmp/mnt.YblZYG) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory

=== from ceph-osd.2.log ===
2017-05-16 10:24:09.774615 7f0709eef7c0 -1 filestore(/var/lib/ceph/tmp/mnt.S1H2Id) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory

What does it mean ?

The result of 'ceph -s' : https://pastebin.com/9KqSDHz3

In my GUI, I see 3 osd in "down" and "out" state. In OSDs section, i can't see my nodes and their created osds.
I don't understand what does it troubles

this is the result of 'pveversion -v' :

Code:
10:27:18 ~ # pveversion -v                                                     
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.35-1-pve: 4.4.35-76
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.80.7-2+deb8u2

Please, any idea to resolv this issues ?

Thanks you, and if you want more details, say me.
 
Last edited:
Yes, i followed the wiki when i created my ceph cluster and osd
 
Yes, i followed the wiki when i created my ceph cluster and osd

no, you didn't - your ceph packages are neither hammer nor jewel, which is a prerequisite for using pveceph (note how step 6 of the linked wiki article is "Installation of Ceph packages"). also, your PVE packages are not up to date, so you are running a version with a buggy kernel and without Ceph Jewel support. I suggest upgrading to the current 4.4 version, re-reading the wiki article and starting your ceph setup from scratch following the instructions.
 
Indeed, sorry. I upgraded. I try it with a clean up cluster ceph
 
Ok, it's done. But now i can't create 3 monitors.

When I initialize ceph with 'pveceph init --network <private-network-ip>/<CIDR>' and i create the first monitor with 'pveceph createmon', I get this :

Code:
15:47:42 ~ # pveceph init --network 10.51.1.0/24                                             root@px-node-1
------------------------------------------------------------
15:47:47 ~ # pveceph createmon                                                               root@px-node-1
creating /etc/pve/priv/ceph.client.admin.keyring
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid 2ced4462-f0b6-4ba2-819b-3bdbd4a8bfa0
epoch 0
fsid 2ced4462-f0b6-4ba2-819b-3bdbd4a8bfa0
last_changed 2017-05-16 15:47:51.447556
created 2017-05-16 15:47:51.447556
0: 10.51.1.11:6789/0 mon.0
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
ceph-mon: set fsid to aa279e1d-b6bf-4920-9fe7-0b63680d19b7
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-0 for mon.0
Job for ceph-mon@0.service failed. See 'systemctl status ceph-mon@0.service' and 'journalctl -xn' for details.
command '/bin/systemctl start ceph-mon@0' failed: exit code 1

Monitor logs :

Code:
15:52:30 ~ # tail /var/log/ceph/ceph-mon.0.log                                               root@px-node-1
2017-05-16 15:27:10.250163 7f4bfd72d600  1 leveldb: Delete type=3 #1

2017-05-16 15:40:34.120787 7fead4135600  1 leveldb: Delete type=3 #1

2017-05-16 15:41:19.515119 7f69c66f8600  1 leveldb: Delete type=3 #1

2017-05-16 15:42:03.993457 7fd2f0d8f600  1 leveldb: Delete type=3 #1

2017-05-16 15:47:51.538399 7f8a5c9d6600  1 leveldb: Delete type=3 #1

On the other nodes i get "got timeout". However all nodes can ping other node.

Code:
15:31:33 ~ # cat /etc/pve/ceph.conf                                                          root@px-node-1
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 10.51.1.0/24
     filestore xattr use omap = true
     fsid = 4c7625c3-8672-4150-bc5c-b07c5c2177ac
     keyring = /etc/pve/priv/$cluster.$name.keyring
     osd journal size = 5120
     osd pool default min size = 1
     public network = 10.51.1.0/24

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.0]
     host = px-node-1
     mon addr = 10.51.1.11:6789

Code:
15:30:01 /var/lib # pvecm status                                                             root@px-node-2
Quorum information
------------------
Date:             Tue May 16 15:30:53 2017
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/268
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.51.0.11
0x00000002          1 10.51.0.12 (local)
0x00000003          1 10.51.0.13
------------------------------------------------------------
 
Last edited:
I ran 'pveceph purge' several times too but the issue persists. I can't start the first monitor after to have init ceph on a private network. When i want to create monitor on the other nodes the result is "got timeout". I don't know how to resolve this. I need to uninstall and reinstall ceph packages ?

Thx
 
please post the complete log, your status output only contains lines stating that it won't start because it has failed too often already ;) the following should do the trick:
Code:
journalctl -b -u "ceph-mon@*.service"
 
Hello fabian,

Thanks to answer me.

Here is the result : https://pastebin.com/P2ZJWdGB

I also updated the pastebin link of my post #8 with 'journalctl -xn', which was deleted.

Code:
mai 17 09:44:28 px-node-1 ceph-mon[2361]: IO error: /var/lib/ceph/mon/ceph-0/store.db/LOCK: Permission denied
mai 17 09:44:28 px-node-1 ceph-mon[2361]: 2017-05-17 09:44:28.828079 7f8674a41600 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument

I change owner of /var/lib/ceph by 'ceph' recursively. but what i need to do this manually?

EDIT : I repeated these commands :

Code:
pveceph purge
pveceph init --network 10.51.1.0/24
pveceph createmon

The result of 'pveceph createmon' now :

Code:
creating /etc/pve/priv/ceph.client.admin.keyring
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid f348a071-0d1f-4bce-923d-855b6ea4fc8c
epoch 0
fsid f348a071-0d1f-4bce-923d-855b6ea4fc8c
last_changed 2017-05-17 09:57:49.259509
created 2017-05-17 09:57:49.259509
0: 10.51.1.11:6789/0 mon.0
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
ceph-mon: set fsid to 4047e89d-1ce2-42e6-83b0-7235bc0d9084
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-0 for mon.0
Job for ceph-mon@0.service failed. See 'systemctl status ceph-mon@0.service' and 'journalctl -xn' for details.
command '/bin/systemctl start ceph-mon@0' failed: exit code 1

And the last log of 'journalctl -b -u "ceph-mon@0.service" ' :

Code:
mai 17 09:44:28 px-node-1 systemd[1]: ceph-mon@0.service: main process exited, code=exited, status=1/FAILURE
mai 17 09:44:28 px-node-1 systemd[1]: Unit ceph-mon@0.service entered failed state.
mai 17 09:44:39 px-node-1 systemd[1]: ceph-mon@0.service holdoff time over, scheduling restart.
mai 17 09:44:39 px-node-1 systemd[1]: Stopping Ceph cluster monitor daemon...
mai 17 09:44:39 px-node-1 systemd[1]: Starting Ceph cluster monitor daemon...
mai 17 09:44:39 px-node-1 systemd[1]: ceph-mon@0.service start request repeated too quickly, refusing to start.
mai 17 09:44:39 px-node-1 systemd[1]: Failed to start Ceph cluster monitor daemon.
mai 17 09:44:39 px-node-1 systemd[1]: Unit ceph-mon@0.service entered failed state.
mai 17 09:57:49 px-node-1 systemd[1]: Starting Ceph cluster monitor daemon...
mai 17 09:57:49 px-node-1 systemd[1]: ceph-mon@0.service start request repeated too quickly, refusing to start.
mai 17 09:57:49 px-node-1 systemd[1]: Failed to start Ceph cluster monitor daemon.

This line :

Code:
mai 17 09:57:49 px-node-1 systemd[1]: ceph-mon@0.service start request repeated too quickly, refusing to start.

What does it mean ?


Moreover, this is the result of 'apt-cache depends ceph-mon' :
(the result is in french)

Code:
ceph-mon
  Dépend: ceph-base
  Dépend: python-flask
  Dépend: init-system-helpers
  Dépend: libboost-iostreams1.55.0
  Dépend: libboost-random1.55.0
  Dépend: libboost-system1.55.0
  Dépend: libboost-thread1.55.0
  Dépend: libc6
  Dépend: libgcc1
  Dépend: libgoogle-perftools4
  Dépend: libleveldb1
  Dépend: libnspr4
  Dépend: libnss3
  Dépend: libsnappy1
  Dépend: libstdc++6
  Dépend: zlib1g
  Recommande: ceph-common
  Casse: ceph
  Remplace: ceph

The 2 last lines mean that 'ceph' package is broken and I need to replace it.

When i run 'ceph -s' :

i get :

Code:
10:19:09 # ceph -s                                                                                                                    root@px-node-1
2017-05-17 10:19:18.428817 7f59bbaa7700  0 -- :/4164518763 >> 10.51.1.11:6789/0 pipe(0x7f59c005ecf0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f59c005c9e0).fault
2017-05-17 10:19:21.429002 7f59bb9a6700  0 -- :/4164518763 >> 10.51.1.11:6789/0 pipe(0x7f59b0000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f59b0001f90).fault
2017-05-17 10:19:24.429187 7f59bbaa7700  0 -- :/4164518763 >> 10.51.1.11:6789/0 pipe(0x7f59b0005160 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f59b0006420).fault
^CTraceback (most recent call last):
  File "/usr/bin/ceph", line 948, in <module>
    retval = main()
  File "/usr/bin/ceph", line 852, in main
    prefix='get_command_descriptions')
  File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line 1300, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "None": exception "['{"prefix": "get_command_descriptions"}']": exception You cannot perform that operation on a Rados object in state configuring.

So I reinstall all packages.
 
Last edited:
Same issue, i reinstalled all packages, but the errors still occur.

Moreover 'ceph' package is still "breaks", like this with 'pat-cache depends ceph-mon' :

Code:
11:17:24 ~ # apt-cache depends ceph-mon                                                                                  root@px-node-1
ceph-mon
  Depends: ceph-base
  Depends: python-flask
  Depends: init-system-helpers
  Depends: libboost-iostreams1.55.0
  Depends: libboost-random1.55.0
  Depends: libboost-system1.55.0
  Depends: libboost-thread1.55.0
  Depends: libc6
  Depends: libgcc1
  Depends: libgoogle-perftools4
  Depends: libleveldb1
  Depends: libnspr4
  Depends: libnss3
  Depends: libsnappy1
  Depends: libstdc++6
  Depends: zlib1g
  Recommends: ceph-common
  Breaks: ceph
  Replaces: ceph

EDIT : The creation of monitor is apparently great now. I have reinstalled ALL packages and library that is needed.
 
Last edited:
Ok, Great ! My monitors work well and all my OSDs is UP !

Thanks.