[SOLVED] Ceph Cluster - Can not start monitors

Neb · May 16, 2017

Hi,

I've some issues with ceph cluster installation.

I've 3 physical servers where ceph is installed on each node. On this nodes there is 3 SAS disks and several NIC 10Gbps. One disk has 300GB, where is installed proxmox packages, the other disks have 1TB, available for my osds. But when i create osds on these disks, they are always "down" and "out". Like this :

Code:

10:23:28 ~ # ceph osd tree                                                          root@px-node-1
# id    weight    type name    up/down    reweight
-1    0    root default
0    0    osd.0    down    0 
1    0    osd.1    down    0 
2    0    osd.2    down    0

Logs of ceph-mon : https://pastebin.com/x1hy68wz

And ceph-osd logs : https://pastebin.com/j3kVJBPg

on these logs we can see that :

Code:

=== from ceph-osd.1.log ===
2017-05-15 14:53:29.620510 7f864d60d7c0 -1 filestore(/var/lib/ceph/tmp/mnt.YblZYG) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory

=== from ceph-osd.2.log ===
2017-05-16 10:24:09.774615 7f0709eef7c0 -1 filestore(/var/lib/ceph/tmp/mnt.S1H2Id) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory

What does it mean ?

The result of 'ceph -s' : https://pastebin.com/9KqSDHz3

In my GUI, I see 3 osd in "down" and "out" state. In OSDs section, i can't see my nodes and their created osds.
I don't understand what does it troubles

this is the result of 'pveversion -v' :

Code:

10:27:18 ~ # pveversion -v                                                     
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.35-1-pve: 4.4.35-76
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.80.7-2+deb8u2

Please, any idea to resolv this issues ?

Thanks you, and if you want more details, say me.

fabian · May 16, 2017

ceph: 0.80.7-2+deb8u2

please follow https://pve.proxmox.com/wiki/Ceph_Server

Neb · May 16, 2017

Yes, i followed the wiki when i created my ceph cluster and osd

fabian · May 16, 2017

Neb said:
Yes, i followed the wiki when i created my ceph cluster and osd

no, you didn't - your ceph packages are neither hammer nor jewel, which is a prerequisite for using pveceph (note how step 6 of the linked wiki article is "Installation of Ceph packages"). also, your PVE packages are not up to date, so you are running a version with a buggy kernel and without Ceph Jewel support. I suggest upgrading to the current 4.4 version, re-reading the wiki article and starting your ceph setup from scratch following the instructions.

Neb · May 16, 2017

Indeed, sorry. I upgraded. I try it with a clean up cluster ceph

Neb · May 16, 2017

Ok, it's done. But now i can't create 3 monitors.

When I initialize ceph with 'pveceph init --network <private-network-ip>/<CIDR>' and i create the first monitor with 'pveceph createmon', I get this :

Code:

15:47:42 ~ # pveceph init --network 10.51.1.0/24                                             root@px-node-1
------------------------------------------------------------
15:47:47 ~ # pveceph createmon                                                               root@px-node-1
creating /etc/pve/priv/ceph.client.admin.keyring
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid 2ced4462-f0b6-4ba2-819b-3bdbd4a8bfa0
epoch 0
fsid 2ced4462-f0b6-4ba2-819b-3bdbd4a8bfa0
last_changed 2017-05-16 15:47:51.447556
created 2017-05-16 15:47:51.447556
0: 10.51.1.11:6789/0 mon.0
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
ceph-mon: set fsid to aa279e1d-b6bf-4920-9fe7-0b63680d19b7
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-0 for mon.0
Job for ceph-mon@0.service failed. See 'systemctl status ceph-mon@0.service' and 'journalctl -xn' for details.
command '/bin/systemctl start ceph-mon@0' failed: exit code 1

Monitor logs :

Code:

15:52:30 ~ # tail /var/log/ceph/ceph-mon.0.log                                               root@px-node-1
2017-05-16 15:27:10.250163 7f4bfd72d600  1 leveldb: Delete type=3 #1

2017-05-16 15:40:34.120787 7fead4135600  1 leveldb: Delete type=3 #1

2017-05-16 15:41:19.515119 7f69c66f8600  1 leveldb: Delete type=3 #1

2017-05-16 15:42:03.993457 7fd2f0d8f600  1 leveldb: Delete type=3 #1

2017-05-16 15:47:51.538399 7f8a5c9d6600  1 leveldb: Delete type=3 #1

On the other nodes i get "got timeout". However all nodes can ping other node.

Code:

15:31:33 ~ # cat /etc/pve/ceph.conf                                                          root@px-node-1
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 10.51.1.0/24
     filestore xattr use omap = true
     fsid = 4c7625c3-8672-4150-bc5c-b07c5c2177ac
     keyring = /etc/pve/priv/$cluster.$name.keyring
     osd journal size = 5120
     osd pool default min size = 1
     public network = 10.51.1.0/24

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.0]
     host = px-node-1
     mon addr = 10.51.1.11:6789

Code:

15:30:01 /var/lib # pvecm status                                                             root@px-node-2
Quorum information
------------------
Date:             Tue May 16 15:30:53 2017
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/268
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.51.0.11
0x00000002          1 10.51.0.12 (local)
0x00000003          1 10.51.0.13
------------------------------------------------------------

jeffwadsworth · May 16, 2017

After running apt-get update and apt-get dist-upgrade, you did reboot the nodes, right?

Neb · May 16, 2017

jeffwadsworth said:
After running apt-get update and apt-get dist-upgrade, you did reboot the nodes, right?

Yes jeffwadsworth, several times ..

result of 'systemctl status ceph-mon@0.service -l' : https://pastebin.com/dVXzg1U3

Result of 'journalctl -xn' : https://pastebin.com/m8Nm02QD

Neb · May 16, 2017

I ran 'pveceph purge' several times too but the issue persists. I can't start the first monitor after to have init ceph on a private network. When i want to create monitor on the other nodes the result is "got timeout". I don't know how to resolve this. I need to uninstall and reinstall ceph packages ?

Thx

fabian · May 17, 2017

please post the complete log, your status output only contains lines stating that it won't start because it has failed too often already

the following should do the trick:

Code:

journalctl -b -u "ceph-mon@*.service"

Neb · May 17, 2017

Hello fabian,

Thanks to answer me.

Here is the result : https://pastebin.com/P2ZJWdGB

I also updated the pastebin link of my post #8 with 'journalctl -xn', which was deleted.

Code:

mai 17 09:44:28 px-node-1 ceph-mon[2361]: IO error: /var/lib/ceph/mon/ceph-0/store.db/LOCK: Permission denied
mai 17 09:44:28 px-node-1 ceph-mon[2361]: 2017-05-17 09:44:28.828079 7f8674a41600 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument

I change owner of /var/lib/ceph by 'ceph' recursively. but what i need to do this manually?

EDIT : I repeated these commands :

Code:

pveceph purge
pveceph init --network 10.51.1.0/24
pveceph createmon

The result of 'pveceph createmon' now :

Code:

creating /etc/pve/priv/ceph.client.admin.keyring
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid f348a071-0d1f-4bce-923d-855b6ea4fc8c
epoch 0
fsid f348a071-0d1f-4bce-923d-855b6ea4fc8c
last_changed 2017-05-17 09:57:49.259509
created 2017-05-17 09:57:49.259509
0: 10.51.1.11:6789/0 mon.0
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
ceph-mon: set fsid to 4047e89d-1ce2-42e6-83b0-7235bc0d9084
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-0 for mon.0
Job for ceph-mon@0.service failed. See 'systemctl status ceph-mon@0.service' and 'journalctl -xn' for details.
command '/bin/systemctl start ceph-mon@0' failed: exit code 1

And the last log of 'journalctl -b -u "ceph-mon@0.service" ' :

Code:

mai 17 09:44:28 px-node-1 systemd[1]: ceph-mon@0.service: main process exited, code=exited, status=1/FAILURE
mai 17 09:44:28 px-node-1 systemd[1]: Unit ceph-mon@0.service entered failed state.
mai 17 09:44:39 px-node-1 systemd[1]: ceph-mon@0.service holdoff time over, scheduling restart.
mai 17 09:44:39 px-node-1 systemd[1]: Stopping Ceph cluster monitor daemon...
mai 17 09:44:39 px-node-1 systemd[1]: Starting Ceph cluster monitor daemon...
mai 17 09:44:39 px-node-1 systemd[1]: ceph-mon@0.service start request repeated too quickly, refusing to start.
mai 17 09:44:39 px-node-1 systemd[1]: Failed to start Ceph cluster monitor daemon.
mai 17 09:44:39 px-node-1 systemd[1]: Unit ceph-mon@0.service entered failed state.
mai 17 09:57:49 px-node-1 systemd[1]: Starting Ceph cluster monitor daemon...
mai 17 09:57:49 px-node-1 systemd[1]: ceph-mon@0.service start request repeated too quickly, refusing to start.
mai 17 09:57:49 px-node-1 systemd[1]: Failed to start Ceph cluster monitor daemon.

This line :

Code:

mai 17 09:57:49 px-node-1 systemd[1]: ceph-mon@0.service start request repeated too quickly, refusing to start.

What does it mean ?

Moreover, this is the result of 'apt-cache depends ceph-mon' :
(the result is in french)

Code:

ceph-mon
  Dépend: ceph-base
  Dépend: python-flask
  Dépend: init-system-helpers
  Dépend: libboost-iostreams1.55.0
  Dépend: libboost-random1.55.0
  Dépend: libboost-system1.55.0
  Dépend: libboost-thread1.55.0
  Dépend: libc6
  Dépend: libgcc1
  Dépend: libgoogle-perftools4
  Dépend: libleveldb1
  Dépend: libnspr4
  Dépend: libnss3
  Dépend: libsnappy1
  Dépend: libstdc++6
  Dépend: zlib1g
  Recommande: ceph-common
  Casse: ceph
  Remplace: ceph

The 2 last lines mean that 'ceph' package is broken and I need to replace it.

When i run 'ceph -s' :

i get :

Code:

10:19:09 # ceph -s                                                                                                                    root@px-node-1
2017-05-17 10:19:18.428817 7f59bbaa7700  0 -- :/4164518763 >> 10.51.1.11:6789/0 pipe(0x7f59c005ecf0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f59c005c9e0).fault
2017-05-17 10:19:21.429002 7f59bb9a6700  0 -- :/4164518763 >> 10.51.1.11:6789/0 pipe(0x7f59b0000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f59b0001f90).fault
2017-05-17 10:19:24.429187 7f59bbaa7700  0 -- :/4164518763 >> 10.51.1.11:6789/0 pipe(0x7f59b0005160 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f59b0006420).fault
^CTraceback (most recent call last):
  File "/usr/bin/ceph", line 948, in <module>
    retval = main()
  File "/usr/bin/ceph", line 852, in main
    prefix='get_command_descriptions')
  File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line 1300, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "None": exception "['{"prefix": "get_command_descriptions"}']": exception You cannot perform that operation on a Rados object in state configuring.

So I reinstall all packages.

Neb · May 17, 2017

Same issue, i reinstalled all packages, but the errors still occur.

Moreover 'ceph' package is still "breaks", like this with 'pat-cache depends ceph-mon' :

Code:

11:17:24 ~ # apt-cache depends ceph-mon                                                                                  root@px-node-1
ceph-mon
  Depends: ceph-base
  Depends: python-flask
  Depends: init-system-helpers
  Depends: libboost-iostreams1.55.0
  Depends: libboost-random1.55.0
  Depends: libboost-system1.55.0
  Depends: libboost-thread1.55.0
  Depends: libc6
  Depends: libgcc1
  Depends: libgoogle-perftools4
  Depends: libleveldb1
  Depends: libnspr4
  Depends: libnss3
  Depends: libsnappy1
  Depends: libstdc++6
  Depends: zlib1g
  Recommends: ceph-common
  Breaks: ceph
  Replaces: ceph

EDIT : The creation of monitor is apparently great now. I have reinstalled ALL packages and library that is needed.

Neb · May 17, 2017

Ok, Great ! My monitors work well and all my OSDs is UP !

Thanks.

[SOLVED] Ceph Cluster - Can not start monitors

Neb

Renowned Member

fabian

Proxmox Staff Member

Neb

Renowned Member

fabian

Proxmox Staff Member

Neb

Renowned Member

Neb

Renowned Member

jeffwadsworth

Member

Neb

Renowned Member

Neb

Renowned Member

fabian

Proxmox Staff Member

Neb

Renowned Member

Neb

Renowned Member

Neb

Renowned Member

We value your privacy