No OSDs configurable on two nodes of a Ceph Cluster

hape

Renowned Member
Jun 10, 2013
75
5
73
Hello all,

i've installed a new 3 Host Cluster for creating a Ceph-based-HA-Cluster. Two of them i installed with the 5.0 and one with the 5.1 installation-source (iso). The two first installed hosts has been upgraded with subscription to 5.1.

For your Information: I have installed Ceph on all hosts with the version "Luminous".

Now i've initialized a Ceph-Cluster on all three members. All monitors are seeing each other and the Health Status is OK. If i configure OSDs on each to build the cluster, i see only the OSD from the on 5.1 installed version. The others are configured, and i see that the partitions are created properly but the OSDs aren't vissible.

What's wrong with my setup. Should i install the both on 5.0 installed hosts new?
 
Please post a 'pveversion -v', a 'ceph -s', a 'ceph osd tree' and any log message that you find relevant (eg. syslog/kernel.log).
 
**************************************************
** "pveversion -v" of host 1 (osd is coming up) **
**************************************************

proxmox-ve: 5.1-28 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.8-2-pve: 4.13.8-28
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.1-pve3

*********************************************************
** "pveversion -v" of host 2 & 3 (osd isn't coming up) **
*********************************************************

proxmox-ve: 5.1-28 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.10.15-1-pve: 4.10.15-15
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.1-pve3

*********************************
** "ceph -s" of all three hosts **
*********************************

cluster:
id: 05120acf-8b92-4160-917f-5ab48c5ecc10
health: HEALTH_OK
services:
mon: 3 daemons, quorum virtfarm-pgp-1,virtfarm-pgp-2,virtfarm-pgp-3
mgr: virtfarm-pgp-3(active), standbys: virtfarm-pgp-2
osd: 1 osds: 1 up, 1 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 bytes
usage: 1062 MB used, 1861 GB / 1862 GB avail
pgs:

****************************************
** "ceph osd tree" of all three hosts **
****************************************

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.81929 root default
-3 1.81929 host virtfarm-pgp-1
0 hdd 1.81929 osd.0 up 1.00000 1.00000
 
Please post output from commands or code in CODE tags (found under the little plus in the editor).
Anything in the logs (ceph.log, syslog, kernel.log)?
 
Code:
# "pveversion -v" of host 1 (osd is coming up)

proxmox-ve: 5.1-28 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.8-2-pve: 4.13.8-28
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.1-pve3


# "pveversion -v" of host 2 & 3 (osd isn't coming up)

proxmox-ve: 5.1-28 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.10.15-1-pve: 4.10.15-15
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.1-pve3


# "ceph -s" of all three hosts

cluster:
id: 05120acf-8b92-4160-917f-5ab48c5ecc10
health: HEALTH_OK
services:
mon: 3 daemons, quorum virtfarm-pgp-1,virtfarm-pgp-2,virtfarm-pgp-3
mgr: virtfarm-pgp-3(active), standbys: virtfarm-pgp-2
osd: 1 osds: 1 up, 1 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 bytes
usage: 1062 MB used, 1861 GB / 1862 GB avail
pgs:


# "ceph osd tree" of all three hosts

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.81929 root default
-3 1.81929 host virtfarm-pgp-1
0 hdd 1.81929 osd.0 up 1.00000 1.00000
 

Attachments

Code:
Dec  5 06:53:18 virtfarm-pgp-1 smartd[809]: Device: /dev/sdd [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Dec  5 06:53:18 virtfarm-pgp-1 smartd[809]: Device: /dev/sdd [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
You have a disk that will soon die!

Code:
Dec  5 07:21:08 virtfarm-pgp-3 smartd[1150]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Well, see above. While remapping may be possible, how much do you trust such a drive.

Code:
Dec  5 11:28:08 virtfarm-pgp-3 ceph-mon[1695]: 2017-12-05 11:28:08.138721 7f24b37d4700 -1 mon.virtfarm-pgp-3@2(peon).paxos(paxos updating c 78564..79194) lease_expire from mon.0 10.84.0.10:6789/0 is 0.131507 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)

Dec  5 11:39:10 virtfarm-pgp-2 ceph-mon[1535]: 2017-12-05 11:39:10.660282 7fb4332a1700 -1 mon.virtfarm-pgp-2@1(peon).paxos(paxos updating c 78815..79517) lease_expire from mon.0 10.84.0.10:6789/0 is 2.537121 seconds in the past; mons are probably laggy (or possibly clocks are too skewed)
Looks like your time is not synced on all hosts or you have a high network latency.

Code:
2017-12-05 11:27:30.554398 mon.virtfarm-pgp-2 mon.1 10.84.0.11:6789/0 327 : cluster [WRN] Health check failed: 1/3 mons down, quorum virtfarm-pgp-2,virtfarm-pgp-3 (MON_DOWN)
Do you see flapping monitors?

Code:
Dec  5 11:24:34 virtfarm-pgp-2 sh[9483]: command_with_stdin: 2017-12-05 11:24:34.249430 7f7c07003700  0 librados: client.bootstrap-osd authentication error (1) Operation not permitted
Dec  5 11:24:34 virtfarm-pgp-2 sh[9483]: [errno 1] error connecting to the cluster
Your OSDs may just not start, due to not being able to connect to the monitors.

I was just skimming through the log files, I guess there might be more to find. Go through the following links, as they provide vital information for a cluster setup.
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/
https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842
 
Hello again,

i now have installed the two from 5.0 updated nodes new with 5.1 iso and made the setup for all the cluster-stuff and the ceph-stuff new. And now it runs all well.

It seems that ceph won't run across a mixed setup of to 5.1 updated and already with 5.1 installed members. The Monitors was visible but i was not able to create OSDs in such a setup, respectively the created OSDs can't be seen.

Now all is running fine. Certainly i have to look for the HDD-failures of one of the nodes disk. But this is another problem ;-)

Regards

Hans-Peter
 
It seems that ceph won't run across a mixed setup of to 5.1 updated and already with 5.1 installed members. The Monitors was visible but i was not able to create OSDs in such a setup, respectively the created OSDs can't be seen.
Hm... usually not a problem, we have a upgrade documentation for PVE 4.x -> 5.1 with steps for Ceph (Jewel -> Luminous). But glad it now works out for you. :)