I already have a solution for my problem but I want to share it. Also I like to have some feedback on the solution or another simpler solution? Maybe its all my fault as it is the first time I am using CEPH with Proxmox .
Server setup (all servers run 4.3-1/e7cdc165):
Well looking at "journalctl -b | grep ceph" it seems that the OSD´s cant be startet:
"journalctl -b | grep mount" shows:
I dont know why Proxmox still thinks the OSD´s are running. After seeing this I tried to start the OSD´s manually with "pveceph start". This works, all OSD´s are startet and I can access the RBD images.
To automate this so I dont have to manually start CEPH and VM´s I wrote the following two scripts. The first script starts the OSD daemons and the second script starts all VM´s which are tagged to start at boot. One script runs everywhere CEPH OSD daemons need to be startet and the second everywhere VM´s need to be startet.
I call both scripts in "/etc/rc.local".
First script to start CEPH OSD daemons:
Second script to start VM´s if CEPH Storage is available:
With those scripts everything works fine after I reboot the cluster or a single node.
Best regards
Jonas Stunkat
Server setup (all servers run 4.3-1/e7cdc165):
- 2 x Storage Server which dont run any VM, these run both a CEPH monitor and several OSD
- 2 x Servers wich run the virtual machines, these only run CEPH monitor
- all Servers are in a cluster, storage connections are 10GBe
- there are two rings for Corosync, each on different NICs and switches
- VM´s do get startet but they dont ... what this means is Proxmox thinks CEPH Storage is available and starts the VM´s -> they appear startet in UI
- Proxmox displays CEPH Storage as available but trying to show the contents fails/times out
- -> CEPH Storage is not available
Well looking at "journalctl -b | grep ceph" it seems that the OSD´s cant be startet:
Code:
Sep 30 15:07:27 pvestorage1 ceph[1515]: === osd.0 ===
Sep 30 15:07:33 pvestorage1 ceph[1515]: 2016-09-30 15:07:33.430172 7fc4c4489700 0 -- 192.168.10.10:0/1157019147 >> 192.168.10.200:6789/0 pipe(0x7fc4b0000c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc4b0004ef0).fault
Sep 30 15:07:39 pvestorage1 ceph[1515]: 2016-09-30 15:07:39.429507 7fc4c468b700 0 -- 192.168.10.10:0/1157019147 >> 192.168.10.200:6789/0 pipe(0x7fc4b0000c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc4b0006470).fault
Sep 30 15:07:45 pvestorage1 ceph[1515]: 2016-09-30 15:07:45.429287 7fc4c458a700 0 -- 192.168.10.10:0/1157019147 >> 192.168.10.200:6789/0 pipe(0x7fc4b0000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc4b0004ea0).fault
Sep 30 15:07:51 pvestorage1 ceph[1515]: 2016-09-30 15:07:51.429223 7fc4c4489700 0 -- 192.168.10.10:0/1157019147 >> 192.168.10.200:6789/0 pipe(0x7fc4b0000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc4b0004ea0).fault
Sep 30 15:07:58 pvestorage1 ceph[1515]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 0.87 host=pvestorage1 root=default'
Sep 30 15:07:58 pvestorage1 ceph[1515]: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
Sep 30 15:07:58 pvestorage1 ceph[1515]: ceph-disk: Error: One or more partitions failed to activate
"journalctl -b | grep mount" shows:
Code:
Sep 30 15:06:57 pvestorage1 kernel: XFS (sdc1): Ending clean mount
Sep 30 15:07:27 pvestorage1 kernel: XFS (sdb1): Ending clean mount
I dont know why Proxmox still thinks the OSD´s are running. After seeing this I tried to start the OSD´s manually with "pveceph start". This works, all OSD´s are startet and I can access the RBD images.
To automate this so I dont have to manually start CEPH and VM´s I wrote the following two scripts. The first script starts the OSD daemons and the second script starts all VM´s which are tagged to start at boot. One script runs everywhere CEPH OSD daemons need to be startet and the second everywhere VM´s need to be startet.
I call both scripts in "/etc/rc.local".
First script to start CEPH OSD daemons:
Code:
#!/bin/sh
# This Script checks if enough monitors/IPs are up to start the OSD´s
MON_PORT="6789"
MON_NAME1="192.168.10.10"
MON_NAME2="192.168.10.20"
MON_NAME3="192.168.10.100"
MON_NAME4="192.168.10.200"
LOOP=true
while $LOOP
do
if `nc -z $MON_NAME1 $MON_PORT 2>/dev/null` &&
`nc -z $MON_NAME2 $MON_PORT 2>/dev/null` &&
`nc -z $MON_NAME3 $MON_PORT 2>/dev/null` &&
`nc -z $MON_NAME4 $MON_PORT 2>/dev/null`
then
pveceph start
LOOP=false
else
sleep 10
fi
done
Second script to start VM´s if CEPH Storage is available:
Code:
#!/bin/sh
# Tests if ceph storage is up and running
#STORAGE='vmimages fastvmimages'
STORAGE='vmimages'
GRACETIME=20
LOOP=true
while [ $LOOP = true ]
do
LOOP=false
for item in $STORAGE
do
`timeout 5 pvesh get nodes/localhost/storage/$item/content 2>1 1> /dev/null`
if [ $? -ne 0 ]
then
LOOP=true
break
fi
done
if [ $LOOP = false ]
then
sleep $GRACETIME
`pvesh create nodes/localhost/startall 2>1 1> /dev/null`
else
sleep 10
fi
done
With those scripts everything works fine after I reboot the cluster or a single node.
Best regards
Jonas Stunkat