Ceph problem when master node is out

Hello to all.
i am play with proxmox on demo environment with ceph storage.

1. first node is master / 4 osd disks
2. second node add to master / 4 osd disks
3. third node add to master / 4 osd disks

when shutdown the node 2 or node 3 the system is accessible
the problems start when the master node 1 is down, the storage ceph and osd disks is not accessible

do it something wrong ?

maybe you now someone how to solve this problem ?
thanks to all
 
How did you add IP address and port number when you connected Ceph RBD with Proxmox through GUI?
It should be like this in the storage.cfg:
rbd: <storage_name>
monhost 192.168.1.1:6789;192.168.1.2:6789;192.168.1.3:6789
pool <pool_name>
...................
 
How did you add IP address and port number when you connected Ceph RBD with Proxmox through GUI?
It should be like this in the storage.cfg:
rbd: <storage_name>
monhost 192.168.1.1:6789;192.168.1.2:6789;192.168.1.3:6789
pool <pool_name>
...................

xmmmmmm
well mr wasim

i put these ip via gui like this
192.168.1.201 192.168.1.202 192.168.1.203

it is necessary to put also the :6789 ;

to be honest i don't check the storage.cfg

my question is why when shutdown the node 2 or 3 it works like a charm and when shudown master node 1 , everything stop work
 
xmmmmmm
well mr wasim

i put these ip via gui like this
192.168.1.201 192.168.1.202 192.168.1.203

it is necessary to put also the :6789 ;

to be honest i don't check the storage.cfg

my question is why when shutdown the node 2 or 3 it works like a charm and when shudown master node 1 , everything stop work
There should be semi-colon( ; ) in between each IP. It is possible that without the ; it is only picking up the first IP as Ceph provider node while ignoring other IPs. Do the modification in the /etc/pve/storage.cfg and try to shutdown the first node. I believe it will work:
Code:
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789
 
There should be semi-colon( ; ) in between each IP. It is possible that without the ; it is only picking up the first IP as Ceph provider node while ignoring other IPs. Do the modification in the /etc/pve/storage.cfg and try to shutdown the first node. I believe it will work:
Code:
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789

mr Wasim
thanks a lot for these information,
i make right now new demo environment to test this and i will be back
thanks a lot again
regards
 
There should be semi-colon( ; ) in between each IP. It is possible that without the ; it is only picking up the first IP as Ceph provider node while ignoring other IPs. Do the modification in the /etc/pve/storage.cfg and try to shutdown the first node. I believe it will work:
Code:
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789

mr wasin
the same problem, when node master 1 is down then freeze all
my storage.cfg
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789

the node2 and node3 ping each other
 
Run ceph -s or ceph health detail from node 2 or 3 and see what it shows.

From Node 2


root@demo2:~# ceph health
HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 4/12 in osds are down; 1 mons down, quorum 1,2 1,2


root@demo2:~# ceph -s
2015-01-07 22:53:57.799829 7f8219c71700 0 -- :/1015685 >> 192.168.1.201:6789/0 pipe(0x1ddb180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1ddb410).fault
cluster 6bbb954a-8c42-4d70-898d-6e6f8c69c429
health HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 4/12 in osds are down; 1 mons down, quo rum 1,2 1,2
monmap e3: 3 mons at {0=192.168.1.201:6789/0,1=192.168.1.202:6789/0,2=192.1 68.1.203:6789/0}, election epoch 18, quorum 1,2 1,2
osdmap e64: 12 osds: 8 up, 12 in
pgmap v185: 256 pgs, 4 pools, 16 bytes data, 3 objects
405 MB used, 36326 MB / 36731 MB avail
3/6 objects degraded (50.000%)
256 stale+active+degraded

/////////////////////////////////////////////////////////////

From node3

root@demo3:~# ceph health
HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2 1,2


root@demo3:~# ceph -s
2015-01-07 22:57:00.285642 7f69f6ba0700 0 -- :/1012409 >> 192.168.1.201:6789/0 pipe(0x1f08180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1f08410).fault
cluster 6bbb954a-8c42-4d70-898d-6e6f8c69c429
health HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2 1,2
monmap e3: 3 mons at {0=192.168.1.201:6789/0,1=192.168.1.202:6789/0,2=192.168.1.203:6789/0}, election epoch 18, quorum 1,2 1,2
osdmap e66: 12 osds: 8 up, 8 in
pgmap v188: 256 pgs, 4 pools, 16 bytes data, 3 objects
269 MB used, 24218 MB / 24487 MB avail
3/6 objects degraded (50.000%)
256 stale+active+degraded

regards
 
How many Ceph Monitors do you have? You need minimum 2 MONs to create quorum for you 3 nodes.

Actually you need at least 3 monitors to make a quorum. Two is not safe and not recommended.

From the Ceph documentation:
For high availability, you should run a production Ceph cluster with AT LEAST three monitors. Ceph uses the Paxos algorithm, which requires a consensus among the majority of monitors in a quorum. With Paxos, the monitors cannot determine a majority for establishing a quorum with only two monitors. A majority of monitors must be counted as such: 1:1, 2:3, 3:4, 3:5, 4:6, etc.
 
the same problem with 4 nodes right now, shut down, node2 or node3 or node4 no problem
when shutdown the master node1 returns communication failure (0)

root@demo2:~# ceph health
HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2,3 1,2,3

root@demo2:~# ceph -s
cluster 6bbb954a-8c42-4d70-898d-6e6f8c69c429
health HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2,3 1,2,3
monmap e4: 4 mons at {0=192.168.1.201:6789/0,1=192.168.1.202:6789/0,2=192.168.1.203:6789/0,3=192.168.1.204:6789/0}, election epoch 28, quorum 1,2,3 1,2,3
osdmap e113: 16 osds: 12 up, 12 in
pgmap v358: 256 pgs, 4 pools, 16 bytes data, 3 objects
414 MB used, 36317 MB / 36731 MB avail
3/6 objects degraded (50.000%)
256 stale+active+degraded
 
Hi,
3 monitors are enough, no need to have 4.


What is your pool configuration ?

#ceph osd pool get yourpool size
#ceph osd pool get yourpool min_size

Yes, 3 monitors are enough for a small to medium sized cluster. And as spirit has pointed out please show us your pool size configuration and your CRUSH map would also be helpful.
 
Hi,
3 monitors are enough, no need to have 4.


What is your pool configuration ?

#ceph osd pool get yourpool size
#ceph osd pool get yourpool min_size

Hello to all thanks for the help, i take this commands when the master node is down.

root@demo2:~# ceph osd pool get mystorage size
2015-01-08 18:19:50.871026 7fbd2e364700 0 -- :/1028891 >> 192.168.1.201:6789/0 pipe(0x128b180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x128b410).fault
size: 2


root@demo2:~# ceph osd pool get mystorage min_size
2015-01-08 18:20:25.494974 7f9374a75700 0 -- :/1029056 >> 192.168.1.201:6789/0 pipe(0x1a65180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1a65410).fault
min_size: 1
 
Hello to all thanks for the help, i take this commands when the master node is down.

root@demo2:~# ceph osd pool get mystorage size
2015-01-08 18:19:50.871026 7fbd2e364700 0 -- :/1028891 >> 192.168.1.201:6789/0 pipe(0x128b180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x128b410).fault
size: 2


root@demo2:~# ceph osd pool get mystorage min_size
2015-01-08 18:20:25.494974 7f9374a75700 0 -- :/1029056 >> 192.168.1.201:6789/0 pipe(0x1a65180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1a65410).fault
min_size: 1

Take a screenshot of Pool info and CrushMAP from Proxmox GUI of node 2 or 3
 
Hi,
3 monitors are enough, no need to have 4.

It would be dangerous to have 4! You need an odd number of monitors, with a minimum of 3. If you have an even number then the system can't always be sure if there is a network partition or not.
 
I have already see this kind of message when the first monitor is down.

It's working but It's deplay this warning.

Just to be sure, are your vms crashing ?

or is it only a problem in proxmox gui ?

(maybe theses messages impact proxmox api requests)
 
Take a screenshot of Pool info and CrushMAP from Proxmox GUI of node 2 or 3

this is crush map

//////////////////////////////////////////////////////////////////
# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host demo1 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 item osd.1 weight 0.000 item osd.2 weight 0.000 item osd.3 weight 0.000 } host demo2 { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.4 weight 0.000 item osd.5 weight 0.000 item osd.6 weight 0.000 item osd.7 weight 0.000 } host demo3 { id -4 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.8 weight 0.000 item osd.9 weight 0.000 item osd.10 weight 0.000 item osd.11 weight 0.000 } root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item demo1 weight 0.000 item demo2 weight 0.000 item demo3 weight 0.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map
////////////////////////////////////////////////////////////////////////////////////

Untitled.png

i repeat again , when close node2 or node3 the ceph storage is okay, the problem start when master node1 is down