Ceph problem when master node is out

Hello to all.
i am play with proxmox on demo environment with ceph storage.

1. first node is master / 4 osd disks
2. second node add to master / 4 osd disks
3. third node add to master / 4 osd disks

when shutdown the node 2 or node 3 the system is accessible
the problems start when the master node 1 is down, the storage ceph and osd disks is not accessible

do it something wrong ?

maybe you now someone how to solve this problem ?
thanks to all
 
How did you add IP address and port number when you connected Ceph RBD with Proxmox through GUI?
It should be like this in the storage.cfg:
rbd: <storage_name>
monhost 192.168.1.1:6789;192.168.1.2:6789;192.168.1.3:6789
pool <pool_name>
...................
 
How did you add IP address and port number when you connected Ceph RBD with Proxmox through GUI?
It should be like this in the storage.cfg:
rbd: <storage_name>
monhost 192.168.1.1:6789;192.168.1.2:6789;192.168.1.3:6789
pool <pool_name>
...................

xmmmmmm
well mr wasim

i put these ip via gui like this
192.168.1.201 192.168.1.202 192.168.1.203

it is necessary to put also the :6789 ;

to be honest i don't check the storage.cfg

my question is why when shutdown the node 2 or 3 it works like a charm and when shudown master node 1 , everything stop work
 
xmmmmmm
well mr wasim

i put these ip via gui like this
192.168.1.201 192.168.1.202 192.168.1.203

it is necessary to put also the :6789 ;

to be honest i don't check the storage.cfg

my question is why when shutdown the node 2 or 3 it works like a charm and when shudown master node 1 , everything stop work
There should be semi-colon( ; ) in between each IP. It is possible that without the ; it is only picking up the first IP as Ceph provider node while ignoring other IPs. Do the modification in the /etc/pve/storage.cfg and try to shutdown the first node. I believe it will work:
Code:
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789
 
There should be semi-colon( ; ) in between each IP. It is possible that without the ; it is only picking up the first IP as Ceph provider node while ignoring other IPs. Do the modification in the /etc/pve/storage.cfg and try to shutdown the first node. I believe it will work:
Code:
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789

mr Wasim
thanks a lot for these information,
i make right now new demo environment to test this and i will be back
thanks a lot again
regards
 
There should be semi-colon( ; ) in between each IP. It is possible that without the ; it is only picking up the first IP as Ceph provider node while ignoring other IPs. Do the modification in the /etc/pve/storage.cfg and try to shutdown the first node. I believe it will work:
Code:
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789

mr wasin
the same problem, when node master 1 is down then freeze all
my storage.cfg
monhost 192.168.1.201:6789;192.168.1.202:6789;192.168.1.203:6789

the node2 and node3 ping each other
 
Run ceph -s or ceph health detail from node 2 or 3 and see what it shows.

From Node 2


root@demo2:~# ceph health
HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 4/12 in osds are down; 1 mons down, quorum 1,2 1,2


root@demo2:~# ceph -s
2015-01-07 22:53:57.799829 7f8219c71700 0 -- :/1015685 >> 192.168.1.201:6789/0 pipe(0x1ddb180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1ddb410).fault
cluster 6bbb954a-8c42-4d70-898d-6e6f8c69c429
health HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 4/12 in osds are down; 1 mons down, quo rum 1,2 1,2
monmap e3: 3 mons at {0=192.168.1.201:6789/0,1=192.168.1.202:6789/0,2=192.1 68.1.203:6789/0}, election epoch 18, quorum 1,2 1,2
osdmap e64: 12 osds: 8 up, 12 in
pgmap v185: 256 pgs, 4 pools, 16 bytes data, 3 objects
405 MB used, 36326 MB / 36731 MB avail
3/6 objects degraded (50.000%)
256 stale+active+degraded

/////////////////////////////////////////////////////////////

From node3

root@demo3:~# ceph health
HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2 1,2


root@demo3:~# ceph -s
2015-01-07 22:57:00.285642 7f69f6ba0700 0 -- :/1012409 >> 192.168.1.201:6789/0 pipe(0x1f08180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1f08410).fault
cluster 6bbb954a-8c42-4d70-898d-6e6f8c69c429
health HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2 1,2
monmap e3: 3 mons at {0=192.168.1.201:6789/0,1=192.168.1.202:6789/0,2=192.168.1.203:6789/0}, election epoch 18, quorum 1,2 1,2
osdmap e66: 12 osds: 8 up, 8 in
pgmap v188: 256 pgs, 4 pools, 16 bytes data, 3 objects
269 MB used, 24218 MB / 24487 MB avail
3/6 objects degraded (50.000%)
256 stale+active+degraded

regards
 
How many Ceph Monitors do you have? You need minimum 2 MONs to create quorum for you 3 nodes.

Actually you need at least 3 monitors to make a quorum. Two is not safe and not recommended.

From the Ceph documentation:
For high availability, you should run a production Ceph cluster with AT LEAST three monitors. Ceph uses the Paxos algorithm, which requires a consensus among the majority of monitors in a quorum. With Paxos, the monitors cannot determine a majority for establishing a quorum with only two monitors. A majority of monitors must be counted as such: 1:1, 2:3, 3:4, 3:5, 4:6, etc.
 
the same problem with 4 nodes right now, shut down, node2 or node3 or node4 no problem
when shutdown the master node1 returns communication failure (0)

root@demo2:~# ceph health
HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2,3 1,2,3

root@demo2:~# ceph -s
cluster 6bbb954a-8c42-4d70-898d-6e6f8c69c429
health HEALTH_WARN 256 pgs degraded; 256 pgs stale; 256 pgs stuck stale; 256 pgs stuck unclean; recovery 3/6 objects degraded (50.000%); 1 mons down, quorum 1,2,3 1,2,3
monmap e4: 4 mons at {0=192.168.1.201:6789/0,1=192.168.1.202:6789/0,2=192.168.1.203:6789/0,3=192.168.1.204:6789/0}, election epoch 28, quorum 1,2,3 1,2,3
osdmap e113: 16 osds: 12 up, 12 in
pgmap v358: 256 pgs, 4 pools, 16 bytes data, 3 objects
414 MB used, 36317 MB / 36731 MB avail
3/6 objects degraded (50.000%)
256 stale+active+degraded
 
Hi,
3 monitors are enough, no need to have 4.


What is your pool configuration ?

#ceph osd pool get yourpool size
#ceph osd pool get yourpool min_size

Yes, 3 monitors are enough for a small to medium sized cluster. And as spirit has pointed out please show us your pool size configuration and your CRUSH map would also be helpful.
 
Hi,
3 monitors are enough, no need to have 4.


What is your pool configuration ?

#ceph osd pool get yourpool size
#ceph osd pool get yourpool min_size

Hello to all thanks for the help, i take this commands when the master node is down.

root@demo2:~# ceph osd pool get mystorage size
2015-01-08 18:19:50.871026 7fbd2e364700 0 -- :/1028891 >> 192.168.1.201:6789/0 pipe(0x128b180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x128b410).fault
size: 2


root@demo2:~# ceph osd pool get mystorage min_size
2015-01-08 18:20:25.494974 7f9374a75700 0 -- :/1029056 >> 192.168.1.201:6789/0 pipe(0x1a65180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1a65410).fault
min_size: 1
 
Hello to all thanks for the help, i take this commands when the master node is down.

root@demo2:~# ceph osd pool get mystorage size
2015-01-08 18:19:50.871026 7fbd2e364700 0 -- :/1028891 >> 192.168.1.201:6789/0 pipe(0x128b180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x128b410).fault
size: 2


root@demo2:~# ceph osd pool get mystorage min_size
2015-01-08 18:20:25.494974 7f9374a75700 0 -- :/1029056 >> 192.168.1.201:6789/0 pipe(0x1a65180 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x1a65410).fault
min_size: 1

Take a screenshot of Pool info and CrushMAP from Proxmox GUI of node 2 or 3
 
Hi,
3 monitors are enough, no need to have 4.

It would be dangerous to have 4! You need an odd number of monitors, with a minimum of 3. If you have an even number then the system can't always be sure if there is a network partition or not.
 
I have already see this kind of message when the first monitor is down.

It's working but It's deplay this warning.

Just to be sure, are your vms crashing ?

or is it only a problem in proxmox gui ?

(maybe theses messages impact proxmox api requests)
 
Take a screenshot of Pool info and CrushMAP from Proxmox GUI of node 2 or 3

this is crush map

//////////////////////////////////////////////////////////////////
# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host demo1 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 item osd.1 weight 0.000 item osd.2 weight 0.000 item osd.3 weight 0.000 } host demo2 { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.4 weight 0.000 item osd.5 weight 0.000 item osd.6 weight 0.000 item osd.7 weight 0.000 } host demo3 { id -4 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.8 weight 0.000 item osd.9 weight 0.000 item osd.10 weight 0.000 item osd.11 weight 0.000 } root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item demo1 weight 0.000 item demo2 weight 0.000 item demo3 weight 0.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map
////////////////////////////////////////////////////////////////////////////////////

Untitled.png

i repeat again , when close node2 or node3 the ceph storage is okay, the problem start when master node1 is down
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!