Few Ceph questions

Not the first time Ive heard people not liking DRBD, cant blame them.

The speed of the connection between your datacenters is completely irrelevant for this issue though. You could either have an equal number of monitors in each datacenter, or have 1 more mon than that in one location. Either way, if the connection between both locations fails, you either completely lose quorum or one of both sides becomes inoperable. Ideally you REALLY REALLY want a third location to run one monitor. This "location" can be a VM hosted by some VPS provider or whatever, just so you have an external view if you lose the direct connection between both your datacenters.
But with DRBD will he not face that "waste of storage space" issue? Since DRBD is a replicated storage among equal number of nodes?
 
I also want to avoid using 6 replica's, this seems like a huge waste of space. I am hoping I can stick with 4 replica's and get the crush maps setup properly to do this.
You can certainly get away with just 4 replicas by manipulating CRUSHMAP. You can select down to which HDD/OSD in a particular node in a particular rack in a particular site you want the replica to be stored.

To agree with e100 in post #13, it is indeed not absolute necessary to use Ce[h with Proxmox. Ceph cluster can site on its own hardware using a distro lets say Ubuntu. Never have to worry about going over 32 or 64 limit. But it certainly makes life easier to be able to view both proxmox and ceph cluster from one gui and not rely on Ceph CLI all the time. Besides if you are going to 64 nodes, thats a pretty big storage system and giant storage space. For best practice you should chunk such a large Ceph cluster in smaller pieces any way.
 
Interesting I may consider those GUI's as I really feel setting up ceph in centos is far simpler because I don't have to worry about the proxmox cluster itself. I still want to do some testing with proxmox but honestly the setup in centos is so easy I might just stick with that. Keep getting side tracked at work with other projects which is pulling me away from this, but today I am hoping to make some solid progress.
 
Just tried the Ceph-dash. Very nice. Great addition to my NOC. I can just leave it running on one dedicated monitor for real time ceph cluster monitoring. :)

I think i still prefer Proxmox way. Mainly because i can do the actual management such as manage OSDs, MONs, Pools etc. Whereas the ceph-dash is only for monitoring. Still looks great. I am finding myself staring at the colorful ceph dash :)
 
Just tried the Ceph-dash. Very nice. Great addition to my NOC. I can just leave it running on one dedicated monitor for real time ceph cluster monitoring. :)

I think i still prefer Proxmox way. Mainly because i can do the actual management such as manage OSDs, MONs, Pools etc. Whereas the ceph-dash is only for monitoring. Still looks great. I am finding myself staring at the colorful ceph dash :)

Ahh bummer that its only a monitor, although im starting to get a decant understanding of the CLI and I have always been a CLI guy so it doesn't really scare me to much.
 
Not sure about any licensing conflict but it may be possible to include something like this to improve the PVE Ceph interface within PVE web management.

Although priority for this would not be great... Unless there is an easy way to make it an "add-on" that would survive between GUI upgrades.

Serge
 
Interesting I may consider those GUI's as I really feel setting up ceph in centos is far simpler because I don't have to worry about the proxmox cluster itself. I still want to do some testing with proxmox but honestly the setup in centos is so easy I might just stick with that. Keep getting side tracked at work with other projects which is pulling me away from this, but today I am hoping to make some solid progress.
Hi,
ceph-dash is "only" for an overview - not for installation/setup of ceph.

Udo
 
I agree, this is not going to be an automatic transition.
Following is just one this will work:
Taking the 8 nodes scenario, we will have one additional node in each location, powered off and not part of the cluster. As soon as site down is realized, fire up the additional node and join to cluster to form quorum. After the other site is up simply remove it from cluster and power down. This way Quorum will be restored and staff will keep working while fallen site is being worked on.

Or,

Use an additional node on each site but put it on separate LAN/WAN and power supply purely for the purpose of quorum. The node can be very underpowered thus cheap. That way even if main cluster in one site goes down Quorum will remain intact.

I have been digging into this concept for a few days now and can't for the life of me get quorum to come back if a site is completely down. I am currently running a 6 node cluster, of these 6 nodes I have 4 monitors, 2 at each site. I then have 2 additional nodes at each site powered off not part of the ceph cluster. When the one site goes down, I loose quorum. I power up my additional node and can't for the life of me figure out how to add it to the cluster as a monitor when quorum is down. I simply get a timeout, maybe there is a way to force the addition but I can't seem to figure it out. I get the timeout both from the proxmox GUI and the CLI when using pveceph. I also tried adding the monitor manually to the ceph.conf file but can't seem to get quorum to come up. Any suggestion on how to force this addition to the cluster?

Well this explains why I can't just simply edit the ceph.conf to add a monitor.

Strict consistency also applies to updates to the monmap. As with any other updates on the Ceph Monitor, changes to the monmap always run through a distributed consensus algorithm called Paxos. The Ceph Monitors must agree on each update to the monmap, such as adding or removing a Ceph Monitor, to ensure that each monitor in the quorum has the same version of the monmap. Updates to the monmap are incremental so that Ceph Monitors have the latest agreed upon version, and a set of previous versions. Maintaining a history enables a Ceph Monitor that has an older version of the monmap to catch up with the current state of the Ceph Storage Cluster.

If Ceph Monitors discovered each other through the Ceph configuration file instead of through the monmap, it would introduce additional risks because the Ceph configuration files aren’t updated and distributed automatically. Ceph Monitors might inadvertently use an older Ceph configuration file, fail to recognize a Ceph Monitor, fall out of a quorum, or develop a situation where Paxos isn’t able to determine the current state of the system accurately.
 
Last edited:
AFAIK, for the situation you described, your only way to add monitors is by injecting a new monmap to the surviving monitors.

This means that while the cluster is fully up and running, you grab a monmap (do all of this on one of the nodes that will still be up in your tests):

ceph mon getmap -o /tmp/monmap

Following steps after your other location has died:
Add a new monitor (so that you have 3/5 mons up):

monmaptool /tmp/monmap --add <name> <ip>:<port>

inject it into the mon:

service ceph stop mon.X #replace X
ceph-mon -i X --inject-monmap /tmp/monmap #replace X
 
AFAIK, for the situation you described, your only way to add monitors is by injecting a new monmap to the surviving monitors.

This means that while the cluster is fully up and running, you grab a monmap (do all of this on one of the nodes that will still be up in your tests):

ceph mon getmap -o /tmp/monmap

Following steps after your other location has died:
Add a new monitor (so that you have 3/5 mons up):

monmaptool /tmp/monmap --add <name> <ip>:<port>

inject it into the mon:

service ceph stop mon.X #replace X
ceph-mon -i X --inject-monmap /tmp/monmap #replace X

I pulled the current monmap before causing a failure.

ceph mon getmap -o monmap

I then caused a failure which killed quorum

At that point I manually forced a new monitor entry into the monmap

ceph-mon -i 4 --inject-monmap monmap

Printed the mon map and it looks good.

root@cephnode1:~# monmaptool --print monmap
monmaptool: monmap file monmap
epoch 16
fsid b6529a36-64f7-4058-8524-588ffce66c2b
last_changed 2014-09-24 09:44:33.119150
created 2014-09-23 14:59:53.476135
0: 172.16.0.151:6789/0 mon.0
1: 172.16.0.152:6789/0 mon.1
2: 172.16.0.154:6789/0 mon.2
3: 172.16.0.155:6789/0 mon.3
4: 172.16.0.157:6789/0 mon.4

I then stopped ceph mon.0

service ceph stop mon.0

Then I injected the new monmap

ceph-mon -i 0 --inject-monmap monmap

I then started ceph on the node which I was injecting the new map. Quorum did not come back, so I then attempted to start the ceph monitor on the new monitor which I added. It complained of no entries in /etc/pve/ceph.conf. So I then manually added an entry. At that point I tried to start the service again, but now its just repoting a number of errors trying to start the mon.

root@cephmon1:~# service ceph start
=== mon.4 ===
Starting Ceph mon.4 on cephmon1...
IO error: /var/lib/ceph/mon/ceph-4/store.db/LOCK: No such file or directory
IO error: /var/lib/ceph/mon/ceph-4/store.db/LOCK: No such file or directory
2014-09-24 09:56:27.357380 7f1545e9e780 -1 failed to create new leveldb store
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i 4 --pid-file /var/run/ceph/mon.4.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on cephmon1...

So I then manually created those directories. Which got me a bit further, but still erroring out.

root@cephmon1:~# service ceph start mon.4
=== mon.4 ===
Starting Ceph mon.4 on cephmon1...
2014-09-24 10:13:15.804474 7fa5c3be3780 -1 unable to read magic from mon data.. did you run mkcephfs?
failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i 4 --pid-file /var/run/ceph/mon.4.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on cephmon1..


Do I need to stop the ceph monitors on all existing monitors and inject the new monmap to each one?
 
Last edited:
Got a bit further. On the new monitor node I had to create the filesystem.

ceph-mon -i 4 --mkfs --monmap ~/monmap --keyring ~/keyring

I had to move the keyring from one of my monitors which was still up. Once this was done, quorum formed.

What is throwing me off now though, is ceph is reporting all of my OSD's as up, even though half of them are down.

root@cephnode1:~# ceph status
cluster b6529a36-64f7-4058-8524-588ffce66c2b
health HEALTH_WARN 2 mons down, quorum 0,1,4 0,1,4
monmap e19: 5 mons at {0=172.16.0.151:6789/0,1=172.16.0.152:6789/0,2=172.16.0.154:6789/0,3=172.16.0.155:6789/0,4=172.16.0.157:6789/0}, election epoch 122, quorum 0,1,4 0,1,4
osdmap e124: 6 osds: 6 up, 6 in
pgmap v263: 492 pgs, 4 pools, 0 bytes data, 0 objects
232 MB used, 2668 GB / 2668 GB avail
492 active+clean

I wonder if its because I didn't specify the fsid when I created the filesystem on the new monitor node.
 
Making some good progress. So here is my plan for quorum. I will run a total of 6 monitors but only 5 will run all the time. 3 at my main site and 2 at the off site (With one powered off at the off site). This ensures that my main site stays up and quorate if my off site goes down or the link goes down.

I will then ensure I keep a good copy of the monmap within /etc/pve so all nodes have a good copy. Do any of you see any issues with this? It would nice if we could get proxmox to do this itself but that is probably allot to ask.

If I was to loose my main site I would do the following.

- Make copy of monmap for safe keeping
cp /etc/pve/monmap /etc/pve/monmap.bak

- Remove 1 monitor from down site in monmap
monmaptool --rm 4 monmap

- Inject new monitor into monmap
monmaptool monmap --add 5 172.16.0.158:6789

- Print monmap to ensure it was added
monmaptool --print monmap

- Move the monmap to the other existing monitors and the new monitor

- Stop ceph monitor servers
service ceph stop mon.X (X is the monitor number)

-Inject new monmap into the existing monitors (Do not do on the new monitor you wish to add)
ceph-mon -i X --inject-monmap monmap

- Add new monitor entry to /etc/pve/ceph.conf and remove the one which you removed from the monmap

- Move keyfile from existing monitor to new monitor
scp /var/lib/ceph/mon/ceph-3/keyring root@172.16.0.158:~

- Create monitor filesystem on new monitor
ceph-mon -i 5 --mkfs --fsid b6529a36-64f7-4058-8524-588ffce66c2b --monmap ~/monmap --keyring ~/keyring

- Start ceph monitors on existing/new monitors

- Quorum should establish and we are on our way.
 
Last edited:
be aware that the /etc/pve filesystem will switch to read-only since cman will lose quorum as well. Meaning that you cant move the container/VM configs into different node folders. I suppose you could lower the "expected votes", but Im not sure how well /etc/pve deals with splitbrain situations (because your off-site will have a newer version of the file system, once the main one comes back up)
 
be aware that the /etc/pve filesystem will switch to read-only since cman will lose quorum as well. Meaning that you cant move the container/VM configs into different node folders. I suppose you could lower the "expected votes", but Im not sure how well /etc/pve deals with splitbrain situations (because your off-site will have a newer version of the file system, once the main one comes back up)

Yea I have been setting "pvecm expected 1" to bring quorum back. So far in my testing I havn't had any issues.

Next beast to tackle is the crush maps.
 
since you still seem to be testing this. Would you mind testing whether /etc/pve syncs up fine after 3/5 nodes are down and you do some writing to /etc/pve on the 2 remaining nodes?
 
much appreciated. I'm actually currently looking at implementing such a metro cluster for 2 separate customers of mine that have asked this to be done.

The required crushmap alterations should be as simple as introducing datacenter buckets to separate the hosts. Also it is possible to define crush rules that makes it so that the primary OSD for all PGs would be located in/on your main site.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!