Proxmox VE Ceph Server released (beta)

impire · May 30, 2014

symmcom said:
Were you talking about image #1 or image #2 when you said 'need to make sure used space can be no more than......'?

Thank you. I was referring to Image #2, which show available space. The size is showing total size of the entire pool. In your image example, the replicas are divided by 3, it should show 17.53TB as the available space (52.59 / 3)?

This is my opinion based on experience; I think using the following formula to decide on Replica number is good idea:
# of Node - 1 = Replica

So a 3 node CEPH cluster will have replica of 2. A 5 Node CEPH cluster will have replica of 4 and so on. Having 3 replicas 3 nodes will allow 2 simultaneous nodes failure and still keep functioning fine. Whereas with 2 replicas, 2 node failures may cause huge issue. What are the chances that 2 machines will fail at the same time. Dont be confuse when say with 3 replicas everything will work just fine even with 2 node failure, cluster will still face rebalancing thus making everything slower while cluster tries to recover. But i dont believe you will face any data loss due to stuck, unclean or stale PGs. Hope this makes sense.

Thank you much but I am confused and it doesn't make sense.

So let say I have 3 nodes of 10TB each. Total 30TB. 2 replicas will give me 15TB storage.

Let's add 4th node of of another 10TB, now the total is 40TB, 3 replicates will give me 13.33TB storage.

Let's add 5th node of of another 10TB, now the total is 50TB, 4 replicates will give me 12.5TB storage.

If I follow the rule of # of Node -1 = Replica, I don't gain any storage space and actually lose more storage space.

Another related question. Is there an advantage of having more than 3 replicas? Isn't 3 already a solid of no single point of failure?

Thank you very much for your help. I really appreciate it.

wahmed · May 31, 2014

impire said:
Thank you. I was referring to Image #2, which show available space. The size is showing total size of the entire pool. In your image example, the replicas are divided by 3, it should show 17.53TB as the available space (52.59 / 3)?

Available space will always show Total storage space in whole cluster, regardless of however many replicas you have.

impire said:
Thank you much but I am confused and it doesn't make sense.

So let say I have 3 nodes of 10TB each. Total 30TB. 2 replicas will give me 15TB storage.

Let's add 4th node of of another 10TB, now the total is 40TB, 3 replicates will give me 13.33TB storage.

Let's add 5th node of of another 10TB, now the total is 50TB, 4 replicates will give me 12.5TB storage.

If I follow the rule of # of Node -1 = Replica, I don't gain any storage space and actually lose more storage space.

For small number such as 3 or 4 node Ceph cluster (# of node -1) formula might be ok, but as your node number grows, you do not necessarily need to follow the same formula. In a 10 node Ceph cluster for example, 4 replicas should be plenty. Thats 4 copies of all data across 10 nodes. I apologize if i confused you more with the formula.

impire said:
Another related question. Is there an advantage of having more than 3 replicas? Isn't 3 already a solid of no single point of failure?

The comment above should already answer that. I dont think there is a sweet number for replicas. It is going to depend on the size of the cluster. A 20 node Ceph cluster for example can easily afford 4 or 5 replicas.

impire · May 31, 2014

symmcom said:
Available space will always show Total storage space in whole cluster, regardless of however many replicas you have.

Thank you very much for your help.

Back to my original comment about Proxmox/Ceph's showing available storage space. Since it is showing TOTAL available and not ACTUAL, I just have to be careful and not storing more than I should?

For an example, total available it is showing 30TB. Since I have divided it into 3 replicas, the ACTUAL available space is only 10TB. Since the Proxmox/Ceph doesn't indicate this actual size anywhere, I just have to keep watching the USED space and make sure it doesn't go too close to 10TB.

That lead to another question? What would happen if I go to say 12TB? This is storing more than I should? I guess there will be data loss if there's a failure on any node in a 3 nodes environment?

Thanks in advance for your time.

mo_ · May 31, 2014

You can't get to 10TB. Ceph has thresholds for that. at 85% (default value iirc, on any given OSD) Ceph will put out a warning, and at 95% it will flat out refuse to write to the OSD to prevent data loss.

Both thresholds would look like this:

Code:

$> ceph health
HEALTH_ERR 1 nearfull osds, 1 full osds
osd.2 is near full at 85%
osd.3 is full at 97%

And yes, you might be putting your data at risk if you have less free capacity than 1 (failing) OSD has storage space. This is why extending a Ceph cluster is so easy and straight forward: you need to make sure to scale out early to never reach alarming disk utilization values.

wahmed · May 31, 2014

mo_ said:
You can't get to 10TB. Ceph has thresholds for that. at 85% (default value iirc, on any given OSD) Ceph will put out a warning, and at 95% it will flat out refuse to write to the OSD to prevent data loss.

Both thresholds would look like this:

Code:

$> ceph health HEALTH_ERR 1 nearfull osds, 1 full osds osd.2 is near full at 85% osd.3 is full at 97%

And yes, you might be putting your data at risk if you have less free capacity than 1 (failing) OSD has storage space. This is why extending a Ceph cluster is so easy and straight forward: you need to make sure to scale out early to never reach alarming disk utilization values.

Very true!

impire · May 31, 2014

Thank you very much. That makes a whole lot of sense.

impire · May 31, 2014

Hello,

Can anyone confirm for sure that the next version of CEPH no longer need dedicated journal disks?

I am setting up from scratch. If it is true the CEPH will not need dedicated journal disk, then I may as well set up right now for it to put the journal on the OSDs.

I just don't know how much of a performance loss for right now if I don't use dedicated journal disk. It would rather use up that slot for another hard drive.

Any help would be greatly appreciated. Thank you.

wahmed · May 31, 2014

impire said:
Hello,

Can anyone confirm for sure that the next version of CEPH no longer need dedicated journal disks?

I am setting up from scratch. If it is true the CEPH will not need dedicated journal disk, then I may as well set up right now for it to put the journal on the OSDs.

I just don't know how much of a performance loss for right now if I don't use dedicated journal disk. It would rather use up that slot for another hard drive.

Any help would be greatly appreciated. Thank you.

Ceph never needed a dedicated journal disk. With dedicated journal disk specially with sad performance does go up but it is not mandatory. But.... With 8 osds or more per node, collocating journal on same disk with osd increases performance.

If you use latest ceph version firefly then it doesn't matter about journal. But I personally didn't convert all my cluster to firefly yet. I am waiting to see how it works out at others expense.

It is just too new to be really know it's stability.

Sent from my SGH-I747M using Tapatalk

mo_ · May 31, 2014

I was trying to get any kind of information whether firefly can be considered stable yet (tried via ceph IRC) or whether inktank has started supporting it yet, but nobody seems to be able/willing to answer that, so I am also in the dark whether firefly is safe to use yet.

felipe · Jun 2, 2014

>>If you use latest ceph version firefly then it doesn't matter about journal.

yeah but this feature will still be some time not for production :-(

impire · Jun 4, 2014

Sorry in advance for the newbie question.

I set up 3 CEPH Nodes.

I would like the VMs on each node to communicate with each other via the hosts main network interface while using a different subnet.

What is the proper way to do this? Thanks in advance for your help.

impire · Jun 8, 2014

Hello,

Could someone spare a moment to help with these questions?

1) I thought the PG and replicas number can be change at any time (per previous threads). But I don't see a way how to do that after the creation of a pool and storage. As I can see now, the only way is to delete it and create a new pool with larger number of PGs or replicas. This can be FATAL as I already have VMs created using that storage and pool.

2) What would happen if I set the PGs number higher than I actually need right now? For example, I have 3 nodes with 4 OSDs in each. With a pool of 3 replicas, I need to set the configuration of 512 PGs (12 x 100 / 3 = 400, next power of 2 = 512). Soon, I plan to add another more OSDs to each nodes as well as additional servers.

Can I just set the PGs number to 1024? Since the PGs number cannot be changed without deleting and recreation of the pool, what's the harm in having a larger number of PGs to forecast the growth?

3) I read somewhere that we don't really need ProxMox HA since CEPH itself is the HA with no single point of failure. This is the part I still don't understand.

- I configured 3 CEPH nodes. Created the Pool. Created the storage in the data center section. Then created VMs on one of the CEPH node 1.
- CEPH node 1 went down. I can not access the VMs while it is down. The data inside those VMs pretty much inaccessible.
- I don't see any High Availability taking place here? CEPH node 1 died and took along those VMs with data in it. What can CEPH nodes 2 and 3 do?

Sorry for the newbie question, as I am still trying to understand the whole benefit of CEPH. From a newbie perspective, how can CEPH's distributed storage and high availability features save an admin's behind?

Thank you in advance for your time.

udo · Jun 8, 2014

impire said:
Hello,

Could someone spare a moment to help with these questions?

1) I thought the PG and replicas number can be change at any time (per previous threads). But I don't see a way how to do that after the creation of a pool and storage. As I can see now, the only way is to delete it and create a new pool with larger number of PGs or replicas. This can be FATAL as I already have VMs created using that storage and pool.

Hi,
of course you can change the PG number "on the fly". But only increase, never less than before and during "rebuild" the guest-IO on the ceph-cluster is very limited (you got normaly an high load on the osd-nodes during "rebuild".
To change the PGs do an

Code:

ceph osd pool set poolname pg_num 512
ceph osd pool set poolname pgp_num 512

2) What would happen if I set the PGs number higher than I actually need right now? For example, I have 3 nodes with 4 OSDs in each. With a pool of 3 replicas, I need to set the configuration of 512 PGs (12 x 100 / 3 = 400, next power of 2 = 512). Soon, I plan to add another more OSDs to each nodes as well as additional servers.

I would extend PGs if it's nessecary

Udo

impire · Jun 9, 2014

udo said:
Hi,

To change the PGs do an

Code:

ceph osd pool set poolname pg_num 512 ceph osd pool set poolname pgp_num 512

Udo

Thank you very much. That is very helpful. So to increase the PGs, it can only be done via the command line and not the GUI?

Also, are these two commands pretty much the same?

ceph osd pool set poolname pg_num 512
ceph osd pool set poolname pgp_num 512

wahmed · Jun 9, 2014

Command format is the same, but you must do both of them to extend PG.

PG = Placement Group
PGP Placement Group of Placement

impire · Jun 9, 2014

symmcom said:
Command format is the same, but you must do both of them to extend PG.

PG = Placement Group
PGP Placement Group of Placement

Thank you very much! This is very helpful.

impire · Jun 9, 2014

Hello,

Apologize in advance for all the newbie questions.

1) When I try to back up a VM Nodes -> Backup -> Backup now, the storage option is empty. It doesn't allow me to select the pool/storage I am using for the VMs. Taking snap shots of the VMs works fine. Any advise?

2) How do I gracefully remove a CEPH node?

3) I want to add a server to CEPH as a node but it have only 1 hard drive for boot, there's no OSDs on it. Can I just add it in and use its resource of memory and CPU? In other words, I want it to be part of CEPH, but not necessarily for storage reason.

4) When I shut down a CEPH node to do maintenance, do I need to do anything special upon bringing it back up? Will CEPH automatically recognize it and does its thing?

5) I've read that the reason for creation of multiple pools is because different users have different workload requirement? But since the pools are accessing the same nodes and number of OSDs in the CEPH, why does it matter?

Example, pools A and B are created. if users of Pool A is driving resources heavily, wouldn't that also affect performance for users on Pool B?

sdutremble · Jun 9, 2014

sdutremble said:
I thought it was also necessary to have at least one MDS? Could you modify your steps to have this added if I am correct? Thanks, Serge

Any idea on how to manually create a MDS on Proxmox under /etc/pve?

impire · Jun 10, 2014

I am confused. There are two documentation ProxMox put out for installing CEPH:

1) http://pve.proxmox.com/wiki/Ceph_Server

OR

2) http://pve.proxmox.com/wiki/Storage:_Ceph

and then there is this one:

3) http://ceph.com/docs/master/install/

Which is the best one to use?

tom · Jun 10, 2014

If you want to install Ceph Server, its http://pve.proxmox.com/wiki/Ceph_Server

Proxmox VE Ceph Server released (beta)

Active Member

Famous Member

Active Member

Renowned Member

Famous Member

Active Member

Active Member

Famous Member

Renowned Member

Well-Known Member

Active Member

Active Member

Distinguished Member

Active Member

Famous Member

Active Member

Active Member

Renowned Member

Active Member

Proxmox Staff Member

We value your privacy