Ceph - Multiples OSD and POOLS

ProxCH

New Member
Jan 5, 2019
29
0
1
24
Hello all,

I broswed the forum but not sure I found my answer. But well, I'm quite new to CEPH and have a question :

I have 2 servers with 2 free SSD drives in each server. What I would like to achieve is the following :

On server 1 :
Free drive 1 is replicated to free drive 2 of server 2

On server 2 :
Free drive 1 is replicated to free drive 2 of server 1

All four free drives are SSD 500 GB.

Is that feasible ?

Thanks !


BTW, here is my current Crush map :

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host host1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 0.909
alg straw2 hash 0 # rjenkins1
item osd.0 weight 0.455
item osd.2 weight 0.455
}
host host2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 0.909
alg straw2 hash 0 # rjenkins1
item osd.1 weight 0.455
item osd.3 weight 0.455
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.819
alg straw2 hash 0 # rjenkins1
item hmvm001 weight 0.909
item hmvm002 weight 0.909
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
emit
}

# end crush map
 
Last edited:
On server 1 :
Free drive 1 is replicated to free drive 2 of server 2

On server 2 :
Free drive 1 is replicated to free drive 2 of server 1
This is not how CEPH works.

step chooseleaf firstn 0 type host
With this, you tell CEPH to store one Replica per Host, not per OSD. So if you have a Pool with Replica 2 (which is not recommended) then you have on booth Nodes the same Data. If one Host fails, you can Access the Data on the Other Host.

But you need more Nodes and more CEPH Mons, you always need a quorum - thats not possible with only two Nodes.
 
Hello,

Thanks for your answer.

To quickly summarize; this is a small LAB for my Home. I have 2 servers where I installed proxmox and I have one Raspberry used with corosync for the quorum. My goal is to replicate both data drives between 2 servers. If Ceph doesn't allow me to do that; is there another way to achieve that ?

I confirm that I want local data of each server replicated to the other to be able to boot VMs in case of one host crash.

Many thanks for your help !
Cheers
 
My goal is to replicate both data drives between 2 servers. If Ceph doesn't allow me to do that; is there another way to achieve that ?

I confirm that I want local data of each server replicated to the other to be able to boot VMs in case of one host crash.
I dont say CEPH can do it, of course will CEPH do it. CEPH is an Object Storage which share the data across many Hosts and Disks and achieve with this an high availability and reliability. But you need more then two Hosts and you need a qourum for your CEPH Mons (corosync is not the only one who need it).

As i already wrote, if you will add an Pool with an Replica 1, then you have all the Data on both Nodes available, but you need at least 3 CEPH Mons - otherwise your CEPH Cluster will go in Read-Only.
 
I dont say CEPH can do it, of course will CEPH do it. CEPH is an Object Storage which share the data across many Hosts and Disks and achieve with this an high availability and reliability. But you need more then two Hosts and you need a qourum for your CEPH Mons (corosync is not the only one who need it).

As i already wrote, if you will add an Pool with an Replica 1, then you have all the Data on both Nodes available, but you need at least 3 CEPH Mons - otherwise your CEPH Cluster will go in Read-Only.

Hi Sb-jw,

Thanks for those clarifications. Not everybody, me included, can have directly three physical server to start with. I think that we can already start with 2 of them and I know several storage solutions (NAS/SAN) working perfectly with only 2 nodes and are redundant.

I don't absolutely want Cephs but any solution with Proxmox that will allows me to replicate my virtual machines between the 2 nodes and start everything on only one in the case the other node fails or goes in maintenance. This implies storage replication and vm live migration.

Do you have any advise for me ? All I have as quorum today is a simple Raspberry with corosync on it.

Thanks again
 
Just a word of advice, the swap and the rrd will kill the SD Card on that RPI in the near future....

If you absolutely must run pmox/ceph on only 2 hosts, then consider having a VM or two as a witness/3rd node.
The ceph conf looks ok!

Good luck
 
Just a word of advice, the swap and the rrd will kill the SD Card on that RPI in the near future....

If you absolutely must run pmox/ceph on only 2 hosts, then consider having a VM or two as a witness/3rd node.
The ceph conf looks ok!

Good luck

Hello,

Thanks for your answer. I understand that the RPI is not the best solution. So I am open to buy one third node but a smaller one compared to 2 others. This third node will only be used for voting so here are my questions :

- What node do you recommends me ? Barebones, Shuttle, ... ?
- Should I also pu same storage on this Third node (2x500GB ssd) in order to also add Ceph OSD ?
- Will it be possible to avoid VM going on this node during a live migration in case of crash of one of two others ?

Thanks again for help / advices all.
 
Auto-answer about Ceph; I will not replicate storage on third node as both first nodes are connected eachother with 10GB and I won't have that on third node... So I need a fully HA Live Migration solution with 3 Proxmox nodes but only 2 ceph nodes... I hope it's possible ?
 
Answers inline
Hello,

Thanks for your answer. I understand that the RPI is not the best solution. So I am open to buy one third node but a smaller one compared to 2 others. This third node will only be used for voting so here are my questions :

- What node do you recommends me ? Barebones, Shuttle, ... ?
I host my 20tb with various drives in replica x3 on HP Z400 (24gb RAM and ssd for OS) but need more RAM even though its working fine as it is. These are old business machines that go for almost NOTHING on ebay. If you want to go the Intel NUC way, remember there is no way to upgrade it with 10gb NICs/GPUs/or even more HDDs (unless you use the Icybox USB 3 UASP caddies ?)

- Should I also pu same storage on this Third node (2x500GB ssd) in order to also add Ceph OSD ?
Preferably yes. This gives you 2 good replicas that can also autoheal in case of bitrot or something else.
This will also make ceph 33% faster! :p

- Will it be possible to avoid VM going on this node during a live migration in case of crash of one of two others ?
You can configure a HA cluster in any way you like, even only one node!
Thanks again for help / advices all.


PS. HA transfer speeds where the best when I switched to SSDs for the proxmox OS, 10gb network upgrade did not even come close! DS.
 
Auto-answer about Ceph; I will not replicate storage on third node as both first nodes are connected eachother with 10GB and I won't have that on third node... So I need a fully HA Live Migration solution with 3 Proxmox nodes but only 2 ceph nodes... I hope it's possible ?
It is possible yes, not recommended at all due to the above..... but possible. Just remember to stick an SSD in that third node tho, else the swapiness monster will eat them spinning rust :D
 
It is possible yes, not recommended at all due to the above..... but possible. Just remember to stick an SSD in that third node tho, else the swapiness monster will eat them spinning rust :D

Many thanks.

So, to summarize, I will have 3 servers :
2 with several resources to hosts VMs and Disk (2x500 SSD for data in each machine)
1 shuttle like server with very few resources compared to 2 others and 1 SSD drive for OS only

With that, even if I clearly understand it's not recommended but it's for HOME only, I can have a 3 nodes cluster with HA and Live motion in case of crash of one node correct ?

Finally, about storage, I leave CEPH configuration as it is and it would be replicated between the 2 VM HOSTS nodes.

Sounds correct to you ?

Cheers
 
It is correct, it will work, no guarantees when it comes to the ceph replica x2 tho! You have been warned!

Dont forget to put in small OS SSDs for the 2 main hosts as well!

Finally, about storage, I leave CEPH configuration as it is and it would be replicated between the 2 VM HOSTS nodes.
Yes, this gives you about 50% of usable disk space.
 
It is correct, it will work, no guarantees when it comes to the ceph replica x2 tho! You have been warned!

Dont forget to put in small OS SSDs for the 2 main hosts as well!


Yes, this gives you about 50% of usable disk space.

Perfect thanks !

I did not mention it but of course I have 2x240 gb ssd for OS only.

last thing; i have 6 NICs on my first 2 hosts. Is the fact the third node will have only one an issue for the cluster HA / live migration even if no VM should run on it ?

thank you
 
Keep the same IP range and dont go wild on separate cluster and public networks and you should be fine.
 
Keep the same IP range and dont go wild on separate cluster and public networks and you should be fine.

Yep, I will put this third "node" in the same cluster network as 2 others. I will buy a NUC J3455 with 8 GB of RAM and 32 Go local SSD for Proxmox install.

I have to do some research to learn to manage Live migration rules to avoid machines to try to go the this NUC.

Many thanks again for all your clarifications !
 
Its simple, dont put the third node in the HA group! :)

Hello AlexLup & others,

I have setup my cluster and it works like a charm for HA. Except for one thing; ceph is lost when 1 of two nodes goes down :)

My guess is that on the NUC I have also to configure Ceph as third monitor node... If that's well the case, can I achieve that without having access to the storage network as it is a 10GB network linked by a direct cable between the two first hosts of the cluster ? If not I guess I have to find a way to route this network...

Thanks !
 
I hate to be that guy, but I told you so :D

You never mentioned the NUC so I am guessing this is the third node, yes ? If yes then a monitor will be needed on this one, which basically means when one of the nodes (votes) go down, the other 2 monitors can vote for a new monitor to be "leader".

The MON itself sends the traffic across on the PUBLIC network so you don't need access to the SANet on the NUC node.

PS. Dont forget to do the corosync.conf totem magic to get faster HA speeds! Check the proxmox wiki! DS.
 
I hate to be that guy, but I told you so :D

You never mentioned the NUC so I am guessing this is the third node, yes ? If yes then a monitor will be needed on this one, which basically means when one of the nodes (votes) go down, the other 2 monitors can vote for a new monitor to be "leader".

The MON itself sends the traffic across on the PUBLIC network so you don't need access to the SANet on the NUC node.

PS. Dont forget to do the corosync.conf totem magic to get faster HA speeds! Check the proxmox wiki! DS.

Hi Alexlup,

Of course I mentionned 3 message above :p No worries... In fact my cluster has well 3 nodes now but the only thing is that when one of the two nodes having Ceph deployed, the other node lost access to the storage pool and is able to see it and restart the VM only after the other Host is up.

So my question is : is there something to do on Ceph (like add another monitor even if the third node is not on the same network where the Ceph has been initialized) ?

Thanks about totem magic I will check that now.

Cheers
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!