Proxmox Ceph 10G setup

DerDanilo

Renowned Member
Jan 21, 2017
477
139
83
Is the following setup possible with Ceph Jewel?

4 Nodes in Proxmox Cluster/Ceph Cluster:

2 Storage nodes, running some testing VMs as well:
--> 2 Nodes (128GB RAM, Octa Core) with 13 OSDs, each 6TB, MONs on same Disks

2 VM dedicated nodes
--> 2 Nodes (265GB RAM, Octa Core) with no OSDs but MONs running on local SSD (GPT) storage

This leaves us with 4 MONs, 2 of them with SSD and 26 OSDs split on 2 storage nodes.

- All nodes have additional 10Gbit network cards dedicated for ceph and cluster communication (via VLANs).
- Public communication runs via 1Gbit network cards.

If I understand Ceph data redundancy (replica) right it would work to set it to 2 instead of the default 3, if one of the storage nodes goes down, Ceph should still be running, even if load is higher, right?

How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 2 ?
How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 3 ?

Thanks for your help!
 
Hi,

1. 13 big OSD on one node is IMHO too much.
2. you need so much osd-nodes you can - every node (if confgured right and enough powerfull) will speedup your IO.
3. With two nodes only, you will run in trouble (replica).
4. Use an odd number of Mons - three ist for normal sized cluster (like 12 OSD-Nodes) enough.
5. MONs don't need much power - but some (fast) storage to write logs.
6. DC-grade SSDs for journaling speed up ceph (writing)

Network: "10GBit for ceph and cluster communication" - what mean cluster communication? pve-cluster or ceph cluster?
Normaly you shouldn't mix pve-cluster with storage-communication.
For good ceph-performance you should use different networks for ceph-cluster and ceph-private.
ceph-cluster: this network must see all clients - pve qemu, and all Mons
ceph-private: this network connect all OSD-nodes - this network is used for replication and recovery (failed or added OSDs).

About replica and "cluster crash": ceph is "self-healing", mean if not enough replicas are online the IO will stop - if you fix the issue the IO is allowed, but perhaps your ceph-clients run in an timeout and need to be restartet. So the server will not crash, but your VMs don't work - looks like the same but isn't it.

ceph spread the data over the OSDs on the basis of the crush-algorithm. There are many posibilities to configure (like datacenter, room, rack, node).
Normaly the replicas are between nodes.
And all data will split in 4MB-chunks. Mean, with an replica of two you have dataloss if on two nodes fails one drive (if the ceph-cluster aren't able to rebuild the missing replicas from the first failed osd before the second die).
This is the reason, why replica 3 is strongly recommended!

Udo
 
Hey Udo!

Thanks again for your reply and taking the time to answer in detail.

It might be easier to give an idea about the available Hardware if I tell you which servers will be used.
We build our clusters at HETZNER Datacenters with new machines.

- All OSD nodes will have one 10G NICs for the CEPH private network and and a seconds 10G NIC for the CEPH cluster incl. Proxmox cluster network (via VLANs).
- All VM nodes will have one 10G NIC for CEPH cluster incl. Proxmox cluster network (via VLANs).
- We have a 10G switch for private networks without public uplink.

For the VM nodes we want to take PX91-SSD or PX121-SSD.
https://www.hetzner.de/de/hosting/produktmatrix/rootserver-produktmatrix-px

For the CEPH nodes we want to take the SX291 or SX131.
https://www.hetzner.de/de/hosting/produktmatrix/rootserver-produktmatrix-sx

This is a matter of calculation since it's more expensive to rent several servers instead of less with more Hardware resources.

I'm considering right now to do calculations again with the available DELL machines.
https://www.hetzner.de/de/hosting/produktmatrix/dell

Thanks!
 
>>How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 2 ?

you can loose 1 node with all osds on this node, or 1 osd max on differents nodes

>>How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 3 ?

you can loose 2 nodes will all osds, or 2 osds max on differents nodes
 
@spirit:
"you can loose 2 nodes will all osds, or 2 osds max on differents nodes"
Meaning: 2 entire nodes, running only one node and 2 OSDs on each node at the same time?

#######################

Thanks for your suggestion! We will approach 3+ nodes.

- We don't need HA in Proxmox, just want to have one central storage system with different volumes (at least 2).
- Backups will be stored on another system.

I'm not entirely new to ceph, but have never had to build it on my own, especially with kind of fixed systems at the DC, just want to make sure it runs nice and stable when ready.

Other solutions, please tell me what you think.
--> Looking at the budget, not real difference, though I think that 1# might be the better choice as it adds 2 more nodes and DC SSD/NVMe MONs.

1#

HDD MONs on OSDs (Data-Nodes) and SSD MONs (VM-Nodes) mixed.
Also: What happens if on MON crashes? What happens to the data?

3x Data-Nodes:
OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

2x VM-Nodes:
MONs (2x DC SSDs 240/480GB or 512GB NVMe cards)
-- Hexa-Core, 128/256GB ECC RAM
-- 1x 10 Gbit Cards (ceph-client + proxmox cluster)
-- 1x 1 Gbit (public uplink)

~ 1220€ / Monat


2#

HDD MONs only, on OSD disks. (Data-Nodes)

4x Data+VM-Nodes:

OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

~ 1230€ / Monat



Thanks guys!

Danilo
 
1/

I would sugest you always keep your VM and OSD nodes seperate, this is also recommend best by CEPH themselves, also allows you to use KRBD without any lockup's when OSD's and the VM's running on the same kernel.

2/

If a MON dies the remaining MON's will continue to operate aslong as they can make QUORUM (enough of them).

@spirit:
"you can loose 2 nodes will all osds, or 2 osds max on differents nodes"
Meaning: 2 entire nodes, running only one node and 2 OSDs on each node at the same time?

#######################

Thanks for your suggestion! We will approach 3+ nodes.

- We don't need HA in Proxmox, just want to have one central storage system with different volumes (at least 2).
- Backups will be stored on another system.

I'm not entirely new to ceph, but have never had to build it on my own, especially with kind of fixed systems at the DC, just want to make sure it runs nice and stable when ready.

Other solutions, please tell me what you think.
--> Looking at the budget, not real difference, though I think that 1# might be the better choice as it adds 2 more nodes and DC SSD/NVMe MONs.

1#

HDD MONs on OSDs (Data-Nodes) and SSD MONs (VM-Nodes) mixed.
Also: What happens if on MON crashes? What happens to the data?

3x Data-Nodes:
OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

2x VM-Nodes:
MONs (2x DC SSDs 240/480GB or 512GB NVMe cards)
-- Hexa-Core, 128/256GB ECC RAM
-- 1x 10 Gbit Cards (ceph-client + proxmox cluster)
-- 1x 1 Gbit (public uplink)

~ 1220€ / Monat


2#

HDD MONs only, on OSD disks. (Data-Nodes)

4x Data+VM-Nodes:

OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

~ 1230€ / Monat



Thanks guys!

Danilo
 
3 nodes are considered for CEPH, but this whole setup is a bit expensive when using them only for ceph.
None critical VMs for Testing will run on the ceph nodes themself.

Is tehe any problem known when CEPH and VMs run on the same nodes?
I mean the nodes have plenty of hardware.
 
3 nodes are considered for CEPH, but this whole setup is a bit expensive when using them only for ceph.
None critical VMs for Testing will run on the ceph nodes themself.

Is tehe any problem known when CEPH and VMs run on the same nodes?
I mean the nodes have plenty of hardware.

As long as you don't use krbd and just use the inbuilt KVM RBD driver. And watch out during OSD rebuilds for RAM OOM.
 
Hi,
I would sugest you always keep your VM and OSD nodes seperate [...]also allows you to use KRBD without any lockup's when OSD's and the VM's running on the same kernel.
Very interesting! Do you have any docu about this?
KRBD need much less cpu-power -isn`t it relavant?

Greetings

Markus
 
Last edited:
Hi,

Very interesting! Do you have any docu about this?
KRBD need much less cpu-power -isn`t it relavant?

Greetings

Markus

It is documented in many online places including : http://ceph-users.ceph.narkive.com/px2L2fHc/is-it-still-unsafe-to-map-a-rbd-device-on-an-osd-server

I have yet to see a clear place that a particular kernel version has the issue 100% fixed, and is not something I wish to test myself.

KRBD vs QEMU has there own advantages and disadvantages:

https://forum.proxmox.com/threads/krbd-on-made-my-vm-fly-like-a-rocket-why.25608/
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005503.html
 
To be honest we are not really sure that we want to use PVEs Ceph as we will use it mainly as block storage to mount directly in VMs (Hetzner vServer) as private Backup Cloud or for apps like Nextcloud/Seafile/Syncwerk.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!