Proxmox Ceph 10G setup

DerDanilo · Jan 22, 2017

Is the following setup possible with Ceph Jewel?

4 Nodes in Proxmox Cluster/Ceph Cluster:

2 Storage nodes, running some testing VMs as well:
--> 2 Nodes (128GB RAM, Octa Core) with 13 OSDs, each 6TB, MONs on same Disks

2 VM dedicated nodes
--> 2 Nodes (265GB RAM, Octa Core) with no OSDs but MONs running on local SSD (GPT) storage

This leaves us with 4 MONs, 2 of them with SSD and 26 OSDs split on 2 storage nodes.

- All nodes have additional 10Gbit network cards dedicated for ceph and cluster communication (via VLANs).
- Public communication runs via 1Gbit network cards.

If I understand Ceph data redundancy (replica) right it would work to set it to 2 instead of the default 3, if one of the storage nodes goes down, Ceph should still be running, even if load is higher, right?

How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 2 ?
How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 3 ?

Thanks for your help!

udo · Jan 22, 2017

Hi,

1. 13 big OSD on one node is IMHO too much.
2. you need so much osd-nodes you can - every node (if confgured right and enough powerfull) will speedup your IO.
3. With two nodes only, you will run in trouble (replica).
4. Use an odd number of Mons - three ist for normal sized cluster (like 12 OSD-Nodes) enough.
5. MONs don't need much power - but some (fast) storage to write logs.
6. DC-grade SSDs for journaling speed up ceph (writing)

Network: "10GBit for ceph and cluster communication" - what mean cluster communication? pve-cluster or ceph cluster?
Normaly you shouldn't mix pve-cluster with storage-communication.
For good ceph-performance you should use different networks for ceph-cluster and ceph-private.
ceph-cluster: this network must see all clients - pve qemu, and all Mons
ceph-private: this network connect all OSD-nodes - this network is used for replication and recovery (failed or added OSDs).

About replica and "cluster crash": ceph is "self-healing", mean if not enough replicas are online the IO will stop - if you fix the issue the IO is allowed, but perhaps your ceph-clients run in an timeout and need to be restartet. So the server will not crash, but your VMs don't work - looks like the same but isn't it.

ceph spread the data over the OSDs on the basis of the crush-algorithm. There are many posibilities to configure (like datacenter, room, rack, node).
Normaly the replicas are between nodes.
And all data will split in 4MB-chunks. Mean, with an replica of two you have dataloss if on two nodes fails one drive (if the ceph-cluster aren't able to rebuild the missing replicas from the first failed osd before the second die).
This is the reason, why replica 3 is strongly recommended!

Udo

DerDanilo · Jan 22, 2017

Hey Udo!

Thanks again for your reply and taking the time to answer in detail.

It might be easier to give an idea about the available Hardware if I tell you which servers will be used.
We build our clusters at HETZNER Datacenters with new machines.

- All OSD nodes will have one 10G NICs for the CEPH private network and and a seconds 10G NIC for the CEPH cluster incl. Proxmox cluster network (via VLANs).
- All VM nodes will have one 10G NIC for CEPH cluster incl. Proxmox cluster network (via VLANs).
- We have a 10G switch for private networks without public uplink.

For the VM nodes we want to take PX91-SSD or PX121-SSD.
https://www.hetzner.de/de/hosting/produktmatrix/rootserver-produktmatrix-px

For the CEPH nodes we want to take the SX291 or SX131.
https://www.hetzner.de/de/hosting/produktmatrix/rootserver-produktmatrix-sx

This is a matter of calculation since it's more expensive to rent several servers instead of less with more Hardware resources.

I'm considering right now to do calculations again with the available DELL machines.
https://www.hetzner.de/de/hosting/produktmatrix/dell

Thanks!

spirit · Jan 23, 2017

>>How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 2 ?

you can loose 1 node with all osds on this node, or 1 osd max on differents nodes

>>How many OSDs can be missing/damaged on each node before cluster crash if replica is set to 3 ?

you can loose 2 nodes will all osds, or 2 osds max on differents nodes

DerDanilo · Jan 26, 2017

@spirit:
"you can loose 2 nodes will all osds, or 2 osds max on differents nodes"
Meaning: 2 entire nodes, running only one node and 2 OSDs on each node at the same time?

#######################

Thanks for your suggestion! We will approach 3+ nodes.

- We don't need HA in Proxmox, just want to have one central storage system with different volumes (at least 2).
- Backups will be stored on another system.

I'm not entirely new to ceph, but have never had to build it on my own, especially with kind of fixed systems at the DC, just want to make sure it runs nice and stable when ready.

Other solutions, please tell me what you think.
--> Looking at the budget, not real difference, though I think that 1# might be the better choice as it adds 2 more nodes and DC SSD/NVMe MONs.

1#

HDD MONs on OSDs (Data-Nodes) and SSD MONs (VM-Nodes) mixed.
Also: What happens if on MON crashes? What happens to the data?

3x Data-Nodes:
OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

2x VM-Nodes:
MONs (2x DC SSDs 240/480GB or 512GB NVMe cards)
-- Hexa-Core, 128/256GB ECC RAM
-- 1x 10 Gbit Cards (ceph-client + proxmox cluster)
-- 1x 1 Gbit (public uplink)

~ 1220€ / Monat

2#

HDD MONs only, on OSD disks. (Data-Nodes)

4x Data+VM-Nodes:
OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

~ 1230€ / Monat

Thanks guys!

Danilo

Ashley · Jan 26, 2017

1/

I would sugest you always keep your VM and OSD nodes seperate, this is also recommend best by CEPH themselves, also allows you to use KRBD without any lockup's when OSD's and the VM's running on the same kernel.

2/

If a MON dies the remaining MON's will continue to operate aslong as they can make QUORUM (enough of them).

DerDanilo said:
@spirit:
"you can loose 2 nodes will all osds, or 2 osds max on differents nodes"
Meaning: 2 entire nodes, running only one node and 2 OSDs on each node at the same time?

#######################

Thanks for your suggestion! We will approach 3+ nodes.

- We don't need HA in Proxmox, just want to have one central storage system with different volumes (at least 2).
- Backups will be stored on another system.

I'm not entirely new to ceph, but have never had to build it on my own, especially with kind of fixed systems at the DC, just want to make sure it runs nice and stable when ready.

Other solutions, please tell me what you think.
--> Looking at the budget, not real difference, though I think that 1# might be the better choice as it adds 2 more nodes and DC SSD/NVMe MONs.

1#

HDD MONs on OSDs (Data-Nodes) and SSD MONs (VM-Nodes) mixed.
Also: What happens if on MON crashes? What happens to the data?

3x Data-Nodes:
OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

2x VM-Nodes:
MONs (2x DC SSDs 240/480GB or 512GB NVMe cards)
-- Hexa-Core, 128/256GB ECC RAM
-- 1x 10 Gbit Cards (ceph-client + proxmox cluster)
-- 1x 1 Gbit (public uplink)

~ 1220€ / Monat

2#

HDD MONs only, on OSD disks. (Data-Nodes)

4x Data+VM-Nodes:
OSDs + MONs on OSDs directly (10x 6TB HDD SATA disks, smaller not possible).
-- Hexa-Core, 64GB ECC RAM
-- 2x 10 Gbit Cards (1# ceph-client + proxmox cluster 2# ceph-private)
-- 1x 1 Gbit (public uplink)

~ 1230€ / Monat

Thanks guys!

Danilo

DerDanilo · Feb 7, 2017

3 nodes are considered for CEPH, but this whole setup is a bit expensive when using them only for ceph.
None critical VMs for Testing will run on the ceph nodes themself.

Is tehe any problem known when CEPH and VMs run on the same nodes?
I mean the nodes have plenty of hardware.

Ashley · Feb 8, 2017

DerDanilo said:
3 nodes are considered for CEPH, but this whole setup is a bit expensive when using them only for ceph.
None critical VMs for Testing will run on the ceph nodes themself.

Is tehe any problem known when CEPH and VMs run on the same nodes?
I mean the nodes have plenty of hardware.

As long as you don't use krbd and just use the inbuilt KVM RBD driver. And watch out during OSD rebuilds for RAM OOM.

markusd · Feb 10, 2017

Hi,

Ashley said:
I would sugest you always keep your VM and OSD nodes seperate [...]also allows you to use KRBD without any lockup's when OSD's and the VM's running on the same kernel.

Very interesting! Do you have any docu about this?
KRBD need much less cpu-power -isn`t it relavant?

Greetings

Markus

Ashley · Feb 11, 2017

markusd said:
Hi,

Very interesting! Do you have any docu about this?
KRBD need much less cpu-power -isn`t it relavant?

Greetings

Markus

It is documented in many online places including : http://ceph-users.ceph.narkive.com/px2L2fHc/is-it-still-unsafe-to-map-a-rbd-device-on-an-osd-server

I have yet to see a clear place that a particular kernel version has the issue 100% fixed, and is not something I wish to test myself.

KRBD vs QEMU has there own advantages and disadvantages:

https://forum.proxmox.com/threads/krbd-on-made-my-vm-fly-like-a-rocket-why.25608/
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005503.html

DerDanilo · Apr 21, 2017

To be honest we are not really sure that we want to use PVEs Ceph as we will use it mainly as block storage to mount directly in VMs (Hetzner vServer) as private Backup Cloud or for apps like Nextcloud/Seafile/Syncwerk.

Search

Search

Proxmox Ceph 10G setup

DerDanilo

Renowned Member

udo

Distinguished Member

DerDanilo

Renowned Member

spirit

Distinguished Member

DerDanilo

Renowned Member

Ashley

Member

DerDanilo

Renowned Member

Ashley

Member

markusd

Renowned Member

Ashley

Member

DerDanilo

Renowned Member