New 3-nodes cluster suggestion

Stefano Giunchi

Renowned Member
Jan 17, 2016
84
12
73
50
Forlì, Italy
www.soasi.com
I'm about to build a new, small and general purpose cluster.
The selected hardware is this:
SuperMicro TwinPro (2029TP-HC0R), with 3 nodes, each with:
1 CPU XEON SCALABLE (P4X-SKL3106-SR3GL)
64GB RAM DDR4-2666 (MEM-DR432L-CL01-ER26)
4 port 10GB (AOC-MTG-I4TM-O SIOM) FOR CEPH TRAFFIC (MESH)
4 port 1 GB (AOC-UG-I4(x4)AOC-UG-I4)
6 Intel D3-S4510 480GB (SSDSC2KB480G8) SATA/6Gbps
1 SFT-OOB-LIC IPMI controller advanced management features

The cluster will be used for a few virtual machines and data, the main scope is high availability, not extreme performance.
Do you see anything bad in this configuration?

Thanks
 
Not knowing the parameters of your "few virtual machines and data" is difficult to give recommendations of configuration. But in my opinion, you may increase RAM to 96 or 128GB - there will be no useless.
 
It will be 6 VMs to start, 4 Windows and 2 Linux. No more than 40GB used by VMs. Your recommendation to increase RAM is valid, but it's easily upgradeable. My main concern is on the SATA SSD, if anyone has used these for Ceph.
 
It will be 6 VMs to start, 4 Windows and 2 Linux. No more than 40GB used by VMs.
Look at this topic - https://forum.proxmox.com/threads/memory-usage-on-empty-node-with-ceph.50760/

You have 6 disks, most likely you will use 5-6 OSD on each node - approximately 20-24GB will be used only by CEPH.
I can add that I have a test cluster ProxMox 5.3 with 3 virtual machines on each node (2 Linux and 1 Windows - allocated 2GB each, in the work take about 1.3 GB) with 6 OSD Ceph - on each node total memory load is about 32GB.
 
Hi Stefano,

I have buy now the same configuration (Supermicro Twin 2029TP-HC0R) with the intention to run Proxmox+Ceph cluster. Have you already install on new hardware and all works fine? Have you request to Supermicro to set LSI 3008 in IT Mode?

Thanks
 
Hi,
I still haven't bought it, I just got the ok from the customer and I think we'll have it in a couple weeks.

I haven't read the need to ask for the LSI to be flashed in IT mode/passthrough, as it's not a real raid card. Have you got any link to that?
Please let's both update this post, I think it will be useful.

What's your hardware? Mine will be this:

SuperMicro TwinPro (2029TP-HC0R), with 3 nodes, each with:
1 CPU XEON SCALABLE (P4X-SKL3106-SR3GL)
64GB RAM DDR4-2666 (MEM-DR432L-CL01-ER26)
4 port 10GB (AOC-MTG-I4TM-O SIOM) FOR CEPH TRAFFIC (MESH)
4 port 1 GB (AOC-UG-I4(x4)AOC-UG-I4)
2 Intel D3-S4510 240GB (SSDSC2KB240G8) SATA/6Gbps (raid1, system)
4 Intel D3-S4510 960GB (SSDSC2KB960G8) SATA/6Gbps (jbod, ceph)
1 SFT-OOB-LIC IPMI controller advanced management features

Stefano
 
Stefano, I have buy yestarday :)

2 SuperMicro TwinPro (2029TP-HC0R), with 4 nodes, each with:
2 CPU Xeon 4114
192 GB RAM
4 port 10GB SPF
2 128GB SSD SATADOM for OS
6 Intel D3-S4510 960GB for Ceph

Why you have also 4 port 1 GB? My intention is to use 2x10Gbit for Ceph and 2x10Gbit for Internet. I will buy also 2 Switch low latency for the network.

I request to Supermicro to have LSI 3008 in IT Mode since can be configured also in Raid (software):

https://www.supermicro.com/en/products/storage/cards

I don't know if is really necessary.
 
Just FYI, 3 nodes is considered ok for lab but not for production for a ceph cluster. The reason is simple- taking one node out will render your cluster read only. At MINIMUM you should have 4 for production.
 
I did not select the SATADOM for OS, because I read this: https://www.supermicro.com/datasheet/datasheet_SuperDOM.pdf (see Use Cases not recommended)

I don't use a switch for Ceph and corosync traffic in 10GB, I have it mesh, every connection with two bonded cables for higher reliability. That is
NODE1-P1/2==>NODE2-P3/4
NODE2-P1/2==>NODE3-P3/4
NODE3-P1/2==>NODE1-P3/4

The 4 ports 1GB are for vm traffic and backup, connected to two stacked switches (two cables each).
This is the theory. I hope that trunking with stacked switches works as expected.
 
alexskysilk,
it's not what I've read and experienced. My lab consisted of two ceph nodes, with one copy each. The pool was 2/1. When I did shut down one node, vm were migrated (if in HA) and everything was working. Recovering was then a pain in the ass (old hardware, only 2 1gb lans, 3 7200rpm sata disks each).

With three nodes, one data copy on each, it can work (both ceph and corosync) in case of a node down.
If I set the pool to 3/1, the cluster can work even with TWO nodes down. That's not advisable though, It could be a last resort to revive the server manually in case disaster occurs.

With four nodes, I need to keep three up for quorum.
 
Last edited:
3 nodes is considered ok for lab but not for production for a ceph cluster. The reason is simple- taking one node out will render your cluster read only

No, why?
 
By my understanding, ceph requires quorum to be authoritative down to the pg layer. Consequently, the minimum required members for a valid authoritative pg is 3. With only three nodes it is not possible to have 3 active chunks in pgs with 2 nodes active, and a default replication rule of 3/2 will disable write access.
 
In a default three node Ceph Cluster, you have all data on all three nodes. so if one host is down, you still have 2 replicas, which is enough.

2 hosts represents 66,66 % of the cluster (more than 50 %)
 
fair enough, although there is the outside chance of delta between two chunks in a pg, and absent a 3rd for quorum the pg will be taken offline. Fine for lab but an unnecessary risk for production since the relative cost of another node is low enough to make the deployment irresponsible otherwise.
 
Thank you for the heads up on this problem. I didn't find any documentation on this, if you have some link it's appreciated.
I think that writing the deployment with three nodes is "irresponsibile", anyway, seems a bit strong to me: the normal situation is with three nodes, and thus the pg quorum is respected.

It's like saying that deploying a server with a raid-5 with three disks is irresponsible: if it is, I live in a world of irresponsible people...
 
I think that writing the deployment with three nodes is "irresponsibile", anyway, seems a bit strong to me: the normal situation is with three nodes, and thus the pg quorum is respected.

The issue isnt with the number of nodes; the issue is with exposure when degraded. The RAID5 analogy is apt here because it suffers from the exact same issue- Yes, the pool continues to function when a disk drops BUT YOU NO LONGER HAVE PARITY, and will not until the pool has been restored to healthy. What that means is that any media, bus, or host error cannot be trapped and your data is suspect as it can be corrupt or broken; in the case of a simple RAID5 you wouldn't even know.

The "irresponsible" adjective is a function of your role in the design of the system. Since your customer is paying you to put this together, and you knowingly design a system with a hole in it you are being irresponsible by definition. Your customer can decide they'd rather not invest on curing the defect and live with the risk, but that is his decision to make.

It's like saying that deploying a server with a raid-5 with three disks is irresponsible: if it is, I live in a world of irresponsible people...

To quote my mom, if Billy jumps off a cliff doesnt mean you should too. RAID5 is absolutely irresponsible for anyone who cares about their data since the cost of removing the risk is ONE DRIVE (RAID6.)
 
  • Like
Reactions: guletz

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!