New 3-nodes cluster suggestion

Stefano Giunchi · Feb 3, 2019

I'm about to build a new, small and general purpose cluster.
The selected hardware is this:
SuperMicro TwinPro (2029TP-HC0R), with 3 nodes, each with:
1 CPU XEON SCALABLE (P4X-SKL3106-SR3GL)
64GB RAM DDR4-2666 (MEM-DR432L-CL01-ER26)
4 port 10GB (AOC-MTG-I4TM-O SIOM) FOR CEPH TRAFFIC (MESH)
4 port 1 GB (AOC-UG-I4(x4)AOC-UG-I4)
6 Intel D3-S4510 480GB (SSDSC2KB480G8) SATA/6Gbps
1 SFT-OOB-LIC IPMI controller advanced management features

The cluster will be used for a few virtual machines and data, the main scope is high availability, not extreme performance.
Do you see anything bad in this configuration?

Thanks

rdrl · Feb 4, 2019

Not knowing the parameters of your "few virtual machines and data" is difficult to give recommendations of configuration. But in my opinion, you may increase RAM to 96 or 128GB - there will be no useless.

Stefano Giunchi · Feb 4, 2019

It will be 6 VMs to start, 4 Windows and 2 Linux. No more than 40GB used by VMs. Your recommendation to increase RAM is valid, but it's easily upgradeable. My main concern is on the SATA SSD, if anyone has used these for Ceph.

rdrl · Feb 4, 2019

Stefano Giunchi said:
It will be 6 VMs to start, 4 Windows and 2 Linux. No more than 40GB used by VMs.

Look at this topic - https://forum.proxmox.com/threads/memory-usage-on-empty-node-with-ceph.50760/

You have 6 disks, most likely you will use 5-6 OSD on each node - approximately 20-24GB will be used only by CEPH.
I can add that I have a test cluster ProxMox 5.3 with 3 virtual machines on each node (2 Linux and 1 Windows - allocated 2GB each, in the work take about 1.3 GB) with 6 OSD Ceph - on each node total memory load is about 32GB.

Stefano Giunchi · Feb 5, 2019

The OSD are 460GB each, for a total of less than 3TB.
Usually the suggestion is 1GB for 1TB of data. (http://docs.ceph.com/docs/jewel/start/hardware-recommendations/)
Anyway I'll evaluate to increase RAM to 96GB, thanks!

alessice · Feb 22, 2019

Hi Stefano,

I have buy now the same configuration (Supermicro Twin 2029TP-HC0R) with the intention to run Proxmox+Ceph cluster. Have you already install on new hardware and all works fine? Have you request to Supermicro to set LSI 3008 in IT Mode?

Thanks

Stefano Giunchi · Feb 22, 2019

Hi,
I still haven't bought it, I just got the ok from the customer and I think we'll have it in a couple weeks.

I haven't read the need to ask for the LSI to be flashed in IT mode/passthrough, as it's not a real raid card. Have you got any link to that?
Please let's both update this post, I think it will be useful.

What's your hardware? Mine will be this:

SuperMicro TwinPro (2029TP-HC0R), with 3 nodes, each with:
1 CPU XEON SCALABLE (P4X-SKL3106-SR3GL)
64GB RAM DDR4-2666 (MEM-DR432L-CL01-ER26)
4 port 10GB (AOC-MTG-I4TM-O SIOM) FOR CEPH TRAFFIC (MESH)
4 port 1 GB (AOC-UG-I4(x4)AOC-UG-I4)
2 Intel D3-S4510 240GB (SSDSC2KB240G8) SATA/6Gbps (raid1, system)
4 Intel D3-S4510 960GB (SSDSC2KB960G8) SATA/6Gbps (jbod, ceph)
1 SFT-OOB-LIC IPMI controller advanced management features

Stefano

alessice · Feb 22, 2019

Stefano, I have buy yestarday

2 SuperMicro TwinPro (2029TP-HC0R), with 4 nodes, each with:
2 CPU Xeon 4114
192 GB RAM
4 port 10GB SPF
2 128GB SSD SATADOM for OS
6 Intel D3-S4510 960GB for Ceph

Why you have also 4 port 1 GB? My intention is to use 2x10Gbit for Ceph and 2x10Gbit for Internet. I will buy also 2 Switch low latency for the network.

I request to Supermicro to have LSI 3008 in IT Mode since can be configured also in Raid (software):

https://www.supermicro.com/en/products/storage/cards

I don't know if is really necessary.

alexskysilk · Feb 22, 2019

Just FYI, 3 nodes is considered ok for lab but not for production for a ceph cluster. The reason is simple- taking one node out will render your cluster read only. At MINIMUM you should have 4 for production.

Stefano Giunchi · Feb 22, 2019

I did not select the SATADOM for OS, because I read this: https://www.supermicro.com/datasheet/datasheet_SuperDOM.pdf (see Use Cases not recommended)

I don't use a switch for Ceph and corosync traffic in 10GB, I have it mesh, every connection with two bonded cables for higher reliability. That is
NODE1-P1/2==>NODE2-P3/4
NODE2-P1/2==>NODE3-P3/4
NODE3-P1/2==>NODE1-P3/4

The 4 ports 1GB are for vm traffic and backup, connected to two stacked switches (two cables each).
This is the theory. I hope that trunking with stacked switches works as expected.

Stefano Giunchi · Feb 22, 2019

alexskysilk,
it's not what I've read and experienced. My lab consisted of two ceph nodes, with one copy each. The pool was 2/1. When I did shut down one node, vm were migrated (if in HA) and everything was working. Recovering was then a pain in the ass (old hardware, only 2 1gb lans, 3 7200rpm sata disks each).

With three nodes, one data copy on each, it can work (both ceph and corosync) in case of a node down.
If I set the pool to 3/1, the cluster can work even with TWO nodes down. That's not advisable though, It could be a last resort to revive the server manually in case disaster occurs.

With four nodes, I need to keep three up for quorum.

tom · Feb 23, 2019

alexskysilk said:
3 nodes is considered ok for lab but not for production for a ceph cluster. The reason is simple- taking one node out will render your cluster read only

No, why?

alexskysilk · Feb 25, 2019

tom said:
No, why?

By my understanding, ceph requires quorum to be authoritative down to the pg layer. Consequently, the minimum required members for a valid authoritative pg is 3. With only three nodes it is not possible to have 3 active chunks in pgs with 2 nodes active, and a default replication rule of 3/2 will disable write access.

tom · Feb 26, 2019

In a default three node Ceph Cluster, you have all data on all three nodes. so if one host is down, you still have 2 replicas, which is enough.

2 hosts represents 66,66 % of the cluster (more than 50 %)

alexskysilk · Feb 26, 2019

fair enough, although there is the outside chance of delta between two chunks in a pg, and absent a 3rd for quorum the pg will be taken offline. Fine for lab but an unnecessary risk for production since the relative cost of another node is low enough to make the deployment irresponsible otherwise.

Stefano Giunchi · Mar 2, 2019

Thank you for the heads up on this problem. I didn't find any documentation on this, if you have some link it's appreciated.
I think that writing the deployment with three nodes is "irresponsibile", anyway, seems a bit strong to me: the normal situation is with three nodes, and thus the pg quorum is respected.

It's like saying that deploying a server with a raid-5 with three disks is irresponsible: if it is, I live in a world of irresponsible people...

alexskysilk · Mar 4, 2019

Stefano Giunchi said:
I think that writing the deployment with three nodes is "irresponsibile", anyway, seems a bit strong to me: the normal situation is with three nodes, and thus the pg quorum is respected.

The issue isnt with the number of nodes; the issue is with exposure when degraded. The RAID5 analogy is apt here because it suffers from the exact same issue- Yes, the pool continues to function when a disk drops BUT YOU NO LONGER HAVE PARITY, and will not until the pool has been restored to healthy. What that means is that any media, bus, or host error cannot be trapped and your data is suspect as it can be corrupt or broken; in the case of a simple RAID5 you wouldn't even know.

The "irresponsible" adjective is a function of your role in the design of the system. Since your customer is paying you to put this together, and you knowingly design a system with a hole in it you are being irresponsible by definition. Your customer can decide they'd rather not invest on curing the defect and live with the risk, but that is his decision to make.

Stefano Giunchi said:
It's like saying that deploying a server with a raid-5 with three disks is irresponsible: if it is, I live in a world of irresponsible people...

To quote my mom, if Billy jumps off a cliff doesnt mean you should too. RAID5 is absolutely irresponsible for anyone who cares about their data since the cost of removing the risk is ONE DRIVE (RAID6.)

Search

Search

New 3-nodes cluster suggestion

Stefano Giunchi

Renowned Member

rdrl

Member

Stefano Giunchi

Renowned Member

rdrl

Member

Stefano Giunchi

Renowned Member

alessice

Active Member

Stefano Giunchi

Renowned Member

alessice

Active Member

alexskysilk

Distinguished Member

Stefano Giunchi

Renowned Member

Stefano Giunchi

Renowned Member

tom

Proxmox Staff Member

alexskysilk

Distinguished Member

tom

Proxmox Staff Member

alexskysilk

Distinguished Member

Stefano Giunchi

Renowned Member

alexskysilk

Distinguished Member