Proxmox Server setup business - 3 nodes cluster - Suggested storage type

Sarlis Dimitris

Well-Known Member
Oct 19, 2018
32
2
48
44
Good day to all,

I am setting up in my company a 3 nodes cluster server for HA and redudancy. My experience with Porxmox is quite ok but the major issue now is the storage type to use.

My thoughts are as simple as it gets, I need the storage to be expandable and with the best possible setup for migration and restoring (in case of fail).

In my previous setup to another project, I build the 3 node setup with Ceph/ OSD, to be able to use HA and have the migration but it didn't not work as I wanted.
The machines where setup correclty, VM disks where located into Ceph pool, monitor was Ok but suddenly the server (HP DL360 g8) had one fan failure and it mentioned as critical so server never came up (always rebooted). The VM was transfered automatically to 2nd node but the machines could not load properly. Even after reboot of VMs they where not operational.
I had to restore my backups to local lvm to work again.

Anyhow, I am trying to say that I want your opinion for the best case scenario in regards with storage and the ability to expand easily.

My setup is in Hardware raid, first array raid1 for Server Proxmox OS, and second array in Raid5 with 4x 1.92 DC600 Kingston
Now to the point, ..
If i wish to add 2 more disks and expand the Raid5 to my PVE, how is this possible? What type of datastorage to use? Should I go with OSD?

Maybe it is better to use a network storage from the very beggining without Ceph/ OSD? Like iSCSI or NFS. This will reduce the speed in overall?

Any other ideas?

I will be really happy to answer any questions you might place for this post helping me out to take the best possible decision.

Thank you all
 
Ceph is the way to go, but it has to be done "right". Get professional support. Your described behavior is nothing which is normal for ceph.
 
Hi @Sarlis Dimitris ,
It’s reassuring to hear “this is what you should do,” but reality is rarely that simple. There are companies running multi-petabyte Ceph clusters without issues, while others have had their "weekends ruined" by Ceph problems. Some successfully use iSCSI or NFS, while even high-end vendor SANs have caused billion-dollar companies to lose millions due to downtime.

You shouldn't use RAID with Ceph, as its primary feature is built-in data protection.

Consider your budget, rack space, power capacity, and support capabilities. Assess your business’s tolerance for data unavailability and whether management is willing to invest in proper support.

All the technologies you mentioned are valid choices - given the right investment of time and money.

Cheers.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
  • Like
Reactions: itNGO
@bbgeek17 thanks for your answer as also @itNGO.

So in regards to the point where you mentioned , not to use raid with Ceph, what will do? Individual disks adding to system?
If I am right Ceph and OSD is like RAID but with servers instead of disks, am i right?

Can you please describe a possible setup with Ceph and 3 nodes?
In case we need to stick with only 3 nodes without expanding the servers, is OSD data expansion a possible solution? cause I presume that in this case I must have my disks already added with the size I need and calculated for the next upcoming years...

Lastly, the fault tolerance of Ceph /OSD is higher than RAID, is this correct? I just need to power off the server, add the disk and then build it into Ceph?

@itNGO, professional support meaning to have a good subscription for Proxmox servers? Or to address my setup to a company and help me out?
Cause in Greece there are no lots of companies under the proxmox knowledge.
 
These links may be helpful:

https://www.ibm.com/docs/en/storage-ceph/7?topic=hardware-avoid-using-raid-san-solutions
https://www.youtube.com/watch?v=7BcSnUz_2zQ
https://docs.ceph.com/en/reef/rados/operations/add-or-rm-osds/
https://www.reddit.com/r/Proxmox/comments/187h33f/how_many_nodes_can_fail_in_a_ceph_cluster/


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
@itNGO, professional support meaning to have a good subscription for Proxmox servers? Or to address my setup to a company and help me out?
Cause in Greece there are no lots of companies under the proxmox knowledge.
Ceph needs the disks as direct attached without any raid. "JBOD". The redundancy level is then configured inside Ceph. (3 Copies with at least 2 online in a 3-node-Cluster)

About support..... there are companies outside greece which do this. And everything is possible online and by remote.... Having a Proxmox Subscription is always a good addon when you need "support" in case of a failure....
 
  • Like
Reactions: itNGO
Good morning to all,
Thank you all for your posts.

Still I got this question..
How easy will it be in case I need to expand our storage using Ceph? Do I need to "build" another node with extra storage and add it to ceph?

I do understand and endorse Ceph model as it concerns reliability, tolerance & redundancy but what it will be in case of storage expansion?

There was a previous post from bbgeek17 mentioned not to use RAID with Ceph. Why is that? because we already have the fault tolerance of our build in ceph?

In case I decide to go with SAN storage, what is the preferred setup?
 
There are two ways:
1. adding more disks to current nodes.
2. adding more nodes with more disks.

Second one is always better.
 
How easy will it be in case I need to expand our storage using Ceph? Do I need to "build" another node with extra storage and add it to ceph?
I do understand and endorse Ceph model as it concerns reliability, tolerance & redundancy but what it will be in case of storage expansion?

Basically CEPH works like a Software Raid Controller over several nodes. That means, no extra layer (Hardware Raid) between drives <-> OS should be involved.

CEPH automatically uses all OSDs for data and redundancey. as soon as you add more OSDs, CPEH will rebalance the data on it for optimal redundancy, depending on the rules you have setup (minimun ODSsfor operate). For example, you can have 5 nodes, but need 3 to operate. If number of online nodes fall to 2, system will switch to read-only mode. the minimuim is 3 nodes, and 2 nodes alive, just as atraditional raid-5 setup.
 
I build the 3 node setup with Ceph/ OSD, to be able to use HA and have the migration but it didn't not work as I wanted.
Probably you had found some of the problematic pitfalls?

 
  • Like
Reactions: Sarlis Dimitris
So If I do understand correct, in overall and for my build, to keep safe all data and have a healthy system I need to:

- Build a cluster with at least (min) 4 servers
- Add them to cluster & Ceph of course
- Use the 10GB of network cards
- Have at least 38GB of RAM in each server for Ceph
- Have the initial disks i will use as OSDs and then I can add extra to each one of the servers to expand size (no matter the size but preferable same type ie. SSDs) and of course without adding them in Hardware RAID.

Any additional points for this setup?
 
So If I do understand correct, in overall and for my build, to keep safe all data and have a healthy system I need to:
One advantage of Ceph is its flexibility. The goal of my "FabU" was to mention some aspects and pitfalls. Not more.
- Have at least 38GB of RAM in each server for Ceph
That "38" is the sum of Ram of my example in the cluster. My point was that each and every daemon - be it OSD/MON/MGR or MDS - needs Ram for its own use.
- Have the initial disks i will use as OSDs and then I can add extra to each one of the servers to expand size (no matter the size but preferable same type ie. SSDs) and of course without adding them in Hardware RAID.
Yes. The Operating System itself is independent and I prefer ZFS in a mirrored setup. How many OSDs are installed on each node is totally arbitrary - and starting from zero :-)

Any additional points for this setup?
Well..., I would recommend to setup the first install intentionally "for test". Perhaps you'll find some aspects to be suboptimal for your usecase. Plan for time to deconstruct everything and start from scratch.

Both Ceph and ZFS are really, really great tools. Both are complex as soon as you look under the hood...
 
  • Like
Reactions: Sarlis Dimitris
So If I do understand correct, in overall and for my build, to keep safe all data and have a healthy system I need to:

- Build a cluster with at least (min) 4 servers
- Add them to cluster & Ceph of course
- Use the 10GB of network cards
- Have at least 38GB of RAM in each server for Ceph
- Have the initial disks i will use as OSDs and then I can add extra to each one of the servers to expand size (no matter the size but preferable same type ie. SSDs) and of course without adding them in Hardware RAID.

Any additional points for this setup?

Carefully choose your SSDs. We had a case with non-enterprise SSDs which had to be replaced to guarantee a stable setup.

In addition: give ceph its own seperate network to avoid problems during backups oder other "high load"-cases.
 
Last edited:
  • Like
Reactions: Sarlis Dimitris
Carefully choose your SSDs. We had a case with non-enterprise SSDs which had to be replaced to guarantee a stable setup.

In addition: give ceph its own seperate network to avoid problems during backups oder other "high load"-cases.
yeap great point the ceph "independent network"

regarding the ssd's I am working with the Kingston DC600M series

@UdoB yes I totally understand and the numbers are theoretical based on the setup. I am just using them as reference..
 
You need at least 3 nodes, but you can attach more als you like.

Enterprise SSDs with *real* Power Loss Protection is a must-have ! For Example Micron 7400 or Samsung pm9a3. PLP is important for latency - non PLp drives will have a latency 10-30x higher. You can use u2 to PCIe Adapter cards.

Important: bind Interfaces via MAC to a static name. Else you will mess up the network config, as interface names are dynamic generated in order when found. that means, if a network interface has "enp36s0" or something like that, and you add a PCI(e) device, it may have come up wth "enp40s0" on next reboot, but config in /etc/network/interfaces will still have old names.

Furthermore it is smart to create seperate vLAN / bridges for cluster communication. that makes it easier to keep cluster only traffic inside the cluster.
 
  • Like
Reactions: _gabriel and UdoB