My first Proxmox (and ceph) design and setup

Pura-Vida

New Member
Feb 11, 2024
4
0
1
Hi,

I plan to set up a 2-node cluster, with each node having exactly the same configuration: 1 CPU with 12 cores, 64 GB RAM, 5 x GbE, 2 TB NVMe SSD, and 500 GB SATA-6 SSD.

Network:
  • One GbE will be dedicated to Proxmox Virtual Environment (PVE) administration (on a dedicated VLAN) and connection to the internet for updating the PVE cluster (on another dedicated VLAN).
  • Two GbE ports will be LACP aggregated and dedicated to VM applications/user workflows (VLANs: users, applications, printers, internet, user NAS, VM administration, sandbox, DMZ) and will never be used by physical nodes.
  • Two GbE ports will be LACP aggregated and dedicated to Ceph storage, VM live migration, and dedicated PVE usage. I assume that these VLANs do not need to be accessible by VMs.
  • The network configuration is already up and running on Layer 3 switches and routers:
    • VLANs <1000 are dedicated to physical components of the infrastructure (servers, routers, switches, physical firewalls),
    • VLANs 1000-1999 for virtualization infrastructure needs (Ceph storage, VM live migration, heartbeat),
    • VLANs 2000-3999 for standard VMs,
    • VLANs 4000-4094 for homelab/POC.

Storage:
  • The PVE system will be installed on the 500 GB SSD, while VMs will use Ceph storage hosted by the NVMe storage.
Others:
  • A cluster of dedicated servers (not included in this thread) will host a FreeIPA instance.

Questions:
  1. Is this design okay?
  2. What changes would you suggest?
  3. Would you recommend using Ceph storage even if there are only 2 nodes to clusterize (in July, 5 other nodes with 25 gbe SFP+ instead of gbe will join this cluster, the low-cost cluster will host non critical VMs, while the the news nodes will hosts production and network intensive tasks)?
  4. How should the network be configured to fulfill this specification?


    Sincerely,

    Fred.
 
The design is not OK, sounds like you should wait until July when you can build with all available nodes. 7 nodes is an OK number to start.

3 host is the absolute minimum for any kind of cluster. In my opinion the minimum for prod Ceph is 5 nodes. The fault tolerance offered by a system is directly related to the size of the system, and 3 is OK for testing, but too small for production. 2 nodes is not a cluster at all, don't waste your time.

You should never use SATA SSDs or 1 GbE network interfaces. Even if you had 100 of them aggregated together, it does not matter, 10 GbE is the bare minimum for Ceph. 25 GbE is frankly pretty low too. I would go for 100 GbE at this point.

If you are on a budget, secondhand 40 GbE equipment is very cheap, costs about the same as 10 GbE.

Ceph overhead increases sharply as cluster size decreases. People who are new to Ceph do not understand how high it is. You only get good performance with a wide cluster, and even then, the overhead is high.

You can't compromise on cluster width, you can't compromise on storage, you can't compromise on network, and you can't compromise on rep size and min_size. Overhead and physical requirements are very high to meet peoples' expectations of how Ceph should behave and perform.

For instance on a 26-node 2x 40 GbE cluster with NVMe storage, it takes 7-8 GB/s worth of throughput in the physical layer to deliver 2 GB/s of throughput in the production layer.
 
The design is not OK, sounds like you should wait until July when you can build with all available nodes. 7 nodes is an OK number to start.

3 host is the absolute minimum for any kind of cluster. In my opinion the minimum for prod Ceph is 5 nodes. The fault tolerance offered by a system is directly related to the size of the system, and 3 is OK for testing, but too small for production. 2 nodes is not a cluster at all, don't waste your time.

You should never use SATA SSDs or 1 GbE network interfaces. Even if you had 100 of them aggregated together, it does not matter, 10 GbE is the bare minimum for Ceph. 25 GbE is frankly pretty low too. I would go for 100 GbE at this point.

If you are on a budget, secondhand 40 GbE equipment is very cheap, costs about the same as 10 GbE.

Ceph overhead increases sharply as cluster size decreases. People who are new to Ceph do not understand how high it is. You only get good performance with a wide cluster, and even then, the overhead is high.

You can't compromise on cluster width, you can't compromise on storage, you can't compromise on network, and you can't compromise on rep size and min_size. Overhead and physical requirements are very high to meet peoples' expectations of how Ceph should behave and perform.

For instance on a 26-node 2x 40 GbE cluster with NVMe storage, it takes 7-8 GB/s worth of throughput in the physical layer to deliver 2 GB/s of throughput in the production layer.
Thanks a lot,

We are a non-profit association, and servers, routers, and equipment are donated by enterprises. Therefore, our budget will never permit us to purchase extra servers or components to achieve the configuration of either a high-performance or medium-sized cluster.

Our main mission is to provide computing infrastructure to high-potential pre-ingeneer students who do not have the budget to realize their projects.

We have CNC, 3D printing, laser engraving/cutting, and we work with carbon-forged, aluminum, wood modeling, etc.
We are in the process of migrating our low cost servers to virtualization for better management and efficiency.

Following our specifications and context, do you suggest that we have to cancel it and or switch to something else or at least freeze our project until July ?
I understand that a 2-ultra-light node cluster is not adequate to start our configuration, training, and migration of our physical infrastructure. But it is always better than nothing, except if 2 nodes is bellow mandatory prerequisites.

if you have better solution that let us move forward rather than backward, I will appreciate your assistance.

Sincerely.
 
I understand that a 2-ultra-light node cluster is not adequate to start our configuration, training, and migration of our physical infrastructure. But it is always better than nothing, except if 2 nodes is bellow mandatory prerequisites.
Ceph cannot run with 2 nodes, so that's out (and with 3 you don't have redundancy). A two-node cluster is possible with a QDevice for the third vote: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_qdevice_technical_overview . ZFS allows for quick replication between the two nodes but maybe you don't even need that as you can also move VMs/containers between nodes without it. Maybe just try it and find out if it matches your expectations?
 
  • Like
Reactions: Pura-Vida
I would not do any kind of cluster based on what you describe.

Let your engineering students handle an infrastructure of disaggregate, heterogeneous, borderline junk hosts. This kind of environment will give them the best preparation for many of the situations they will encounter in the early stages of their IT career.
 
  • Like
Reactions: ubu
Ceph cannot run with 2 nodes, so that's out (and with 3 you don't have redundancy). A two-node cluster is possible with a QDevice for the third vote: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_qdevice_technical_overview . ZFS allows for quick replication between the two nodes but maybe you don't even need that as you can also move VMs/containers between nodes without it. Maybe just try it and find out if it matches your expectations?
Thank you very much @leesteken, I gonna read your documentation.

So, can I also expand my understanding with: The network segmentation I had previously designed, which involved 2 groups of LACP aggregation, may not be necessary now that I have decided to abandon Ceph for storage. Should I advise updating my L2/L3 network configuration to a single 4 GbE LACP aggregation per Proxmox server instead?

I have some very low-end SBCs that I was using for a witness appliance. Should I repurpose one of these for such a role in a Proxmox configuration?
 
Thank you very much @leesteken, I gonna read your documentation.
It's not mine, it's the official manual, which can also be reached locally on a Proxmox using the Help button.
So, can I also expand my understanding with: The network segmentation I had previously designed, which involved 2 groups of LACP aggregation, may not be necessary now that I have decided to abandon Ceph for storage. Should I advise updating my L2/L3 network configuration to a single 4 GbE LACP aggregation per Proxmox server instead?

I have some very low-end SBCs that I was using for a witness appliance. Should I repurpose one of these for such a role in a Proxmox configuration?
Sorry but I don't know most of those technical terms. I'm just a stranger on the internet that runs Proxmox as a hobby and can't really advise on serious redundant (and performant) enterprise setups. It sounds like you'll be experimenting and tinkering with it a lot (instead of outsourcing it to another company) and I think that's also the way to start getting experience with it.
 
  • Like
Reactions: Pura-Vida
The cluster with 5 nodes with 25 gbe should be ok for CEPH assuming they are not heavy I/O except when deploying new vms, and the CEPH is on it's own ports separate from the cluster. Doesn't sound like it will be heavy traffic very often.
 
best practices for a hyperconverged configuration (vlan numbers just for separation, you can use whatever you like:)

v0: management
v1: facing your external network (can be combined with 0)
v2: corosync ring0 (low latency)
v3: corosync ring1 (low latency)
v4: ceph private network (access to all ceph service nodes, high bandwidth)
v5: ceph public network (access to all ceph guests, high bandwidth)

IDEALLY, all networks except v2 and v3 should have link redundancy- LACP if you have switch support, active/backup if you dont. Each link in a pair should be connected to separate switches, along with v2 and v3 on different switches from each other.

Yes, you may combine functions logically on the same pair of physical links, but bear in mind that when the links become loaded you can lose functionality due to contention and worse, have the node fenced. combining v4 and v5 is also commonly done, but realize you'd be cutting your bandwidth in half when doing so since OSD traffic will be duplicated (once to the guests, once to assure service function.) Make sure that at LEAST one of your corosync interfaces is not competing with any other function.

Now for the big question: how much bandwidth do I need? thats really a question for your "clients." having a lot of bandwidth doesn't guarantee speedy user experience, but having too little guarantees it wont. you need to draw out what your needs are and define a "absolute minimum" performance floor- and thats in terms of iops, not mb/s. As a rule of thumb, you want, at minimum, 2 links of 10gb+ for adequate ceph performance (4 with redundancy); If you will have a lot of fast OSDs that will increase in order to get any benefit.
 
  • Like
Reactions: inta and Pura-Vida
Don't really want to hijack the thread, but has there been any actual tests that compare dedicating 10 separate links over 10+ vlans bonded on 10 links (or other like number like 4 vs 4), especially if you setup custom tc for priority on the bonds to give low latency? Clearly it can make diagnosing bottlenecks easier with dedicated links, but bonding should allow heavy links like CEPH to use additional bandwidth, but if you setup traffic control you should be able to keep it from slowing down cronosync.
 
best practices for a hyperconverged configuration (vlan numbers just for separation, you can use whatever you like:)

v0: management
v1: facing your external network (can be combined with 0)
v2: corosync ring0 (low latency)
v3: corosync ring1 (low latency)
v4: ceph private network (access to all ceph service nodes, high bandwidth)
v5: ceph public network (access to all ceph guests, high bandwidth)

IDEALLY, all networks except v2 and v3 should have link redundancy- LACP if you have switch support, active/backup if you dont. Each link in a pair should be connected to separate switches, along with v2 and v3 on different switches from each other.

Yes, you may combine functions logically on the same pair of physical links, but bear in mind that when the links become loaded you can lose functionality due to contention and worse, have the node fenced. combining v4 and v5 is also commonly done, but realize you'd be cutting your bandwidth in half when doing so since OSD traffic will be duplicated (once to the guests, once to assure service function.) Make sure that at LEAST one of your corosync interfaces is not competing with any other function.

Now for the big question: how much bandwidth do I need? thats really a question for your "clients." having a lot of bandwidth doesn't guarantee speedy user experience, but having too little guarantees it wont. you need to draw out what your needs are and define a "absolute minimum" performance floor- and thats in terms of iops, not mb/s. As a rule of thumb, you want, at minimum, 2 links of 10gb+ for adequate ceph performance (4 with redundancy); If you will have a lot of fast OSDs that will increase in order to get any benefit.
Merci oups, Thanks @alexskysilk,
I understand your naming convention,
someone yesterday wrote that ceph implementation has to be postponed to July as I will only able de reach 5 nodes into my cluster at this step.
So as I have only 2 nodes to enroll into the cluster configuration with 5 x gbe interfaces each (10 / 25 gbe interfaces will land on July) what do you suggest to do ?
this is exactly the reason we are doing POC on 2 tiny servers (to be honest "1U noname" servers) before to move our manufacturing hardware management infrastructure to vpe in July. we are not a lucrative organization, just an association of professionals that are giving a part of their free time and energy to students.
We are delivering services, resources and experiences in manufacturing, design, dev, CAO, IA, maths, ...
 
So as I have only 2 nodes to enroll into the cluster configuration with 5 x gbe interfaces each (10 / 25 gbe interfaces will land on July) what do you suggest to do ?
with only two nodes, you need to consider that you need a third to provide quorum, and something to provide shared storage services. This would not make a good POC for a ceph cluster; you need to define what you intend to accomplish with this interim step- if you just want to see how pve works, one node with one interface would do.
 
I manage three-node cluster with CEPH (10GbE) in a production environment - and it works very well for us. But our Ceph is not used for general data storage - only VM disks for high availability. As other have stated, it's all about use-case. Our deployment is not designed for user data storage, so we make do just fine with 10GbE. We run a host of domain services, remote connection brokers, SFTP, network monitoring tools, domain controllers and quite a few license servers. While many of these are quite modest VMs, some see decent compute and bandwidth - such as our brokers. It isn't designed for many virtual desktops, but we do have a handful in operation for various tasks.

So don't be scared away by other peoples' caution against a modest deployment! Again - it's all about use-case and expectations.

Also to note - you can use SATA SSDs but they need to be enterprise-grade with onboard power backup to be any use in Ceph. We learned this the hard way... I have a stack of nearly dead consumer Crucials that last less than a year. Replaced with Kingston DC500/600s and they're still at 0-1% wear after months (the Crucials would probably been around 30% wear as a point of comparison).
 
I manage three-node cluster with CEPH (10GbE) in a production environment - and it works very well for us. But our Ceph is not used for general data storage - only VM disks for high availability. As other have stated, it's all about use-case. Our deployment is not designed for user data storage, so we make do just fine with 10GbE. We run a host of domain services, remote connection brokers, SFTP, network monitoring tools, domain controllers and quite a few license servers. While many of these are quite modest VMs, some see decent compute and bandwidth - such as our brokers. It isn't designed for many virtual desktops, but we do have a handful in operation for various tasks.

So don't be scared away by other peoples' caution against a modest deployment! Again - it's all about use-case and expectations.

Also to note - you can use SATA SSDs but they need to be enterprise-grade with onboard power backup to be any use in Ceph. We learned this the hard way... I have a stack of nearly dead consumer Crucials that last less than a year. Replaced with Kingston DC500/600s and they're still at 0-1% wear after months (the Crucials would probably been around 30% wear as a point of comparison).
Are your VMs critical? How does your environment behave when you are down 1 node? Have you ever tested running that way for a week or two? Have you ever tested a whole node recovery?
 
How does your environment behave when you are down 1 node? Have you ever tested running that way for a week or two?
I cant speak to the OPs experience, but I can tell you ceph operates completely fine (normally) with a node missing, even in a 3 node environment. The issue with that on "long term" basis is that you're operating without a safety net- if a remaining node gets fenced, or an OSD goes out you'll end up with a read-only filesystem at best, and at worst stuck operations in flight that may be lost. whole node recovery works fine as well, its just that you may need to fine tune your osd rebuild variables to not impact your host side too much- that really depends on what type of drives you're using and what type of connectivity you have for your ceph public and private interfaces.
 
Are your VMs critical? How does your environment behave when you are down 1 node? Have you ever tested running that way for a week or two? Have you ever tested a whole node recovery?

@alexskysilk is correct. I have not intentionally ran the cluster a node down for an extended period, but during maintenance periods, ceph operates without issue when a node is offline. You do need to learn the proper methods for shutting down ceph with the applicable flags set to disable auto-recovery functions if it's only a quick reboot, say, after kernel / bios updates or OS disk replacements. Otherwise, no problem for our use-case. And, yes, all mission-critical services run on our HA cluster.

We haven't experienced any major faults - knock on wood! - that caused an entire node to be offline for an extended period. The worst I dealt with was a faulty HBA cable that shorted and caused HBA pins to fuse together. It wasn't enough damage to break the system, but enough to cause significant I/O errors. During the extensive diagnostic period and eventual hardware replacement, I kept Ceph up and running on this node but migrated off all guest VMs to the other two members in the cluster. Fortunately, we had just enough local CPU/RAM resources to take the additional load (after shutting down some non-essential guests).

I have not had a catastrophic whole node failure - knock on wood! - that required a full rebuild. I have had OS disk failures, ceph osd replacements, RAM upgrades and firmware updates in addition to the above hardware failure. Ceph was flexible and resilient throughout every maintenance period.

I also highly recommend taking out a support license. Prox support is excellent.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!